All Questions
Tagged with web-scraping html
16 questions
2
votes
0
answers
68
views
Simplified HTML parsing for LEGO features
The goal is to extract the the Features section from a Lego product page. In the Features section, usually there's a header (...
1
vote
1
answer
54
views
Extracting information from HTML with XSLT 3.0 when data is grouped visually as siblings in a td separated by blank lines
I have a work-in-progress where I'm using XSLT 3 to extract information from some preprocessed archaic HTML. I'd like to produce JSON showing the relationships between the various entities for further ...
2
votes
1
answer
2k
views
Parsing scraped data from html table
I've written a simple python web scraper that parses text from an html table and stores the scraped data in List of dictionaries. The code works and doesn't seem to have any glaring issues performance-...
2
votes
1
answer
1k
views
Python script to scrape titles of public Youtube playlist
Just started in Python; wrote a script to get the names of all the titles in a public Youtube playlist given as input, but it got messier than it might have to be.
I looked around online and found ...
3
votes
2
answers
1k
views
Performance and Readability Improvements for HTML Parser with BeautifulSoup
This function takes as an argument a JSON file (could contain anything in JSON format, since I scrape hundreds of random pages) and returns a list of dictionaries where a URL is mapped to its ...
1
vote
1
answer
14k
views
Extract html content based on tags, specifically headers
I want the function to take as an input json file containing html_body with its corresponding url and return list of tuples containing headers and their corresponding url (so could be tuple with one ...
3
votes
1
answer
749
views
Scraping data from a table in python
I'm new to python, and after doing a few tutorials, some about scraping, I've been trying some simple scraping on my own. Using BeautifulSoup I manage to get data from web pages where everything has ...
2
votes
1
answer
160
views
Optimizing Java HTML parser
I wrote a program that goes through a webpage and returns matches of regex. I used it on my letterboxd.com account to go through all of my movies (over 900 entries) and then find genres field for each ...
4
votes
1
answer
291
views
HTML Scraper for Plex downloads page
I have written a scraper in Python 3 using Beautiful Soup 4 to retrieve the latest version of Plex Media Server from https://plex.tv, and I'd like some feedback on how to improve it.
The HTML the ...
3
votes
0
answers
146
views
Using Nokogiri to scrape Oscars winners from Wikipedia
I am scraping a Wikipedia page, getting info from that page and instantiating a new object with the information collected:
...
5
votes
2
answers
289
views
Press any login button on any site
I'm working on a script that will be able to press the login button on any site for an app I'm working on. I have it working (still a few edge cases to work out such as multiple submit buttons and ...
1
vote
1
answer
173
views
Program to create list of all English Wikipedia articles
This program will scrape Wikipedia to create a list of all English Wikipedia articles.
How can I improve this program as it currently performs very badly performance-wise? On my Internet connection ...
2
votes
0
answers
139
views
Compressing a blog into a preview using tumblr_api_read
Here is what I have currently working. I would like to make it look more aesthetically pleasing, so not finish words in mid word. Also not have the two previews be so much larger than the other.
...
5
votes
3
answers
111
views
Clean up repeated file.writes, if/elses when adding keys to a dict
I'm getting familiar with python and I'm still learning it's tricks and idioms.
Is there an better way to implement print_html() without the multiple calls to <...
4
votes
2
answers
10k
views
Scraping HTML using Beautiful Soup
I have written a script using Beautiful Soup to scrape some HTML and do some stuff and produce HTML back. However, I am not convinced with my code and I am looking for some improvements.
Structure of ...