Web scraping is the use of a program to simulate human interaction with a web server or to extract specific information from a web page.
5
votes
0answers
57 views
Python program that scrapes my CS teacher's website
I am new to programming, and I'm looking forward to seeing what I can do to improve my code.
I've been working on creating an individual final project for my python CS class that checks my teacher's ...
4
votes
0answers
29 views
Crawling and parsing meteorological data from the web into R
I am interested in collecting directly into R data published by the Mexican Met-office. The data pieces are spread through several URLs, but one can start here. There I can get the names and ...
4
votes
2answers
288 views
Amazon web scraper
I am trying to improve my programming and programming design skills (poor at the moment). I created a small Amazon scraper program. It is a working program. I would be very grateful if you could ...
2
votes
2answers
64 views
Web-scraper for a larger program
I have a web scraper that I use in a part of a larger program. However, I feel like I semi-repeat my code a lot and take up a lot of room. Is there any way I can condense this code?
...
2
votes
1answer
32 views
Scraping through product pages
I'm working through a scraping function where pages of results lead to product pages. I've added a default maximum number of results pages, and pages per set of results, to prevent a simple mistake ...
4
votes
2answers
90 views
Press any login button on any site
I'm working on a script that will be able to press the login button on any site for an app I'm working on. I have it working (still a few edge cases to work out such as multiple submit buttons and ...
3
votes
0answers
76 views
Pure Python script that saves html page with all images
Here is pure Python script that saves html page without CSS but with all images on it and replaces all hrefs with path of image on hard drive.
I know that there are great libraries like ...
4
votes
3answers
52 views
Searching for a string in a downloaded PDF
This code goes to the website containing the PDF, downloads the PDF, then it converts this PDF to text. Finally, it reads this whole file (Over 5000 lines) into a list, line by line, and searches for ...
4
votes
3answers
53 views
Displaying sorted results of a web crawl
The issue I have with this class is that most of the methods are almost the same. I would like for this code to be more pythonic.
Note: I plan on replacing all the ...
4
votes
2answers
259 views
Trivago hotels price checker
I've decided to write my first project in Python. I would like to hear some opinion from you.
Description of the script:
Generate Trivago URLs for 5 star hotels in specified city.
Scrap these URLs ...
5
votes
1answer
45 views
Print the list of winter bash 2014 hats as a list of checkboxes in GFM format
In Winter Bash 2014,
since there is no easy way to see the hats I'm missing per site,
I decided to use Gists for that.
A perhaps not so well-known feature of GitHub Flavered Markdown (GFM) format ...
4
votes
2answers
92 views
Retrieving stock prices
It takes around 5-8 seconds for me to retrieve a previously-closed stock price and a dividend rate from US Yahoo! Finance. If I wanted to retrieve 10+ stock prices, it would take me more than a minute ...
4
votes
5answers
122 views
Finding the occurrences of all words in movie scripts
I was wondering if someone could tell me things I could improve in this code. This is one of my first Python projects. This program gets the script of a movie (in this case Interstellar) and then ...
3
votes
0answers
101 views
Scraping efficiently with mechanize and bs4
I have written some code that scrapes data on asteroids, but the problem is that is super slow! I understand that it has a lot to scrape, but as of now it has been running for 5 days and is bot even a ...
0
votes
1answer
60 views
Program to create list of all English Wikipedia articles
This program will scrape Wikipedia to create a list of all English Wikipedia articles.
How can I improve this program as it currently performs very badly performance-wise? On my Internet connection ...
7
votes
3answers
130 views
RateBeer.com scraper
This was largely an exercise in making my code more Pythonic, especially in catching errors and doing things the right way.
I opted to make the PageNotFound ...
3
votes
1answer
1k views
Refactoring a Crawler
I've recently ported an old project and made it object-oriented. However, I've noticed that rubocop points out the following status: ...
1
vote
1answer
263 views
Utilization of Steam APIs and web-scraping
Some background info here:
This is a small fun project I made utilizing Steam APIs and web-scraping
This is the first time I've ever used Python, so I'm not very familiar with the language
I used ...
5
votes
1answer
91 views
Getting information of countries out of a website that isn't using consistent verbiage
From this website I needed to grab the information for each country and insert it into an Excel spreadsheet.
My original plan was to use my program and search each website for the text and later ...
2
votes
0answers
28 views
Compressing a blog into a preview using tumblr_api_read
Here is what I have currently working. I would like to make it look more aesthetically pleasing, so not finish words in mid word. Also not have the two previews be so much larger than the other.
...
1
vote
1answer
209 views
Crawl multiple pages at once
This an update to my last question.
I want to process multiple pages at once pulling URLs from tier_list in the crawl_web ...
3
votes
3answers
332 views
Implementing a POC Async Web Crawler
I've created a small proof of concept web crawler to learn more about asynchrony in .NET.
Currently when run it crawls stack overflow with a fixed number of current requests (workers).
I was ...
1
vote
2answers
137 views
Basic search engine
I want to improve efficiency of this search engine. It works in about 10 seconds for a search depth of 1, but 4 minutes at 2 etc.
I tried to give straightforward comments and variable names, any ...
2
votes
2answers
210 views
Phone Number Extracting using RegEx And HtmlAgilityPack
I've written this whole code to extract cell numbers from a website. It is extracting numbers perfectly but very slowly, and it's also hanging my Form while Extracting.
...
6
votes
1answer
366 views
Clojure core.async web crawler
I'm currently a beginner with clojure and I thought I'd try building a web crawler with core.async.
What I have works, but I am looking for feedback on the following points:
How can I avoid using ...
2
votes
1answer
144 views
Web scraper running extremely slow
I am making my first web scraper in Python. It works great but it runs extremely slow. The website loads in about 10ms but it only does like 1 every couple of seconds. There are about 4-6 million ...
2
votes
0answers
133 views
Rails app that scrapes forum using Nokogiri gem
I've built a website that scrapes a guitar forum's pages and populates Rails model. I'm using rake task along with heroku scheduler to run background scrapes every hour.
On the homepage, the forum ads ...
2
votes
1answer
299 views
Getting rid of certain HTML tags
This code simply returns a small section of HTML code and then gets rid of all tags except for break tags.
It seems inefficient because you cannot search and replace with a beautiful soup object as ...
11
votes
2answers
2k views
Scrape an HTML table with python
I think I'm on the right track, but ANY suggestions or critques are welcome. The program just scrapes an HTML table and prints it to stdout.
...
2
votes
0answers
299 views
Scraping HTML using PHP
Because a website I need data from doesn't have any API or RSS feed for their service status, I use a Web Scraper I built using PHP to grab the data I need and structure it as JSON. However I want to ...
5
votes
1answer
2k views
Instagram bot script
I'm very new to Python and would like some feedback on my script. I'm fairly clueless to best practices, code correctness etc. so if there's anything at all that looks wrong, isn't 'pythonic' or could ...
2
votes
1answer
34 views
Find and select image files from webpage
For some reason, I feel like this is a bit messy and could be cleaner. Any suggestions?
I'm selecting any image files ending in .png or ...
13
votes
2answers
238 views
Nokogiri crawler
The following code works but is a mess. But being totally new to Ruby I have had big problems trying to refactor it into something resembling clean OOP code. Could you help with this and explain what ...
2
votes
1answer
38 views
Cheat Code Scraper
During breaks, I find myself playing Emerald version a lot and was tired of having to use the school's slow wifi to access the internet. I wrote a scraper to obtain cheat codes and send them to my psp ...
5
votes
3answers
81 views
Clean up repeated file.writes, if/elses when adding keys to a dict
I'm getting familiar with python and I'm still learning it's tricks and idioms.
Is there an better way to implement print_html() without the multiple calls to ...
3
votes
1answer
55 views
Node PSP ISO Scraper
I recently bought a PSP and wanted to know the best ISO files and wrote a scraper to retrieve games ISOs titles that received a high rating and send them to a csv. Any recommendations as to ...
7
votes
1answer
152 views
Improved minimal webcrawler - why is it so slow?
I recently made a webcrawler that I submitted here for a review:
Minimal webcrawler - bad structure and error handling?
With that help, I've made a much cleaner and better(?) webcrawler.
The only ...
10
votes
2answers
1k views
Spliterator implementation
I'm trying to post a little tutorial on the new Spliterator class. There are many tutorials these days on using stream starting from a standard Java collection, but ...
4
votes
1answer
1k views
Web Crawler in Java
I've written a working web crawler in Java that finds the frequencies of words on web pages. I have two issues with it.
The organization of my code in WebCrawler.java is terrible. Is there a way I ...
5
votes
1answer
87 views
Reverse-engineering with Filepicker API
I have this script to pull data out of the Filepicker API internal. It's mostly reverse-engineering and the code seems to be ugly to me. How can this be improved?
...
2
votes
0answers
96 views
Parsing a website
Following is the code I wrote to download the information of different items in a page.
I have one main website which has links to different items. I parse this main page to get the list. This is ...
1
vote
0answers
86 views
scraping and saving using Arrays or Objects
I'm using Anemone to Spider a website, I am then using a set of rules specific to that website, to find certain parameters.
I feel like it's simple enough, but any attempt I make to save the ...
11
votes
3answers
880 views
Minimal webcrawler - bad structure and error handling?
I did this code over one day as a part of a job application, where they wanted me to make a minimal webcrawler in any language. The purpose was to crawl a site, find all of the URLs on that page, and ...
1
vote
2answers
462 views
Number of Google search results over a period of time, saved to database
I am writing a Python script that scrapes data from Google search results and stores it in a database. I couldn't find any Google API for this, so I am just sending a HTTP GET request on Google's main ...
8
votes
1answer
165 views
Optimize web-scraping of Moscow grocery website
This code works fine, but I believe it has optimization problems. Please review this.
Also, please keep in mind that it stops after each iteration of the loop ...
26
votes
2answers
536 views
We'll be counting stars
Lately, I've been, I've been losing sleep
Dreaming about the things that we could be
But baby, I've been, I've been praying hard,
Said, no more counting dollars
We'll be counting stars, yeah we'll be ...
2
votes
1answer
76 views
4
votes
1answer
1k views
A simple little Python web crawler
The crawler is in need of a mechanism that will dispatch threads based on network latency and system load. How does one keep track of network latency
in Python without using system tools like ping?
...
2
votes
0answers
127 views
Prototype spider for indexing RSS feeds
This code is super slow. I'm looking for advice on how to improve its performance.
...
3
votes
1answer
162 views
Crawling for emails on websites given by Google API
I'm trying to build an app which crawls a website to find the emails that it has and prints them. I also want to allow the user to type "false" into the console when they want to skip the website ...