Web scraping is the use of a program to simulate human interaction with a web server or to extract specific information from a web page.

learn more… | top users | synonyms

3
votes
5answers
92 views

Finding the occurrences of all words in movie scripts

I was wondering if someone could tell me things I could improve in this code. This is one of my first Python projects. This program gets the script of a movie (in this case Interstellar) and then ...
3
votes
0answers
38 views

Scraping efficiently with mechanize and bs4

I have written some code that scrapes data on asteroids, but the problem is that is super slow! I understand that it has a lot to scrape, but as of now it has been running for 5 days and is bot even a ...
0
votes
1answer
46 views

Program to create list of all English Wikipedia articles

This program will scrape Wikipedia to create a list of all English Wikipedia articles. How can I improve this program as it currently performs very badly performance-wise? On my Internet connection ...
7
votes
3answers
103 views

RateBeer.com scraper

This was largely an exercise in making my code more Pythonic, especially in catching errors and doing things the right way. I opted to make the PageNotFound ...
2
votes
1answer
261 views

Refactoring a Crawler

I've recently ported an old project and made it object-oriented. However, I've noticed that rubocop points out the following status: ...
1
vote
1answer
133 views

Utilization of Steam APIs and web-scraping

Some background info here: This is a small fun project I made utilizing Steam APIs and web-scraping This is the first time I've ever used Python, so I'm not very familiar with the language I used ...
5
votes
1answer
89 views

Getting information of countries out of a website that isn't using consistent verbiage

From this website I needed to grab the information for each country and insert it into an Excel spreadsheet. My original plan was to use my program and search each website for the text and later ...
2
votes
0answers
18 views

Compressing a blog into a preview using tumblr_api_read

Here is what I have currently working. I would like to make it look more aesthetically pleasing, so not finish words in mid word. Also not have the two previews be so much larger than the other. ...
1
vote
1answer
85 views

Crawl multiple pages at once

This an update to my last question. I want to process multiple pages at once pulling URLs from tier_list in the crawl_web ...
3
votes
3answers
164 views

Implementing a POC Async Web Crawler

I've created a small proof of concept web crawler to learn more about asynchrony in .NET. Currently when run it crawls stack overflow with a fixed number of current requests (workers). I was ...
1
vote
2answers
111 views

Basic search engine

I want to improve efficiency of this search engine. It works in about 10 seconds for a search depth of 1, but 4 minutes at 2 etc. I tried to give straightforward comments and variable names, any ...
2
votes
2answers
139 views

Phone Number Extracting using RegEx And HtmlAgilityPack

I've written this whole code to extract cell numbers from a website. It is extracting numbers perfectly but very slowly, and it's also hanging my Form while Extracting. ...
1
vote
0answers
60 views

Using URLs and RegEx for web scraper from a dictionary [closed]

I have dozens of functions which GET/POST to some URLs and extract data using RegEx. The URLs and regular expressions were hard-coded earlier but now I moved all of them to a dictionary. I then saw ...
5
votes
0answers
228 views

Clojure core.async web crawler

I'm currently a beginner with clojure and I thought I'd try building a web crawler with core.async. What I have works, but I am looking for feedback on the following points: How can I avoid using ...
2
votes
1answer
88 views

Web scraper running extremely slow

I am making my first web scraper in Python. It works great but it runs extremely slow. The website loads in about 10ms but it only does like 1 every couple of seconds. There are about 4-6 million ...
2
votes
0answers
85 views

Rails app that scrapes forum using Nokogiri gem

I've built a website that scrapes a guitar forum's pages and populates Rails model. I'm using rake task along with heroku scheduler to run background scrapes every hour. On the homepage, the forum ads ...
2
votes
1answer
131 views

Getting rid of certain HTML tags

This code simply returns a small section of HTML code and then gets rid of all tags except for break tags. It seems inefficient because you cannot search and replace with a beautiful soup object as ...
10
votes
2answers
779 views

Scrape an HTML table with python

I think I'm on the right track, but ANY suggestions or critques are welcome. The program just scrapes an HTML table and prints it to stdout. ...
2
votes
0answers
198 views

Scraping HTML using PHP

Because a website I need data from doesn't have any API or RSS feed for their service status, I use a Web Scraper I built using PHP to grab the data I need and structure it as JSON. However I want to ...
2
votes
1answer
29 views

Find and select image files from webpage

For some reason, I feel like this is a bit messy and could be cleaner. Any suggestions? I'm selecting any image files ending in .png or ...
13
votes
2answers
183 views

Nokogiri crawler

The following code works but is a mess. But being totally new to Ruby I have had big problems trying to refactor it into something resembling clean OOP code. Could you help with this and explain what ...
2
votes
1answer
38 views

Cheat Code Scraper

During breaks, I find myself playing Emerald version a lot and was tired of having to use the school's slow wifi to access the internet. I wrote a scraper to obtain cheat codes and send them to my psp ...
5
votes
3answers
79 views

Clean up repeated file.writes, if/elses when adding keys to a dict

I'm getting familiar with python and I'm still learning it's tricks and idioms. Is there an better way to implement print_html() without the multiple calls to ...
3
votes
1answer
48 views

Node PSP ISO Scraper

I recently bought a PSP and wanted to know the best ISO files and wrote a scraper to retrieve games ISOs titles that received a high rating and send them to a csv. Any recommendations as to ...
7
votes
1answer
133 views

Improved minimal webcrawler - why is it so slow?

I recently made a webcrawler that I submitted here for a review: Minimal webcrawler - bad structure and error handling? With that help, I've made a much cleaner and better(?) webcrawler. The only ...
10
votes
2answers
882 views

Spliterator implementation

I'm trying to post a little tutorial on the new Spliterator class. There are many tutorials these days on using stream starting from a standard Java collection, but ...
4
votes
1answer
670 views

Web Crawler in Java

I've written a working web crawler in Java that finds the frequencies of words on web pages. I have two issues with it. The organization of my code in WebCrawler.java is terrible. Is there a way I ...
5
votes
1answer
75 views

Reverse-engineering with Filepicker API

I have this script to pull data out of the Filepicker API internal. It's mostly reverse-engineering and the code seems to be ugly to me. How can this be improved? ...
2
votes
0answers
92 views

Parsing a website

Following is the code I wrote to download the information of different items in a page. I have one main website which has links to different items. I parse this main page to get the list. This is ...
1
vote
0answers
71 views

scraping and saving using Arrays or Objects

I'm using Anemone to Spider a website, I am then using a set of rules specific to that website, to find certain parameters. I feel like it's simple enough, but any attempt I make to save the ...
11
votes
3answers
840 views

Minimal webcrawler - bad structure and error handling?

I did this code over one day as a part of a job application, where they wanted me to make a minimal webcrawler in any language. The purpose was to crawl a site, find all of the URLs on that page, and ...
1
vote
2answers
302 views

Number of Google search results over a period of time, saved to database

I am writing a Python script that scrapes data from Google search results and stores it in a database. I couldn't find any Google API for this, so I am just sending a HTTP GET request on Google's main ...
8
votes
1answer
143 views

Optimize web-scraping of Moscow grocery website

This code works fine, but I believe it has optimization problems. Please review this. Also, please keep in mind that it stops after each iteration of the loop ...
17
votes
2answers
437 views

We'll be counting stars

Lately, I've been, I've been losing sleep Dreaming about the things that we could be But baby, I've been, I've been praying hard, Said, no more counting dollars We'll be counting stars, yeah we'll be ...
2
votes
1answer
71 views

Scraping thefreedictionary.com

Scrape results from thefreedictionary.com ...
4
votes
1answer
978 views

A simple little Python web crawler

The crawler is in need of a mechanism that will dispatch threads based on network latency and system load. How does one keep track of network latency in Python without using system tools like ping? ...
2
votes
0answers
111 views

Prototype spider for indexing RSS feeds

This code is super slow. I'm looking for advice on how to improve its performance. ...
3
votes
1answer
144 views

Crawling for emails on websites given by Google API

I'm trying to build an app which crawls a website to find the emails that it has and prints them. I also want to allow the user to type "false" into the console when they want to skip the website ...
10
votes
1answer
273 views

Is this the Clojure way to web-scrape a book cover image?

Is there a way to write this better or more Clojure way? Especially the last part with with-open and the let. Should I put the ...
5
votes
1answer
3k views

Getting data correctly from <span> tag with beautifulsoup and regex

I am scraping an online shop page, trying to get the price mentioned in that page. In the following block the price is mentioned: ...
7
votes
1answer
305 views

AngularJs and Google Bot experiment

I have learned the question of solving Angular app optimization for search engines, and was frustrated that the most recommended option is prerendering HTML. After some time spent, I suggested to ...
6
votes
3answers
165 views

HTTP scraper not clean and straightforwardly coded?

A job application of mine has been declined because the test project I submitted was not coded in a clean and straightforward way. Fine, but that's all the feedback I got. Since I like to ...
2
votes
2answers
1k views

Scraping HTML using Beautiful Soup

I have written a script using Beautiful Soup to scrape some HTML and do some stuff and produce HTML back. However, I am not convinced with my code and I am looking for some improvements. Structure of ...
1
vote
1answer
600 views

Script taking too long for curl request

The below script takes the list of provided URLs and scrapes the present links in each URL and for each scraped link Facebook ...
5
votes
2answers
270 views

Spreadsheet function that gives the number of Google indexed pages

I've developed this spreadsheet in order to scrape a website's number of indexed pages through Google and Google Spreadsheets. I'm not a developer, so how can I improve this code in order to have ...
4
votes
1answer
345 views

Craigslist search-across-regions script

I'm a JavaScript developer. I'm pretty sure that will be immediately apparent in the below code if for no other reason than the level/depth of chaining that I'm comfortable with. However, I'm learning ...
3
votes
4answers
4k views

Download an image from a webpage

I am trying to write a Python script that download an image from a webpage. On the webpage (I am using NASA's picture of the day page), a new picture is posted everyday, with different file names. ...
1
vote
1answer
222 views

URL and source page scraper

The code does seem a bit repetitive in places such as the parenturlscraper module and the childurlscraper module. Does anyone ...
4
votes
1answer
224 views

Web scraper for job listings

Is there any room for improvement on this code? I use mechanize to get the links of a job listing web site. There are pages with pagination (when jobs > 25) and pages without. If there is, then the ...
2
votes
2answers
2k views

Download image links posted to reddit.com

This is a Python script to save imgur pictures posted to reddit.com forums. I'm looking for an assessment on the design of this script and any web security issues that might exist. Obvious ...