Web scraping is the use of a program to simulate human interaction with a web server or to extract specific information from a web page.
2
votes
0answers
6 views
Compressing a blog into a preview using tumblr_api_read
Here is what I have currently working. I would like to make it look more aesthetically pleasing, so not finish words in mid word. Also not have the two previews be so much larger than the other.
...
1
vote
1answer
41 views
Crawl multiple pages at once
This an update to my last question.
I want to process multiple pages at once pulling URLs from tier_list in the crawl_web ...
3
votes
3answers
99 views
Implementing a POC Async Web Crawler
I've created a small proof of concept web crawler to learn more about asynchrony in .NET.
Currently when run it crawls stack overflow with a fixed number of current requests (workers).
I was ...
1
vote
2answers
91 views
Basic search engine
I want to improve efficiency of this search engine. It works in about 10 seconds for a search depth of 1, but 4 minutes at 2 etc.
I tried to give straightforward comments and variable names, any ...
0
votes
1answer
42 views
Improving Watir::Browser for my needs
I want to:
use Watir::Browser methods without browser. instance prefix
expand abilities of ...
2
votes
2answers
76 views
Phone Number Extracting using RegEx And HtmlAgilityPack
I've written this whole code to extract cell numbers from a website. It is extracting numbers perfectly but very slowly, and it's also hanging my Form while Extracting.
...
1
vote
0answers
34 views
Using URLs and RegEx for web scraper from a dictionary [closed]
I have dozens of functions which GET/POST to some URLs and extract data using RegEx. The URLs and regular expressions were hard-coded earlier but now I moved all of them to a dictionary. I then saw ...
5
votes
0answers
106 views
Clojure core.async web crawler
I'm currently a beginner with clojure and I thought I'd try building a web crawler with core.async.
What I have works, but I am looking for feedback on the following points:
How can I avoid using ...
2
votes
1answer
56 views
Web scraper running extremely slow
I am making my first web scraper in Python. It works great but it runs extremely slow. The website loads in about 10ms but it only does like 1 every couple of seconds. There are about 4-6 million ...
2
votes
0answers
55 views
Rails app that scrapes forum using Nokogiri gem
I've built a website that scrapes a guitar forum's pages and populates Rails model. I'm using rake task along with heroku scheduler to run background scrapes every hour.
On the homepage, the forum ads ...
2
votes
1answer
90 views
Getting rid of certain HTML tags
This code simply returns a small section of HTML code and then gets rid of all tags except for break tags.
It seems inefficient because you cannot search and replace with a beautiful soup object as ...
10
votes
2answers
406 views
Scrape an HTML table with python
I think I'm on the right track, but ANY suggestions or critques are welcome. The program just scrapes an HTML table and prints it to stdout.
...
2
votes
0answers
118 views
Scraping HTML using PHP
Because a website I need data from doesn't have any API or RSS feed for their service status, I use a Web Scraper I built using PHP to grab the data I need and structure it as JSON. However I want to ...
2
votes
1answer
25 views
Find and select image files from webpage
For some reason, I feel like this is a bit messy and could be cleaner. Any suggestions?
I'm selecting any image files ending in .png or ...
13
votes
2answers
162 views
Nokogiri crawler
The following code works but is a mess. But being totally new to Ruby I have had big problems trying to refactor it into something resembling clean OOP code. Could you help with this and explain what ...
2
votes
1answer
37 views
Cheat Code Scraper
During breaks, I find myself playing Emerald version a lot and was tired of having to use the school's slow wifi to access the internet. I wrote a scraper to obtain cheat codes and send them to my psp ...
5
votes
3answers
78 views
Clean up repeated file.writes, if/elses when adding keys to a dict
I'm getting familiar with python and I'm still learning it's tricks and idioms.
Is there an better way to implement print_html() without the multiple calls to ...
3
votes
1answer
46 views
Node PSP ISO Scraper
I recently bought a PSP and wanted to know the best ISO files and wrote a scraper to retrieve games ISOs titles that received a high rating and send them to a csv. Any recommendations as to ...
7
votes
1answer
121 views
Improved minimal webcrawler - why is it so slow?
I recently made a webcrawler that I submitted here for a review:
Minimal webcrawler - bad structure and error handling?
With that help, I've made a much cleaner and better(?) webcrawler.
The only ...
10
votes
2answers
577 views
Spliterator implementation
I'm trying to post a little tutorial on the new Spliterator class. There are many tutorials these days on using stream starting from a standard Java collection, but ...
4
votes
1answer
388 views
Web Crawler in Java
I've written a working web crawler in Java that finds the frequencies of words on web pages. I have two issues with it.
The organization of my code in WebCrawler.java is terrible. Is there a way I ...
5
votes
1answer
68 views
Reverse-engineering with Filepicker API
I have this script to pull data out of the Filepicker API internal. It's mostly reverse-engineering and the code seems to be ugly to me. How can this be improved?
...
2
votes
0answers
86 views
Parsing a website
Following is the code I wrote to download the information of different items in a page.
I have one main website which has links to different items. I parse this main page to get the list. This is ...
1
vote
0answers
64 views
scraping and saving using Arrays or Objects
I'm using Anemone to Spider a website, I am then using a set of rules specific to that website, to find certain parameters.
I feel like it's simple enough, but any attempt I make to save the ...
11
votes
3answers
803 views
Minimal webcrawler - bad structure and error handling?
I did this code over one day as a part of a job application, where they wanted me to make a minimal webcrawler in any language. The purpose was to crawl a site, find all of the URLs on that page, and ...
1
vote
2answers
200 views
Number of Google search results over a period of time, saved to database
I am writing a Python script that scrapes data from Google search results and stores it in a database. I couldn't find any Google API for this, so I am just sending a HTTP GET request on Google's main ...
7
votes
1answer
117 views
Optimize web-scraping of Moscow grocery website
This code works fine, but I believe it has optimization problems. Please review this.
Also, please keep in mind that it stops after each iteration of the loop ...
16
votes
2answers
232 views
We'll be counting stars
Lately, I've been, I've been losing sleep
Dreaming about the things that we could be
But baby, I've been, I've been praying hard,
Said, no more counting dollars
We'll be counting stars, yeah we'll be ...
2
votes
1answer
65 views
4
votes
1answer
759 views
A simple little Python web crawler
The crawler is in need of a mechanism that will dispatch threads based on network latency and system load. How does one keep track of network latency
in Python without using system tools like ping?
...
2
votes
0answers
98 views
Prototype spider for indexing RSS feeds
This code is super slow. I'm looking for advice on how to improve its performance.
...
3
votes
1answer
122 views
Crawling for emails on websites given by Google API
I'm trying to build an app which crawls a website to find the emails that it has and prints them. I also want to allow the user to type "false" into the console when they want to skip the website ...
10
votes
1answer
193 views
Is this the Clojure way to web-scrape a book cover image?
Is there a way to write this better or more Clojure way? Especially the last part with with-open and the let. Should I put the ...
5
votes
1answer
2k views
Getting data correctly from <span> tag with beautifulsoup and regex
I am scraping an online shop page, trying to get the price mentioned in that page. In the following block the price is mentioned:
...
7
votes
1answer
259 views
AngularJs and Google Bot experiment
I have learned the question of solving Angular app optimization for search engines, and was frustrated that the most recommended option is prerendering HTML.
After some time spent, I suggested to ...
6
votes
3answers
159 views
HTTP scraper not clean and straightforwardly coded?
A job application of mine has been declined because the test project I submitted was not coded in a clean and straightforward way.
Fine, but that's all the feedback I got. Since I like to ...
2
votes
2answers
1k views
Scraping HTML using Beautiful Soup
I have written a script using Beautiful Soup to scrape some HTML and do some stuff and produce HTML back. However, I am not convinced with my code and I am looking for some improvements.
Structure of ...
1
vote
1answer
536 views
Script taking too long for curl request
The below script takes the list of provided URLs and scrapes the present links in each URL and for each scraped link Facebook ...
5
votes
2answers
242 views
Spreadsheet function that gives the number of Google indexed pages
I've developed this spreadsheet in order to scrape a website's number of indexed pages through Google and Google Spreadsheets.
I'm not a developer, so how can I improve this code in order to have ...
4
votes
1answer
339 views
Craigslist search-across-regions script
I'm a JavaScript developer. I'm pretty sure that will be immediately apparent in the below code if for no other reason than the level/depth of chaining that I'm comfortable with. However, I'm learning ...
3
votes
4answers
3k views
Download an image from a webpage
I am trying to write a Python script that download an image from a webpage. On the webpage (I am using NASA's picture of the day page), a new picture is posted everyday, with different file names. ...
1
vote
1answer
209 views
URL and source page scraper
The code does seem a bit repetitive in places such as the parenturlscraper module and the childurlscraper module.
Does anyone ...
4
votes
1answer
219 views
Web scraper for job listings
Is there any room for improvement on this code?
I use mechanize to get the links of a job listing web site. There are pages with pagination (when jobs > 25) and pages without.
If there is, then the ...
2
votes
2answers
2k views
Download image links posted to reddit.com
This is a Python script to save imgur pictures posted to reddit.com forums. I'm looking for an assessment on the design of this script and any web security issues that might exist.
Obvious ...
2
votes
2answers
3k views
Beautifulsoup scraper for sport events
I've written a simple scraper that parses HTML using BeautifulSoup and collects the data (schedule of sports events), then clubs them together in a list of dicts.
The code works just fine, but the ...
2
votes
2answers
348 views
CR Stack Exchange crawler
I am writing a program which automatically crawls codes from this site!
Would you please review my code?
The required .jars: jsoup, org.apache.commons.io.
Main.java:
...
2
votes
1answer
270 views
HTML downloader and parser for CR
This program downloads a Code Review HTML file and parses it.
Could you review my program?
Main.java
...
5
votes
2answers
1k views
Slow web-scraping geolocator
How do I make my Python program faster?
I have 3 suspects right now for it being so slow:
Maybe my computer is just slow
Maybe my Internet is too slow (sometimes my program has to download the html ...
2
votes
2answers
161 views
HNews “ask section” page scraping Python script
Here is a small script I wrote to get the HNews ask section and display them without using a web browser. I'm just looking for feedback on how to improve my style/coding logic/overall code.
...
13
votes
1answer
978 views
Simple xkcd comic downloader
I'd really appreciate some harsh/constructive criticism of what I would consider as my first program in Haskell. The program should download all of the xkcd comics into a folder in the current ...