6
votes
2answers
230 views

Why does this regex take so long to find email addresses in certain files?

I have a regular expression that looks for email addresses ( this was taken from another SO post that I can't find and has been tested on all kinds of email configurations ... changing this is not ...
4
votes
1answer
524 views

How to perform web scraping to find specific linked pages in Java on Google App Engine?

I need to retrieve text from a remote web site that does not provide an RSS feed. What I know is that the data I need is always on pages linked to from the main page (http://www.example.com/) with a ...
3
votes
2answers
203 views

Python find file download link on webpage

I need a regex that will return to me the text contained between double quotes that starts with a specified text block, and ends with a specific file extension (say .txt). I'm using urllib2 to get ...
3
votes
2answers
74 views

How can I extract sentences with years in them with a regex?

I'm parsing Wikipedia articles. I want to extract every sentence with a year in it. The year can be anything from 1000 - 2012. Below is the regex I've been trying, but I can't quite get it right. ...
3
votes
3answers
93 views

PHP Filtering an array for 1 url

I made a script that creates an array of urls scraped from a page and I want to filter the array for just 1 certain url. The array currently looks like this: Array ( [0] => index.jsp [1] ...
3
votes
3answers
279 views

Regex pattern with subpattern exceptions (Python)

I am using BeautifulSoup to extract tabledata tags from a table. The TD's have a class of either 'a','u','e','available-unavailable' or 'unavailable-available'. (Yes, I know quirky class names but ...
2
votes
2answers
258 views

Android/Java: Html scraping, regex album art from Spotify

I'm working on a project that requires me to scrape an image link to an album art from open.spotify Example: http://open.spotify.com/track/296mPMQavmf1vvxYrUvLN8 In this example I'm looking for this ...
1
vote
3answers
370 views

PHP web scraping

I use php web scraping, and I want to get the price (3.65) on Sunday form the html code below: <tr class="odd"> <td > <b>Sunday</b> Info ...
1
vote
4answers
181 views

PHP regex to return <option> values

Just wondering if you can help me out a bit with a little task I'm trying to do in php. I have text that looks something like this in a file: (random html) ... <OPTION VALUE="195" ...
1
vote
1answer
219 views

Cannot find data from string using regex while string.find() works just fine

import re import urllib p = urllib.urlopen("http://sprunge.us/QZhU") page = p.read() pos = page.find("<h2><span>") print page[pos:pos+48] c = ...
1
vote
3answers
268 views

Web Scraping of Person Descriptions

I've attempted to build a program to scrape the web for company management teams. It's very accurate at obtaining many things, including: -names -job titles -images -emails -Qualifications (MD, ...
1
vote
1answer
75 views

Inquiry: Why is my regex code not reading all characters?

I have the following description I want scrap using my program. <hr>Provides AFROTC cadets up to 13 options for practical leadership and specialized training through exposure to USAF ...
1
vote
2answers
1k views

Need to ignore case in preg_match_all usage

Im trying to crape html and grab items between <tr> tags. Some of the tags are coming through as uppercase for some reason ( <TR> ) and are being ignored by my pattern. How can i tell my ...
1
vote
1answer
964 views

How do I get rid of characters like &#x27; that appear instead of apostrophes? [duplicate]

Possible Duplicate: Convert XML/HTML Entities into Unicode String in Python I am attempting to scrape a website using Python. I import and use the urllib2, BeautifulSoup and re modules. ...
1
vote
3answers
2k views

How can I parse specific info from html source code using Java

I know there is lots of topics for my question but I couldnt find helpful solution for my answer. I could connect to website and read line by line in Java, now here is my problem. I want to parse a ...

1 2 3 4 5
15 30 50 per page