Cannot find data from string using regex while string.find() works just fine

Question

import re
import urllib
p = urllib.urlopen("http://sprunge.us/QZhU")
page = p.read()
pos = page.find("<h2><span>")
print page[pos:pos+48]
c = re.compile(r'<h2><span>(.*)</span>')
print c.match(page).group(1)

When I run it:

shadyabhi@archlinux $ python2 temp.py 
<h2><span>House.S08E02.HDTV.XviD-LOL.avi</span> 
Traceback (most recent call last):
  File "temp.py", line 8, in <module>
    print c.match(page).group(1)
AttributeError: 'NoneType' object has no attribute 'group'
shadyabhi@archlinux $

If I can find a string using string.find then what is the problem when I use regex. I have tried looking http://docs.python.org/howto/regex.html#regex-howto but no help.

aix · Answer 1 · 2011-10-14 13:18:28Z

up vote 6 down vote accepted

match only matches at the beginning of the string. Use search, finditer or findall.

Also note that * is greedy. You might want to change your regex to r'<h2><span>(.*?)</span>'.

In summary, the following works for me:

import re
import urllib
p = urllib.urlopen("http://sprunge.us/QZhU")
page = p.read()
pos = page.find("<h2><span>")
print page[pos:pos+48]
c = re.compile(r'<h2><span>(.*?)</span>')
print c.search(page).group(1)

answered Oct 14 '11 at 13:18

aix
45.7k455113

Does by beginning mean "till the first newline comes"? Also, please tell what happened when I add "?"? It matches zero or 1 repetition of previous RE. Whats previoud RE? Isn't the while string a RE? – shadyabhi Oct 14 '11 at 13:23

1

@shadyabhi: No, it means that the first character of the string must be the first character of the match. Matches beginning at the second character and thereafter are not considered. On the words, for match to work, the HTML document must begin with <h2><span>..., not contain it somewhere in the middle. – aix Oct 14 '11 at 13:24

Also, please tell what happened when I add "?"? It matches zero or 1 repetition of previous RE. Whats previous RE? Isn't the whole string a RE? – shadyabhi Oct 14 '11 at 13:29

1

@shadyabhi: *? is a single operator. It matches as few characters as possible, whereas * matches as many as possible. See docs.python.org/library/re.html#regular-expression-syntax – aix Oct 14 '11 at 13:31

Thanks. Everything is clear now. :) – shadyabhi Oct 14 '11 at 13:33

feedback

asked	3 months ago
viewed	50 times
active	3 months ago

Cannot find data from string using regex while string.find() works just fine

1 Answer

Your Answer

Not the answer you're looking for? Browse other questions tagged python regex web-scraping urllib or ask your own question.

Hello World!

Cannot find data from string using regex while string.find() works just fine

1 Answer

Your Answer

Not the answer you're looking for? Browse other questions tagged python regex web-scraping urllib or ask your own question.

Hello World!

Related