0

Ok please be gentle - this is my first stackoverflow question and I've struggled with this for a few hours. I'm sure the answer is something obvious, staring me in the face but I give up.

I'm trying to grab an element from a webpage (ie determine gender of a name) from a name website.

The python code I've written is here:

import re
import urllib2

response=urllib2.urlopen("http://www.behindthename.com/name/janet")
html=response.read()
print html

patterns = ['Masculine','Feminine']

for pattern in patterns:
print "Looking for %s in %s<<<" % (pattern,html)

    if re.findall(pattern,html):
        print "Found a match!"
        exit
    else:
        print "No match!"

When I dump html I see Feminine there, but the re.findall isn't matching. What in the world am I doing wrong?

2
  • 1
    You know that with such simple regex you can just do if pattern in html?
    – vaultah
    Commented Jul 29, 2014 at 20:27
  • Other answers notwithstanding, I don't see any reason why your code as given won't actually work. What do you get if you print re.findall(pattern, html) in the loop? Commented Jul 29, 2014 at 20:36

1 Answer 1

1

Do not parse an HTML with regex, use a specialized tool - an HTML parser.

Example using BeautifulSoup:

from urllib2 import urlopen
from bs4 import BeautifulSoup

url = 'http://www.behindthename.com/name/janet'
soup = BeautifulSoup(urlopen(url))

print soup.select('div.nameinfo span.info')[0].text  # prints "Feminine"

Or, you can find an element by text:

gender = soup.find(text='Feminine')

And then, see if it is None (not found) or not: gender is None.

1
  • @alecxe yes, the top answer in that link was very clear and easy to understand. No confusion at all. (I'm bookmarking that for later use.) Commented Jul 29, 2014 at 20:37

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.