How to extract the following HTML snippet with Python

Question

I have the following Python code:

def getAddress(text):
    text = re.sub('\t', '', text)
    text = re.sub('\n', '', text)
    blocks = re.findall('<div class="result-box" itemscope itemtype="http://schema.org/LocalBusiness">([a-zA-Z0-9 ",;:\.#&_=()\'<>/\\\t\n\-]*)</span>Follow company</span>', text)
    name = ''
    strasse = ''
    locality = ''
    plz = ''
    region = ''
    i = 0

    for block in blocks:
        names = re.findall('class="url">(.*)</a>', block)
        strassen = re.findall('<span itemprop="streetAddress">([a-zA-Z0-9 ,;:\.&#]*)</span>', block)
        localities = re.findall('<span itemprop="addressLocality">([a-zA-Z0-9 ,;:&]*)</span>', block)
        plzs = re.findall('<span itemprop="postalCode">([0-9]*)</span>', block)
        regions = re.findall('<span itemprop="addressRegion">([a-zA-Z]*)</span>', block)

        try:
            for name in names:
                name = str(name)
                name = re.sub('<[^<]+?>', '', name)
                break

            for strasse in strassen:
                strasse = str(strasse)
                strasse = re.sub('<[^<]+?>', '', strasse)
                break

            for locality in localities:
                locality = str(locality)
                locality = re.sub('<[^<]+?>', '', locality)
                break

            for plz in plzs:
                plz = str(plz)
                plz = re.sub('<[^<]+?>', '', plz)
                break

            for region in regions:
                region = str(region)
                region = re.sub('<[^<]+?>', '', region)
                break
        except:
            continue
        print i
        i = i + 1

        if plz == '':
            plz = getZipCode(strasse, locality, region)
        address = '"' + name + '"' + ';' + '"' + strasse + '";' + locality + ';' + str(plz) + ';' + region + '\n'

        #saveToCSV(address)

I want to filter out this html snippet. This snip gets repeated several times. I want the function to return one entry for each snippet. But instead it returns me one entry with both snippets. What do I have to change?

<div class="result-box" itemscope itemtype="http://schema.org/LocalBusiness">
        <div class="clear">
            <h2 itemprop="name"><a href="http://www.manta.com/c/mxlk5yt/belgium-jewelers-corp" class="url">Belgium Jewelers Corp</a></h2>           </div>
        <div itemprop="address" itemscope itemtype="http://schema.org/PostalAddress">               <span itemprop="addressLocality">Lawrenceville</span> <span itemprop="addressRegion">NJ</span>
        </div>          <a href="#" class="followCompany" data-emid="mxlk5yt" data-companyname="Belgium Jewelers Corp" data-location="ListingFollowButton" data-location-page="Megabrowse">
            <span class="followMsg"><span class="followIcon mrs"></span>Follow company</span>
            <span class="followingMsg"><span class="followIcon mrs"></span>Following</span>
            <span class="unfollowMsg"><span class="followIcon mrs"></span>Unfollow company</span>
        </a>            <p class="type">Jewelry Stores</p>      </div>
    </li>       <li>        <div class="icons">
        <ul>            </ul>
    </div>

Why not use a HTML parser to extract that information instead? Regular expressions are not the tool to use here. — Martijn Pieters, May 21 '13 at 8:29

Martijn Pieters · Accepted Answer · 2013-05-21 08:37:53Z

Please put down that hammer; HTML is not a regular-expression shaped nail. Regular expressions to parse HTML get complicated fast, and are very fragile, easily broken when the HTML changes subtly.

Use a proper HTML parser instead. BeautifulSoup would make your task trivial:

from bs4 import BeautifulSoup

soup = BeautifulSoup(text)
for block in soup.find_all('div', class_="result-box", itemtype="http://schema.org/LocalBusiness"):
    print block.find('a', class_='url').string

    street = block.find('span', itemprop="streetAddress")
    if street:
        print street.string

    locality = block.find('span', itemprop="addressLocality")
    if locality:
        print locality.string

    # .. etc. ..

Many thanks for all your help. I now use Beautiful Soup 4. It made the task much easier and reduced my code alot. — infoBB, May 21 '13 at 10:37

Suhosin Pony · Answer 2 · 2013-05-21 08:32:22Z

up vote 0 down vote

You should look at HTMLParser (documentation) for Python. Regex is notoriously bad for parsing HTML.

answered May 21 '13 at 8:32

Suhosin Pony
5,9592838

add comment

asked	10 months ago
viewed	120 times
active	10 months ago

current community

your communities

more stack exchange communities

How to extract the following HTML snippet with Python

2 Answers

Your Answer

Not the answer you're looking for? Browse other questions tagged python html regex or ask your own question.

Visit Chat

Hot Network Questions

current community

your communities

more stack exchange communities

How to extract the following HTML snippet with Python

2 Answers

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged python html regex or ask your own question.

Visit Chat

Related

Hot Network Questions