First program with scraping, lists, string manipulation

Question

I wanted to find out which states and cities the USA hockey team was from, but I didn't want to manually count from the roster site here.

I'm really interested to see if someone has a more elegant way to do what I've done (which feels like glue and duct tape) for future purposes. I read about 12 different Stack Overflow questions to get here.

from bs4 import BeautifulSoup
from collections import Counter
import urllib2

url='http://olympics.usahockey.com/page/show/1067902-roster'
page=urllib2.urlopen(url)
soup = BeautifulSoup(page.read())


locations = []
city = []
state = []

counter = 0
tables = soup.findAll("table", { "class" : "dataTable" })
for table in tables:
    rows = table.findAll("tr")
    for row in rows:
        entries = row.findAll("td")
        for entry in entries:
            counter = counter + 1
            if counter == 7:
                locations.append(entry.get_text().encode('ascii'))
        counter = 0


for i in locations:
    splitter = i.split(", ")
    city.append(splitter[0])
    state.append(splitter[1])

print Counter(state)
print Counter(city)

I essentially did a three tier loop for table->tr->td, and then used a counter to grab the 7th column and added it to a list. Then I iterated through the list splitting the first word to one list, and the second word to a second list. Then ran it through Counter to print the cities and states. I get a hunch this could be done a lot simpler, curious for opinions.

Joel Cornett · Accepted Answer · 2014-02-25 17:22:14Z

It looks like you're just trying to get the n-th column from a bunch of tables on the page, in that case, there's no need to iterate through all the find_all() results, just use list indexing:

for row in soup.find_all('tr'):
    myList.append(row.find_all('td')[n])

This is also a good use case for generators because you are iterating over the same set of data several times. Here is an example:

from bs4 import BeautifulSoup
from collections import Counter
from itertools import chain
import urllib2

url = 'http://olympics.usahockey.com/page/show/1067902-roster'
soup = BeautifulSoup(urllib2.urlopen(url).read())

def get_table_column(table, n):
    rows = (row.find_all("td") for row in table.find_all("tr"))
    return (cells[n-1] for cells in rows)

tables = soup.find_all("table", class_="dataTable")
column = chain.from_iterable(get_table_column(table, 7) for table in tables)
city, state = zip(*(cell.get_text().encode('ascii').split(', ') for cell in column))

print Counter(state)
print Counter(city)

While this works, it also might be a good idea to anticipate possible errors and validate your data:

To anticipate cases where not all rows have n>=7 td elements, we would change the last line in get_table_column to:

return (cells[n-1] for cells in rows if len(cells) >= n)

We should also anticipate cases where the cell contents does not contain a comma. Let's expand the line where we split on a comma:

splits = (cell.get_text().encode('ascii').split(',') for cell in column)
city, state = zip(*(split for split in splits if len(split) == 2))

Josay · Answer 2 · 2014-02-25 09:02:32Z

You can use enumerate in order not to play with counter.

for counter,entry in enumerate(entries):
    if counter == 6:
        locations.append(entry.get_text().encode('ascii'))

states and cities would probably be a better name for collections of states/cities.
You can use unpacking when splitting the string

For instance :

for i in locations:
   city, state = i.split(", ")
   cities.append(city)
   states.append(state)

You don't need to store locations in a list and then iterate over the locations, you can handle them directly without storing them.

This is pretty much what my code is like at this stage :

#!/usr/bin/python

from bs4 import BeautifulSoup
from collections import Counter
import urllib2

url='http://olympics.usahockey.com/page/show/1067902-roster'
soup = BeautifulSoup(urllib2.urlopen(url).read())

cities = []
states = []

for table in soup.findAll("table", { "class" : "dataTable" }):
    for row in table.findAll("tr"):
        for counter,entry in enumerate(row.findAll("td")):
            if counter == 6:
                city, state = entry.get_text().encode('ascii').split(", ")
                cities.append(city)
                states.append(state)

print Counter(states)
print Counter(cities)

Now, because of the way American cities are named, it might be worth counting the cities keeping their state into account (because Portand and Portland are not quite the same city). Thus, it might be worth storing information about city and state as a tuple.

This is how I'd do it :

#!/usr/bin/python

from bs4 import BeautifulSoup
from collections import Counter
import urllib2

url='http://olympics.usahockey.com/page/show/1067902-roster'
soup = BeautifulSoup(urllib2.urlopen(url).read())

cities = []
for table in soup.findAll("table", { "class" : "dataTable" }):
    for row in table.findAll("tr"):
        for counter,entry in enumerate(row.findAll("td")):
            if counter == 6:
                city, state = entry.get_text().encode('ascii').split(", ")
                cities.append((state,city))

print Counter(cities)
print Counter(state for state,city in cities)

Also, something I have forgotten but might be useful to you is to use defaultdict if all you want is to count elements.

from collections import defaultdict

url='http://olympics.usahockey.com/page/show/1067902-roster'
soup = BeautifulSoup(urllib2.urlopen(url).read())

cities = defaultdict(int)
for table in soup.findAll("table", { "class" : "dataTable" }):
    for row in table.findAll("tr"):
        for counter,entry in enumerate(row.findAll("td")):
            if counter == 6:
                city, state = entry.get_text().encode('ascii').split(", ")
                cities[state,city] += 1

The unpacking is what I was looking for, that makes much more sense. — user3349351, Feb 25 at 14:00

asked	22 days ago
viewed	53 times
active	22 days ago

current community

your communities

more stack exchange communities

First program with scraping, lists, string manipulation

2 Answers

Your Answer

Not the answer you're looking for? Browse other questions tagged python html parsing or ask your own question.

Community Bulletin

Hot Network Questions

current community

your communities

more stack exchange communities

First program with scraping, lists, string manipulation

2 Answers

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged python html parsing or ask your own question.

Community Bulletin

Related

Hot Network Questions