Take the 2-minute tour ×
Stack Overflow is a question and answer site for professional and enthusiast programmers. It's 100% free, no registration required.

I'm trying to scrape a table from the NYSE website (http://www1.nyse.com/about/listed/IPO_Index.html) into a pandas dataframe. In order to do so, I have a setup like this:

def htmltodf(url):
page = requests.get(url)
soup = BeautifulSoup(page.text)

tables = soup.findAll('table')
test = pandas.io.html.read_html(str(tables))

return(test)            #return dataframe type object

However, when I run this on the page, all of the table returned in the list are essentially empty. When I further investigated, I found that the table is generated by javascript. When using the developer tools in my web browser, I see that the table looks like any other HTML table with the tags, etc. However, a view of the source code revealed something like this instead:

<script language="JavaScript">

.
.
.

<script>
var year = [["ICC","21st Century Oncology Holdings, Inc.","22 May  2014","/about/listed/icc.html" ],
... more entries here ...
,["ZOES","Zoe's Kitchen, Inc.","11 Apr 2014","/about/listed/zoes.html" ]] ;

    if(year.length != 0) 
    {   

    document.write ("<table width='619' border='0' cellspacing='0' cellpadding='0'><tr><td><span class='fontbold'>");
    document.write ('2014' + " IPO Showcase"); 
    document.write ("</span></td></tr></table>"); 
    }  
</script>

Therefore, when my HTML parser goes to look for the table tag, all it can find is the if condition, and no proper tags below that would indicate content. How can I scrape this table? Is there a tag that I can search for instead of table that will reveal the content? Because the code is not in traditional html table form, how do I read it in with pandas--do I have to manually parse the data?

share|improve this question
    
I don't think you can with BS. Maybe try selenium? stackoverflow.com/questions/8960288/… –  fredtantini Jul 31 at 15:08
    
Did you use splinter? –  WannaBeCoder Jul 31 at 15:11
    
stackoverflow.com/questions/8143023/… may be this will help. –  WannaBeCoder Jul 31 at 15:16

1 Answer 1

up vote 0 down vote accepted

In this case, you need something to run that javascript code for you.

One option here would be to use selenium:

from pandas.io.html import read_html
from selenium import webdriver


driver = webdriver.Firefox()
driver.get('http://www1.nyse.com/about/listed/IPO_Index.html')

table = driver.find_element_by_xpath('//div[@class="sp5"]/table//table/..')
table_html = table.get_attribute('innerHTML')

df = read_html(table_html)[0]
print df

driver.close()

prints:

                                                    0        1          2   3
0                                                Name   Symbol        NaT NaN
1                       Performance Sports Group Ltd.      PSG 2014-06-20 NaN
2                           Century Communities, Inc.      CCS 2014-06-18 NaN
3                        Foresight Energy Partners LP     FELP 2014-06-18 NaN
...
79  EGShares TCW EM Long Term Investment Grade Bon...     LEMF 2014-01-08 NaN
80  EGShares TCW EM Short Term Investment Grade Bo...     SEMF 2014-01-08 NaN

[81 rows x 4 columns]
share|improve this answer
    
This works really well--thanks! –  user2643394 Jul 31 at 16:47

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.