1

I can scrape the target values for a date-specific URL... How should I setup datetime and scraping to skip URLs that do not have the target table?? this is the code I have so far --

date = datetime.datetime.today()
url = "http://www.wsj.com/mdc/public/page/2_3022-mfsctrscan-moneyflow- 20161205.html?mod=mdc_pastcalendar"

I know I sub in {date} to the URL to get the date dynamic to work - supplied a static url in case URL is blank.

date_time = urlopen(url.format(date=date.strftime('%Y%m%d')))
address = url
print 'Retrieving information from: ' + address    
print '\n'
soup = BeautifulSoup (requests.get(address).content, "lxml")

scraping proceeds as:

rows = soup.select('div#column0 table tr')[2:]

headers = ['name', 'last', 'chg', 'pct_chg',
       'total_money_flow', 'total_tick_up', 'total_tick_down', 'total_up_down_ratio',
       'block_money_flow', 'block_tick_up', 'block_tick_down', 'block_up_down_ratio']
for row in rows:
# skip non-data rows
    if row.find("td", class_="b14") is True:
    continue

print(dict(zip(headers, [cell.get_text(strip=True) for cell in row.find_all('td')])))
3
  • Can't you use a try to skip the dates that don't have the target table?
    – Rafael
    Commented Dec 8, 2016 at 23:52
  • I think so, but how would the values look to get a script to fetch retroactive dates - I know if it were numerical like URL.x. I could use n=1; for i in range(1, n+1) but not sure how to manipulate it datetime-wise
    – Derek_P
    Commented Dec 9, 2016 at 0:03
  • 1
    Off the top of my head, you could make a separate file with the dates and read them in as a list, then iterate over that list, or use timedelta stackoverflow.com/questions/1712116/…
    – Rafael
    Commented Dec 9, 2016 at 0:08

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Browse other questions tagged or ask your own question.