I've got strings coming from irregularly and ugly formatted HTML sites, that contain a timestamp. I am interested in removing the timestamp entirely and get all the rest.
from bs4 import BeautifulSoup
date1 = '<P><SPAN STYLE="font-family: Univers" STYLE="font-size: 11pt"><STRONG></STRONG></SPAN><SPAN STYLE="font-family: Univers" STYLE="font-size: 11pt">10:15 AM ARVIND KRISHNAMURTHY, Northwestern University</SPAN></P>'
date2 = """<tr><td style="width:1.2in;padding:0in 5.4pt 0in 5.4pt" valign="top" width="115"><p class="MsoNormal"><span style="font-size:11.0pt;font-family:Univers"><span style="mso-spacerun: yes"> </span>8:45 a.m.<o:p></o:p></span></p></td><td style="width:5.45in;padding:0in 5.4pt 0in 5.4pt" valign="top" width="523"><p class="MsoNormal"><span style="font-size:11.0pt;font-family:Univers">RICARDO CABALLERO, MIT and NBER<o:p></o:p></span></p></td></tr>"""
soup1 = BeautifulSoup(date1)
print repr(soup1.text.strip())
# "u'10:15 AM ARVIND KRISHNAMURTHY, Northwestern University'"
soup2 = BeautifulSoup(date2)
print repr(soup2.text.strip())
# "u'8:45 a.m.RICARDO CABALLERO, MIT and NBER'"
Now, to get the text following the timestamp, I split along whitespaces to join all elements, except the first two:
def remove_date(aString):
cleaned = aString.replace("\t", " ").replace(".m.", " ").strip()
return " ".join(cleaned.split(" ")[ 2:]).strip()
string1 = remove_date(soup1.text.strip())
print repr(string1)
# u'ARVIND KRISHNAMURTHY, Northwestern University'
string2 = remove_date(soup2.text.strip())
print repr(string2)
# u'RICARDO CABALLERO, MIT and NBER'
It delivers the desired result, but this is arguably very ugly. Is there something better? Maybe something that works like dateutil.parser.parse()
, but in reverse?
dateutil.parser.parse()
might be enought with thefuzzy_with_tokens=True
parameter. \$\endgroup\$ – 301_Moved_Permanently Oct 30 '15 at 12:32