Remove (not extract) timestamp from HTML string

Question

I've got strings coming from irregularly and ugly formatted HTML sites, that contain a timestamp. I am interested in removing the timestamp entirely and get all the rest.

from bs4 import BeautifulSoup

date1 = '<P><SPAN STYLE="font-family: Univers" STYLE="font-size: 11pt"><STRONG></STRONG></SPAN><SPAN STYLE="font-family: Univers" STYLE="font-size: 11pt">10:15 AM  ARVIND KRISHNAMURTHY, Northwestern University</SPAN></P>'
date2 = """<tr><td style="width:1.2in;padding:0in 5.4pt 0in 5.4pt" valign="top" width="115"><p class="MsoNormal"><span style="font-size:11.0pt;font-family:Univers"><span style="mso-spacerun: yes"> </span>8:45 a.m.<o:p></o:p></span></p></td><td style="width:5.45in;padding:0in 5.4pt 0in 5.4pt" valign="top" width="523"><p class="MsoNormal"><span style="font-size:11.0pt;font-family:Univers">RICARDO  CABALLERO, MIT and NBER<o:p></o:p></span></p></td></tr>"""

soup1 = BeautifulSoup(date1)
print repr(soup1.text.strip())
# "u'10:15 AM  ARVIND KRISHNAMURTHY, Northwestern University'"
soup2 = BeautifulSoup(date2)
print repr(soup2.text.strip())
# "u'8:45 a.m.RICARDO  CABALLERO, MIT and NBER'"

Now, to get the text following the timestamp, I split along whitespaces to join all elements, except the first two:

def remove_date(aString):
    cleaned = aString.replace("\t", " ").replace(".m.", " ").strip()
    return " ".join(cleaned.split(" ")[ 2:]).strip()

string1 = remove_date(soup1.text.strip())
print repr(string1)
# u'ARVIND KRISHNAMURTHY, Northwestern University'
string2 = remove_date(soup2.text.strip())
print repr(string2)
# u'RICARDO  CABALLERO, MIT and NBER'

It delivers the desired result, but this is arguably very ugly. Is there something better? Maybe something that works like dateutil.parser.parse(), but in reverse?

The use of dateutil.parser.parse() might be enought with the fuzzy_with_tokens=True parameter. — 301_Moved_Permanently, Oct 30 '15 at 12:32
@MathiasEttinger: I was using an old version of dateutils, which is why it didn't work. Do you want to write a complete answer which I then accept? — MERose, Oct 30 '15 at 23:03

SuperBiasedMan · Accepted Answer · 2015-10-30 12:45:45Z

You're right that it's ugly, I thought it didn't work at first until I realised how you're splitting to remove the first two whitespace separated strings. It's very unclear and very specific to these formats. If you're going to be format specific, then shouldn't you just use a regex?

r'^\d+:\d+\s+(am|pm|a\.m\.|p\.m\.)' will match the format in both the provided cases.

A brief breakdown is that it will match digits, then a colon, then digits again, then whitespace followed by either am, pm, a.m. or p.m.. You can make it case insensitive to avoid manually marking upper cases too. Also the ^ at the start means it will only work on strings that have this pattern at the beginning, so it wouldn't affect strings that happen to include similar formats as part of the text. I used this site for figuring out this regex and it's very helpful for stuff like this.

You can compile this regex simply with this:

import re

pattern = re.compile(ur'^\d+:\d+\s+(am|pm|a\.m\.|p\.m\.)', re.IGNORECASE)

Then just call it on your stripped soup text:

string1 = re.sub(pattern, "", soup1.text.strip())
print repr(string1)
string2 = re.sub(pattern, "", soup2.text.strip())
print repr(string2)

MERose · Accepted Answer · 2015-11-01 08:58:56Z

Based on MathiasEttinger's suggestions in the comments and my initial feeling, I tried using dateutils:

from bs4 import BeautifulSoup
from dateutil import parser

date1 = '<P><SPAN STYLE="font-family: Univers" STYLE="font-size: 11pt"><STRONG></STRONG></SPAN><SPAN STYLE="font-family: Univers" STYLE="font-size: 11pt">10:15 AM  ARVIND KRISHNAMURTHY, Northwestern University</SPAN></P>'
date2 = """<tr><td style="width:1.2in;padding:0in 5.4pt 0in 5.4pt" valign="top" width="115"><p class="MsoNormal"><span style="font-size:11.0pt;font-family:Univers"><span style="mso-spacerun: yes"> </span>8:45 a.m.<o:p></o:p></span></p></td><td style="width:5.45in;padding:0in 5.4pt 0in 5.4pt" valign="top" width="523"><p class="MsoNormal"><span style="font-size:11.0pt;font-family:Univers">RICARDO  CABALLERO, MIT and NBER<o:p></o:p></span></p></td></tr>"""

soup1 = BeautifulSoup(date1)
soup2 = BeautifulSoup(date2)

string1 = ' '.join(parser.parse(soup1.text, fuzzy_with_tokens=True)[1])
print repr(string1)
# u'    ARVIND KRISHNAMURTHY, Northwestern University'
string2 = ' '.join(parser.parse(soup2.text, fuzzy_with_tokens=True)[1])
print repr(string2)
# u'    .m.RICARDO  CABALLERO,   and NBER'

However, for some characters (e.g. tab) the algorithm is too greedy:

date3 = """<BR WP="BR1"><BR WP="BR2"><P><SPAN STYLE="font-size: 11pt"> 4:00 PM      JOHN P. CONLEY, Northwestern University</SPAN></P>"""
soup3 = BeautifulSoup(date3)
print repr(soup3.text)
# u' 4:00 PM\tJOHN P. CONLEY, Northwestern University'
string3 = ' '.join(parser.parse(soup3.text, fuzzy_with_tokens=True)[1])
print repr(string3)
# u'        . CONLEY, Northwestern University'

Stack Exchange Network

current community

your communities

more stack exchange communities

Remove (not extract) timestamp from HTML string

2 Answers 2

Your Answer

Not the answer you're looking for? Browse other questions tagged python strings datetime beautifulsoup or ask your own question.

Hot Network Questions

Remove (not extract) timestamp from HTML string

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged python strings datetime beautifulsoup or ask your own question.

Related

Hot Network Questions