Good tools to parse repetitive unstructured data

Question

I'm looking to parse a large number of lines of repetitive but unstructured data. This is a task that happens at least once every project, in my experience, so I'm looking for a tool to transform fairly standard text into structured data. Right now I just use a combination of regex find and replace and one-off python scripts.

Here's a clean example:

Distance: 25.903 miles*
Morgan Road Middle School Extension
    HEPHZIBAH GA, 30815
    Telephone: 706.504.4071
    A unit of: Boys & Girls Clubs of Augusta
        http://www.bgcaugusta.org

And here's a slightly messier example:

Maria Teresa’s Babies Early Enrichment Center/Daycare 
825 23rd Street South 
Arlington, VA 22202 
703-979-BABY (2229) 
22. 
Maria Teresa Desaba, Owner/Director; Tony Saba, Org. Director. 
Website: www.mariateresasbabies.com 
Serving children 6 wks to 5yrs full-time. 


National Science Foundation Child  Development Center  
23. 
4201 Wilson Blvd., Suite 180  22203 
703-292-4794 
Website:  www.brighthorizons.com 112 children, ages 6 wks - 5 yrs.   
7:00 a.m. – 6:00 p.m. Summer Camp for children 5 - 9 years.

These are only examples. The issue of parsing unstructured data that is nonetheless repetitive is something I come across fairly often, especially when receiving text or word documents in response to FOIA requests. Mostly I'm wondering if someone has written a tool or library that's good at converting these documents into structured data, or if I should be thinking about how to write something myself.

can you be more specific in how unstructured the data is? A small sample would be nice.
There has been a discussion if this question is off-topic. To be on the safe side, you might want to extend your question and briefly explain how it is relevant for the Open Data community.
Edited to make the problem a bit clearer. I don't feel like this is a StackOverflow question because I'm not asking about how to code a solution - I've got a lot of those. I'm more wondering if anyone has any tools or libraries they know of that can generalize this problem. Obviously, I didn't ask that question very well.

Daniel Knell · Accepted Answer · 2013-05-13 08:26:37Z

OpenRefine can be used to parse semi-structured data into a table like structure, where it can be operated on in a manner similar to a spread sheet and exported.

The site features a tutorial on converting a list on wikipedia into a table which may be a good starting point.

The operations involved can also be exported incase you need to perform the same cleanup operation again.

johnthexiii · Answer 2 · 2013-05-09 01:18:53Z

Python is the tool I would use.

import csv
from itertools import izip_longest

def grouper(iterable, n, fillvalue=None):
    '''Collect data into fixed-length chunks or blocks
      grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx'''
    args = [iter(iterable)] * n
    return izip_longest(fillvalue=fillvalue, *args)

with open("raw_data.txt", 'r') as f:
    csv_file = open("out.csv", 'w')
    writer = csv.writer(csv_file)

    for record in grouper(f, 4):
        record = [item.strip() for item in record]
        city = record[-1:][0].split(',')[0]
        state = record[-1:][0].split(' ')[1]
        zip_code = record[-1:][0].split(' ')[2]

        sanitized_record = record[:-1] + [city, state, zip_code]

        writer.writerow(sanitized_record)

    csv_file.close()

the advantage this offers you over grep is that if later you decide to put the data in a database or perform some complex calculations on the data that will be a lot easier with python than grep.

Mark J · Answer 3 · 2013-05-09 03:07:11Z

It entirely depends on how unstructured your data is. You'll be making a little "mini-domain language" to put the data into structured form.

The more unstructured it is, the more you going to have to get in the minds of everyone who made the records. Shoot for the most common and put the last 20% in the "complete-by-hand" file.

Python, Sed/awk, Perl are tools that are quite commonly used. I use Python myself and it's quite good.

Joe · Answer 4 · 2013-05-13 12:39:21Z

I've always done work like this in Perl -- my basic methodology goes something like this. (note, this is for dealing with multi-GB files ... it can be simplified if you can load the whole thing in memory)

I've denoted helper routines with &, although you probably don't want to use that old perl-4 calling style as it'll force it to ignore function prototyping.

my $record = '';
while ( my $line = <> ) {
  if ( &test_for_record_separator ) {
     &parse_record( $record );
     $record = '';
     next;
  }
  $record .= $line;
}
&parse_record( $record );

sub parse_record { 
  my $record = shift;
  my %record = ();
  # exact logic depends on how variable the record structure is.  To handle labeled lines:
  foreach my $line ( split /\n/, $record ) {
    if ( $line =~ m/Some Label:\s*(.*)$/  ) { $record{'Label'}  = $1; next } 
    if ( $line =~ m/Other Label:\s*(.*)$/ ) { $record{'Label2'} = $1; next }
    ...
    &flag_unknown_lines( $line );
  }
  # if the record structure is fixed:
  @record{ qw( company, street_num, street, city, state, zip, phone, index, description )} = (
    $record =~ m/(.+)\n(\d+)\s+(.+)\n(.+),\s+(\w[\w\s]+\w)\s+([-\d]+)\n(\d+)[.]\n([\s\S]+)/
  );
  # or some other more complicated process

  &throw_parse_error unless &validate(%record);
  &write_record_to_database_or_whatever(%record);
}

Obviously, knowledge of the input will allow you to select the best parsing routine -- if you're only going to have state abbreviations, you won't need (\w[\w\s]+\w) (which will match New York or West Virginia cleanly). Bits of this (like looking for 'Key: Value' stuff can be easily abstracted to handle whatever input, provided your writing routine can handle it.)

David Medinets · Answer 5 · 2013-05-09 01:32:39Z

up vote 0 down vote

It does seem like a scripting language is the proper approach. I'd read line by line. Starting a new record at the top of the file or after an empty line. Use some pattern recognition and you'll be parsing away fairly quickly.

answered May 9 at 1:32

David Medinets
12112

1

Which language would you recommend? Do you know of any libraries to help with the pattern recognition? – Patrick Hoefler♦ May 9 at 1:55

The number of built-in commands in Ruby that deal with strings may still be larger than any other language (I haven't checked recently). But you ought to be able to do pretty similar work in Ruby, Perl or Python. – Roger_S May 18 at 22:05

Antonio Cunha Santos · Answer 6 · 2013-05-09 09:22:42Z

up vote 0 down vote

Python and some XSLT wold do the trick

answered May 9 at 9:22

Antonio Cunha Santos
2913

Your answer is very short. Would you care to elaborate? – Patrick Hoefler♦ May 13 at 7:26

Brad Gilbert · Answer 7 · 2013-08-18 20:38:35Z

The problem with the second example you posted, is there is almost no structure to it. It doesn't even have a consistent ordering of rows for each record.

I think this is where Perl shines, so I went ahead and wrote up a prototype. ( took me about an hour )

I don't think there exists a tool that would be any simpler, that would get as decent a result. ( There might be a tool that would be easier to learn, but it still won't be any simpler to setup )

All it does is parse the input file, and give a computer readable output.

You would still have to do some hand editing of either the input or the output, but it does a fairly decent job on your sample.

The code was designed to not lose a single piece of information, so it would still need some processing in order to load it into a database.

It just throws whatever it can't decipher into comment.
If you knew a line should be tagged as hours (for example) you could just add "hours: " to the beginning of that line.

#! perl
use strict;
use warnings;

use 5.14.0;

# there was some wide chars in the input
use open ':encoding(utf8)';
use open (':std', ':encoding(utf8)');

$/ = "\n\n"; # input record separator

my @data;

RECORD: while( my $record = <> ){
  # split on newlines, ignore empty ones
  my @lines = grep { length } split /\r?\n/, $record;

  my %record = (
    name => shift @lines, # first line is the name
  );

  LINE: for( @lines ){ # for local $_ ( @lines ){...}
    s/^\s|\s+$//g; # remove leading and trailing spaces

    if( /^(\d+)\.$/a ){ # "22." "23." etc
      $record{number} = ''.$1;

    }elsif( /^\d{3}.\d{3}.[^\s]{4}/a ){ # loosely looks like a phone number
      push @{$record{phone}}, $_; # may be more than one phone number so put it in an array

    }elsif( /^(\w+):\s+(.*)/ ){
      push @{$record{lc $1}}, ''.$2; # handles website: ...

    }elsif( /\d+ .*? \b(?:st(?:reet)?|blvd|boul[ae]vard|ave(?:nue)?)\b/axi ){ # looks like a street address
      push @{$record{address}}, $_;

    }elsif( /,\s+\w\w\s+\d{5}/a ){ # looks like "city, state, zipcode"
      push @{$record{address}}, $_;

    }else{
      push @{ $record{comment} }, $_; # unknown line
    }
  }

  push @data, \%record;
}

use JSON;
# UTF8 is handled by "use open ..." above
# so we turn it off here
my $encode = JSON->new->utf8(0)->pretty;
print $encode->encode(\@data);

print "\n";

[
   {
      "website" : [
         "www.mariateresasbabies.com"
      ],
      "comment" : [
         "Maria Teresa Desaba, Owner/Director; Tony Saba, Org. Director.",
         "Serving children 6 wks to 5yrs full-time."
      ],
      "number" : "22",
      "name" : "Maria Teresa’s Babies Early Enrichment Center/Daycare ",
      "address" : [
         "825 23rd Street South",
         "Arlington, VA 22202"
      ],
      "phone" : [
         "703-979-BABY (2229)"
      ]
   },
   {
      "website" : [
         "www.brighthorizons.com 112 children, ages 6 wks - 5 yrs."
      ],
      "comment" : [
         "7:00 a.m. – 6:00 p.m. Summer Camp for children 5 - 9 years."
      ],
      "number" : "23",
      "name" : "National Science Foundation Child  Development Center  ",
      "address" : [
         "4201 Wilson Blvd., Suite 180  22203"
      ],
      "phone" : [
         "703-292-4794"
      ]
   }
]

noam · Answer 8 · 2013-05-09 16:15:24Z

up vote -1 down vote

Open City's Dedupe Python Library works along similar lines. I haven't tried it.

edited May 9 at 16:15

answered May 9 at 13:15

noam
992

Dedupe takes structured, but messy data. You'll need some other tool to parse an unstructured text file into different records and fields within records. – fgregg May 9 at 13:23

tonnier · Answer 9 · 2013-08-17 20:22:25Z

up vote -1 down vote

I think the best thing would be to use IntelliGet for this. Its a smart tool and has got nice support

answered Aug 17 at 20:22

tonnier
1

1

Your answer is very short. Could you please add a link to the tool and explain how to use it to solve the problem stated in the question? – Patrick Hoefler♦ Aug 17 at 21:02

asked	3 months ago
viewed	365 times
active	12 days ago

Good tools to parse repetitive unstructured data

9 Answers

Your Answer

Not the answer you're looking for? Browse other questions tagged tool-request unstructured text parsing or ask your own question.

Linked

Good tools to parse repetitive unstructured data

9 Answers

Your Answer

Sign up or login

Post as a guest

Not the answer you're looking for? Browse other questions tagged tool-request unstructured text parsing or ask your own question.

Linked

Related