Tell me more ×
Open Data Stack Exchange is a question and answer site for developers and researchers interested in open data. It's 100% free, no registration required.

I'm looking to parse a large number of lines of repetitive but unstructured data. This is a task that happens at least once every project, in my experience, so I'm looking for a tool to transform fairly standard text into structured data. Right now I just use a combination of regex find and replace and one-off python scripts.

Here's a clean example:

Distance: 25.903 miles*
Morgan Road Middle School Extension
    HEPHZIBAH GA, 30815
    Telephone: 706.504.4071
    A unit of: Boys & Girls Clubs of Augusta
        http://www.bgcaugusta.org

And here's a slightly messier example:

Maria Teresa’s Babies Early Enrichment Center/Daycare 
825 23rd Street South 
Arlington, VA 22202 
703-979-BABY (2229) 
22. 
Maria Teresa Desaba, Owner/Director; Tony Saba, Org. Director. 
Website: www.mariateresasbabies.com 
Serving children 6 wks to 5yrs full-time. 


National Science Foundation Child  Development Center  
23. 
4201 Wilson Blvd., Suite 180  22203 
703-292-4794 
Website:  www.brighthorizons.com 112 children, ages 6 wks - 5 yrs.   
7:00 a.m. – 6:00 p.m. Summer Camp for children 5 - 9 years. 

These are only examples. The issue of parsing unstructured data that is nonetheless repetitive is something I come across fairly often, especially when receiving text or word documents in response to FOIA requests. Mostly I'm wondering if someone has written a tool or library that's good at converting these documents into structured data, or if I should be thinking about how to write something myself.

share|improve this question
1  
can you be more specific in how unstructured the data is? A small sample would be nice. – RSFalcon7 May 9 at 1:07
2  
Thanks. Edited. – rcackerman May 9 at 1:19
Where is this data coming from? – johnthexiii May 11 at 2:05
There has been a discussion if this question is off-topic. To be on the safe side, you might want to extend your question and briefly explain how it is relevant for the Open Data community. – Patrick Hoefler May 12 at 17:36
1  
Edited to make the problem a bit clearer. I don't feel like this is a StackOverflow question because I'm not asking about how to code a solution - I've got a lot of those. I'm more wondering if anyone has any tools or libraries they know of that can generalize this problem. Obviously, I didn't ask that question very well. – rcackerman May 12 at 17:47
show 1 more comment

9 Answers

up vote 7 down vote accepted

OpenRefine can be used to parse semi-structured data into a table like structure, where it can be operated on in a manner similar to a spread sheet and exported.

The site features a tutorial on converting a list on wikipedia into a table which may be a good starting point.

The operations involved can also be exported incase you need to perform the same cleanup operation again.

share|improve this answer

Python is the tool I would use.

import csv
from itertools import izip_longest

def grouper(iterable, n, fillvalue=None):
    '''Collect data into fixed-length chunks or blocks
      grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx'''
    args = [iter(iterable)] * n
    return izip_longest(fillvalue=fillvalue, *args)

with open("raw_data.txt", 'r') as f:
    csv_file = open("out.csv", 'w')
    writer = csv.writer(csv_file)

    for record in grouper(f, 4):
        record = [item.strip() for item in record]
        city = record[-1:][0].split(',')[0]
        state = record[-1:][0].split(' ')[1]
        zip_code = record[-1:][0].split(' ')[2]

        sanitized_record = record[:-1] + [city, state, zip_code]

        writer.writerow(sanitized_record)

    csv_file.close()

the advantage this offers you over grep is that if later you decide to put the data in a database or perform some complex calculations on the data that will be a lot easier with python than grep.

share|improve this answer
1  
Do you know of any libraries for this kind of work? – rcackerman May 9 at 2:06
@rcackerman see Daniel Knell's answer. – johnthexiii May 13 at 8:29

It entirely depends on how unstructured your data is. You'll be making a little "mini-domain language" to put the data into structured form.

The more unstructured it is, the more you going to have to get in the minds of everyone who made the records. Shoot for the most common and put the last 20% in the "complete-by-hand" file.

Python, Sed/awk, Perl are tools that are quite commonly used. I use Python myself and it's quite good.

share|improve this answer

I've always done work like this in Perl -- my basic methodology goes something like this. (note, this is for dealing with multi-GB files ... it can be simplified if you can load the whole thing in memory)

I've denoted helper routines with &, although you probably don't want to use that old perl-4 calling style as it'll force it to ignore function prototyping.

my $record = '';
while ( my $line = <> ) {
  if ( &test_for_record_separator ) {
     &parse_record( $record );
     $record = '';
     next;
  }
  $record .= $line;
}
&parse_record( $record );

sub parse_record { 
  my $record = shift;
  my %record = ();
  # exact logic depends on how variable the record structure is.  To handle labeled lines:
  foreach my $line ( split /\n/, $record ) {
    if ( $line =~ m/Some Label:\s*(.*)$/  ) { $record{'Label'}  = $1; next } 
    if ( $line =~ m/Other Label:\s*(.*)$/ ) { $record{'Label2'} = $1; next }
    ...
    &flag_unknown_lines( $line );
  }
  # if the record structure is fixed:
  @record{ qw( company, street_num, street, city, state, zip, phone, index, description )} = (
    $record =~ m/(.+)\n(\d+)\s+(.+)\n(.+),\s+(\w[\w\s]+\w)\s+([-\d]+)\n(\d+)[.]\n([\s\S]+)/
  );
  # or some other more complicated process

  &throw_parse_error unless &validate(%record);
  &write_record_to_database_or_whatever(%record);
}

Obviously, knowledge of the input will allow you to select the best parsing routine -- if you're only going to have state abbreviations, you won't need (\w[\w\s]+\w) (which will match New York or West Virginia cleanly). Bits of this (like looking for 'Key: Value' stuff can be easily abstracted to handle whatever input, provided your writing routine can handle it.)

share|improve this answer

It does seem like a scripting language is the proper approach. I'd read line by line. Starting a new record at the top of the file or after an empty line. Use some pattern recognition and you'll be parsing away fairly quickly.

share|improve this answer
1  
Which language would you recommend? Do you know of any libraries to help with the pattern recognition? – Patrick Hoefler May 9 at 1:55
The number of built-in commands in Ruby that deal with strings may still be larger than any other language (I haven't checked recently). But you ought to be able to do pretty similar work in Ruby, Perl or Python. – Roger_S May 18 at 22:05

Python and some XSLT wold do the trick

share|improve this answer
Your answer is very short. Would you care to elaborate? – Patrick Hoefler May 13 at 7:26

The problem with the second example you posted, is there is almost no structure to it. It doesn't even have a consistent ordering of rows for each record.

I think this is where Perl shines, so I went ahead and wrote up a prototype. ( took me about an hour )

I don't think there exists a tool that would be any simpler, that would get as decent a result. ( There might be a tool that would be easier to learn, but it still won't be any simpler to setup )


All it does is parse the input file, and give a computer readable output.

You would still have to do some hand editing of either the input or the output, but it does a fairly decent job on your sample.

The code was designed to not lose a single piece of information, so it would still need some processing in order to load it into a database.

It just throws whatever it can't decipher into comment.
If you knew a line should be tagged as hours (for example) you could just add "hours: " to the beginning of that line.

#! perl
use strict;
use warnings;

use 5.14.0;

# there was some wide chars in the input
use open ':encoding(utf8)';
use open (':std', ':encoding(utf8)');

$/ = "\n\n"; # input record separator

my @data;

RECORD: while( my $record = <> ){
  # split on newlines, ignore empty ones
  my @lines = grep { length } split /\r?\n/, $record;

  my %record = (
    name => shift @lines, # first line is the name
  );

  LINE: for( @lines ){ # for local $_ ( @lines ){...}
    s/^\s|\s+$//g; # remove leading and trailing spaces

    if( /^(\d+)\.$/a ){ # "22." "23." etc
      $record{number} = ''.$1;

    }elsif( /^\d{3}.\d{3}.[^\s]{4}/a ){ # loosely looks like a phone number
      push @{$record{phone}}, $_; # may be more than one phone number so put it in an array

    }elsif( /^(\w+):\s+(.*)/ ){
      push @{$record{lc $1}}, ''.$2; # handles website: ...

    }elsif( /\d+ .*? \b(?:st(?:reet)?|blvd|boul[ae]vard|ave(?:nue)?)\b/axi ){ # looks like a street address
      push @{$record{address}}, $_;

    }elsif( /,\s+\w\w\s+\d{5}/a ){ # looks like "city, state, zipcode"
      push @{$record{address}}, $_;

    }else{
      push @{ $record{comment} }, $_; # unknown line
    }
  }

  push @data, \%record;
}

use JSON;
# UTF8 is handled by "use open ..." above
# so we turn it off here
my $encode = JSON->new->utf8(0)->pretty;
print $encode->encode(\@data);

print "\n";
[
   {
      "website" : [
         "www.mariateresasbabies.com"
      ],
      "comment" : [
         "Maria Teresa Desaba, Owner/Director; Tony Saba, Org. Director.",
         "Serving children 6 wks to 5yrs full-time."
      ],
      "number" : "22",
      "name" : "Maria Teresa’s Babies Early Enrichment Center/Daycare ",
      "address" : [
         "825 23rd Street South",
         "Arlington, VA 22202"
      ],
      "phone" : [
         "703-979-BABY (2229)"
      ]
   },
   {
      "website" : [
         "www.brighthorizons.com 112 children, ages 6 wks - 5 yrs."
      ],
      "comment" : [
         "7:00 a.m. – 6:00 p.m. Summer Camp for children 5 - 9 years."
      ],
      "number" : "23",
      "name" : "National Science Foundation Child  Development Center  ",
      "address" : [
         "4201 Wilson Blvd., Suite 180  22203"
      ],
      "phone" : [
         "703-292-4794"
      ]
   }
]
share|improve this answer

Open City's Dedupe Python Library works along similar lines. I haven't tried it.

share|improve this answer
Dedupe takes structured, but messy data. You'll need some other tool to parse an unstructured text file into different records and fields within records. – fgregg May 9 at 13:23

I think the best thing would be to use IntelliGet for this. Its a smart tool and has got nice support

share|improve this answer
1  
Your answer is very short. Could you please add a link to the tool and explain how to use it to solve the problem stated in the question? – Patrick Hoefler Aug 17 at 21:02

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.