Take the 2-minute tour ×
Code Review Stack Exchange is a question and answer site for peer programmer code reviews. It's 100% free, no registration required.

I'm trying to parse words from a badly garbled text file that contains many repeats. It's about 100k characters in length and was formed from joining many substrings in alphabetical order.

I'm curious about other methods for finding words without using whitespace.

def unique_words(string):
    words = dict()
    p1 = 0 # String slice position 1
    p2 = 1 # String slice position 2
    len_string = len(string)
    while p2 < len_string:
        p2 += 1
        sub1 = string[p1:p2] # A shorter sub
        sub2 = string[p1:(p2 + 1)] # A longer sub
        sub1_count = string.count(sub1) # Counts the frequency of the shorter sub
        sub2_count = string.count(sub2) # Counts the frequency of the longer sub
        if sub2_count * len(sub2) < sub1_count * len(sub1): # True if the frequency of sub1 * its length is greater
            words[sub1] = ('') # Add 
            p1 = p2
    return words

The above code works when the number of unique words is small but fails when it is large. I've used TextMechanic to generate a random string like

'updownleftupdowndownleftupleftrightupdownleftup'

and the above code returns a dictionary exactly as desired:

{'up': '', 'down': '', 'left': '', 'right': ''}

Here's the problem:

When the number of unique words increases, there is a point where the occurrence of single letters out numbers the total character count of any word in a string.

My current solution uses the algorithm on short slices of the original string, but this involves trial-and-error and has artifacts.

share|improve this question
    
how do you know they are words ? –  joaquin Sep 9 at 22:47
    
It was originally a log file with names and locations. If you look at the file you can clearly see the words are intact, they're just out of order, repeated many times, and have no white space. The original files are gone. I want to know for each of my files, what the original information was. –  mattkaeo Sep 9 at 22:54
1  
'you can clearly see the words are intact'. This is my question: how a computer could know that than you can see so clearly ?. Maybe you can use some initial seed to start breaking the text, using tools from natural language or simply a dictionary prepared with the more frequent words on a similar log file. Once you take worrds from the middle the others will shine. –  joaquin Sep 9 at 22:59

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Browse other questions tagged or ask your own question.