Take the 2-minute tour ×

Code Review Stack Exchange is a question and answer site for peer programmer code reviews. It's 100% free, no registration required.

A function for parsing words from a string without using whitespace

up vote 1 down vote favorite

I'm trying to parse words from a badly garbled text file that contains many repeats. It's about 100k characters in length and was formed from joining many substrings in alphabetical order.

I'm curious about other methods for finding words without using whitespace.

def unique_words(string):
    words = dict()
    p1 = 0 # String slice position 1
    p2 = 1 # String slice position 2
    len_string = len(string)
    while p2 < len_string:
        p2 += 1
        sub1 = string[p1:p2] # A shorter sub
        sub2 = string[p1:(p2 + 1)] # A longer sub
        sub1_count = string.count(sub1) # Counts the frequency of the shorter sub
        sub2_count = string.count(sub2) # Counts the frequency of the longer sub
        if sub2_count * len(sub2) < sub1_count * len(sub1): # True if the frequency of sub1 * its length is greater
            words[sub1] = ('') # Add 
            p1 = p2
    return words

The above code works when the number of unique words is small but fails when it is large. I've used TextMechanic to generate a random string like

'updownleftupdowndownleftupleftrightupdownleftup'

and the above code returns a dictionary exactly as desired:

{'up': '', 'down': '', 'left': '', 'right': ''}

Here's the problem:

When the number of unique words increases, there is a point where the occurrence of single letters out numbers the total character count of any word in a string.

My current solution uses the algorithm on short slices of the original string, but this involves trial-and-error and has artifacts.

edited Sep 9 at 22:47

Jamal♦
18.7k565145

asked Sep 9 at 22:39

mattkaeo
1666

how do you know they are words ? – joaquin Sep 9 at 22:47

It was originally a log file with names and locations. If you look at the file you can clearly see the words are intact, they're just out of order, repeated many times, and have no white space. The original files are gone. I want to know for each of my files, what the original information was. – mattkaeo Sep 9 at 22:54

'you can clearly see the words are intact'. This is my question: how a computer could know that than you can see so clearly ?. Maybe you can use some initial seed to start breaking the text, using tools from natural language or simply a dictionary prepared with the more frequent words on a similar log file. Once you take worrds from the middle the others will shine. – joaquin Sep 9 at 22:59

add a comment |

Your Answer

Sign up or log in

Post as a guest

Name

Email required, but not shown

Post as a guest

Name

Email required, but not shown

discard

By posting your answer, you agree to the privacy policy and terms of service.

Browse other questions tagged python strings parsing or ask your own question.

question feed

asked	1 month ago
viewed	49 times

current community

your communities

more stack exchange communities

A function for parsing words from a string without using whitespace

Your Answer

Browse other questions tagged python strings parsing or ask your own question.

Hot Network Questions

current community

your communities

more stack exchange communities

A function for parsing words from a string without using whitespace

Know someone who can answer? Share a link to this question via email, Google+, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest

Browse other questions tagged python strings parsing or ask your own question.

Related

Hot Network Questions