How to split a string and match its substrings to a list of substrings? - Python

Question

I need to split a string up into all possible ways without changing the order of the characters. I understand this task can be seen as tokenization or lemmatization in NLP but i am attempting it from a pure string search perspective which is simpler and more robust. Given,

dictionary = ['train','station', 'fire', 'a','trainer','in']
str1 = "firetrainstation"

Task 1: How do i generate all possible substrings such that i get:

all_possible_substrings = [['f','iretrainstation'],
['fo','retrainstation'], ...
['firetrainstatio','n'],
['f','i','retrainstation'], ... , ...
['fire','train','station'], ... , ...
['fire','tr','a','instation'], ... , ...
['fire','tr','a','in','station'], ... , ...
['f','i','r','e','t','r','a','i','n','s','t','a','t','i','o','n']

Task 2: Then from all_possible_substring, how could i check through to see and say that the substring set that contains all elements from the dictionary is the correct output. The desired output would be a list of substrings from the dictionary that matches the most number of characters from left to right. Desired output is where:

"".join(desire_substring_list) == str1 and \
[i for i desire_substring_list if in dictionary] == len(desire_substring_list)
#(let's assume, the above condition can be true for any input string since my english
#language dictionary is very big and all my strings are human language 
#just written without spaces)

Desired output:

'fire','train','station'

What have I done?

For task 1, i have done this but i know it wouldn't give me all possible whitespace insertions:

all_possible_substrings.append(" ".join(str1))

I've done this but this only does task 2:

import re
seed = ['train','station', 'fire', 'a','trainer','in']
str1 = "firetrainstation"
all_possible_string = [['f','iretrainstation'],
['fo','retrainstation'],
['firetrainstatio','n'],
['f','i','retrainstation'], 
['fire','train','station'], 
['fire','tr','a','instation'], 
['fire','tr','a','in','station'], 
['f','i','r','e','t','r','a','i','n','s','t','a','t','i','o','n']]
pattern = re.compile(r'\b(?:' + '|'.join(re.escape(s) for s in seed) + r')\b')
highest_match = ""
for i in all_possible_string:
  x = pattern.findall(" ".join(i))
  if "".join(x) == str1 and len([i for i in x if i in seed]) == len(x):
    print " ".join(x)

Also, I'm pretty sure you need to do a little more explanation. Why is `'foo','bar','bar','str' the desired output? — mgilson, Mar 19 '13 at 1:25
How do you get str1 from dictionary? And I might be misunderstanding, but wouldn't the "list of substrings from the dictionary that matches the most number of characters from left to right" always be str1 minus the last letter? (Assuming you don't want the whole string.) — toxotes, Mar 19 '13 at 1:55

jurgenreza · Accepted Answer · 2013-03-19 04:17:31Z

For the first part you could write a recursive generator similar to this:

>>> def all_substr(string):
    for i in range(len(string)):

        if i == len(string) - 1:
            yield string

        first_part = string[0:i+1]
        second_part = string[i+1:]

        for j in all_substr(second_part):
            yield ','.join([first_part, j])


>>> for x in all_substr('apple'):
    print(x)


a,p,p,l,e
a,p,p,le
a,p,pl,e
a,p,ple
a,pp,l,e
a,pp,le
a,ppl,e
a,pple
ap,p,l,e
ap,p,le
ap,pl,e
ap,ple
app,l,e
app,le
appl,e
apple

asked	3 years ago
viewed	841 times
active	3 years ago

current community

your communities

more stack exchange communities

How to split a string and match its substrings to a list of substrings? - Python

1 Answer 1

Your Answer

Not the answer you're looking for? Browse other questions tagged python string dictionary substring string-matching or ask your own question.

Linked

Hot Network Questions

current community

your communities

more stack exchange communities

How to split a string and match its substrings to a list of substrings? - Python

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged python string dictionary substring string-matching or ask your own question.

Linked

Related

Hot Network Questions