I need to split a string up into all possible ways without changing the order of the characters. I understand this task can be seen as tokenization or lemmatization in NLP but i am attempting it from a pure string search perspective which is simpler and more robust. Given,
dictionary = ['train','station', 'fire', 'a','trainer','in']
str1 = "firetrainstation"
Task 1: How do i generate all possible substrings such that i get:
all_possible_substrings = [['f','iretrainstation'],
['fo','retrainstation'], ...
['firetrainstatio','n'],
['f','i','retrainstation'], ... , ...
['fire','train','station'], ... , ...
['fire','tr','a','instation'], ... , ...
['fire','tr','a','in','station'], ... , ...
['f','i','r','e','t','r','a','i','n','s','t','a','t','i','o','n']
Task 2: Then from all_possible_substring
, how could i check through to see and say that the substring set that contains all elements from the dictionary is the correct output. The desired output would be a list of substrings from the dictionary that matches the most number of characters from left to right. Desired output is where:
"".join(desire_substring_list) == str1 and \
[i for i desire_substring_list if in dictionary] == len(desire_substring_list)
#(let's assume, the above condition can be true for any input string since my english
#language dictionary is very big and all my strings are human language
#just written without spaces)
Desired output:
'fire','train','station'
What have I done?
For task 1, i have done this but i know it wouldn't give me all possible whitespace insertions:
all_possible_substrings.append(" ".join(str1))
I've done this but this only does task 2:
import re
seed = ['train','station', 'fire', 'a','trainer','in']
str1 = "firetrainstation"
all_possible_string = [['f','iretrainstation'],
['fo','retrainstation'],
['firetrainstatio','n'],
['f','i','retrainstation'],
['fire','train','station'],
['fire','tr','a','instation'],
['fire','tr','a','in','station'],
['f','i','r','e','t','r','a','i','n','s','t','a','t','i','o','n']]
pattern = re.compile(r'\b(?:' + '|'.join(re.escape(s) for s in seed) + r')\b')
highest_match = ""
for i in all_possible_string:
x = pattern.findall(" ".join(i))
if "".join(x) == str1 and len([i for i in x if i in seed]) == len(x):
print " ".join(x)
list
. – mgilson Mar 19 '13 at 1:24str1
fromdictionary
? And I might be misunderstanding, but wouldn't the "list of substrings from the dictionary that matches the most number of characters from left to right" always bestr1
minus the last letter? (Assuming you don't want the whole string.) – toxotes Mar 19 '13 at 1:55