First, going over your code
import itertools, re
re is not used
contentWords = content.split()
I would be more explicit on the split, calling split(' ') so your intentions are crystal clear
index = contentWords.index(term)
Each call to this is an O(n) operation on the list, m times where m is the number of times the element occurs, x times where x is the number of keywords. So in big O (and worst case) we have O(nmx). We can be smarter about this, and do it in O(n) and in one iteration. We will use extra memory to store the keywords in a dict
So replacing this:
termPosition = {}
for term in keywords:
index = 0
positions = []
while index is not None:
try:
index = contentWords.index(term)
positions.append(index)
contentWords[index] = None
except:
index = None
termPosition[term] = positions
With:
# Create an empty list for each keyword
# We could achieve the same result with a defaultdict
termPosition = {keyword: [] for keyword in keywords}
# enumerate gives us the indexes of the word
for index, word in enumerate(contentWords):
# if the word is a keyword, record the index
if word in termPosition:
termPosition[word].append(index)
for term in termPosition.keys():
if len(termPosition.get(term)) == 1:
singlePositions.append(termPosition.get(term)[0])
else:
multiplePositions.append(termPosition.get(term))
You don't need to work out the values from the keys. Also .append(x[0]) for a list of size one is the same as extending by that list. (You can concat lists with either += or .extend)
for positions in termPosition.values():
if len(positions) == 1:
singlePositions += positions
else:
multiplePositions.append(positions) # Keeping it as a list of lists
contentWords = content.split(' ')
No longer needed since we don't modify contentWords
for element in itertools.product(*multiplePositions):
tempList = list(element) + singlePositions
minPos = min(tempList)
maxPos = max(tempList) + 1
if len(' '.join(contentWords[minPos:maxPos])) < len(snippet):
snippet = ' '.join(contentWords[minPos:maxPos])
This is going to be the main slowdown of your code as product means factorial growth. I can't think of something that does this better right now, but I'll come back if I do find one. The issue at hand is that it isn't the shortest range of indexes which contain all the keywords, it is the shortest snippet. The words might vary in length, giving each index a weight. In addition, spaces mean it isn't the smallest sum of weights, it is the min sum of weights plus number of additions used.
You could create a list of the length of each word in contentWords, and find the min sum(contentWordsLens[minPos:maxPos] + (maxPos - minPos)). The should be more performant as you are adding numbers instead of strings
As for the code, there are a couple of tricks to squeeze down the time
# The min and max from singlePositions never change
# We convert the single value to a tuple to avoid doing repeatedly in the loop
minSinglePositions = tuple(min(singlePositions))
maxSinglePositions = tuple(max(singlePositions))
minSnippetLen = len(snippet)
for element in itertools.product(*multiplePositions):
minPos = min(element + minSinglePositions)
maxPos = max(element + maxSinglePositions) + 1
candidate = ' '.join(contentWords[minPos:maxPos])
if len(candidate) < minSnippetLen:
snippet = candidate
minSnippetLen = len(snippet)
The main improvements come from moving any work that is constant to outside the loop. We also avoid repeating work like working out the length of the snippet. For this kind of tricking about, profiling should be done to test the function.
A final note, there are two parts to the function, getting the indexes, and checking which combination is the shortest. As such, we can move them to two functions.
A cleaner version of the submitted code:
import itertools
def get_positions(content, keywords):
termPosition = {keyword: [] for keyword in keywords}
# enumerate gives us the indexes of the word
for index, word in enumerate(content):
# if the word is a keyword, record the index
if word in termPosition:
termPosition[word].append(index)
singlePositions = []
multiplePositions = []
# Note, if any of the positions are of len 0, the keyword isn't in the snippet
# This should be raised. As it is currently, it returns the input snippet
for positions in termPosition.values():
if len(positions) == 1:
singlePositions += positions
else:
multiplePositions.append(positions) # Keeping it as a list of lists
return singlePositions, multiplePositions
def get_shortest_snippet(content, contentWords, singlePositions, multiplePositions):
# The min and max from singlePositions never change
# We convert the single value to a tuple to avoid doing repeatedly in the loop
if singlePositions:
minSinglePositions = tuple(min(singlePositions))
maxSinglePositions = tuple(max(singlePositions))
else:
minSinglePositions = tuple()
maxSinglePositions = tuple()
snippet = content
minSnippetLen = len(snippet)
for element in itertools.product(*multiplePositions):
minPos = min(element + minSinglePositions)
maxPos = max(element + maxSinglePositions) + 1
candidate = ' '.join(contentWords[minPos:maxPos])
if len(candidate) < minSnippetLen:
snippet = candidate
minSnippetLen = len(snippet)
return snippet
def solve(content, keywords):
# Input
content = content.strip()
contentWords = content.split(' ')
singlePositions, multiplePositions = get_positions(contentWords, keywords)
shortestSnippet = get_shortest_snippet(content, contentWords, singlePositions, multiplePositions)
return shortestSnippet