Segmentation of strings using dynamic programming

Question

Suppose you have a function quality(x) that returns the quality of a sequence of letters x. Given a string such as "howareyoutoday," what is the most efficient way to determine that the segmentation is "how are you today" (i.e. quality(how)+quality(are)+quality(you)+quality(today) is the maximum quality possible)?

I was thinking that we could have something like the following:

A[0] = h, A[1] = o, ..., A[n] = y 
Q[0] = quality(A[0]), Q[1] = quality(A[0]A[1]), ..., Q[n] = quality(A[0]...A[n])

Now to determine the segmentation, we find max{Q[0], .., Q[n]} which will return some Q[i] (the first space is after this). Then, we find max{Q[i+1], .. Q[n]} which returns another Q[i] (second space is after this), etc. until max returns Q[n].

I have two questions: does this method use dynamic programming? It seems to me that it does, since we build the initial Q with subproblems to the original problem. Also, is this an optimal solution? To my understanding, the worst case would be O(n^2), which would be when max returns Q[0], then Q[1], then Q[2], etc.

This question might be more suitable for Theoretical CS Stack Exchange and CS Stack Exchange — isim, Oct 4 '14 at 3:49
There are a few problems here I think. First of all, after finding the first word (let's say it was "how"), you say you want to find max{Q[i+1], .., Q[n]} (I assume you meant i+1, not i+i) -- but this means quality() will be called on the entire string so far, i.e. quality("howa"), quality("howar"), quality("howare"), etc. — j_random_hacker, Oct 4 '14 at 3:57
Second, this appears to be a greedy approach that will not necessarily give you the optimal solution, since you decide where to make the first break without considering how this could affect downstream decisions. E.g. for the string "theredcar" you would want the first word to be "the", but for the string "thereitgoes" it should be "there". If you give "there" a lower quality score than "the", then your algo will always be wrong on the second input, while if you give it a higher quality score it will always be wrong on the first -- you must look at the rest of the string to decide. — j_random_hacker, Oct 4 '14 at 4:03
@j_random_hacker what kind of modification do you propose so that I can handle the case you considered? — Bob Jonas, Oct 4 '14 at 5:28

Zim-Zam O'Pootertoot · Answer 1 · 2014-10-04 03:59:40Z

You can use dynamic programming if you can make use of cached results of sub-problems. For example, if quality(A[0]A[1]) is a function of quality(A[0]) and quality(A[1]) (e.g. quality(A[0]A[1]) = quality(A[0]) + quality(A[1])) then you can benefit by caching the results of the sub-problems; if quality(A[0]A[1]) is independent of quality(A[0]) and quality(A[1]) then you won't benefit by caching the results of the sub-problems and you can't use dynamic programming.

The worst case of your algorithm is O(n^2). You might be able to make use of a potentially sub-optimal O(n) solution: test quality(A[0]), then quality(A[0]A[1]), then quality(A[0]A[1]A[2]), and so on; halt when the quality of the current solution is worse than the quality of the previous solution, so if quality(A[0]A[1]) > quality(A[0]A[1]A[2]) then partition at A[0]A[1] and begin the search for the next word at A[2]. You can improve the quality of the algorithm at the cost of a larger constant factor by increasing the lookahead; the algorithm just described has a lookahead(1), whereas a lookahead(2) algorithm would partition at A[0]A[1] if quality(A[0]A[1]) > quality(A[0]A[1]A[2]) and quality(A[0]A[1]) > quality(A[0]A[1]A[2]A[3]).

j_random_hacker · Answer 2 · 2014-10-06 00:52:35Z

One way to formulate this as a DP problem is to compute a function f(i):

f(i) = The highest sum of quality scores achievable on the first i characters.

What we notice is that the last segment must end at position i-1, and it could begin anywhere before it, right back to position 0 (assuming we number the positions starting at 0). Let's call the starting position of this segment j; we'll need to try all possible values of j, and find the one that's best. What should we combine this final segment A[j]..A[i-1] with? With the optimal solution to whatever remains to its left, which is conveniently given by f(j). So we get:

f(i) = max { f(j) + quality(A[j]..A[i-1]) } over all 0 <= j <= i.

with f(0) = 0 as a boundary case.

f(n+1) will calculate the optimal score for the entire string in quadratic time and using linear memory requirements. To actually calculate where the word breaks should appear to get this score, you need to walk backwards from f(n+1), determining which j value caused it to achieve its maximum (there may be several; in that case, any will work). Keep repeating this until you get j = 0. (Alternatively you can store these "maximising" values of j in a separate array as you compute f(). This array is often called "pred[]", for "predecessor".)

[EDIT]

The above is a recursion to solve the problem. It will solve it correctly, but very slowly for large n. All DP algorithms can be stated this way, but to actually turn it from a plain recursion into a DP algorithm you need to either memoize it, or figure out how to make a bottom-up table-filling iterative algorithm from it. Memoization of a recursion is trivial: all you need to do is, whenever the function computes the answer to a fresh value of i, store this answer (here, an array of size n+1 will do), and use it next time that particular value of f(i) is asked for. Bottom-up DP involves figuring out an order in which the table can be populated so that all dependencies are met. This can be (a constant factor) faster, but is sometimes harder to do.

Very concise. Thank you for the help! If it's not too much trouble, can you perhaps offer an example of the algorithm at work? — Bob Jonas, Oct 5 '14 at 1:05
@BobJonas: Sorry but I think it would take a lot of time and space to give a useful example, and it wouldn't show you anything more than if you implemented the recursion (just a few lines) and stepped through it in a debugger. The best thing, though, would be to convince yourself that (a) every valid way of breaking up the prefix consisting of the first i characters has a rightmost segment ending at position i-1, and (b) that at whatever position j this rightmost segment starts, the optimal overall score for a solution with this segment A[j]..A[i-1] is given by f(j) + quality(A[j]..A[i-1]). — j_random_hacker, Oct 5 '14 at 1:25
OK, that makes sense. Just one more thing: why do we compute the minimum between the two values as opposed to the maximum? — Bob Jonas, Oct 5 '14 at 23:04

asked	1 year ago
viewed	401 times
active	1 year ago

current community

your communities

more stack exchange communities

Segmentation of strings using dynamic programming

2 Answers 2

Your Answer

Not the answer you're looking for? Browse other questions tagged string algorithm or ask your own question.

Hot Network Questions

current community

your communities

more stack exchange communities

Segmentation of strings using dynamic programming

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged string algorithm or ask your own question.

Related

Hot Network Questions