Take the 2-minute tour ×
Code Review Stack Exchange is a question and answer site for peer programmer code reviews. It's 100% free, no registration required.

I'm facing an interesting issue at the moment:

My Situation:

I'm having (in java) String-Arrays like the following (more complicated, of course). Each String-Array represents one sentence (I cant change the representation):

String[] tokens = {"This", "is", "just", "an", "example", "."};

My Problem:

I want to rebuild the original sentences from this String-Arrays. This sounds not that hard at first, but becomes really complex since sentence structure can have many cases. Sometimes you need whitespaces and sometimes you dont.

My Approach

Please, don't hate me for this terrible code

I've implemented a method that should do most of the tasks, means rebuild a sentence from the original String-Array. As you can see, it's very complex and complicated already, but works "okay" for the moment - I don't know how to improve it at the moment.

public static String detokenize(String[] tokens) {
    StringBuilder sentence = new StringBuilder();
    boolean sentenceInQuotation = false; 
    boolean firstWordInQuotationSentence = false;
    boolean firstWordInParenthisis = false;
    boolean date = false;

    for (int i = 0; i < tokens.length; i++) {

        if (tokens[i].equals(".") || tokens[i].equals(";") || tokens[i].equals(",") || tokens[i].equals("?") || tokens[i].equals("!")) {
            sentence.append(tokens[i]);
        }
        else if(tokens[i].equals(":")){
            Pattern p = Pattern.compile("\\d");
            Matcher m = p.matcher(tokens[i-1]);
            if(m.find() == true){
                date = true;
            }
            sentence.append(tokens[i]);
        }
        else if(tokens[i].equals("(")){
            sentence.append(" ");
            sentence.append(tokens[i]);
            firstWordInParenthisis = true;
        }
        else if (tokens[i].equals(")")) {
            sentence.append(tokens[i]);
            firstWordInParenthisis = false;
        } 
        else if(tokens[i].equals("\"")){
            if(sentenceInQuotation == false){
                sentence.append(" ");
                sentence.append(tokens[i]);
                sentenceInQuotation = true;
                firstWordInQuotationSentence = true;
            }
            else if(sentenceInQuotation == true){
                sentence.append(tokens[i]);
                sentenceInQuotation = false;
            }
        }
        else if (tokens[i].equals("&") || tokens[i].equals("+") || tokens[i].equals("=")) {
            sentence.append(" ");
            sentence.append(tokens[i]);
        } 
        //words
        else {
            if(sentenceInQuotation == true){
                if(firstWordInQuotationSentence == true){
                    sentence.append(tokens[i]);
                    firstWordInQuotationSentence = false;
                }
                else if(firstWordInQuotationSentence == false){
                    if(firstWordInParenthisis == true){
                        sentence.append(tokens[i]);
                        firstWordInParenthisis = false;
                    }
                    else if(firstWordInParenthisis == false){
                        sentence.append(" ");
                        sentence.append(tokens[i]);
                    }
                }
            }
            else if(firstWordInParenthisis == true){
                sentence.append(tokens[i]);
                firstWordInParenthisis = false;
            }
            else if(date == true){
                sentence.append(tokens[i]);
                date = false;
            }
            else if(sentenceInQuotation == false){
                sentence.append(" ");
                sentence.append(tokens[i]);
            }
        }

    }

    return sentence.toString().replaceFirst(" ", "");
}

As I said, this works quite good, but not perfect. I suggest you try my method with copy/paste and see it on your own.

My Question

Do you have ANY ideas for my problem or a better solution? Maybe you had a similar problem once and still have got your solution (I can't really believe I'm the first one with this problem). Or you're just so genius (I hope so :-) ) that the best solution is coming to your mind right ahead.

I'm happy for every help and thought you're sharing with me.

Thanks a lot

Update: Examples:

Well, for example, as I just tried some texts out I noticed that I dont yet check about tokens like "[", "]", or e.g. the different types of quotations " or “. I also heard that it can make a different if if use ... (three points) or one … unicode sign (mark it and you'll see it).

So it becomes more and more complex. If you need further information, just let me know, thanks a lot

And that's it, there's probably something I'm forgetting

share|improve this question
2  
Can you provide more sample data to showcase the errors you are encountering? –  eabraham Apr 23 '12 at 13:14
 
updated my post, thanks for your comment –  user1293755 Apr 23 '12 at 13:37
 
You could do it more declaratively, that would make it less complex. Create a Set of tokens that have no WS in front and another Set of those with no WS after. Then you just check those sets for each token. –  Marko Topolnik Apr 23 '12 at 13:43
add comment

migrated from stackoverflow.com Apr 23 '12 at 14:09

This question came from our site for professional and enthusiast programmers.

4 Answers

up vote 3 down vote accepted

The OpenNLP will provide a more robust solution, but the following approximation may be good enough.

The general rule is to join the 'words' with a space between, there are three excpetions

  1. Special punctuation characters that should not have a space before, eg . ; :
  2. Special punctuation characters that should not have a space after, eg ( [
  3. Quoted sentences in which case the " will start in case 2 and then switch after each occurrence

The code underneath has been tested with this sentence:

A test, (string). Hello this is a 2nd sentence. Here is a quote: "This is the quote." Sentence 4.

import java.util.Arrays;
import java.util.List;
import java.util.LinkedList;

public class Detokenizer {
    public String detokenize(List<String> tokens) {

    //Define list of punctuation characters that should NOT have spaces before or after 
    List<String> noSpaceBefore = new LinkedList<String>(Arrays.asList(",", ".",";", ":", ")", "}", "]"));
    List<String> noSpaceAfter = new LinkedList<String>(Arrays.asList("(", "[","{", "\"",""));

    StringBuilder sentence = new StringBuilder();

    tokens.add(0, "");  //Add an empty token at the beginning because loop checks as position-1 and "" is in noSpaceAfter
    for (int i = 1; i < tokens.size(); i++) {
        if (noSpaceBefore.contains(tokens.get(i))
                || noSpaceAfter.contains(tokens.get(i - 1))) {
            sentence.append(tokens.get(i));
        } else {
            sentence.append(" " + tokens.get(i));
        }

        // Assumption that opening double quotes are always followed by matching closing double quotes
        // This block switches the " to the other set after each occurrence
        // ie The first double quotes should have no space after, then the 2nd double quotes should have no space before
        if ("\"".equals(tokens.get(i - 1))) {
            if (noSpaceAfter.contains("\"")) {
                noSpaceAfter.remove("\"");
                noSpaceBefore.add("\"");
            } else {
                noSpaceAfter.add("\"");
                noSpaceBefore.remove("\"");
            }
        }
    }
    return sentence.toString();
}

}

And the test case...

import static org.junit.Assert.*;
import java.util.Arrays;
import java.util.List;
import org.junit.Test;
import java.util.LinkedList;

public class DetokenizerTest {
@Test
public void test() {
    List<String> tokens = new LinkedList<String>(Arrays.asList("A", "test", ",", "(", "string", ")", ".", "Hello","this","is","a","2nd","sentence",".","Here","is","a","quote",":","\"","This","is","the","quote",".","\"","Sentence","4","."));
    String expected = "A test, (string). Hello this is a 2nd sentence. Here is a quote: \"This is the quote.\" Sentence 4.";
    String actual = new Detokenizer().detokenize(tokens);
    assertEquals(expected, actual);
    System.out.println(actual);

}

}
share|improve this answer
1  
Thanks for this method! I've extended it by a few punctuations and it works perfect so far. A plus is that extending it is really easy, which will make the later live-maintenance of my tool hopefully much easier. –  user1293755 Apr 24 '12 at 12:45
add comment

I'm not sure if you are allowed to leverage open source projects, but perhaps you could use something like OpenNLP. They have an interface opennlp.tools.tokenize.Detokenizer. An implementation of this looks like it would do what you need (see this forum post). It looks like there is at least one implementation: opennlp.tools.tokenize.DictionaryDetokenizer.

You could simply create a sentence by adding a space in-between each token, and then pass this to the detokenizer.

share|improve this answer
 
Looks great! Have a look at the testDetokenizeToString() method from this test class, it seems to me that it does exactly what you need. –  sp00m Apr 23 '12 at 14:23
 
That method would also be possible, but as far as I understand it so far, you only get commands like "MERGE_TO_LEFT" or "NO_OPERATION" to your tokens. So you would need to check every token for its position-command and from this decide if you need a whitespace or not. Maybe I'll try this later as well. Thanks –  user1293755 Apr 24 '12 at 12:42
add comment

I suspect a simpler approach would be to just join the tokens together to form a sentence as string, and then use list of regular expression patterns to substitute each case. This will reduce the complexity, and will be much more readable than your example. (I will flush it out later.) pseudocode:

mystring = join(words," ")
for ((e,s) : regexps_andreplacements)
     mystring = mystring.replaceAll(e,s)
return mystring
share|improve this answer
add comment

I would suggest using the strategy pattern rather than code for specific tokens. Because some of the operations are state dependent, you would probably create some kind of DetokenizerResult object.

For example, you default strategy would append a space, then the token, except at the beginning.

interface Detokenizer {
    void apply(String token, DetokenizerResult result);
}

class DefaultDetokenizer implements Detokenizer {
    void String apply(String token, DetokenizerResult result) {
        if (result.atBeginning()) result.append(token);
        else result.append(" ").append(token);
    }
}

Put all of the tokens that have special strategies into a HashMap<String, Detokenizer> and then you can reduce the loop to something like this:

DetokenizerResult result = new DetokenizerResult()
for (token : tokens) {
    if (map.contains(token) map.get(token).apply(token, result);
    else defaultStrategy.apply(token, result);
}
share|improve this answer
add comment

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.