I'm facing an interesting issue at the moment:
My Situation:
I'm having (in java) String-Arrays like the following (more complicated, of course). Each String-Array represents one sentence (I cant change the representation):
String[] tokens = {"This", "is", "just", "an", "example", "."};
My Problem:
I want to rebuild the original sentences from this String-Arrays. This sounds not that hard at first, but becomes really complex since sentence structure can have many cases. Sometimes you need whitespaces and sometimes you dont.
My Approach
Please, don't hate me for this terrible code
I've implemented a method that should do most of the tasks, means rebuild a sentence from the original String-Array. As you can see, it's very complex and complicated already, but works "okay" for the moment - I don't know how to improve it at the moment.
public static String detokenize(String[] tokens) {
StringBuilder sentence = new StringBuilder();
boolean sentenceInQuotation = false;
boolean firstWordInQuotationSentence = false;
boolean firstWordInParenthisis = false;
boolean date = false;
for (int i = 0; i < tokens.length; i++) {
if (tokens[i].equals(".") || tokens[i].equals(";") || tokens[i].equals(",") || tokens[i].equals("?") || tokens[i].equals("!")) {
sentence.append(tokens[i]);
}
else if(tokens[i].equals(":")){
Pattern p = Pattern.compile("\\d");
Matcher m = p.matcher(tokens[i-1]);
if(m.find() == true){
date = true;
}
sentence.append(tokens[i]);
}
else if(tokens[i].equals("(")){
sentence.append(" ");
sentence.append(tokens[i]);
firstWordInParenthisis = true;
}
else if (tokens[i].equals(")")) {
sentence.append(tokens[i]);
firstWordInParenthisis = false;
}
else if(tokens[i].equals("\"")){
if(sentenceInQuotation == false){
sentence.append(" ");
sentence.append(tokens[i]);
sentenceInQuotation = true;
firstWordInQuotationSentence = true;
}
else if(sentenceInQuotation == true){
sentence.append(tokens[i]);
sentenceInQuotation = false;
}
}
else if (tokens[i].equals("&") || tokens[i].equals("+") || tokens[i].equals("=")) {
sentence.append(" ");
sentence.append(tokens[i]);
}
//words
else {
if(sentenceInQuotation == true){
if(firstWordInQuotationSentence == true){
sentence.append(tokens[i]);
firstWordInQuotationSentence = false;
}
else if(firstWordInQuotationSentence == false){
if(firstWordInParenthisis == true){
sentence.append(tokens[i]);
firstWordInParenthisis = false;
}
else if(firstWordInParenthisis == false){
sentence.append(" ");
sentence.append(tokens[i]);
}
}
}
else if(firstWordInParenthisis == true){
sentence.append(tokens[i]);
firstWordInParenthisis = false;
}
else if(date == true){
sentence.append(tokens[i]);
date = false;
}
else if(sentenceInQuotation == false){
sentence.append(" ");
sentence.append(tokens[i]);
}
}
}
return sentence.toString().replaceFirst(" ", "");
}
As I said, this works quite good, but not perfect. I suggest you try my method with copy/paste and see it on your own.
My Question
Do you have ANY ideas for my problem or a better solution? Maybe you had a similar problem once and still have got your solution (I can't really believe I'm the first one with this problem). Or you're just so genius (I hope so :-) ) that the best solution is coming to your mind right ahead.
I'm happy for every help and thought you're sharing with me.
Thanks a lot
Update: Examples:
Well, for example, as I just tried some texts out I noticed that I dont yet check about tokens like "[", "]", or e.g. the different types of quotations " or “. I also heard that it can make a different if if use ... (three points) or one … unicode sign (mark it and you'll see it).
So it becomes more and more complex. If you need further information, just let me know, thanks a lot
And that's it, there's probably something I'm forgetting