Which one of these two would you prefer writing in your code?
This:
tweet = tweet.replaceAll("@\\w+|#\\w+|\\bRT\\b", "");
tweet = tweet.replaceAll("\n", " ");
tweet = tweet.replaceAll("[^\\p{L}\\p{N} ]+", " ");
tweet = tweet.replaceAll(" +", " ").trim();
Or this:
tweet = tweet.replaceAll("@\\w+|#\\w+|\\bRT\\b", "")
.replaceAll("\n", " ")
.replaceAll("[^\\p{L}\\p{N} ]+", " ")
.replaceAll(" +", " ")
.trim();
Which one looks cleaner (in my opinion the second one but it obstructs you from commenting every line if you wanted to) and which one should perform better? (my guess is that there is no difference since regex doesn't create a new string everytime and it does everything internally if I am correct)
Additional Information:
Using one .replaceAll()
is not possible because if the content that I want to be removed is not removed in this order then the output will be wrong.
Here is a short compilable example:
public class StringTest {
public static void main(String args[]) {
String text = "RT @AshStewart09: Vote for Lady Gaga for \"Best Fans\""
+ " at iHeart Awards\n"
+ "\n"
+ "RT!!\n"
+ "\n"
+ "My vote for #FanArmy goes to #LittleMonsters #iHeartAwards\n"
+ "for ART!"
+ "htt… è, é, ê, ë asdf324 ah_";
System.out.println("Before: " + text + "\n");
text = text.replaceAll("@\\w+|#\\w+|\\bRT\\b", "")
.replaceAll("\n", " ")
.replaceAll("[^\\p{L}\\p{N} ]+", " ")
.replaceAll(" +", " ").trim();
System.out.println("After: " + text + "\n");
}
}
I have tried merging some of the .replaceAll()
but the output was always wrong if I did the operations in any other order than this. In the end I want to be left with just the words of the tweet and nothing else. Bare in mind that this is the first time I'm using regex so I am far from a pro at it so if there actually is a way to merge them then do tell.
This is the closest I'm currently at merging the replaceAll()
:
text = text
.replaceAll("@\\w+|#\\w+|\\bRT\\b|\n|[^\\p{L}\\p{N} ]+", " ")
.replaceAll(" +", " ")
.trim();
I basically made everything a space
and then removed all the extra spaces. Is that better than before? Is it even possible to merge the last replaceAll
as well?