Java String.split() with a regex

Question

I have a parsing question. I have sentences that are stored as Strings. I want to grab each word in each sentence however I would like to filter which words I grab. For example say I have a sentence like the following:

Hell0 3v3ryb0dy @ stackoverflow $people \implies queen$ equals ~queen --> ~people. /#logic

I would do the following:

grab 'H3ll0'
grab 3v3ryb0dy
throw away the @
grab 'people' from '$people'
grab 'implies' from '\implies'
grab 'queen' from 'queen$'
grab 'equals'
grab 'queen' from '~queen'
throw away -->
grab 'people' from '~people'
grab 'logic' from '/#logic'

Essentially I want only alphanumeric characters and whenever I have some other character such as a \ before or after a word I want to disregard this other character.

Currently I am doing:sentence.split(" ")

This gets the individual words from the sentence but it grabs '$people' and '~people' and treats them differently when I want them to be treated the same.

How can I achieve this?
Would a regex help me here?

Bhesh Gurung · Accepted Answer · 2012-11-05 17:21:56Z

Split the string with this regex \\W+, split at one or more non-word character(s).

String sentence = "Hell0 3v3ryb0dy @ stackoverflow $people \\implies queen$ equals ~queen --> ~people. /#logic";
String[] split = sentence.split("\\W+");
System.out.println(Arrays.asList(split));

Output

[Hell0, 3v3ryb0dy, stackoverflow, people, implies, queen, equals, queen, people, logic]

Srinivas · Answer 2 · 2012-11-11 08:09:58Z

up vote 1 down vote

I am using this regex.
[^A-Za-z0-9 ]+ (Edited) and the output I get is:
Hell0 3v3ryb0dy stackoverflow people implies queen equals queen people logic

Is this what you are expecting?

Snipped from myregextester

import java.util.regex.Pattern;
import java.util.regex.Matcher;
class Module1{
  public static void main(String[] asd){
  String sourcestring = "source string to match with pattern";
  Pattern re = Pattern.compile("[^A-Za-z0-9 ]+",Pattern.CASE_INSENSITIVE);
  Matcher m = re.matcher(sourcestring);
  String result = m.replaceAll("");
  }
}

$sourcestring after replacement:
Hell0 3v3ryb0dy stackoverflow people implies queen equals queen people logic

edited Nov 11 '12 at 8:09

answered Nov 5 '12 at 2:33

Srinivas
1,380719

1

+1 for giving the same answer I was about to post – David Harris Nov 5 '12 at 2:35

I stil want to split my sentence by spaces. So say I do the following: wordsInSentence = sentence.split(" "); I would like this code to also filter out non-word characters with the functionality in my question – CodeKingPlusPlus Nov 5 '12 at 2:46

@CodeKingPlusPlus: Did you try my answer? – Bhesh Gurung Nov 5 '12 at 2:49

1

-1. With this I got this output:

[, H, e, l, l, 0,  , 3, v, 3, r, y, b, 0, d, y,  , ,  , s, t, a, c, k, o, v, e, r, f, l, o, w,  , , p, e, o, p, l, e,  , , i, m, p, l, i, e, s,  , q, u, e, e, n, ,  , e, q, u, a, l, s,  , , q, u, e, e, n,  , ,  , , p, e, o, p, l, e, ,  , , l, o, g, i, c]

. – Bhesh Gurung Nov 5 '12 at 2:52

1

You need to change the * to +: [^A-Za-z0-9 ]+. Your regex can match nothing, meaning it will match at every character boundary regardless of what follows it. If it happens to see any of the unwanted characters it will consume them, but something or nothing, it will always match. The regex in your comment ([^A-Za-z0-9]*) has the same problem. It won't throw an exception or return incorrect results, but it's doing lot of work it doesn't need to do. – Alan Moore Nov 5 '12 at 7:07

show 7 more comments

asked	4 months ago
viewed	131 times
active	3 months ago

Java String.split() with a regex

2 Answers

Your Answer

Not the answer you're looking for? Browse other questions tagged java regex string parsing split or ask your own question.

Hello World!

Community Bulletin

Java String.split() with a regex

2 Answers

Your Answer

Not the answer you're looking for? Browse other questions tagged java regex string parsing split or ask your own question.

Hello World!

Community Bulletin

Related