Code:
static short state = 0;
static int td_number = 0;
public static void main(String[] args) {
final Pattern p = Pattern.compile("^[\\s]*?\\d+\\.\\d+[\\s]*?");
final short TD_ENTRY = 0;
final short NO_ENTRY = 1;
HTMLEditorKit.ParserCallback callback = new HTMLEditorKit.ParserCallback() {
public void handleText(char[] data, int pos) {
switch (state) {
case NO_ENTRY:
break;
case TD_ENTRY: {
// We are in the right table column
// Create string from char array
String s = new String(data);
Matcher m = p.matcher(s);
boolean b = m.matches();
// Check if data information has correct format (0.0)
if (b) {
}
}
break;
default:
break;
}
state = NO_ENTRY;
}
public void handleStartTag(HTML.Tag tag, MutableAttributeSet set, int pos) {
if (tag == HTML.Tag.TD) {
//[...]
}
}
};
Reader reader = new StringReader(html);
try {
new ParserDelegator().parse(reader, callback, false);
} catch (IOException e) {
}
}
I am trying to parse HTML with Regular Expressions. The program reads the content of td
tags within an html table. The content in the table cell should fit a special pattern defined in Pattern p
.
The main problem is now that the regex pattern does not match for cell content like this " 0.1".
But if I define the String s
manually with the value (" 0.1") in the code the pattern matches.
Furthermore if I copy the content of char[] data
in debug mode and define s
with this copied content the pattern does also not fit although it looks the same like the manually defined value from above.
Is it possible to find out which whitespace characters are really read?
It seems that the whitespace is not always a whitespace and therefore does not match with regex class [\s]. Is this possible?
EDIT:
Thanks for answers. It was really a whitespace character (\xA0) which was not recognized by \s regex class.
For all of you which downvote (really frustrating) my question simply missunderstood me. Maybe the problem was really the sentence "I want to parse HTML with regex" but in fact I simply have content from a HTML table cell with unknown whitespace characters ;-).
I think I had got the same problems with a library like jsoup.
\s
or by the definition of space used byString.trim()
.\s
matches[ \t\n\x0B\f\r]
only andtrim
treats only code-points < 0x20 as space. – Mike Samuel Jan 14 at 15:37