Join the Stack Overflow Community
Stack Overflow is a community of 4.7 million programmers, just like you, helping each other.
Join them; it only takes a minute:
Sign up

So, I have an issue that really bothers me. I have a simple parser that I made in java. Here is the piece of relevant code:

while( (line = br.readLine())!=null)
{
    String splitted[] = line.split(SPLITTER);
    int docNum = Integer.parseInt(splitted[0].trim());
    //do something
}

Input file is CSV file, the first entry of the file being an integer. When I start parsing, I immidiately get this exception:

Exception in thread "main" java.lang.NumberFormatException: For input string: "1"
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:580)
at java.lang.Integer.parseInt(Integer.java:615)
at dipl.parser.TableParser.parse(TableParser.java:50)
at dipl.parser.DocumentParser.main(DocumentParser.java:87)

I checked the file, it indeed has 1 as its first value (no other characters are in that field), but I still get the message. I think that it may be because of file encoding: it is UTF-8, with Unix endlines. And the program is run on Ubuntu 14.04. Any suggestions where to look for the problem are welcome.

share|improve this question
7  
Nice one using copy and paste to put the error in the question! – T.J. Crowder 7 hours ago
up vote 28 down vote accepted

You have a BOM in front of that number; if I copy what looks like "1" in your question and paste it into vim, I see that you have a FE FF (e.g., a BOM) in front of it. From that link:

The exact bytes comprising the BOM will be whatever the Unicode character U+FEFF is converted into by that transformation format.

So that's the issue, consume the file with the appropriate reader for the transformation (UTF-8, UTF-16 big-endian, UTF-16 little-endian, etc.) the file is encoded with. See also this question and its answers for more about reading Unicode files in Java.

share|improve this answer
2  
That's a UTF-16 BOM. UTF-8 doesn't need a BOM, but if you add one the byte sequence is EF BB BF. – Doval 51 mins ago
1  
@Doval: Thank you, I was absolutely wrong to say it was a UTF-8 BOM, and you're quite right that on-the-wire, the BOM for UTF-8 is EF BB BF. But what we're looking at is the end result of reading the file and then seeing the output in the error message. The file might be in any transformation; all BOMs end up being FE FF once read. – T.J. Crowder 38 mins ago
    
But if it was read raw, then...oh, I don't know. :-) Could well have been UTF-16. :-) It'll all depend on how the file was read into the stream. – T.J. Crowder 21 mins ago

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.