Java: Parsing CSV files
As I mentioned in a previous post I recently moved a bunch of neo4j data loading code from Ruby to Java and as part of that process I needed to parse some CSV files.
In Ruby I was using FasterCSV which became the standard CSV library from Ruby 1.9 but it’s been a while since I had to parse CSV files in Java so I wasn’t sure which library to use.
I needed a library which could parse a comma separated file where there might be commas in the values of one of the fields. I think that’s fairly standard behaviour in any CSV library but my googling led me to OpenCSV.
It can be downloaded from here and so far seems to do the job!
This is an example of how I’m using it:
String filePath = "/Users/mneedham/data/awesome-csv-file.csv"; CSVReader reader = new CSVReader(new FileReader(filePath), ','); List<String[]> csvEntries = reader.readAll(); Iterator<String[]> iterator = csvEntries.iterator(); while (iterator.hasNext()) { String[] row = iterator.next(); System.out.println("field 1: " + row[0]); }
There are more use cases described on the home page.
(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)
Comments
Lance Semmens replied on Fri, 2012/10/05 - 3:24am
As a side note, openCSV supports an iterator style reading pattern which can be handy with large files where you don't want to read the entire file into memory at once.
Jean Said replied on Fri, 2012/10/05 - 9:44am
I beg to differ.
CSV parsing is by no way generic and I think it's quite hard to build up a generic library for this. Why? Because CSV is not a file format, it's a multitude of file formats with no clear standard. Even the basic functionality is by no way standard. I mean, if I use ';' as a field separator, do I accept trailing characters as a field? Do I ignore these? Which charset(s) do I accept? Do I use some heuristics to detect multiple charsets with no bom? Then, which charset are available for use? Unless you have a clear and flawlessly used standard, csv is a nightmare.
In my company, we were forced to create our own smart and configurable CSV parser as no lib out there was up to our needs. Of course we could use some shaky generic lib, but we didn't and for sadly valid reasons.
CSV files are just as much a problem as "text files". It means basically nothing. Our input is CSV is almost as precise as saying that our input is a binary file with records in it.
Even if it looks simple, don't assume you'll get a silver bullet lib to parse it...
Evagelos Sinapidis replied on Fri, 2012/10/05 - 7:43am
Cristofer Weber replied on Fri, 2012/10/05 - 10:27am
There's also a project for batch import CSV files into Neo4j
https://github.com/maxdemarzi/batch-import
malakooti replied on Fri, 2012/10/05 - 12:26pm
Parsing of CSV can be done in a generic manner using an array that holds the Definition of each field. In fact the best way is to create a CSV parser that:
1- Throws a new document event when file is opened.
2- Reads a line and splits it to an array of fields
3- Create a For each field gets the field name from the Definition of the field and throw a new element event for each field with the value dervied from splitted fields.
In fact you can extend this concept to reading the fixed length files and you can treat all legacy data files as modern (XML) files. This also alignes you with the new XML specification which has provision for trnsferring XML data in a non XML format.
Greg Brown replied on Fri, 2012/10/05 - 2:01pm
Seb Cha replied on Fri, 2012/10/05 - 2:24pm
I agree with Jean Said
CSV is not a standard, it's not backed up by W3C. How do i declare a null field ? How do i declare a blank field ? How do i escape special characteres ? How do i know the charactere encoding ?...
XML will be verbose if you use Element instead of Attributes.
Greg Brown replied on Fri, 2012/10/05 - 6:58pm
in response to:
Seb Cha
It may not be a "formal" standard but it is a de facto standard.
How do i declare a null field ? How do i declare a blank field ? How do i escape special characteres ? How do i know the charactere encoding ?...These questions may be open to interpretation but they are certainly not difficult to solve.
XML will be verbose if you use Element instead of Attributes.It's true that XML will be less verbose with attributes, but it will still be far more verbose than CSV. In XML, the element and attribute names must be repeated for every record, in addition to the field values. In CSV, only the field values are required.
Don't get me wrong - I like XML and JSON, too. Each technology has it's pros and cons - just choose the right tool for the job.
matt inger replied on Fri, 2012/10/05 - 8:48pm
There's also a handy StrTokenizer class in commons-lang3.
The tokenizer itself is quite configurable allowing you to tell it what constitutes whitespace, what to do with empty tokens, what the delimiters are, etc... Though in most cases, just calling "getCSVInstance()" or "getTSVInstance()" is sufficient for common cases. It also implements the ListIterator interface as well.
The one downside is it's line oriented, so it's up to the programmer to feed it a single record at a time (using a LineNumberReader or something of that nature).
Full disclosure: I am one of the authors of this class.
Lund Wolfe replied on Sat, 2012/10/06 - 2:49pm
OpenCSV is an excellent choice. If there is an interoperable data standard it is probably CSV or some text equivalent. That's usually what you see when there is an import/export option from a proprietary format. CSV is often messy because it originated from Excel and you may get trailing commas for fake columns and trailing fake rows. There are lots of Excel users out there, and your expectations should be pretty low.
You do need to know your data, and clean up and validate it accordingly. That's necessary for any conversion. This is one of the biggest opportunities for improvement when moving data from one information system to another.