Java: Parsing CSV files

We Recommend These Resources

As I mentioned in a previous post I recently moved a bunch of neo4j data loading code from Ruby to Java and as part of that process I needed to parse some CSV files.

In Ruby I was using FasterCSV which became the standard CSV library from Ruby 1.9 but it’s been a while since I had to parse CSV files in Java so I wasn’t sure which library to use.

I needed a library which could parse a comma separated file where there might be commas in the values of one of the fields. I think that’s fairly standard behaviour in any CSV library but my googling led me to OpenCSV.

It can be downloaded from here and so far seems to do the job!

This is an example of how I’m using it:

String filePath = "/Users/mneedham/data/awesome-csv-file.csv";
CSVReader reader = new CSVReader(new FileReader(filePath), ',');
 
List<String[]> csvEntries = reader.readAll();
Iterator<String[]> iterator = csvEntries.iterator();
 
while (iterator.hasNext()) {
    String[] row = iterator.next();
    System.out.println("field 1: " + row[0]);
}

There are more use cases described on the home page.

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)

Comments

Lance Semmens replied on Fri, 2012/10/05 - 3:24am

As a side note, openCSV supports an iterator style reading pattern which can be handy with large files where you don't want to read the entire file into memory at once.

String[] nextLine;
while ((nextLine = reader.readNext()) != null) {
	// do stuff
}

Jean Said replied on Fri, 2012/10/05 - 9:44am

I beg to differ.

CSV parsing is by no way generic and I think it's quite hard to build up a generic library for this. Why? Because CSV is not a file format, it's a multitude of file formats with no clear standard. Even the basic functionality is by no way standard. I mean, if I use ';' as a field separator, do I accept trailing characters as a field? Do I ignore these? Which charset(s) do I accept? Do I use some heuristics to detect multiple charsets with no bom? Then, which charset are available for use? Unless you have a clear and flawlessly used standard, csv is a nightmare.

In my company, we were forced to create our own smart and configurable CSV parser as no lib out there was up to our needs. Of course we could use some shaky generic lib, but we didn't and for sadly valid reasons.

CSV files are just as much a problem as "text files". It means basically nothing. Our input is CSV is almost as precise as saying that our input is a binary file with records in it.

Even if it looks simple, don't assume you'll get a silver bullet lib to parse it...

Evagelos Sinapidis replied on Fri, 2012/10/05 - 7:43am

I use SuperCSV

Cristofer Weber replied on Fri, 2012/10/05 - 10:27am

There's also a project for batch import CSV files into Neo4j

https://github.com/maxdemarzi/batch-import

malakooti replied on Fri, 2012/10/05 - 12:26pm

Parsing of CSV can be done in a generic manner using an array that holds the Definition of each field. In fact the best way is to create a CSV parser that:

1- Throws a new document event when file is opened.

2- Reads a line and splits it to an array of fields

3- Create a For each field gets the field name from the Definition of the field and throw a new element event for each field with the value dervied from splitted fields.

In fact you can extend this concept to reading the fixed length files and you can treat all legacy data files as modern (XML) files. This also alignes you with the new XML specification which has provision for trnsferring XML data in a non XML format.

Greg Brown replied on Fri, 2012/10/05 - 2:01pm

CSV is so underrated. It is ideal for transferring tabular data (as long as you know what the column data types are), since it is so much less verbose than XML or JSON.

Seb Cha replied on Fri, 2012/10/05 - 2:24pm

I agree with Jean Said

CSV is not a standard, it's not backed up by W3C. How do i declare a null field ? How do i declare a blank field ? How do i escape special characteres ? How do i know the charactere encoding ?...

XML will be verbose if you use Element instead of Attributes.

Greg Brown replied on Fri, 2012/10/05 - 6:58pm in response to: Seb Cha

CSV is not a standard

It may not be a "formal" standard but it is a de facto standard.

How do i declare a null field ? How do i declare a blank field ? How do i escape special characteres ? How do i know the charactere encoding ?...

These questions may be open to interpretation but they are certainly not difficult to solve.

It's true that XML will be less verbose with attributes, but it will still be far more verbose than CSV. In XML, the element and attribute names must be repeated for every record, in addition to the field values. In CSV, only the field values are required.

Don't get me wrong - I like XML and JSON, too. Each technology has it's pros and cons - just choose the right tool for the job.

matt inger replied on Fri, 2012/10/05 - 8:48pm

There's also a handy StrTokenizer class in commons-lang3.

StrTokenizer tokenizer = StrTokenizer.getCSVInstance();
while ( (line = reader.readLine()) != null) {
tokenizer.reset(line);
String toks[] = tokenizer.getTokenArray();
}

The tokenizer itself is quite configurable allowing you to tell it what constitutes whitespace, what to do with empty tokens, what the delimiters are, etc... Though in most cases, just calling "getCSVInstance()" or "getTSVInstance()" is sufficient for common cases. It also implements the ListIterator interface as well.

The one downside is it's line oriented, so it's up to the programmer to feed it a single record at a time (using a LineNumberReader or something of that nature).

Full disclosure: I am one of the authors of this class.

Lund Wolfe replied on Sat, 2012/10/06 - 2:49pm

OpenCSV is an excellent choice. If there is an interoperable data standard it is probably CSV or some text equivalent. That's usually what you see when there is an import/export option from a proprietary format. CSV is often messy because it originated from Excel and you may get trailing commas for fake columns and trailing fake rows. There are lots of Excel users out there, and your expectations should be pretty low.

You do need to know your data, and clean up and validate it accordingly. That's necessary for any conversion. This is one of the biggest opportunities for improvement when moving data from one information system to another.

Java: Parsing CSV files

We Recommend These Resources

5 Critical Application Deployment Mistakes (You Don’t Want to Make)

Developing .NET/C# Applications with VoltDB

DevOps: From Concepts to Practical Applications Featuring Forrester Research, Inc. and Equinox

Apache Solr: Download the LucidWorks Certified Distribution

Micro Focus SilkTest 2011

Comments

Lance Semmens replied on Fri, 2012/10/05 - 3:24am

Jean Said replied on Fri, 2012/10/05 - 9:44am

Evagelos Sinapidis replied on Fri, 2012/10/05 - 7:43am

Cristofer Weber replied on Fri, 2012/10/05 - 10:27am

malakooti replied on Fri, 2012/10/05 - 12:26pm

Greg Brown replied on Fri, 2012/10/05 - 2:01pm

Seb Cha replied on Fri, 2012/10/05 - 2:24pm

Greg Brown replied on Fri, 2012/10/05 - 6:58pm in response to: Seb Cha

matt inger replied on Fri, 2012/10/05 - 8:48pm

Lund Wolfe replied on Sat, 2012/10/06 - 2:49pm

Comment viewing options

Recommended Resources

Best Practices for Building High Performance Development Teams

Getting Started with Apache Camel - 6/28/2012 - 10:00AM EST

5 Critical Application Deployment Mistakes (You Don’t Want to Make)

Implementing Enterprise Integration Patterns

NOSQL for the Enterprise

Spotlight Features

Weekly Poll: Project Management Tools

Understanding Logging in the Cloud

Algorithm of the Week: Topological Sort of a Graph

F* MongoDB, F* Node.js, and F*** You!

Popular at DZone

Spotlight Resources

Data Warehousing: Best Practices for Collecting, Storing, and Delivering Decision-Support Data

Java Profiling with VisualVM: X-Ray Vision for Dramatic Performance Gains

Database Partitioning with MySQL: Improving Performance, Availability, and Manageability