I have the following problem scenario:

  • I have a text file and I have to read it and split it into lines.
  • Some lines might need to be dropped (according to criteria that are not fixed).
  • The lines that are not dropped must be parsed into some predefined records.
  • Records that are not valid must be dropped.
  • Duplicate records may exist and, in such a case, they are consecutive. If duplicate / multiple records exist, only one item should be kept.
  • The remaining records should be grouped according to the value contained in one field; all records belonging to the same group appear one after another (e.g. AAAABBBBCCDEEEFF and so on).
  • The records of each group should be numbered (1, 2, 3, 4, ...). For each group the numbering starts from 1.
  • The records must then be saved somewhere / consumed in the same order as they were produced.

I have to implement this in Java or C++.

My first idea was to define functions / methods like:

  • One method to get all the lines from the file.
  • One method to filter out the unwanted lines.
  • One method to parse the filtered lines into valid records.
  • One method to remove duplicate records.
  • One method to group records and number them.

The problem is that the data I am going to read can be too big and might not fit into main memory: so I cannot just construct all these lists and apply my functions one after the other.

On the other hand, I think I do not need to fit all the data in main memory at once because once a record has been consumed all its underlying data (basically the lines of text between the previous record and the current record, and the record itself) can be disposed of.

With the little knowledge I have of Haskell I have immediately thought about some kind of lazy evaluation, in which instead of applying functions to lists that have been completely computed, I have different streams of data that are built on top of each other and, at each moment, only the needed portion of each stream is materialized in main memory.

But I have to implement this in Java or C++. So my question is which design pattern or other technique can allow me to implement this lazy processing of streams in one of these languages.

link|improve this question

75% accept rate
If homework, please tag as such. – Thorbjørn Ravn Andersen Apr 14 at 23:57
Do you have a database? – Emmad Kareem Apr 15 at 0:22
1  
@ThorbjørnRavnAndersen We've blacklisted [homework], here's why... – Yannis Rizos Apr 15 at 7:15
@Thorbjørn Ravn Andersen: It is not homework, it is a project I am working on. This pattern is occurring again and again and I am trying to find a more general solution / approach. – Giorgio Apr 15 at 8:07
@Emmad Kareem: I do not have a database and even if I had I would like to see the data as streams and process the data by composing functions on streams. I find it is a very elegant and efficient approach, but I am still learning how it works. – Giorgio Apr 15 at 8:08
show 11 more comments
feedback

1 Answer

up vote 5 down vote accepted

You should look into iterators if deciding for Java.

Write an Iterator<String> that reads a line from the file. Write a filtering iterator that accepts the above iterator in the constructor and only generate those lines you are interested in. Write a splitting Iterator<Record> that accepts a string iterator in the constructor, and splits each line into a record. And so on.

You will most likely find that you will do the processing in the "are there more?" section to get the logic right.

link|improve this answer
Sounds like a good idea: I had been thinking about more complex solutions but probably iterators (possibly with small adaptations) are what I need. Thanks for the answer! +1. – Giorgio Apr 15 at 8:27
1  
When you have experiences to share, please do. – Thorbjørn Ravn Andersen Apr 15 at 10:37
I have the following solution (sketch): special generic iterator interface with methods hasNext(), next() [get and consume], peekNext() [get, do not consume], and a generic class IteratorCombinator that produces a new iterator from a source iterator. The above-mentioned methods are all generic and based on a protected method produceNext() that each custom combinator must implement. As a front-end to the outermost iterator I have a class that produces a list from the first n elements. In this way, I removed most of the boilerplate. – Giorgio Apr 16 at 20:41
In my experience you do not need even that. Just implement the interface and do the hard work in hasNext(). – Thorbjørn Ravn Andersen Apr 20 at 3:10
feedback

Your Answer

 
or
required, but never shown

Not the answer you're looking for? Browse other questions tagged or ask your own question.