Code Review Stack Exchange is a question and answer site for peer programmer code reviews. It's 100% free, no registration required.

Sign up
Here's how it works:
  1. Anybody can ask a question
  2. Anybody can answer
  3. The best answers are voted up and rise to the top

Goal

I have a large string (~10MB in ~400k lines). I want to "efficiently" split it on a delimiter and map the resulting sets of lines. Efficiently is defined as something that is fast (wall time), even when RAM may be limited.

The delimiter is always a fixed string comprising a full line of the input. The delimiter always appears as the last line of the string. When multiple delimiters occur in sequence, I want to skip the interceding values. Each set of lines between the delimiters is also is processed one line at a time, though they are often further split into sets.

For example, given this string:

a: 1
b: blah
c: cats
a: 2
b: burgers
c: canoodling
done
d: 3
done
d: 4
done
d: 5
done
done
done
q: 6
x: xiph
z: 42
q: 7
x: xena
q: 8
z: 17
done

…I want to process the lines between each done independently, and (based on logic outside the scope of this question) produce something like:

[
  [
    { a:1, b:'blah', c:'cats' },
    { a:2, b:'burgers', c:'canoodling' }
  ],
  { d:3 },
  { d:4 },
  { d:5 },
  [
    { q:6, x:'xiph', z:42 },
    { q:7, x:'xena' },
    { q:8, z:17 }
  ]
]

Possible Approaches

The simplest approach is something like:

# skipping empty lines in the map
large_string.split("done\n").map{ ... }.compact

# pre-rejecting empty lines
large_string.split("done\n").reject(&:empty?).map{ ... }

...but that creates all the sub-string results and possibly will not work well in a limited RAM situation.

As an alternative, I have currently coded up:

[].tap do |result|
  large_string.scan(/(.*?)(?:^done\n)/) do |(match)|
    result << match unless match.empty?
  end
end

This feels gross to me, creating unnecessary wrapper arrays.

Further, benchmarking it (when plenty of RAM is available) shows it to be about 8–10× slower than just split/map/compact.

I've considered using strscan to walk through the string, but this seems unnecessary.

This code performs only about 5× slower than split (and requires me to rewrite my chunk processor to handle lines instead of strings):

delim = "done\n"
[].tap do |result|
  accum = []
  str.each_line do |line|
    if line==delim
      result << process(accum) unless accum.empty?
      accum=[]
    else
      accum << line
    end
  end
end

Is there a better, more elegant way to do this that is performant when RAM is available, but that degrades gracefully when the data set is large

share|improve this question
    
For the curious, I am parsing the results of command lists in MPD, as part of the ruby-mpd library. The size of the response above assumes fetching metadata for 30k songs, which feels like a reasonable upper limit. Most command lists will be far, far smaller (which tempts me to just use a naïve split). – Phrogz Mar 7 at 20:52
    
To clarify, all 3 approaches work as intended (performance aside), but you're trying to decide which is best, or if there is a better method? – Phrancis Mar 7 at 20:57
    
@PinCrash Correct. They all extract the data correctly. The simple split is benchmarked by far the fastest, but I am concerned with how it will perform on embedded environments where RAM is less than the 16GB of my development laptop. – Phrogz Mar 7 at 20:58
    
You might want to clarify this in your post. Also, the tag comparative-review would be appropriate to use in a case like this where you are comparing different approaches. – Phrancis Mar 7 at 21:01
1  
Your third approach is the one I would use: read line by line and process data when you see "done". – Barry Carter Mar 8 at 15:39

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Browse other questions tagged or ask your own question.