Goal
I have a large string (~10MB in ~400k lines). I want to "efficiently" split it on a delimiter and map the resulting sets of lines. Efficiently is defined as something that is fast (wall time), even when RAM may be limited.
The delimiter is always a fixed string comprising a full line of the input. The delimiter always appears as the last line of the string. When multiple delimiters occur in sequence, I want to skip the interceding values. Each set of lines between the delimiters is also is processed one line at a time, though they are often further split into sets.
For example, given this string:
a: 1
b: blah
c: cats
a: 2
b: burgers
c: canoodling
done
d: 3
done
d: 4
done
d: 5
done
done
done
q: 6
x: xiph
z: 42
q: 7
x: xena
q: 8
z: 17
done
…I want to process the lines between each done
independently, and (based on logic outside the scope of this question) produce something like:
[
[
{ a:1, b:'blah', c:'cats' },
{ a:2, b:'burgers', c:'canoodling' }
],
{ d:3 },
{ d:4 },
{ d:5 },
[
{ q:6, x:'xiph', z:42 },
{ q:7, x:'xena' },
{ q:8, z:17 }
]
]
Possible Approaches
The simplest approach is something like:
# skipping empty lines in the map
large_string.split("done\n").map{ ... }.compact
# pre-rejecting empty lines
large_string.split("done\n").reject(&:empty?).map{ ... }
...but that creates all the sub-string results and possibly will not work well in a limited RAM situation.
As an alternative, I have currently coded up:
[].tap do |result|
large_string.scan(/(.*?)(?:^done\n)/) do |(match)|
result << match unless match.empty?
end
end
This feels gross to me, creating unnecessary wrapper arrays.
Further, benchmarking it (when plenty of RAM is available) shows it to be about 8–10× slower than just split/map/compact.
I've considered using strscan
to walk through the string, but this seems unnecessary.
This code performs only about 5× slower than split (and requires me to rewrite my chunk processor to handle lines instead of strings):
delim = "done\n"
[].tap do |result|
accum = []
str.each_line do |line|
if line==delim
result << process(accum) unless accum.empty?
accum=[]
else
accum << line
end
end
end
Is there a better, more elegant way to do this that is performant when RAM is available, but that degrades gracefully when the data set is large
ruby-mpd
library. The size of the response above assumes fetching metadata for 30k songs, which feels like a reasonable upper limit. Most command lists will be far, far smaller (which tempts me to just use a naïve split). – Phrogz Mar 7 at 20:52comparative-review
would be appropriate to use in a case like this where you are comparing different approaches. – Phrancis Mar 7 at 21:01