I have a strange issue with large files and bash
. This is the context:
- I have a large file: 75G and 400,000,000+ lines (it is a log file, my bad, I let it grow).
- The first 10 chars of each line is a time stamps in the format YYYY-MM-DD.
- I want to split that file: one file per day.
I tried with the following script that did not work. My question is about this script not working, not alternative solutions.
while read line; do
new_file=${line:0:10}_file.log
echo "$line" >> $new_file
done < file.log
After debugging, I found the problem in the new_file
variable. This script:
while read line; do
new_file=${line:0:10}_file.log
echo $new_file
done < file.log | uniq -c
gives the result bellow (I put the x
es to keep the data confidential, other chars are the real ones). Notice the dh
and the shorter strings:
...
27402 2011-xx-x4
27262 2011-xx-x5
22514 2011-xx-x6
17908 2011-xx-x7
...
3227382 2011-xx-x9
4474604 2011-xx-x0
1557680 2011-xx-x1
1 2011-xx-x2
3 2011-xx-x1
...
12 2011-xx-x1
1 2011-xx-dh
1 2011-xx-x1
1 208--
1 2011-xx-x1
1 2011-xx-dh
1 2011-xx-x1
...
It is not a problem in the format of my file. The script cut -c 1-10 file.log | uniq -c
gives only valid time stamps. Interestingly, a part of the the above output becomes with cut ... | uniq -c
:
3227382 2011-xx-x9
4474604 2011-xx-x0
5722027 2011-xx-x1
We can see that after the uniq count 4474604
, my initial script failed.
Did I hit a limit in bash that I do not know, did I find a bug in bash (it seams unlikely), or have I done something wrong ?
Update:
The problem happens after reading 2G of the file. It seams read
and redirection do not like larger files than 2G. But still searching for a more precise explanation.
Update2:
It definitively looks like a bug. It can be reproduced with:
yes "0123456789abcdefghijklmnopqrs" | head -n 100000000 > file
while read line; do file=${line:0:10}; echo $file; done < file | uniq -c
but this works fine as a workaround (it seams that I found a useful use of cat
):
cat file | while read line; do file=${line:0:10}; echo $file; done | uniq -c
A bug has been filed to GNU and Debian. Affected versions are bash
4.1.5 on Debian Squeeze 6.0.2 and 6.0.4.
echo ${BASH_VERSINFO[@]}
4 1 5 1 release x86_64-pc-linux-gnu
Update3:
Thanks to Andreas Schwab who reacted quickly to my bug report, this is the patch that is the solution to this misbehavior. The impacted file is lib/sh/zread.c
as Gilles pointed out sooner:
diff --git a/lib/sh/zread.c b/lib/sh/zread.c index 0fd1199..3731a41 100644
--- a/lib/sh/zread.c
+++ b/lib/sh/zread.c @@ -161,7 +161,7 @@ zsyncfd (fd)
int fd; { off_t off;
- int r;
+ off_t r;
off = lused - lind; r = 0;
The r
variable is used to hold the return value of lseek
. As lseek
returns the offset from the beginning of the file, when it is over 2GB, the int
value is negative, which causes the test if (r >= 0)
to fail where it should have succeed.
read
statement in bash. – jfgagne Mar 1 '12 at 21:41