Remove extra header lines from file, except for the first line

Question

I have a file that looks like this toy example. My actual file has 4 million lines, about 10 of which I need to delete.

ID  Data1  Data2
1    100    100
2    100    200
3    200    100
ID  Data1  Data2
4    100    100
ID  Data1  Data2
5    200    200

I want to delete the lines that look like the header, except for the first line.

Final file:

ID  Data1  Data2
1    100    100
2    100    200
3    200    100
4    100    100
5    200    200

How can I do this?

Stéphane Chazelas · Accepted Answer · 2016-01-27 13:04:28Z

up vote 13 down vote accepted

header=$(head -n 1 file)
(printf "%s\n" "$header";
 tail -n +2 file|grep -vFxe "$header"
) > newfile

grab the header line into a variable
print the header
print everything except the first line, and omit lines that look like the header

edited 9 hours ago

Stéphane Chazelas

142k21194375

answered yesterday

Jeff Schaller

4,0942726

2

or perhaps { IFS= read -r head; printf '%s\n' "$head"; grep -vF "$head" ; } <file – 1_CR yesterday

2

or { head -n 1; grep -vF 'ID' ; } <infile – don_crissti 21 hours ago

Both good additions. Thanks to don_crissti for indirectly pointing out that posix recently removed -1 syntax from head, in favor of -n 1. – Jeff Schaller 21 hours ago

2

@JeffSchaller, recently as in 12 years ago. And head -1 has been obsoleted for decades before that. – Stéphane Chazelas 9 hours ago

add a comment |

bkmoney · Answer 2 · 2016-01-26 19:09:16Z

up vote 24 down vote

You can use

sed '2,${/ID/d;}'

This will delete lines with ID starting from line 2.

edited yesterday

answered yesterday

bkmoney

4868

2

nice; or to be more specific with the pattern matching, sed '2,${/^ID Data1 Data2$/d;}' file (using the right number of spaces between the columns, of course) – Jeff Schaller yesterday

Hm I thought you could omit the semicolon for only 1 command, but ok. – bkmoney yesterday

Not w/ sane seds, no. – mikeserv yesterday

aaaand -i for the in-place edit win. – user2066657 23 hours ago

2

Or sed '1!{/ID/d;}' – Stéphane Chazelas 9 hours ago

add a comment |

val0x00ff · Answer 3 · 2016-01-26 19:34:38Z

up vote 7 down vote

For those who do not like curly brackets

sed -e '1n' -e '/^ID/d'

n means pass line No.1
d delete all matched line(s) that start with ^ID

edited yesterday

val0x00ff

3,1971621

answered yesterday

Costas

8,260520

4

This can also be shorten to sed '1n;/^ID/d' filename. just a suggestion – val0x00ff yesterday

Note that this will also print lines like IDfoo which are not the same as the header (unlikely to make a difference in this case, but you never know). – terdon♦ 9 hours ago

add a comment |

Wildcard · Answer 4 · 2016-01-27 03:25:40Z

Here's a fun one. You can use sed directly to strip all copies of the first line out and leave everything else in place (including the first line itself).

sed '1{h;n;};G;/^\(.*\)\n\1$/d;s/\n.*$//' input

1{h;n;} puts the first line into the hold space, and skips the rest of the sed commands for the first line (thus printing it, which is the default).

G appends a newline followed by the contents of the hold space to the pattern space.

/^$.*$\n\1$/d deletes the contents of the pattern space (thus skipping to the next line) if the portion after the newline (i.e. what was appended from the hold space) exactly matches the portion before the newline. This is where lines that duplicate the header will get deleted.

s/\n.*$// deletes the portion of text that was added by the G command, so that what gets printed is just the line of text from the file.

Output when given your input is:

ID  Data1  Data2
1    100    100
2    100    200
3    200    100
4    100    100
5    200    200

Serg · Answer 5 · 2016-01-27 06:35:11Z

AWK is a quite decent tool for such purpose as well. Here's sample run of code:

$ awk 'NR == 1 {print} NR != 1 && $0!~/ID  Data1  Data2/' rmLines.txt | head -n 10                                
ID  Data1  Data2
1    100    100
     100    200
3    200    100
1    100    100
     100    200
3    200    100
1    100    100
     100    200
3    200    100

Break down:

NR == 1 {print} tells us to print first line of text file
NR != 1 && $0!~/ID Data1 Data2/ logical operator && tells AWK to print line that is not equal to 1 and doesn't contain ID Data1 Data2. Note the lack of {print} part; in awk if a test condition is evaluated to true,it is assumed for line to be printed.
| head -n 10 is just a tiny addition to limit output to only first 10 lines. Not relevant to the AWK part itself, only used for demo purpose.

If you want that in a file, redirect the output of the command by appending > newFile.txt at the end of command, like so:

awk 'NR == 1 {print} NR != 1 && $0!~/ID  Data1  Data2/' rmLines.txt > newFile.txt

How does it hold up ? Pretty good actually:

$ time awk 'NR == 1 {print} NR != 1 && $0!~/ID  Data1  Data2/' rmLines.txt > /dev/null                            
    0m3.60s real     0m3.53s user     0m0.06s system

Side note

The generated sample file was done with for looping from one to million and printing first four lines of your file (so 4 lines times million equals 4 millions of lines ), which took 0.09 seconds, by the way.

awk 'BEGIN{ for(i=1;i<=1000000;i++) printf("ID  Data1  Data2\n1    100    100\n     100    200\n3    200    100\n");  }' > rmLines.txt

Note that this will also print lines like ID Data1 Data2 foo which are not the same as the header (unlikely to make a difference in this case, but you never know). — terdon♦, 9 hours ago
@terdon yes, exactly right. OP however specified only one pattern they want to remove and his example appears to support that — Serg, 8 hours ago

terdon · Answer 6 · 2016-01-27 12:38:42Z

Here are a couple more choices that don't require you to know the first line in advance:

perl -ne 'print unless $_ eq $k; $k=$_ if $.==1;

The -n flag tells perl to loop over its input file, saving each line as $_. The $k=$_ if $.==1; saves the first line ($. is the line number, so $.==1 will only be true for the 1st line) as $k. The print unless $k eq $_ prints the current line if it isn't the same as the one saved in $k.

Alternatively, the same thing in awk:

awk '(NR==1){a[$0]++; print}!a[$0]' file

Here, the 1st line is saved in the array a and printed. Then, all lines that aren't in a (!a[$0]) are printed because !a[$0] will evaluate to true for them and the default action for awk on true expressions is to print.

I like not having to know the first line idea since it makes it a generalized script for your toolbox. — Mark Stewart, 30 mins ago

asked	yesterday
viewed	544 times
active	today

current community

your communities

more stack exchange communities

Remove extra header lines from file, except for the first line

6 Answers 6

Your Answer

Not the answer you're looking for? Browse other questions tagged text-processing or ask your own question.

Hot Network Questions

current community

your communities

more stack exchange communities

Remove extra header lines from file, except for the first line

6 Answers 6

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged text-processing or ask your own question.

Related

Hot Network Questions