Unix & Linux Stack Exchange is a question and answer site for users of Linux, FreeBSD and other Un*x-like operating systems. It's 100% free, no registration required.

Sign up
Here's how it works:
  1. Anybody can ask a question
  2. Anybody can answer
  3. The best answers are voted up and rise to the top

I have a file that looks like this toy example. My actual file has 4 million lines, about 10 of which I need to delete.

ID  Data1  Data2
1    100    100
2    100    200
3    200    100
ID  Data1  Data2
4    100    100
ID  Data1  Data2
5    200    200

I want to delete the lines that look like the header, except for the first line.

Final file:

ID  Data1  Data2
1    100    100
2    100    200
3    200    100
4    100    100
5    200    200

How can I do this?

share|improve this question
up vote 13 down vote accepted
header=$(head -n 1 file)
(printf "%s\n" "$header";
 tail -n +2 file|grep -vFxe "$header"
) > newfile
  1. grab the header line into a variable
  2. print the header
  3. print everything except the first line, and omit lines that look like the header
share|improve this answer
2  
or perhaps { IFS= read -r head; printf '%s\n' "$head"; grep -vF "$head" ; } <file – 1_CR yesterday
2  
or { head -n 1; grep -vF 'ID' ; } <infile – don_crissti 21 hours ago
    
Both good additions. Thanks to don_crissti for indirectly pointing out that posix recently removed -1 syntax from head, in favor of -n 1. – Jeff Schaller 21 hours ago
2  
@JeffSchaller, recently as in 12 years ago. And head -1 has been obsoleted for decades before that. – Stéphane Chazelas 9 hours ago

You can use

sed '2,${/ID/d;}'

This will delete lines with ID starting from line 2.

share|improve this answer
2  
nice; or to be more specific with the pattern matching, sed '2,${/^ID Data1 Data2$/d;}' file (using the right number of spaces between the columns, of course) – Jeff Schaller yesterday
    
Hm I thought you could omit the semicolon for only 1 command, but ok. – bkmoney yesterday
    
Not w/ sane seds, no. – mikeserv yesterday
    
aaaand -i for the in-place edit win. – user2066657 23 hours ago
2  
Or sed '1!{/ID/d;}' – Stéphane Chazelas 9 hours ago

For those who do not like curly brackets

sed -e '1n' -e '/^ID/d'
  • n means pass line No.1
  • d delete all matched line(s) that start with ^ID
share|improve this answer
4  
This can also be shorten to sed '1n;/^ID/d' filename. just a suggestion – val0x00ff yesterday
    
Note that this will also print lines like IDfoo which are not the same as the header (unlikely to make a difference in this case, but you never know). – terdon 9 hours ago

Here's a fun one. You can use sed directly to strip all copies of the first line out and leave everything else in place (including the first line itself).

sed '1{h;n;};G;/^\(.*\)\n\1$/d;s/\n.*$//' input

1{h;n;} puts the first line into the hold space, and skips the rest of the sed commands for the first line (thus printing it, which is the default).

G appends a newline followed by the contents of the hold space to the pattern space.

/^\(.*\)\n\1$/d deletes the contents of the pattern space (thus skipping to the next line) if the portion after the newline (i.e. what was appended from the hold space) exactly matches the portion before the newline. This is where lines that duplicate the header will get deleted.

s/\n.*$// deletes the portion of text that was added by the G command, so that what gets printed is just the line of text from the file.

Output when given your input is:

ID  Data1  Data2
1    100    100
2    100    200
3    200    100
4    100    100
5    200    200
share|improve this answer

AWK is a quite decent tool for such purpose as well. Here's sample run of code:

$ awk 'NR == 1 {print} NR != 1 && $0!~/ID  Data1  Data2/' rmLines.txt | head -n 10                                
ID  Data1  Data2
1    100    100
     100    200
3    200    100
1    100    100
     100    200
3    200    100
1    100    100
     100    200
3    200    100

Break down:

  • NR == 1 {print} tells us to print first line of text file
  • NR != 1 && $0!~/ID Data1 Data2/ logical operator && tells AWK to print line that is not equal to 1 and doesn't contain ID Data1 Data2. Note the lack of {print} part; in awk if a test condition is evaluated to true,it is assumed for line to be printed.
  • | head -n 10 is just a tiny addition to limit output to only first 10 lines. Not relevant to the AWK part itself, only used for demo purpose.

If you want that in a file, redirect the output of the command by appending > newFile.txt at the end of command, like so:

awk 'NR == 1 {print} NR != 1 && $0!~/ID  Data1  Data2/' rmLines.txt > newFile.txt

How does it hold up ? Pretty good actually:

$ time awk 'NR == 1 {print} NR != 1 && $0!~/ID  Data1  Data2/' rmLines.txt > /dev/null                            
    0m3.60s real     0m3.53s user     0m0.06s system

Side note

The generated sample file was done with for looping from one to million and printing first four lines of your file (so 4 lines times million equals 4 millions of lines ), which took 0.09 seconds, by the way.

awk 'BEGIN{ for(i=1;i<=1000000;i++) printf("ID  Data1  Data2\n1    100    100\n     100    200\n3    200    100\n");  }' > rmLines.txt
share|improve this answer
    
Note that this will also print lines like ID Data1 Data2 foo which are not the same as the header (unlikely to make a difference in this case, but you never know). – terdon 9 hours ago
    
@terdon yes, exactly right. OP however specified only one pattern they want to remove and his example appears to support that – Serg 8 hours ago

Here are a couple more choices that don't require you to know the first line in advance:

perl -ne 'print unless $_ eq $k; $k=$_ if $.==1; 

The -n flag tells perl to loop over its input file, saving each line as $_. The $k=$_ if $.==1; saves the first line ($. is the line number, so $.==1 will only be true for the 1st line) as $k. The print unless $k eq $_ prints the current line if it isn't the same as the one saved in $k.

Alternatively, the same thing in awk:

awk '(NR==1){a[$0]++; print}!a[$0]' file 

Here, the 1st line is saved in the array a and printed. Then, all lines that aren't in a (!a[$0]) are printed because !a[$0] will evaluate to true for them and the default action for awk on true expressions is to print.

share|improve this answer
    
I like not having to know the first line idea since it makes it a generalized script for your toolbox. – Mark Stewart 30 mins ago

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.