Remove lines from CSV file if the second column is in a word list

Question

I have found a long list of free email providers that I want to remove from my email lists - https://gist.github.com/tbrianjones/5992856

Below are two commands I currently use that do the same job for a handful or single domain entries however how can I convert these to import the words from another file? remove.txt for example rather than adding all of them manually.

ruby -rcsv -i -ne 'row = CSV::parse_line($_); puts $_ unless row[2] =~ /gmail|hotmail|qq.com|yahoo|live.com|comcast.com|icloud.com|aol.co/i' All.txt

sed -i '/^[^,]*,[^,]*hotmail/d' All.txt

Below is a line of the data we will be using this on

"fox*******","scott@sc***h.com","821 Ke****on Rd","Neenah","Wisconsin","54***6","UNITED STATES"

JJoao · Accepted Answer · 2015-07-19 08:15:06Z

up vote 1 down vote accepted

Two steps:

create a remover script (AUX) with print unless m!gmail.com!hotmail.com|...! (the regular expressio is huge but there is no problem)
apply it to All.txt

Code:

perl -n0E 's/\n/|/g; say "print unless m!\\b($_ç)\\b!\n" ' remove.txt > AUX
perl -n AUX    All.txt > outfile

Update1: to be case-insensitive add a iin the match operator:

perl -n0E 's/\n/|/g; say "print unless m!@($_=)\\b!i\n" ' remove.txt > AUX

Update2 to have extra remove domains: create a new file with the exception list (extra.txt) and:

cat remove.txt extra.txt | 
  perl -n0E 's/\n/|/g; say "print unless m!@($_=)\\b!i\n" ' > AUX
perl -n AUX   All.txt > outfile

edited Jul 19 at 8:15

answered Jul 18 at 10:41

JJoao
1,5379

Thank you this works great. It did it instantly, it is not processing capitals though YAhoO.cOm for example - How can I do that? – Teddy291 Jul 18 at 21:28

@Teddy291, Please check the update. – JJoao Jul 18 at 23:42

Thank you, sorry to bother you again but I have one more request. If I wanted to remove privacyprotect.org but only "privacy" is in the remove.txt file how would I include that? – Teddy291 Jul 19 at 0:31

@Tedd291, in that case you could (1) add this new domain to remove.txt or (2) create an extra.txt and do cat remove extrar.txt | perl -n0E '.......\n' > AUX – JJoao Jul 19 at 8:16

The line "privacy" is in the remove.txt or extra.txt yet emails containing domainnameprivacyemail.com for example [email protected] are not being removed. – Teddy291 Jul 19 at 11:31

add a comment |

mikeserv · Answer 2 · 2015-07-18 03:02:10Z

up vote 1 down vote

{   sed -ne's/./^[^,]*,[^,]*&/p' | 
    grep -vf- ./All.txt 
}   <./remove.txt >./outfile

Is what I think you are asking about. I'm not sure how it is relevant to ruby or to the line of data you're talking about...

If you want the matches to be case-insensitive then just add the -ignore case option to grep like:

{   sed -ne's/./^[^,]*,[^,]*&/p' | 
    grep -ivf- ./All.txt 
}   <./remove.txt >./outfile

edited Jul 18 at 3:02

answered Jul 18 at 1:06

mikeserv
25.4k32477

Hi, this doesn't seem to work if they have capitals in the email (YAhoO.com for example) – Teddy291 Jul 18 at 2:33

@Teddy291 - Use grep -ivf- – mikeserv Jul 18 at 2:35

Thanks. Do you know if there are any quicker ways of doing this? The file is over 80MB in size and I expect it to be down to 20-30MB after processing, it's been running for 12 hours and has only outputted 1MB. There are 3618 keywords to delete. – Teddy291 Jul 18 at 18:12

@Teddy291 - Wow. That's pretty awful. Yes. Quicker would be to use multiple greps on a subset of the match against list. Quicker still would be to use -Fixed-string matches, but in that case you wouldn't be able to anchor and could only do: grep . remove.txt | grep -ivFf- All.txt, but in the latter case you'd drop all occurrences of any match in remove.txt no matter where it was made in your string. Safely you might split out your file to only the matchable bits: sed -ne's/[^@]*@\([^,]*\).*/\1/p' <All.txt | grep -iFxnf remove.txt | sed 's/:.*/d/' | sed -f- All.txt >out – mikeserv Jul 18 at 18:23

If you did the above in 5-10 passes, say on 5-10 smaller chunks of remove.txt, you could probably finish the deal in 5 minutes or so. The more you can reduce tthe number of matches grep has to check against per line the better. – mikeserv Jul 18 at 18:31

| show 1 more comment

asked	1 month ago
viewed	132 times
active	1 month ago

current community

your communities

more stack exchange communities

Remove lines from CSV file if the second column is in a word list

2 Answers 2

Your Answer

Not the answer you're looking for? Browse other questions tagged shell-script text-processing csv ruby or ask your own question.

Linked

Hot Network Questions

current community

your communities

more stack exchange communities

Remove lines from CSV file if the second column is in a word list

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged shell-script text-processing csv ruby or ask your own question.

Linked

Related

Hot Network Questions