Take the 2-minute tour ×
Unix & Linux Stack Exchange is a question and answer site for users of Linux, FreeBSD and other Un*x-like operating systems. It's 100% free, no registration required.

I have found a long list of free email providers that I want to remove from my email lists - https://gist.github.com/tbrianjones/5992856

Below are two commands I currently use that do the same job for a handful or single domain entries however how can I convert these to import the words from another file? remove.txt for example rather than adding all of them manually.

ruby -rcsv -i -ne 'row = CSV::parse_line($_); puts $_ unless row[2] =~ /gmail|hotmail|qq.com|yahoo|live.com|comcast.com|icloud.com|aol.co/i' All.txt

sed -i '/^[^,]*,[^,]*hotmail/d' All.txt

Below is a line of the data we will be using this on

"fox*******","scott@sc***h.com","821 Ke****on Rd","Neenah","Wisconsin","54***6","UNITED STATES"
share|improve this question

2 Answers 2

up vote 1 down vote accepted

Two steps:

  1. create a remover script (AUX) with print unless m!gmail.com!hotmail.com|...! (the regular expressio is huge but there is no problem)
  2. apply it to All.txt

Code:

perl -n0E 's/\n/|/g; say "print unless m!\\b($_ç)\\b!\n" ' remove.txt > AUX
perl -n AUX    All.txt > outfile

Update1: to be case-insensitive add a iin the match operator:

perl -n0E 's/\n/|/g; say "print unless m!@($_=)\\b!i\n" ' remove.txt > AUX

Update2 to have extra remove domains: create a new file with the exception list (extra.txt) and:

cat remove.txt extra.txt | 
  perl -n0E 's/\n/|/g; say "print unless m!@($_=)\\b!i\n" ' > AUX
perl -n AUX   All.txt > outfile
share|improve this answer
    
Thank you this works great. It did it instantly, it is not processing capitals though YAhoO.cOm for example - How can I do that? –  Teddy291 Jul 18 at 21:28
    
@Teddy291, Please check the update. –  JJoao Jul 18 at 23:42
    
Thank you, sorry to bother you again but I have one more request. If I wanted to remove privacyprotect.org but only "privacy" is in the remove.txt file how would I include that? –  Teddy291 Jul 19 at 0:31
    
@Tedd291, in that case you could (1) add this new domain to remove.txt or (2) create an extra.txt and do cat remove extrar.txt | perl -n0E '.......\n' > AUX –  JJoao Jul 19 at 8:16
    
The line "privacy" is in the remove.txt or extra.txt yet emails containing domainnameprivacyemail.com for example [email protected] are not being removed. –  Teddy291 Jul 19 at 11:31
{   sed -ne's/./^[^,]*,[^,]*&/p' | 
    grep -vf- ./All.txt 
}   <./remove.txt >./outfile

Is what I think you are asking about. I'm not sure how it is relevant to ruby or to the line of data you're talking about...

If you want the matches to be case-insensitive then just add the -ignore case option to grep like:

{   sed -ne's/./^[^,]*,[^,]*&/p' | 
    grep -ivf- ./All.txt 
}   <./remove.txt >./outfile
share|improve this answer
    
Hi, this doesn't seem to work if they have capitals in the email (YAhoO.com for example) –  Teddy291 Jul 18 at 2:33
    
@Teddy291 - Use grep -ivf- –  mikeserv Jul 18 at 2:35
    
Thanks. Do you know if there are any quicker ways of doing this? The file is over 80MB in size and I expect it to be down to 20-30MB after processing, it's been running for 12 hours and has only outputted 1MB. There are 3618 keywords to delete. –  Teddy291 Jul 18 at 18:12
    
@Teddy291 - Wow. That's pretty awful. Yes. Quicker would be to use multiple greps on a subset of the match against list. Quicker still would be to use -Fixed-string matches, but in that case you wouldn't be able to anchor and could only do: grep . remove.txt | grep -ivFf- All.txt, but in the latter case you'd drop all occurrences of any match in remove.txt no matter where it was made in your string. Safely you might split out your file to only the matchable bits: sed -ne's/[^@]*@\([^,]*\).*/\1/p' <All.txt | grep -iFxnf remove.txt | sed 's/:.*/d/' | sed -f- All.txt >out –  mikeserv Jul 18 at 18:23
    
If you did the above in 5-10 passes, say on 5-10 smaller chunks of remove.txt, you could probably finish the deal in 5 minutes or so. The more you can reduce tthe number of matches grep has to check against per line the better. –  mikeserv Jul 18 at 18:31

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.