Take the 2-minute tour ×
Unix & Linux Stack Exchange is a question and answer site for users of Linux, FreeBSD and other Un*x-like operating systems.. It's 100% free, no registration required.

The duplicate is combination of different case text.

I need to count number of duplicate (case-insensitive) and then I need to remove duplicate by choose case with highest duplicate.

Below example:

hot chocolate
hot chocolate
hot chocolate
Hot Chocolate
Hot Chocolate
Hot Chocolate
Hot Chocolate
Hot Chocolate
Xicolatada
Xicolatada
Xicolatada
Xicolatada
XICOLATADA
XICOLATADA

Should become:

Hot Chocolate, 8
Xicolatada, 6

This question similar to this one but I need to choose case with highest duplicate and count case-insensitively.

share|improve this question
    
Just curious, has anyone here ever needed to search for a string but only return whichever version of that string has the most instances based on case?!? this just seems like purely academic hoops that people are made to jump through in school and maybe would never, ever be needed in the real world! –  Baazigar 16 hours ago

6 Answers 6

up vote 4 down vote accepted

And there's uniq --ignore-case --count | sort --numeric --reverse:

uniq -ic /tmp/foo.txt | sort -nr
      8 hot chocolate
      6 Xicolatada

And to switch around the order putting a comma in there add on:

... | sed -e 's/^ *\([0-9]*\) \(.*\)/\2, \1/'
share|improve this answer

I would use tolower() to make all the items lowercase. Then it is a matter of storing them in an array a[] and then printing the results:

$ awk '{a[tolower($0)]++} END {for (i in a) print i, a[i]}' file
xicolatada 6
hot chocolate 8

To have the output in comma-separated format, add -v OFS=,.

share|improve this answer

If that list of items is in a file named list.txt, you can do:

tr '[:upper:]' '[:lower:]' < list.txt | sort | uniq -c

...which will output:

8 hot chocolate
6 xicolatada
share|improve this answer

To pick the correct duplicate to output in the end you need to keep track of the global counts and unique (verbatim) counts; using awk:

#!/usr/bin/awk -f

{ l[tolower($0)]++; v[$0]++; }

END {
    OFS=", ";
    for (expr in l) {
        maxlc = 0;
        maxv = "";
        for (verb in v) {
            if (tolower(verb) == expr && v[verb] > maxlc) {
                maxlc = v[verb];
                maxv = verb;
            }
        }
        print maxv, l[expr];
    }
}

This counts all the unique lines in v, and their lowercase variants in l. l thus gives the counts of de-duplicated lines; for each of these, the script finds the matching occurrence in v with the highest count, and outputs that.

share|improve this answer

POSIXly:

<in dd conv=lcase|LC_ALL=C sort|uniq -c >out

...which prints...

8 hot chocolate
6 xicolatada

Or with GNU tools:

<in LC_ALL=C sort -f|uniq -ic >out

...which prints...

8 Hot Chocolate
6 XICOLATADA

You need a GNU uniq there - or, well, you need one which supports the case -insensitive option, anyhow. All sorts should do that with -f anyway.

share|improve this answer
    
@cuonglm - is there any solution offered here which wouldn't need an explicit locale edited in? With the sample input given though, this one doesn't. –  mikeserv yesterday
    
Yeah, all current answer need. The locale is for strictness. Feel free to revert it. –  cuonglm 23 hours ago
    
@cuonglm - nah. better is better. It's faster, anyway. So you know, uniq has the same problem. –  mikeserv 23 hours ago
1  
GNU uniq does but POSIX uniq doesn't. GNU uniq -i using byte comparison instead of collation order so doesn't have. –  cuonglm 23 hours ago
    
@cuonglm - I upvoted that comment - but maybe in haste? I don't read that anywhere in the spec. It doesn't seem to specify anywhere how the lines should be compared, only that they should be. –  mikeserv 23 hours ago

This will give you your desired output

use List::Util qw(sum);

my %count;
while (<>) {
    chomp;
    $count{+lc}{$_}++; 
}

$,=", ";
$\="\n";

while (my ($key, $hash) = each %count) {
    my @labels = reverse 
                 map  { $_->[0] }
                 sort { $a->[1] <=> $b->[1] } 
                 map  { [ $_, $hash->{$_} ] } 
                 keys %$hash;
    my $sum = sum values %$hash;

    print $labels[0], $sum;
}

Then

$ perl count.pl data.txt 
Hot Chocolate, 8
Xicolatada, 6

The order of the output is indeterminate.

share|improve this answer
    
You can use $b->[1] <=> $a->[1] to sort reverse, so we can drop reverse. –  cuonglm 18 hours ago
    
I find that less readable: the sort block already contains sufficient magic. –  glenn jackman 18 hours ago

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.