Case-insensitive count duplicate line, remove duplicate by choose case with highest duplicate

Question

The duplicate is combination of different case text.

I need to count number of duplicate (case-insensitive) and then I need to remove duplicate by choose case with highest duplicate.

Below example:

hot chocolate
hot chocolate
hot chocolate
Hot Chocolate
Hot Chocolate
Hot Chocolate
Hot Chocolate
Hot Chocolate
Xicolatada
Xicolatada
Xicolatada
Xicolatada
XICOLATADA
XICOLATADA

Should become:

Hot Chocolate, 8
Xicolatada, 6

This question similar to this one but I need to choose case with highest duplicate and count case-insensitively.

Just curious, has anyone here ever needed to search for a string but only return whichever version of that string has the most instances based on case?!? this just seems like purely academic hoops that people are made to jump through in school and maybe would never, ever be needed in the real world! — Baazigar, 16 hours ago

rocky · Accepted Answer · 2015-06-17 10:08:17Z

up vote 4 down vote accepted

And there's uniq --ignore-case --count | sort --numeric --reverse:

uniq -ic /tmp/foo.txt | sort -nr
      8 hot chocolate
      6 Xicolatada

And to switch around the order putting a comma in there add on:

... | sed -e 's/^ *\([0-9]*\) \(.*\)/\2, \1/'

edited yesterday

answered yesterday

rocky
44429

add a comment |

fedorqui · Answer 2 · 2015-06-17 09:20:55Z

I would use tolower() to make all the items lowercase. Then it is a matter of storing them in an array a[] and then printing the results:

$ awk '{a[tolower($0)]++} END {for (i in a) print i, a[i]}' file
xicolatada 6
hot chocolate 8

To have the output in comma-separated format, add -v OFS=,.

Marios Zindilis · Answer 3 · 2015-06-17 09:21:40Z

up vote 0 down vote

If that list of items is in a file named list.txt, you can do:

tr '[:upper:]' '[:lower:]' < list.txt | sort | uniq -c

...which will output:

8 hot chocolate
6 xicolatada

answered yesterday

Marios Zindilis
1763

add a comment |

Stephen Kitt · Answer 4 · 2015-06-17 09:41:24Z

To pick the correct duplicate to output in the end you need to keep track of the global counts and unique (verbatim) counts; using awk:

#!/usr/bin/awk -f

{ l[tolower($0)]++; v[$0]++; }

END {
    OFS=", ";
    for (expr in l) {
        maxlc = 0;
        maxv = "";
        for (verb in v) {
            if (tolower(verb) == expr && v[verb] > maxlc) {
                maxlc = v[verb];
                maxv = verb;
            }
        }
        print maxv, l[expr];
    }
}

This counts all the unique lines in v, and their lowercase variants in l. l thus gives the counts of de-duplicated lines; for each of these, the script finds the matching occurrence in v with the highest count, and outputs that.

cuonglm · Answer 5 · 2015-06-17 10:23:45Z

up vote 0 down vote

POSIXly:

<in dd conv=lcase|LC_ALL=C sort|uniq -c >out

...which prints...

8 hot chocolate
6 xicolatada

Or with GNU tools:

<in LC_ALL=C sort -f|uniq -ic >out

...which prints...

8 Hot Chocolate
6 XICOLATADA

You need a GNU uniq there - or, well, you need one which supports the case -insensitive option, anyhow. All sorts should do that with -f anyway.

edited yesterday

cuonglm
40.1k352101

answered yesterday

mikeserv
20.2k31666

@cuonglm - is there any solution offered here which wouldn't need an explicit locale edited in? With the sample input given though, this one doesn't. – mikeserv yesterday

Yeah, all current answer need. The locale is for strictness. Feel free to revert it. – cuonglm 23 hours ago

@cuonglm - nah. better is better. It's faster, anyway. So you know, uniq has the same problem. – mikeserv 23 hours ago

1

GNU uniq does but POSIX uniq doesn't. GNU uniq -i using byte comparison instead of collation order so doesn't have. – cuonglm 23 hours ago

@cuonglm - I upvoted that comment - but maybe in haste? I don't read that anywhere in the spec. It doesn't seem to specify anywhere how the lines should be compared, only that they should be. – mikeserv 23 hours ago

| show 3 more comments

glenn jackman · Answer 6 · 2015-06-17 15:38:50Z

up vote 0 down vote

This will give you your desired output

use List::Util qw(sum);

my %count;
while (<>) {
    chomp;
    $count{+lc}{$_}++; 
}

$,=", ";
$\="\n";

while (my ($key, $hash) = each %count) {
    my @labels = reverse 
                 map  { $_->[0] }
                 sort { $a->[1] <=> $b->[1] } 
                 map  { [ $_, $hash->{$_} ] } 
                 keys %$hash;
    my $sum = sum values %$hash;

    print $labels[0], $sum;
}

Then

$ perl count.pl data.txt 
Hot Chocolate, 8
Xicolatada, 6

The order of the output is indeterminate.

answered 18 hours ago

glenn jackman
22.4k2647

You can use $b->[1] <=> $a->[1] to sort reverse, so we can drop reverse. – cuonglm 18 hours ago

I find that less readable: the sort block already contains sufficient magic. – glenn jackman 18 hours ago

add a comment |

asked	yesterday
viewed	106 times
active	today

current community

your communities

more stack exchange communities

Case-insensitive count duplicate line, remove duplicate by choose case with highest duplicate

6 Answers 6

Your Answer

Not the answer you're looking for? Browse other questions tagged text-processing sed awk or ask your own question.

Hot Network Questions

current community

your communities

more stack exchange communities

Case-insensitive count duplicate line, remove duplicate by choose case with highest duplicate

6 Answers 6

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged text-processing sed awk or ask your own question.

Related

Hot Network Questions