Checking if two strings are an anagram

Question

This is my sample code to check if a given pair of strings are anagram or not.

      static bool AreAnagram(string s1, string s2)
      {

        if (s1.Length != s2.Length)
            return false;

        foreach (char c in s1)
        {
            int ix = s2.IndexOf(c);

            if (ix == -1)
                return false;
        }

        return true;
    }

The complexity of this approach is \$O(n)\$.

What boundary conditions am I missing? Can it be made better?

not meaning to brag, but I got something working in place ;) codereview.stackexchange.com/questions/59370/… — Mark, yesterday
The complexity of this code is not O(n), it is O(n^2), because there's a hidden n multiplier. — Mooing Duck, yesterday

Pimgd · Accepted Answer · 2014-08-07 10:44:37Z

up vote 12 down vote accepted

You're missing the case where a word has more than one of the same characters.

For instance, the strings "aba" and "bab" would be seen as an anagram by your algorithm.

You can fix this by removing the found character from the string.

Additionally, if you wanted to make your algorithm faster, you could check if both strings are the same (and return true if they are) before checking the length. This is only a valid option if this is a likely scenario, because otherwise you'll be slowing down each anagram check.

edited yesterday

answered yesterday

Pimgd
2,3341636

@kyle code on the question shouldn't be changed, because it invalidates the answers. If we allowed changing the code, there'd be Edit: Edit2: Edit5: blocks everywhere, and that's messy. If you wish to post the updated code up for review then you should accept this answer, change your code and write a new question with your updated code. Don't forget to link back to this question. – Pimgd yesterday

yes you are right...i realized that later :) – kyle yesterday

3

As an addition to this answer, this is where unit tests come in handy. You'll often find that determining some test cases up front will catch bugs like this. Sometimes you won't think of the case, but when it's discovered it gets added and any future changes to your code will not contain the same bug. – ckuhn203 yesterday

2

I think that adding a special case to check if the strings are equal is not good. You make 99% of cases slower to speedup 1%... in average this is a loss of performance. – Emanuele Paolini yesterday

@EmanuelePaolini You're right. I've added a note to that section of my answer. It basically depends on how much anagram checking is gonna happen, and how much of that will be done with strings that are the same. – Pimgd yesterday

add a comment |

Pimgd · Answer 2 · 2014-08-07 10:51:09Z

up vote 26 down vote

The complexity of your algorithm is \$O(n^2)\$, since IndexOf has complexity \$O(n)\$. You can get a \$O(n\log n)\$ by sorting the two strings and comparing them.

edited yesterday

Pimgd
2,3341636

answered yesterday

Emanuele Paolini
638116

1

Great idea! Is sorting performant enough? – Recipe yesterday

1

@Recipe Yes. \$\log n\$ grows very slowly; if you have 2 billion characters in your string, \$\log n\$ is only 31, which is the least of your concerns. – mjolka yesterday

2

If you just dump the characters from each string into a hash table (mapping characters to frequency), then you can do this in \$O(n)\$. – Joshua Taylor yesterday

1

@JoshuaTaylor Even better! – Ryan yesterday

1

You could even speed this up for many cases, by computing a position independent character-checksum/hash for both strings and comparing these before sorting and comparing the strings. Computing the checksums and comparing them is O(n) so won't add complexity, but in most cases this will give you an early return if you don't have a checksum collision! – Falco 11 hours ago

| show 1 more comment

Pimgd · Answer 3 · 2014-08-07 13:12:40Z

In addition to the fine points already made about alternative algorithms, and the incorrect assertion of the \$O(n)\$ performance, there's another condition the original code doesn't handle. And that's characters (really graphemes) that have multiple representations (that is, sequences of char values).

Since you're using C#, you're implicitly using Unicode. You're probably concerned only with the ASCII subset typically used in English, and if so you can ignore the rest of this answer. But if you're looking for a fully general answer, you have to pay attention to other characters. Some characters have multiple ways they can be represented. Some characters can only be represented with more than one sequential char value. For example, from String.Normalize, the character ắ can be represented in three different ways, requiring respectively 1, 2, or 3 char values.

Note that any char-based traversal will lose information; for example treating these the two-character / three char sequences as identical: combining accent + a, e and a, combining accent + e, and treating them differently from accented a, e and a, accented e.

Normalization can convert between these representations. However even normalization cannot turn every character into a single char. Because of this you need to enumerate the characters (which are by necessity themselves each represented as a String). Extract them with StringInfo.GetTextElementEnumerator, then normalize and store a Dictionary of Strings (counting occurrences) per mjolka's approach, or store a List of Strings that you sort and compare using true Unicode-aware sorting and comparisons.

+1 for picking that nit before I could. UTF-16 surrogate pairs are another unicode problem with naive implementations. Luckily the solution of using StringInfo.GetTextElementEnumerator remains the same. — CodesInChaos, yesterday

mjolka · Answer 4 · 2014-08-07 11:15:18Z

The complexity of this method is actually \$O(n^2)\$, where \$n\$ is s1.Length (which is equal to s2.Length). Let's expand out IndexOf and see why.

foreach (char c in s1)
{
    int ix = -1;
    for (var i = 0; i < s2.Length; i++)
    {
        if (s2[i] == c)
        {
            ix = i;
            break;
        }
    }

    if (ix == -1)
        return false;
}

return true;

As @Pimgd pointed out, it is also incorrect. So how can we fix it? Two strings are anagrams if each character occurs the same number of times, so that seems like a likely approach.

Let's write a method to count the occurrences of each character in a string. We'll use a Dictionary<char, int> to keep track.

private static IDictionary<char, int> GetCharacterCount(string input)
{
    var tally = new Dictionary<char, int>();
    foreach (var c in input)
    {
        int count = tally.TryGetValue(c, out count)
            ? count + 1
            : 1;
        tally[c] = count;
    }

    return tally;
}

Now we want to compare the results of this method

var s1Count = GetCharacterCount(s1);
var s2Count = GetCharacterCount(s2);

foreach (var kvp in s1Count)
{
    var c = kvp.Key;
    if (!s2Count.ContainsKey(c))
    {
        return false;
    }

    if (kvp.Value != s2Count[c])
    {
        return false;
    }
}

return true;

Well, that's one approach, but it seems a bit complicated.

Anagrams have another useful properly, which is that two strings are anagrams of each other if and only if they are equal when they are sorted. So let's convert that into code.

To sort a string, we first have to convert it into a character array, sort the array, and then convert back into a string.

private static string Sort(string input)
{
    var chars = input.ToCharArray();
    Array.Sort(chars);
    return new string(chars);
}

Now we can compare the two sorted sorted strings

var s1Sorted = Sort(s1);
var s2Sorted = Sort(s2);

return s1Sorted == s2Sorted;

Why bother with a for loop for comparing strings character by character? Can't you simply use a string equals function? — Pimgd, yesterday
Don't forget that this code will not check for uppercase / lowercase characters, as well as foreign characters (eg. ñ != n). You could replace the s1Sorted == s2Sorted with s1Sorted.Equals(s2Sorted, StringComparison) — Recipe, yesterday
The dictionary implementation seems to be \$O(n)\$ if the dictionary is implemented as an hashtable. In fact one could use a fixed array of size 256 instead of the dictionary. — Emanuele Paolini, yesterday
@EmanuelePaolini it is \$O(n)\$. Chars in .NET are 16 bits though, so you would need an array of size 65,536, which makes the dictionary a better option. — mjolka, yesterday

Mark · Answer 5 · 2014-08-08 04:50:34Z

up vote 4 down vote

Consider pooling the chars you have and check if you can assemble the second string from them. I am not 100% sure but this should be somewhere between linear and logarithmic in performance.

bool anagramChecker(string first, string second)
{
    if(first.Length != second.Length)
        return false;

    if(first == second)
        return true;//or false: Don't know whether a string counts as an anagram of itself

    Dictionary<char, int> pool = new Dictionary<char, int>();
    foreach(char element in first.ToCharArray()) //fill the dictionary with that available chars and count them up
    {
        if(pool.ContainsKey(element))
            pool[element]++;
        else
            pool.Add(element, 1);
    }
    foreach(char element in second.ToCharArray()) //take them out again
    {
        if(!pool.ContainsKey(element)) //if a char isn't there at all; we're out
            return false;
        if(--pool[element] == 0) //if a count is less than zero after decrement; we're out
            pool.Remove(element);
    }
    return pool.Count == 0;
}

edited 17 hours ago

answered yesterday

Mark
1413

as a side note, a python similar implementation is return Counter(first) == Counter(second) – njzk2 yesterday

also, the complexity for this is O(n), as getting and inserting in a Dictionary is O(1), which makes it the best solution so far – njzk2 yesterday

for shortening, you can use int value; pool[element] = pool.TryGetValue(element, out value) ? value + 1 : 1; and pool[element] = pool.TryGetValue(element, out value) ? value - 1 : -1; (early return is not entirely necessary, as it does not change the complexity). – njzk2 yesterday

@njzk2 it does not change the class of complexity, indeed, but in practice it will give an advantage on average. Also I think all the shortness you gain from using the ternay operator, you loose on the readability. The compiler even reverts it back do what I did there, so why bother? – Mark yesterday

on the contrary, I find that this ternary notation is clearer as it is a well-known form representing a Default Dict. – njzk2 yesterday

| show 4 more comments

Keith · Answer 6 · 2014-08-08 04:22:26Z

The Dictionary<> is probably a better solution, but if you are interested only in ASCII strings, an array approach is also possible.

 // Returns false if characters out of range.
        static public bool anagramCheckerAsciiOnly(string first, string second)
        {
            if (first.Length != second.Length)
                return false;

            if (first == second)
                return true;

            const int maxValue = 127;
            int[] count = new int[maxValue];

            for (int index = 0; index != first.Length; ++index)
            {
                char a = first[index];
                char b = second[index];
                if (a >= maxValue || b >= maxValue || a < 0 || b < 0)
                    return false;
                ++count[a];
                --count[b];
            }

            return !(count.Any(
                (element) =>
                {
                    return element != 0;
                }
            ));
        }

CJ Dennis · Answer 7 · 2014-08-08 04:51:10Z

"Simply sort the characters in both strings then check if they're equal."
"           '.Saaabcccceeeeeeefghhhhhhiiiikllmnnnoopqrrrrrsssstttttttuyy"

Use Unicode normalization on both first to make sure each character is encoded in a way that will match.

Edit (as a comment)

The OP asked "can it be made better". The OP seemed to be assuming that the provided function was the only (or best) method. I was providing a different way of achieving the same result with less complexity, also with an intuitive understanding of why it works. Are "doctor who" and "hotrod cow" anagrams of each other? " cdhooortw" == " cdhooortw". What about "doctor who" and "torchwood"? " cdhooortw" != "cdhooortw" (there is an extra space character).

This does not seem to be a code review. Rather it looks like a suggestion or a comment instead? — James Khoury, 18 hours ago

asked	yesterday
viewed	1825 times
active	today

current community

your communities

more stack exchange communities

Checking if two strings are an anagram

7 Answers 7

Your Answer

Not the answer you're looking for? Browse other questions tagged c# performance or ask your own question.

Hot Network Questions

current community

your communities

more stack exchange communities

Checking if two strings are an anagram

7 Answers 7

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged c# performance or ask your own question.

Related

Hot Network Questions