Counting occurrences of values in C Array (Shannon Entropy)

Question

I have written the following C code for calculating the Shannon Entropy of a distribution of 8-bit ints. But obviously this is very inefficient for small arrays and won't work with say 32-bit integers, because that would require literally gigabytes of memory. I am not very experienced in C and don't know what would be the best approach here. If it would simplify things, I could use C++ or Objective-C...

Also, please tell me about any other issues with the code you may find :-)

double entropyOfDistribution(uint8_t *x, size_t length)
{
    double entropy = 0.0;

    //Counting number of occurrences of a number (using "buckets")
    double *probabilityOfX = calloc(sizeof(double), 256);
    for (int i = 0; i < length; i++)
        probabilityOfX[x[i]] += 1.0;

    //Calculating the probabilities
    for (int i = 0; i < 256; i++)
        probabilityOfX[x[i]] /= length;

    //Calculating the sum of p(x)*lg(p(x)) for all X
    double sum = 0.0;
    for (int i = 0; i < 256; i++)
        if (probabilityOfX[i] > 0.0) 
            sum += probabilityOfX[i] * log2(probabilityOfX[i]);

    entropy = -1.0 * sum;

    free(probabilityOfX);

    return entropy;
}

btw, this is the formula I implemented: Entropy

I believe your code currently contains a bug. Consider what happens in: for (int i = 0; i < 256; i++) probabilityOfX[x[i]] /= length;, if x has a length of 10 (or anything else less than 256). I believe here you want probabilityOfX[i] /= length; — Jerry Coffin, yesterday

Yuushi · Answer 1 · 2014-03-27 07:48:44Z

Firstly, you know the size of the number of buckets you want to use here, so there is no reason to use dynamic allocation (specifically, double *probabilityOfX = calloc(sizeof(double), 256);). This could simply be:

double probabilityOfX[256];
memset(&probabilityOfX, 0.0, 256);

You don't need to free this memory at the end then, either, reducing the possibility for memory leaks.

Of course, this is fine for small values (like what can fit in a uint8_t), however, using a uint32_t (or larger), this will pre-allocate a large array which could potentially be very sparse. In this case, what you actually want is a dictionary data structure (like a hashmap). Since C doesn't have anything like this inbuilt, I'm going to switch over to C++ so we can use std::unordered_map and some other nice things like std::vector (instead of raw uintx_t pointers):

double entropyOfDistribution(const std::vector<uint32_t>& vec)
{
    std::unordered_map<uint32_t, unsigned> counts;

    // Store the number of counts
    for(uint32_t value : vec) {
        ++counts[value];
    }

    double sum = 0.0;
    // Note the cast as otherwise we'll be doing integer
    // division and hence rounding to an int -
    // thanks @syb0rg for pointing that out.
    const double num_samples = static_cast<double>(vec.size());

    for(auto it = counts.begin(); it != counts.end(); ++it) {
        double probability = it->second / num_samples;
        sum += probability * log2(probability);
    }

    return -1.0 * sum;
}

I you switch for a static array, wouldn't it be better to zero-initialize it by using = { 0 }; rather than calling memset? — Morwenn, yesterday
@Morwenn Probably, it just didn't jump to my mind when I was writing it. — Yuushi, 23 hours ago

Jerry Coffin · Answer 2 · 2014-03-27 07:22:10Z

@Yuushi has already given a pretty good suggestions, but if I were doing this in C++, I'd do it just a bit differently.

Let's start by reviewing your code:

    //Counting number of occurrences of a number (using "buckets")
    double *probabilityOfX = calloc(sizeof(double), 256);

Although it's rarely likely to be a problem, I don't believe that the contents of an array allocated with calloc is guaranteed to contain 0.0 when viewed as doubles (but it will in most typical implementations, so you may not care).

    //Calculating the probabilities
    for (int i = 0; i < 256; i++)
        probabilityOfX[x[i]] /= length;

This (I'm fairly sure) is a bug. I'm pretty sure what was intended was:

    //Calculating the probabilities
    for (int i = 0; i < 256; i++)
        probabilityOfX[i] /= length;

As it was, it attempted to use 256 characters of the input string, even if the input string was much shorter than that. It could also fail to adjust the probabilities in the counts correctly if the string was longer than 256, but the string contained any characters that didn't occur in the first 256 characters.

As to how I'd do things: I think I'd at least consider implementing it as a (relatively) generic algorithm that could take (for example) any sequence-type container as its input. I'd implement it internally using std::accumulate to do the majority of the work:

template <class Vector>
double entropy(Vector const &v) {
    typedef typename Vector::value_type v_t;
    std::map<v_t, size_t> counts;

    for (auto && c : v)
        ++counts[c];

    return -std::accumulate(counts.begin(), counts.end(), 0.0,
        [=](double a, std::pair<v_t, size_t> const &t) { 
            double x = t.second / static_cast<double>(v.size()); 
            return a + x * log2(x); 
        });
}

In this case, using accumulate doesn't really make the code drastically shorter, but it does (IMO) give a little clearer picture of what you're trying to do that using a generic for loop to sum the probabilities.

As specified by IEEE 754, a double whose bytes are all zero indeed represents +0.0, for obvious practical reasons. — 200_success, yesterday
I think what Jerry was getting at was that the C/C++ standards do not guarantee what format will be used for floating-point numbers. Sure, the vast majority of implementations use IEEE-754, which guarantees that a value of all zero bits corresponds to 0.0, but there's no explicit guarantee of this property. With that said, I wouldn't bat an eye at code that makes that assumption; it's simply too common to justify rewriting code that uses it. — Jason R, 21 hours ago

asked	yesterday
viewed	119 times
active	yesterday

current community

your communities

more stack exchange communities

Counting occurrences of values in C Array (Shannon Entropy)

2 Answers

Your Answer

Not the answer you're looking for? Browse other questions tagged c beginner or ask your own question.

Hot Network Questions

current community

your communities

more stack exchange communities

Counting occurrences of values in C Array (Shannon Entropy)

2 Answers

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged c beginner or ask your own question.

Related

Hot Network Questions