C++ HyperLogLog Implementation

Question

I'm a scientific C programmer moving my way over to using Modern C++. I found myself needing a HyperLogLog implementation, and I wanted to use this for practice.

I plan to move a number of these functions over into an accompanying header file, but I do find it convenient at the moment to only need to include it (esp. in unit tests), and it was easier to post this to CodeReview as one chunk anyhow.

I know I'm not quite writing idiomatic C++, and that's why I've posted this here. I want to improve my code to be more modern, as well as just good code.

It does seem to work -- I have unit tests for random integers, all integers in given ranges (e.g., 1-100000), and real world examples, covering addition and set difference operations.

However, I'm particularly unsure about my copy and move constructors and my assignment operators. I have C++17 at my disposal, but I am currently only using features though C++14.

#ifndef _HLL_H_
#define _HLL_H_
#include <cstdlib>
#include <cstdio>
#include <cmath>
#include <cstdint>
#include <cstring>
#include <cinttypes>
#include <algorithm>
#include <vector>
#include "logutil.h" // for LOG_DEBUG

namespace mystuff {

constexpr double make_alpha(size_t m) {
    switch(m) {
        case 16: return .673;
        case 32: return .697;
        case 64: return .709;
        default: return 0.7213 / (1 + 1.079/m);
    }
}

class hll_t {
// HyperLogLog implementation.
// To make it general, the actual point of entry is a 64-bit integer hash function.
// Therefore, you have to perform a hash function to convert various types into a suitable query.
// We could also cut our memory requirements by switching to only using 6 bits per element,
// (up to 64 leading zeros), though the gains would be relatively small
// given how memory-efficient this structure is.

// Attributes
    size_t np_;
    size_t m_;
    double alpha_;
    double relative_error_;
    std::vector<uint8_t> core_;
    double sum_;
    int is_calculated_;

public:
    // Constructor
    hll_t(size_t np=20): np_(np), m_(1 << np), alpha_(make_alpha(m_)),
                         relative_error_(1.03896 / std::sqrt(m_)),
                         core_(m_, 0),
                         sum_(0.), is_calculated_(0) {}

    // Call sum to recalculate if you have changed contents.
    void sum() {
        sum_ = 0;
        for(unsigned i(0); i < m_; ++i) sum_ += 1. / (1 << core_[i]);
        is_calculated_ = 1;
    }

    // Returns cardinality estimate. Sums if not calculated yet.
    double report() {
        if(!is_calculated_) sum();
        const double ret(alpha_ * m_ * m_ / sum_);
        // Correct for small values
        if(ret < m_ * 2.5) {
            int t(0);
            for(unsigned i(0); i < m_; ++i) t += (core_[i] == 0);
            if(t) return m_ * std::log((double)(m_) / t);
        }
        return ret;
        // We don't correct for too large just yet, but we should soon.
    }

    // Returns the size of a symmetric set difference.
    double operator^(hll_t &other) {
        hll_t tmp(*this);
        tmp += other;
        tmp.sum();
        return report() + other.report() - tmp.report();
    }

    // Returns error estimate
    double est_err() {
        return relative_error_ * report();
    }

    void add(uint64_t hashval) {
        const uint32_t index = hashval >> (64 - np_);
        const uint32_t lzt(__builtin_clzll(hashval << np_) + 1);
        if(core_[index] < lzt) core_[index] = lzt;
    }

    std::string to_string() {
        return std::to_string(report()) + ", +- " + std::to_string(est_err());
    }

    // Reset.
    void clear() {
         std::fill(core_.begin(), core_.end(), 0u);
         sum_ = is_calculated_ = 0;
    }

    // Assignment Operators
    hll_t &operator=(hll_t &other) {
        m_ = other.m_;
        np_ = other.np_;
        core_ = std::move(other.core_);
        alpha_ = other.alpha_;
        sum_ = other.sum_;
        relative_error_ = other.relative_error_;
        m_ = other.m_;
        return *this;
    }

    hll_t &operator=(const hll_t &other) {
        m_ = other.m_;
        np_ = other.np_;
        core_ = other.core_;
        alpha_ = other.alpha_;
        sum_ = other.sum_;
        relative_error_ = other.relative_error_;
        m_ = other.m_;
        return *this;
    }

    hll_t(const hll_t &other): hll_t(other.m_) {
        *this = other;
    }

    hll_t(hll_t &&other):
        np_(other.np_),
        m_(other.m_),
        alpha_(other.alpha_),
        relative_error_(other.relative_error_),
        core_(std::move(other.core_)),
        sum_(other.sum_),
        is_calculated_(other.is_calculated_) {
    }

    hll_t const &operator+=(const hll_t &other) {
        if(other.np_ != np_)
            LOG_EXIT("np_ (%zu) != other.np_ (%zu)\n", np_, other.np_);
         // If we ever find this to be expensive, this could be trivially implemented with SIMD.
        for(unsigned i(0); i < m_; ++i) core_[i] |= other.core_[i];
        return *this;
    }

    hll_t operator+(const hll_t &other) const {
        if(other.np_ != np_)
            LOG_EXIT("np_ (%zu) != other.np_ (%zu)\n", np_, other.np_);
        hll_t ret(*this);
        return ret += other;
    }

    // Clears, allows reuse with different np.
    void resize(size_t new_size) {
        new_size = roundup64(new_size);
        LOG_DEBUG("Resizing to %zu, with np = %zu\n", new_size, (size_t)std::log2(new_size));
        clear();
        core_.resize(new_size);
    }
    // Getter for is_calculated_
    bool is_ready() {
        return is_calculated_;
    }
};



} // namespace mystuff

Based on the feedback in a comment, here is my second attempt at the move and copy assignment and constructors:

// Assignment Operators
hll_t &operator=(const hll_t &other) {
    m_ = other.m_;
    np_ = other.np_;
    core_ = other.core_;
    alpha_ = other.alpha_;
    sum_ = other.sum_;
    relative_error_ = other.relative_error_;
    m_ = other.m_;
    return *this;
}

hll_t &operator=(hll_t &&other) {
    np_ = other.np_;
    m_ = other.m_;
    alpha_ = other.alpha_;
    relative_error_ = other.relative_error_;
    core_ = other.core_;
    is_calculated_ = other.is_calculated_;
    sum_ = other.sum_;
    return *this;
}

hll_t(const hll_t &other): hll_t(other.m_) {
    *this = other;
}

hll_t(hll_t &&other):
    np_(other.np_),
    m_(other.m_),
    alpha_(other.alpha_),
    relative_error_(other.relative_error_),
    core_(std::move(other.core_)),
    sum_(other.sum_),
    is_calculated_(other.is_calculated_) {
}

hll_t &operator=(hll_t &other) usually is a copy assignment operator which you implemented further below with the const argument. The move assignment operator should be hll_t &operator=(hll_t &&other). — Ratatwisker
– Ratatwisker, Commented Nov 6, 2016 at 8:41
So you would rename the const assignment operator, and delete the hll_t &operator=(hll_t &other)? How would you write the move assignment operator? — NoSeatbelts
– NoSeatbelts, Commented Nov 6, 2016 at 20:19
Can the "sketch-data-structures" tag be added? I wouldn't assume I'm the only one who's interested in and likes them. I also have a count-min-sketch implementation I might post later after testing. — NoSeatbelts
– NoSeatbelts, Commented Nov 6, 2016 at 21:07
For the copy/move constructor, I suggest copy and swap idiom — Danh
– Danh, Commented Nov 7, 2016 at 7:58
@NoSeatbelts If your program target POSIX, It's reserved in global scope only. Otherwise it's not reserved. double underscore, underscore follow by capital letter and name starts with underscore in global scope is reserved by standard C++ — Danh
– Danh, Commented Nov 7, 2016 at 8:35

Community · Accepted Answer · 2017-05-23 12:40:49Z

I won't get into the details of the actual hyperloglog since I'm not familiar with, but I can certainly provide advice about modern C++. Here we go:

In your move assignment operator, core_ = other.core_ should be core_ = std::move(other.core_) if you really want to take advantage of move semantics.
You explicitly pass other.m_ to your copy constructor initialization, even though you implemented it in terms of your copy assignment operator, which already copies other.m_. This is a bit redundant.
I see that you implement your copy constructor in terms of your copy assignment operator. In general the assignment operator is implemented in term of the constructor, using the copy-and-swap idiom. It should look like this:
```
hll_t &operator=(hll_t other) {
    using std::swap;
    swap(*this, other);
    return *this;
}
```
The explanation is quite subtle, but basically it has to do with code factoring, correctness and exception safety. You should really look at the linked post, which explains the advantages of the idiom in great details.
That said, since your copy/move constructor/assignment operator just copies/moves everything, the best way to implement it is still to tell to the compiler to generate the correct implementations for you, avoiding errors altogether:
```
hll_t(const hll_t&) = default;
hll_t(hll_t&&) = default;
hll_t& operator=(const hll_t&) = default;
hll_t& operator=(hll_t&&) = default;
```
The default implementations will just copy or move every member of the class, which is exactly what your implementation does, so you can delete your manual implementation of these functions and let the compiler do the job :)
You should mark the one-parameter constructor explicit to avoid implicit conversions from double (such conversions are unwanted most of the time). That said, if you do so, the default constructor will also be explicit, which isn't desirable. The best solution would be to separate them and to use constructor delegation like this:
```
hll_t(std::size_t np): np_(np), m_(1 << np), alpha_(make_alpha(m_)),
                       relative_error_(1.03896 / std::sqrt(m_)),
                       core_(m_, 0),
                       sum_(0.), is_calculated_(0) {}

hll_t(): hll_t(20) {}
```
You should always use the type bool and the constants true and false for is_calculated_ since it's obviously a boolean variable (is_ready even returns a bool already). It will make things clearer for everyone at first glance.
When a method does not alter the instance of the class, you can mark it as const to make sure that you can call said method, even if the hll_t instance is const. The method is_ready can be const-qualified:
```
bool is_ready() const {
    return is_calculated_;
}
```
If you didn't have a two-phase initialization with is_calculted_ and computed everything at construction, then many more functions could be marked as const and hll_t could actually be usable as a const object.

You could make some variables mutable and const-qualify more methods, but I'd always think at least twice before making anything mutable. It's alays like opening a strange door. I know that it's sometimes used when you have expensive computations and caches like you do. Please read a lot about it if you ever decide to perform such a change.
While operators like operator@= can only be implemented directly into the class, it's idiomatic to implement operators like operator@ outside of the class, in terms of the operator@=. In your case operator^= wouldn't make sense, but you could reimplement operator^ as a free function:
```
double operator^(hll_t& lhs, hll_t rhs&) {
    hll_t tmp(lhs);
    lhs += rhs;
    tmp.sum();
    return lhs.report() + rhs.report() - tmp.report();
}
```
I guess that you can even drop the explicit call to tmp.sum() since tmp.report() already computes it.
It's good practice to always std::-qualify components of your library. You've done pretty well, but still forgot to fully qualify std::uin8_t and std::size_t.
As others have said in the comments, identifiers starting with an underscore followed by a capital letter are reserved for the implementers of the compiler and standard library. _HLL_H_ can easily be renamed HLL_H_ to solve that problem.

It's true that POSIX officially reserves *_t names in the global namespace, but that's probably highly disregarded by most. I'd have to check but I believe that many popular C and/or C++ libraries simply don't care.

That said you could rename your class hyperloglog, which isn't that long, is really descriptive, and doesn't use any reserved identifier.
It's excellent practice to (almost) always use braces after control statements like if and for, even if they only exist for a single statement. It will avoid stupid problems like Apple's goto fail; bug in the long run. It also helps to better understand scopes when visual indentation is screwed (which occasionally happens when you have a mix of spaces and tabs and then switch to another editor).
As far as I can tell, your code doesn't use anything from <cstdlib>, <cstdio> and <cstring>, so you can remove those includes. std::uint8_t lives in <cstdint>, which you already included, so you don't need <cinttypes> either.

However, once you've removed all those includes, you've got no guarantee from the standard that std::size_t has been defined. It's defined in several headers, the most lightweight (and thus prefered) one being <cstddef>.
While it's not going to make a difference most of the time, using the global functions std::beginand std::end is generally better than the .begin and .end method of containers.
```
std::fill(std::begin(core_), std::end(core_), 0u);
```
That said, it's mostly true for generic code: the global methods work with more types (including fixed-size C arrays and std::initializer_list), and call the .begin and .end methods if there are any. In a code such as yours, it doesn't really matter, but it's always good to know.
If you're going to use compiler intrinsics like __builtin_clzll, then maybe you'll want to wrap them in generic functions that fall back to a hand-rolled albeit probably less efficient implementation to make sure that your code runs everywhere.
You could make hll_t a class template. That way it could operate with other types that double when needed. Of course float and long double come to my mind, but many more decimal types are being standardized (targeting the C2x norm, and maybe C++20), as well as a short float type. Moreover, the additional floating point types from Boost.Multiprecision could also surely be used out-of-the-box with a class template.

And that's pretty much it. As you can see, it's mostly details, but details contribute to writing modern and idiomatic C++. I hope that it will help you in the future :)

This was very thorough and exactly what I needed. Thank you! The =default; options for constructors is a powerful feature I wish I'd known about! — NoSeatbelts
– NoSeatbelts, Commented Dec 12, 2016 at 1:04
@NoSeatbelts No problem. If you want a good overview of the C++11 features, the Wikipedia article is a good start. Then you can look for blog posts or articles if you want more details on a specific feature. Then there's always cppreference.com when you're unsure about something :) — Morwenn
– Morwenn, Commented Dec 12, 2016 at 20:28

Stack Exchange Network

C++ HyperLogLog Implementation

1 Answer 1

Your Answer

Hot Network Questions

C++ HyperLogLog Implementation

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Related

Hot Network Questions