Custom memcmp function

Question

I have written my own memcmp() function for learning purposes.

bool memcmp( byte* _bufA, byte* _bufB, uint n )
{
    while ( n > 4 )
    {
        if ( (int)*_bufA != (int)*_bufB )
            return false;
        _bufA += 4, _bufB += 4;
        n -= 4;;
    }
    while( --n )
    {
        if ( *_bufA != *_bufB )
            return false;
        ++_bufA, ++_bufB;
    }
    return true;
}

My thought process was that whenever the byte stream is greater than 4 bytes I'd compare 4 bytes at a time instead of one, meaning less iterations which would mean it'd would be a little more faster.

And then just simply take care of the remaining 4 bytes or so in the second while statement.

Benchmarks:

Standard memcmp function comparing 1024 bytes: 585 Nanoseconds.

My memcmp function comparing 1024 bytes: 586 Nanoseconds.

Its pretty inconsistent but I figure its still a pretty good function since its 50/50 generally.

(int)*_bufA != (int)*_bufB doesn't do what you think it does - it only compares one byte! — immibis, 23 hours ago
Note that most names beginning with an underscore are reserved for the implementation. The full details are a bit more nuanced than that, but you won't go far wrong if you assume that. (See ISO/IEC 9899:2011, §7.1.3 Reserved identifiers: — All identifiers that begin with an underscore and either an uppercase letter or another underscore are always reserved for any use. — All identifiers that begin with an underscore are always reserved for use as identifiers with file scope in both the ordinary and tag name spaces.) — Jonathan Leffler, 22 hours ago
You should give your function a different name, since its return value is quite different from the standard function (yours only compares for equality returning true if they are, the original returns which one is bigger or 0 if they're equal. — CodesInChaos, 19 hours ago

5gon12eder · Answer 1 · 2016-02-05 02:49:42Z

Correctness

Your function does not work correctly. In my experience, memory manipulating functions almost never work correctly as they are written the first time. (At least when I'm writing them.) If you need to go down into that rabbit hole, be sure to do so with an extensive set of unit-tests. In your case, you are lucky. You have two functions you can compare your results with:

The existing memcmp function from the standard library and
the naïve implementation of a for loop that compares one byte after the other.

When I'm writing optimized functions, I usually start with the naïve implementation. Then I write a comprehensive set of tests for it to make sure I got it right. (For the naïve version, this is usually easy.) Then I start optimizing the function and after each change, run the test suite again. Should one of my optimizations introduce a correctness bug, I'll notice.

Also be sure to run your unit tests trough Valgrind and compile test programs with sanitizers added. For GCC and Clang, all you have to do is pass the -fsanitize=WHAT switch where WHAT is one of address, memory or undefined. (memory is only supported by Clang yet.)

Cast the pointer, not the value

Have a look at this piece.

if ((int) *_bufA != (int) *_bufB)
  return false;

What does it do? It dereferences the pointers _bufA and _bufB which are of type byte * (which I assume is an alias for char *). Now, only after that, you cast the result (which is of type byte) to an int. In effect, you're only looking at each forth byte.

What you have to do instead is casting the pointer before dereferencing it.

if (*((int *) _bufA) != *((int *) _bufB))
  return false;

Watch out for off-by-one errors

Your second loop is off-by-one.

while (--n) { … }

If I call your function with n == 1, it'll never look at any byte. The correct version would be to use post-decrement.

while (n--) { … }

On the other hand, your first loop could use

while (n >= 4) { … }

instead of

while (n > 4) { … }

even though this “bug” doesn't cause the function to produce a wrong answer, it is probably not what you've meant.

Do memory operations on unsigned types

While I don't think it would cause a problem in this particular case, using signed integer types can lead to all kinds of surprises. If you don't treat the values as numbers but as a bunch of bytes, stay with unsigned types.

Beware of undefined behavior

The C standard does not allow you to cast a pointer to another type (except char) to access the memory it points to. (This is a simplified explanation, search for “strict aliasing rules” for more information.) Therefore, your function invokes undefined behavior and this cannot be fixed other than by using the standard library functions.

That said, I think that this is a case where you can get pretty far with (extremely) carefully blundering the line of undefined behavior if you know what you are doing.

On many machines, memory must be aligned. That is, an object o of type T must have an address such that ((uintptr_t) &o) % alignof(T) == 0 where alignof(T) is called the alignment of type T. Unfortunately, alignof is not an operator in C (unlike in C++ since C++11). A safe estimate is to use the maximum of sizeof(T) and 8. Even on those machines where an unaligned memory access does not cause an exception, it might be slow.

Therefore, what you should do is check whether both pointers can be word-aligned by advancing both of them by the same number of bytes. If so, first do a short loop that aligns the buffers, then cast to a machine word type and compare the bulk. Finally, compare the remaining bytes with a byte-for-byte loop again.

-fsanitize=undefined can warn you about mis-aligned reads.

Portability

Don't make assumptions on `sizeof(int)`

Your function silently assumes that sizeof(int) == 4. This will not be true on all platforms. Instead of hard-coding such values, use sizeof(int). You can store it in a constant if repeating it seems too much typing.

Performance

Prefer `long`s over `int`s

In absence of any reason to do otherwise, unsigned long is your best bet to get a type that directly maps to a machine-word and will therefore give you optimal performance. Once I made this change to your code, I got significant speedup on my 64 bit machine from 2.9 GiB/s to 4.3 GiB/s!

Be sure to measure

Doing low-level optimization without measuring is a waste of time. You should compare your function with (at least) two alternatives:

the naïve byte-for-byte approach and
at least one highly optimized standard library implementation.

Before you do any optimization, be sure to compare those two approaches when compiled with all compiler optimizations turned on. If they don't differ much, you're probably done. Otherwise, deploy one optimization at a time and see how the performance of your implementation goes from “naïve” to “standard library”. If it goes below “naïve”, you're doing something wrong. If it goes above “standard library”, you did something awesome (or your function has a bug and doesn't do the same thing as the standard library version).

Writing good benchmarks is very hard. As a general rule, don't look at a single input but measure how the run-time scales with the input size. Also investigate the inevitable statistical errors your measurements will contain. I like to fit regressions to the data series and get a statistically more significant result from the fit parameters. Be sure to test large enough inputs so to reduce noise. Timings in the sub-millisecond range as you reported them are hardly significant.

I did this for your problem and got this result on my machine. Your version (with the fixes discussed above and using unsigned long) is not doing bad. Note, however, that the benchmark always tests the lucky case where the buffers can be aligned. I'll leave it as an exercise to you to test the more realistic case, too.

I'm dumping the code I've used to produce this data without any explanation here only so you can repeat the experiments on your hardware if you want to. It is a quick-and-dirty solution and I'm not particularly proud of the coding style but I believe that it tests accurately.

main.c

#define _XOPEN_SOURCE 500  // random, srandom

#include <assert.h>   // assert
#include <math.h>     // sqrt, isfinite
#include <stdbool.h>  // bool
#include <stddef.h>   // size_t
#include <stdio.h>    // stderr, FILE, fopen, fclose, fprintf, snprintf
#include <stdlib.h>   // random, srandom, malloc, free
#include <string.h>   // memcmp
#include <time.h>     // CLOCKS_PER_SEC, clock

#include "memcmp_bench.h"


typedef bool (* memcmp_type)(const void *, const void *, size_t);

typedef struct
{
  double mean;
  double stdev;
} timing_result;

static volatile int tally;
static const char * errmsg;

static double
square(const double x)
{
  return x * x;
}

static int
do_statistics(const double *const datapoints,
              const size_t n,
              timing_result *const resultptr)
{
  assert(n >= 3);
  double accu = 0.0;
  for (size_t i = 0; i < n; ++i)
    accu += datapoints[i];
  const double mean = accu / n;
  if (!isfinite(mean))
    {
      errmsg = "non-finite mean";
      return -1;
    }
  accu = 0.0;
  for (size_t i = 0; i < n; ++i)
    accu += square(datapoints[i] - mean);
  const double stdev = sqrt(accu / (n - 1));
  if (!isfinite(stdev))
    {
      errmsg = "non-finite standard deviation";
      return -1;
    }
  resultptr->mean = mean;
  resultptr->stdev = stdev;
  return 0;
}

typedef unsigned long word_type;
static const size_t word_size = sizeof(word_type);

bool
maybe_clobber(word_type *const wp)
{
  assert(wp != NULL);
  unsigned char bytes[sizeof(word_type)];
  const word_type before = *wp;
  memcpy(bytes, wp, word_size);
  const size_t index = random() % word_size;
  if ((random() % 3) != 0)
    bytes[index] = (unsigned char) random();
  memcpy(wp, bytes, word_size);
  return (*wp != before);
}

static int
run_benchmark(const memcmp_type funcptr,
              const size_t input_size,
              const size_t repetitions,
              word_type *const buff1st,
              word_type *const buff2nd,
              timing_result *const resultptr)
{
  assert(funcptr != NULL);
  assert(resultptr != NULL);
  assert(repetitions >= 3);
  assert(buff1st != NULL);
  assert(buff2nd != NULL);
  const size_t words = input_size / word_size;
  const size_t bytes = words * word_size;
  double timings[repetitions + 1];
  for (size_t i = 0; i < words; ++i)
    {
      const word_type w = random();
      buff1st[i] = w;
      buff2nd[i] = w;
    }
  for (size_t i = 0; i <= repetitions; ++i)
    {
      const clock_t no_clock = (clock_t) -1;
      const bool expected = (words >= 2)
        ? !maybe_clobber(buff2nd + words - 2)
        : true;
      const size_t skip = random() % word_size;
      const clock_t t0 = clock();
      const int actual = funcptr((const char *) buff1st + skip,
                                 (const char *) buff2nd + skip,
                                 bytes - skip);
      const clock_t t1 = clock();
      if (actual != expected)
        {
          errmsg = "function returned wrong result";
          goto label_catch;
        }
      tally += actual;
      if ((t0 == no_clock) || (t1 == no_clock))
        {
          errmsg = "cannot get CPU time";
          goto label_catch;
        }
      timings[i] = (((double) t1) - ((double) t0)) / CLOCKS_PER_SEC;
      if (words >= 2)
        buff2nd[words - 2] = buff1st[words - 2];
    }
  if (do_statistics(timings + 1, repetitions, resultptr) < 0)
    goto label_catch;
  goto label_finally;
 label_catch:
  assert(errmsg != NULL);
 label_finally:
  return (errmsg == NULL) ? 0 : -1;
}

static int
run_all_benchmarks()
{
  const size_t repetitions = 10;
  const size_t datapoints = 50;
  const size_t max_size = 99 * (1ULL << 20);  // 100 MiB
  const size_t candidates = 3;
  const memcmp_type cand_funcs[] = {memcmp_stdlib, memcmp_naive, memcmp_yamiez};
  const char *const cand_names[] = {"stdlib",      "naive",      "yamiez"};
  word_type *const buff1st = malloc(max_size);
  word_type *const buff2nd = malloc(max_size);
  if ((buff1st == NULL) || (buff2nd == NULL))
    {
      errmsg = "out of memory";
      goto label_catch_outer;
    }
  for (size_t candidx = 0; candidx < candidates; ++candidx)
    {
      FILE * fh = NULL;
      fprintf(stderr, "Running benchmarks for '%s'... ", cand_names[candidx]);
      char filename[128];
      const int status = snprintf(filename, sizeof(filename), "timing_%s.dat",
                                  cand_names[candidx]);
      if ((status < 0) || ((size_t) status >= sizeof(filename)))
        {
          errmsg = "error in snprintf";
          goto label_catch_inner;
        }
      fh = fopen(filename, "w");
      if (fh == NULL)
        {
          errmsg = "cannot open output file";
          goto label_catch_inner;
        }
      if (fprintf(fh, "# %22s %24s %24s\n\n", "n", "mean / s", "stdev / s") < 0)
        {
          errmsg = "I/O error";
          goto label_catch_inner;
        }
      for (size_t j = 0; j < datapoints; ++j)
        {
          size_t n = random() % max_size;
          timing_result result = {0.0, 0.0};
          if (run_benchmark(cand_funcs[candidx], n, repetitions, buff1st, buff2nd, &result) < 0)
            goto label_catch_inner;
          if (fprintf(fh, "%24zu %24.10e %24.10e\n", n, result.mean, result.stdev) < 0)
            {
              errmsg = "I/O error";
              goto label_catch_inner;
            }
        }
      goto label_finally_inner;
    label_catch_inner:
      assert(errmsg != NULL);
    label_finally_inner:
      if (fh != NULL)
        {
          if (fclose(fh) != 0)
            errmsg = "error closing output file";
        }
      fprintf(stderr, "%s\n", (errmsg != NULL) ? "failed" : "done");
      if (errmsg != NULL)
        goto label_finally_outer;
    }
  goto label_finally_outer;
 label_catch_outer:
  assert(errmsg != NULL);
 label_finally_outer:
  free(buff1st);
  free(buff2nd);
  return (errmsg != NULL) ? -1 : 0;
}


int
main()
{
  srandom((unsigned int) clock());
  if (run_all_benchmarks() < 0)
    {
      if (errmsg == NULL)
        errmsg = "unknown error";
      fprintf(stderr, "error: %s\n", errmsg);
      return EXIT_FAILURE;
    }
  return EXIT_SUCCESS;
}

memcmp_bench.h

#ifndef MEMCMP_BENCH_H
#define MEMCMP_BENCH_H

#include <stdbool.h>  // bool
#include <stddef.h>   // size_t

#ifdef __cplusplus
extern "C" {
#endif

bool
memcmp_stdlib(const void * s1, const void * s2, size_t n)
  __attribute__ ((hot));

bool
memcmp_naive(const void * s1, const void * s2, size_t n)
  __attribute__ ((hot));

bool
memcmp_yamiez(const void * s1, const void * s2, size_t n)
  __attribute__ ((hot));

#ifdef __cplusplus
}  // extern "C"
#endif

#endif  // #ifndef MEMCMP_BENCH_H

memcmp_stdlib.c

#include "memcmp_bench.h"

#include <string.h>  // memcmp


bool
memcmp_stdlib(const void *const s1, const void *const s2, const size_t n)
{
  return memcmp(s1, s2, n) == 0;
}

memcmp_naive.c

#include "memcmp_bench.h"


bool
memcmp_naive(const void *const s1, const void *const s2, const size_t n)
{
  const unsigned char *const c1 = s1;
  const unsigned char *const c2 = s2;
  for (size_t i = 0; i < n; ++i)
    {
      if (c1[i] != c2[i])
        return false;
    }
  return true;
}

memcmp_yamiez.c

#include "memcmp_bench.h"

#include <stdint.h>  // uintptr_t


bool
memcmp_yamiez(const void *const s1, const void *const s2, size_t n)
{
  typedef unsigned char byte_type;
  typedef unsigned long word_type;
  const size_t word_size = sizeof(word_type);
  const size_t word_align = (word_size >= 8) ? word_size : 8;
  const uintptr_t align_mask = word_align - 1;
  const byte_type * buf1 = s1;
  const byte_type * buf2 = s2;
  const uintptr_t addr1 = (uintptr_t) s1;
  const uintptr_t addr2 = (uintptr_t) s2;
  if ((addr1 & align_mask) == (addr2 & align_mask))
    {
      const uintptr_t skip = word_align - (addr1 & align_mask);
      for (uintptr_t i = 0; i < skip; ++i)
        {
          if (*buf1++ != *buf2++)
            return false;
          --n;
        }
      const word_type * wbuf1 = (const word_type *) buf1;
      const word_type * wbuf2 = (const word_type *) buf2;
      while (n >= word_size)
        {
          if (*wbuf1++ != *wbuf2++)
            return false;
          n -= word_size;
        }
      buf1 = (const byte_type *) wbuf1;
      buf2 = (const byte_type *) wbuf2;
    }
  while (n--)
    {
      if (*buf1++ != *buf2++)
        return false;
    }
  return true;
}

Here is the Makefile I have used to compile the benchmarks.

CC = gcc -std=c11
CPPFLAGS = -DNDEBUG
CFLAGS = -Wall -Wextra -Werror -pedantic -O3
LDFLAGS =
LIBS = -lm


all: main

main: main.o memcmp_stdlib.o memcmp_yamiez.o memcmp_naive.o
    ${CC} -o $@ ${CFLAGS} $^ ${LDFLAGS} ${LIBS}

main.o: main.c memcmp_bench.h
    ${CC} -c ${CPPFLAGS} ${CFLAGS} $<

memcmp_stdlib.o: memcmp_stdlib.c memcmp_bench.h
    ${CC} -c ${CPPFLAGS} ${CFLAGS} $<

memcmp_yamiez.o: memcmp_yamiez.c memcmp_bench.h
    ${CC} -c ${CPPFLAGS} ${CFLAGS} $<

memcmp_naive.o: memcmp_naive.c memcmp_bench.h
    ${CC} -c ${CPPFLAGS} ${CFLAGS} $<

clean:
    rm -f main *.o

.PHONY: all clean

And here is the Gnuplot script that produces the graphic shown above.

#! /usr/bin/gnuplot

set terminal png size 800,600
set output 'plot.png'

set key top left
set xlabel 'size / MiB'
set ylabel 'time / ms'

set xrange [0 : *]
set yrange [0 : *]

set format y '%.0f'
set ytics add ('0' 0)

kilo = 1000.0
mebi = 1048576.0

c2gibps(x) = kilo / (x * 1024.0)

regression_stdlib(x) = c1_stdlib * x
regression_naive(x)  = c1_naive  * x
regression_yamiez(x) = c1_yamiez * x

fit regression_stdlib(x) 'timing_stdlib.dat' using ($1 / mebi):(kilo * $2):(kilo * $3) yerror via c1_stdlib
fit regression_naive(x)  'timing_naive.dat'  using ($1 / mebi):(kilo * $2):(kilo * $3) yerror via c1_naive
fit regression_yamiez(x) 'timing_yamiez.dat' using ($1 / mebi):(kilo * $2):(kilo * $3) yerror via c1_yamiez

plot 'timing_stdlib.dat' using ($1 / mebi):(kilo * $2):(kilo * $3) with yerror \
         title sprintf('%s (%.2f GiB/s)', "stdlib", c2gibps(c1_stdlib))        \
         lc '#cc0000',                                                         \
     regression_stdlib(x) notitle lc '#cc0000',                                \
     'timing_naive.dat' using ($1 / mebi):(kilo * $2):(kilo * $3) with yerror  \
         title sprintf('%s (%.2f GiB/s)', "naive", c2gibps(c1_naive))          \
         lc '#3465a4',                                                         \
     regression_naive(x) notitle lc '#3465a4',                                 \
     'timing_yamiez.dat' using ($1 / mebi):(kilo * $2):(kilo * $3) with yerror \
         title sprintf('%s (%.2f GiB/s)', "yamiez", c2gibps(c1_yamiez))        \
         lc '#73d216',                                                         \
     regression_yamiez(x) notitle lc '#73d216'

Learn from the giants

There has been massive interest in optimizing memory functions and standard libraries have evolved highly optimized versions. If you want to learn from them, have a look at their code. For example, look at the version in the GNU C library. If reading that code gives you a headache, you are in good companion.

Coding Style

Use the proper types

In C, a read-only buffer of arbitrary data should be passed as const void *. The size of a buffer should be passed as size_t.

Be `const` correct

Since your function only ever reads the data from the buffer it is passed, it should be declared const.

Don't use identifiers that begin with an underscore

Identifiers that begin with an underscore followed by an upper-case letter and identifiers containing two adjacent underscores are reserved for the implementation. I recognize that your variables don't match this pattern but _bufA still looks bogus. Just drop the underscore.

Avoid confusing notation

I find this hard to read.

if (*bufA != *bufB)
  return false;
++bufA, ++bufB;

How about the commonly used form?

if (*bufA++ != *bufB++)
  return false;

Less typing and easier to understand. In general, I'd avoid putting more than one statement on a single line and not chain multiple statements together with the comma operator.

Fantastic write up. One of the best I have seen. Especially the use of the graph to show the relative speeds. But I disagree with Prefer longs over ints not that I prefer ints but I think it is another thing that can be explicitly tested to see if it improves performance (I suspect it will but am not 100% sure that it will help on all architectures). — Loki Astari, 23 hours ago
@LokiAstari Thank you ;-) I don't think that ints will be faster than longs on (m)any platforms. They might be the same, of course. As I've written in my answer, I've gained almost 50 % speedup just by changing the typedef but of course, that only reinforces your suggestion to always measure. If you need to optimize for platforms you don't have to experiment on, maybe using uint32_fast_t might be useful, hoping the standard library implementers knew what they did. — 5gon12eder, 23 hours ago
hoping the standard library implementers knew what they did :-) — Loki Astari, 22 hours ago

vnp · Answer 2 · 2016-02-04 19:38:59Z

The idea of comparing larger pieces of data is appealing, but requires great care to implement.

The code assumes that the integer is 4 bytes wide. c doesn't guarantee it. A simple fix is to replace your 4s with sizeof(int).
Alignment of buffers. If they happen to not be aligned on an int boundary anything may happen, from fault to slowdown to wrong results, depending on the underlying hardware architecture.

ChrisWue · Answer 3 · 2016-02-05 01:16:30Z

If speed is your goal then the next simple step would be to first use long long which in C99 or later is guaranteed to have at least 64bit.

The next step after this is using streaming intrinsics, for example _mm512_cmpeq_epu8_mask allows you to compare 64 bytes at once. However this is pretty much leaving the land of portability so depends on what exactly you want to do.

In general memcmp is probably by and large mostly memory bandwidth rather than CPU limited, so once the memory bandwidth is saturated comparing more stuff in parallel won't achieve much. For learning and educational purposes this might be a fun exercise though.

CiaPan · Answer 4 · 2016-02-05 09:24:40Z

It's really hard to add anything to the 5gon12eder's answer, but I'll try.

You tried to optimize the memory access by fetching 4 bytes at a time. That's a good approach. However be aware that the input buffers may be misaligned. If one of them starts at the dword (4-multiple) boundary and another one is one byte off (say, in case of memcmp(buf, buf+5) call), then accessing a misaligned 4-byte chunk may require two bus cycles per read. Of course a memory cache may help with minimizing that additional cost, but it's worth remembering the cost may appear.

Additionally in some hardware architectures accessing misaligned data is strictly forbidden – if you try to read a 16-bit word from odd address, or a 32-bit dword from an address not divisible by 4, an exception is raised.
Google: http://www.google.com/?#q=misaligned+access+exception

So to be fully portable and the fastest possible, your function should check the input pointers for the least common alignment factor and perform a loop with properly chosen chunk size (one- to four-, or even eight-byte, with appropriate type of char, int16, long, long long, depending on what types are available and what size they are).

Falling back to types with alignment between char and long long when the buffers are only “partially aligned” is indeed a good suggestion. — 5gon12eder, 9 hours ago

asked	yesterday
viewed	381 times
active	today

current community

your communities

more stack exchange communities

Custom memcmp function

4 Answers 4

Correctness

Cast the pointer, not the value

Watch out for off-by-one errors

Do memory operations on unsigned types

Beware of undefined behavior

Portability

Don't make assumptions on `sizeof(int)`

Performance

Prefer `long`s over `int`s

Be sure to measure

Learn from the giants

Coding Style

Use the proper types

Be `const` correct

Don't use identifiers that begin with an underscore

Avoid confusing notation

Your Answer

Not the answer you're looking for? Browse other questions tagged performance c reinventing-the-wheel or ask your own question.

Hot Network Questions

current community

your communities

more stack exchange communities

Custom memcmp function

4 Answers 4

Correctness

Cast the pointer, not the value

Watch out for off-by-one errors

Do memory operations on unsigned types

Beware of undefined behavior

Portability

Don't make assumptions on sizeof(int)

Performance

Prefer longs over ints

Be sure to measure

Learn from the giants

Coding Style

Use the proper types

Be const correct

Don't use identifiers that begin with an underscore

Avoid confusing notation

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged performance c reinventing-the-wheel or ask your own question.

Related

Hot Network Questions

Don't make assumptions on `sizeof(int)`

Prefer `long`s over `int`s

Be `const` correct