Speed optimization for block XOR

Question

In code I'm currently maintaining, there is a need to do very many repeated XOR operations of blocks of memory. The block size in my case is always 16 bytes. Because the code is executed very frequently, and because it's speed critical, I would like it to be as fast as possible. Ordinarily I'd write this in assembly language for that reason, but I'd prefer that the code stay in C++ to maximize portability. This is what I'm currently using, but I would like a review to address particularly the following questions:

Can this code be made faster?
Can the portability be improved? In particular, I'm interested in avoiding problems on architectures for which there is a penalty for misaligned memory access.

memxor.c

const size_t BLKSIZE = 16;
typedef unsigned bigsize_t;
inline void memxor(uint8_t dst[BLKSIZE], const uint8_t *a, const uint8_t *b)
{
    bigsize_t *s=reinterpret_cast<bigsize_t *>(dst);
    const bigsize_t *au = reinterpret_cast<const bigsize_t *>(a);
    const bigsize_t *bu = reinterpret_cast<const bigsize_t *>(b);
    for (size_t i=BLKSIZE/sizeof(*s); i; --i)
        *s++ = *au++ ^ *bu++;
}

it looks fine for me but the converting from unit8_t to unsigned can be done by using bit and shift operations: for example, a= *a | ( ++a << 8); or some alike — MORTAL, Jan 9 at 14:50
As far as I know, this cast from 8-bit pointers to 32-bit pointers is safe only if the underlying HW architecture (as well as the designated compiler) supports unaligned memory access operations (load and store). In any case, if you do it using the original 8-bit pointers, and enable compiler-optimization, then the compiler should be able to optimize it safely. In fact, loop-unrolling by itself might be sufficient for achieving the same magnitude of improvement as the one that you gain with those 32-bit operations (or possibly even more than that). — barak manos, Jan 9 at 17:02

rolfl · Accepted Answer · 2015-01-09 16:43:37Z

You're stuck at a point where your readability is good, and any performance improvements are going to come at the expense of making the code more ugly. Additionally, you're already making architecture choices based on the compiler's interpretation of 'unsigned'. It's already not pretty.

Oh, and the un-braced 1-line for-loop is a problem for readability too.

I don't know of any optimizations that are available at this point that will not come with the portability, or readability cost. For a block-size of 16 bytes, and with a likely 4-byte unsigned value, the odds are that your loop will iterate 4 times only anyway.

The compiler may unroll that loop, and move on.

The above is just a waffle that really means: from here on, you cannot accomplish all three: clean code, portability, and performance. You need to compromise somewhere.

Profiling would be the logical thing to do. profile your current code, and establish a baseline.

I would then suggest that you investigate whether you can have a larger block size. Any alignment fiddling you may have to do will pay off better if you are working with larger blocks. Can you batch these blocks in to 1MB contiguous regions?

With larger contiguous blocks you could then consider vectorization, alignment, and other optimizations that have an increased, and fixed setup cost, but improved throughput.

Additionally, I would consider forcing bigsize_t to be unsigned long long, and then manually unroll the loop. If you find an exact type that is 64 bits (there has to be one? and, if not, you can have an alternate implementation...), then you can force your type to match:

#include <limits.h>

#if UINT_MAX == 18446744073709551615ULL
typedef unsigned int big64_t;
#elif ULONG_MAX == 18446744073709551615ULL
typedef unsigned long big64_t ;
#elif ULLONG_MAX == 18446744073709551615ULL
typedef unsigned long long big64_t;
#else
#error "Cannot find unsigned 64bit integer."
#endif

big64_t *s = reinterpret_cast<big64_t *>(dst);
const big64_t *au = reinterpret_cast<const big64_t *>(a);
const big64_t *bu = reinterpret_cast<const big64_t *>(b);

s[0] = au[0] ^ bu[0];
s[1] = au[1] ^ bu[1];

With systems that are not natively 64-bit, the compiler will adjust the operation to be relatively efficient anyway.

In the event that this is the year 2100, and all datatypes are 128 bit, or more, then you should possibly add the code that does a single 128-bit XOR for your input.

I see the above as being as portable, and equally readable. As for the performance, that will require a profile and benchmark.

The code was profiled before I started, which is why I know that on the profiled platform unsigned actually worked faster than unsigned long long. The problem with uint64_t is that it is only optionally defined on platforms which have a 64-bit quantity. — Edward, Jan 9 at 15:01
@Edward - you're right. It's possible that uint64 is not defined. On the other hand, long long will be defined, and you should be able to find a 64-bit integer somehow. I am surprised that unsigned was faster than unsigned long, but that is the value of benchmarking, and it also illustrates the common problem that performance tuning, and portability, are often competing requirements, and impact readability too. Note, I edited my answer to locate and use an exact 64-bit unsigned type. Out of interest, does it need to e unsigned? — rolfl, Jan 9 at 16:45
It doesn't need to be unsigned as long as all bits are treated correctly. — Edward, Jan 9 at 17:07
Regardless, my answer is really: without compromizing somthing, you will not likely get better performance. My suggestion at the end is just an idea you may want to try, and it seems you have. — rolfl, Jan 9 at 17:09

Jamal · Answer 2 · 2015-01-12 21:31:32Z

I don't know about your target architecture but you should definitely consider using SIMD instructions:

Portability:

Current x86/64 processors support SSE/AVX-instructions. You could otherwise use a fallback that is similar to your current implementation (e.g. for older x86-processors and µCs).
Current ARM-chips support NEON-instructions. If you know your target-hardware at compile time you can either use something like:
```
#ifdef TARGET_SUPPORTS_SSE
#elif  TARGET_SUPPORTS_NEON
#else  /*fallback-implementation*/ 
#endif
```
If you are using C++, you can use specialized templates, which I would definitely prefer over preprocessor-macros.
You should prefer intrinsics over pure assembly. This will offer better portability and more room for compiler optimizations (since the compiler is free to choose registers).

Speaking about performance:

Performance:

Making assumptions about performance is a bad idea! You need to measure!
Pay attention to memory alignment of your data since you pay a huge performance-penalty for unaligned loads (and stores)! For 16 byte wide registers you should align data on 16 byte boundaries (meaning the least 4 bits of the address being zero).
Unrolling loops can boost your performance even further but you should check if your compiler does this already for you (since it knows better about i-cache-sizes etc...) else try and measure.

SIMD registers are typically at least 16 bytes wide. You would load 16 bytes of data from *a at once and do the same with data at *b. Then you would use the appropriate xor-instruction and store the result. So 2 loads, 1 xor and 1 store for 16 bytes (or even more using AVX).

I would expect this code to be memory-bound, probably even if you are not using SIMD instructions. Depending on your target-hardware, you could also try to use more threads otherwise.

I think you either have to rely on your compiler to transform your loop into appropriate instructions when using plain and portable C/C++, or use the knowledge of your target-hardware and exploit it.

If you measured different implementations with different data-sizes and you didn't mind sharing, I would be interested in the results to see if my assumptions were correct.

Thanks for your review. Two things that may be relevant are that 1) not all of the targets are x86 (some are ARM) and 2) I always measure, however the desire is this case is to also accommodate other future architectures. I can't measure on hardware I don't have. — Edward, Jan 12 at 22:55
Thanks Jamal for polishing! Edward, are compile-time switches a valid option for you then? As long as you did not implement an adapted algorithm for your new (future) architecture this approach will use the default/fallback-implementation which any compiler (for that architecture) should be able to build. As you learn about all the specifics you can then implement an adapted, faster algorithm (prooved by measuring). Do you think this is a valid option for you or do you have any concerns? — david, Jan 13 at 0:06
I figure that I'll go with the C++ code I posted on all platforms unless it's either measured to be too slow or if there's a problem on a particular platform. So, compile-time switches may be an option, but I'd rather avoid them for the same reason I was avoiding assembly language. — Edward, Jan 13 at 2:28

asked	8 months ago
viewed	368 times
active	8 months ago

current community

your communities

more stack exchange communities

Speed optimization for block XOR

memxor.c

2 Answers 2

Portability:

Performance:

Your Answer

Not the answer you're looking for? Browse other questions tagged c++ performance c++11 portability or ask your own question.

Visit Chat

Hot Network Questions

current community

your communities

more stack exchange communities

Speed optimization for block XOR

memxor.c

2 Answers 2

Did you find this question interesting? Try our newsletter

Portability:

Performance:

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged c++ performance c++11 portability or ask your own question.

Visit Chat

Related

Hot Network Questions