You're stuck at a point where your readability is good, and any performance improvements are going to come at the expense of making the code more ugly. Additionally, you're already
making architecture choices based on the compiler's interpretation of 'unsigned'. It's already not pretty.
Oh, and the un-braced 1-line for-loop is a problem for readability too.
I don't know of any optimizations that are available at this point that will not come with the portability, or readability cost. For a block-size of 16 bytes, and with a likely 4-byte unsigned value, the odds are that your loop will iterate 4 times only anyway.
The compiler may unroll that loop, and move on.
The above is just a waffle that really means: from here on, you cannot accomplish all three: clean code, portability, and performance. You need to compromise somewhere.
Profiling would be the logical thing to do. profile your current code, and establish a baseline.
I would then suggest that you investigate whether you can have a larger block size. Any alignment fiddling you may have to do will pay off better if you are working with larger blocks. Can you batch these blocks in to 1MB contiguous regions?
With larger contiguous blocks you could then consider vectorization, alignment, and other optimizations that have an increased, and fixed setup cost, but improved throughput.
Additionally, I would consider forcing bigsize_t
to be unsigned long long
, and then manually unroll the loop. If you find an exact type that is 64 bits (there has to be one? and, if not, you can have an alternate implementation...), then you can force your type to match:
#include <limits.h>
#if UINT_MAX == 18446744073709551615ULL
typedef unsigned int big64_t;
#elif ULONG_MAX == 18446744073709551615ULL
typedef unsigned long big64_t ;
#elif ULLONG_MAX == 18446744073709551615ULL
typedef unsigned long long big64_t;
#else
#error "Cannot find unsigned 64bit integer."
#endif
big64_t *s = reinterpret_cast<big64_t *>(dst);
const big64_t *au = reinterpret_cast<const big64_t *>(a);
const big64_t *bu = reinterpret_cast<const big64_t *>(b);
s[0] = au[0] ^ bu[0];
s[1] = au[1] ^ bu[1];
With systems that are not natively 64-bit, the compiler will adjust the operation to be relatively efficient anyway.
In the event that this is the year 2100, and all datatypes are 128 bit, or more, then you should possibly add the code that does a single 128-bit XOR for your input.
I see the above as being as portable, and equally readable. As for the performance, that will require a profile and benchmark.
unit8_t
tounsigned
can be done by using bit and shift operations: for example, a= *a | ( ++a << 8); or some alike – MORTAL Jan 9 at 14:50