I want to do a popcount of bytes in an array.
Not a total popcount of everything, but lots of little popcounts for every individual byte to determine the neighborhood count in GoL.
The following plain x64 code does the job nicely.
function PopCountPerByte(input: int64): int64;
asm
//RCX = input
mov RAX, $5555555555555555
mov RDX, RCX
shr RDX, 1
and RDX, RAX
and RCX, RAX
mov RAX, $3333333333333333
add RCX, RDX
mov RDX, RCX
shr RDX, 2
and RDX, RAX
and RCX, RAX
add RCX, RDX
mov RDX, RCX
shr RDX, 4
add RCX, RDX
mov RAX, $0f0f0f0f0f0f0f0f
and RAX, RCX
//RAX=output
end;
However I was hoping I could do better with SSE2, because every x64 CPU has SSE2.
function PopCountPerByteSSE2(input1, input2: int64): int128;
asm
//RCX = input1, RDX=input2
mov RAX, $5555555555555555
movq XMM5, RAX
movlhps XMM5, XMM5 //XMM5 all 5's
movq XMM1, RCX
movq XMM2, RDX
MOVLHPS XMM1, XMM2 //XMM1 = input
movdqa XMM2, XMM1
PSRLD XMM2, 1
pand XMM1, XMM5
pand XMM2, XMM5
mov RAX, $3333333333333333
movq XMM3, RAX
movlhps XMM3, XMM3 //XMM3 all 3's
paddq XMM1, XMM2
movdqa XMM2, XMM1
PSRLD XMM2, 2
pand XMM1, XMM3
pand XMM2, XMM3
paddq XMM1, XMM2
movdqa XMM2, XMM1
PSRLD XMM2, 4
paddq XMM1, XMM2
mov RAX, $0f0f0f0f0f0f0f0f
movq XMM0, RAX
movlhps XMM0, XMM0
pand XMM0,XMM1
//XMM0 is output
end;
The SSE2 routine works on double the amount of bits, but the timings don't bode well.
freq = 2143105
x64 popcount took 8795 Ticks 4 ms
sse popcount took 13873 Ticks 6 ms
Result: sse is 57% slower for the same amount of bytes processed.
I run the above code in a loop and take the lowest timing after running the loop 20x. Obviously the SSE loop has half as many iterations as the x64 loop because it does double the work.
Can the SSE loop be optimized, so that it outperforms straight x64 code?
The code should run well on any x64 processor and not just on the latest 10nm generations.
EDIT - lets lighten the loop
Moving the masks out of the SSE loop makes the code quite a bit faster
procedure InitPopcountSSE2;
asm
mov RAX,$5555555555555555
movq XMM5, RAX
PUNPCKLBW XMM5,XMM5
mov RAX,$3333333333333333
movq XMM3,RAX
PUNPCKLBW XMM3,XMM3
mov RAX,$0f0f0f0f0f0f0f0f
movq XMM0, RAX
PUNPCKLBW XMM0,XMM0
end;
The result is:
freq = 2143105
x64 popcount took 8797 Ticks 4 ms
sse old popcount took 13873 Ticks 6 ms
SSE improved popcount took 10284 Ticks 4 ms
Result: SSE improved is 15% slower for the same about of bytes processed.
EDIT2 - and break up the dependency chain
Running an extra count in parallel helps to relieve the long dependency chain in the SSE code. This makes it a paltry 18% faster than the plain x64 code.
function PopCountPerByteSSE2(i1,i2,i3,i4: nativeint): nativeint;
asm
//RCX, RDX, R8, R9 = input
movq XMM1, RCX
movq XMM2, RDX
MOVLHPS XMM1, XMM2 //XMM1 = input
movq XMM4, R8
movq XMM6, R9
MOVLHPS XMM4, XMM6
movdqa XMM2, XMM1
movdqa XMM6, XMM4
PSRLD XMM1, 1
pand XMM2,XMM5
pand XMM1,XMM5
PSRLD XMM4, 1
pand XMM4,XMM5
pand XMM6,XMM5
paddq XMM2,XMM1
paddq XMM6,XMM4
movdqa XMM1,XMM2
movdqa XMM4,XMM6
PSRLD XMM2,2
pand XMM1,XMM3
pand XMM2,XMM3
PSRLD XMM4,2
pand XMM4,XMM3
pand XMM6,XMM3
paddq XMM1,XMM2
paddq XMM4,XMM6
movdqa XMM2,XMM1
movdqa XMM6,XMM4
PSRLD XMM2,4
paddq XMM1,XMM2
pand XMM1,XMM0
PSRLD XMM6,4
paddq XMM4,XMM6
pand XMM4,XMM0
//XMM1, XMM4 = output
end;
freq = 2143105
x64 popcount took 8808 Ticks 4 ms
sse parallel popcount took 7208 Ticks 3 ms
Result: SSE is 18% Faster
Edit 3 - Parallelizing the x64 code
A simple interleaving of a parallel count in the x64 code speeds things up more than the SSE code: result:
freq = 2143105
x64 popcount took 9284 Ticks 4 ms
x64-per 2 popcount took 6850 Ticks 3 ms
sse parallel popcount took 7508 Ticks 3 ms
Result: SSE is 19% faster, x64-per 2 is 26% faster.
popcnt
would not make it faster. I'm running 16 popcounts in parallel. I'm only popcounting individual bytes, not the full thing. The popcount is for GoF BTW. – Johan Feb 23 at 0:55