Code Review Stack Exchange is a question and answer site for peer programmer code reviews. Join them; it only takes a minute:

Sign up

Here's how it works:

Anybody can ask a question
Anybody can answer
The best answers are voted up and rise to the top

Parallel popcount of packed bytes

up vote 4 down vote favorite

I want to do a popcount of bytes in an array.

Not a total popcount of everything, but lots of little popcounts for every individual byte to determine the neighborhood count in GoL.

The following plain x64 code does the job nicely.

function PopCountPerByte(input: int64): int64;
asm
  //RCX = input
  mov RAX, $5555555555555555
  mov RDX, RCX
  shr RDX, 1
  and RDX, RAX
  and RCX, RAX
  mov RAX, $3333333333333333
  add RCX, RDX
  mov RDX, RCX
  shr RDX, 2
  and RDX, RAX
  and RCX, RAX
  add RCX, RDX
  mov RDX, RCX
  shr RDX, 4
  add RCX, RDX
  mov RAX, $0f0f0f0f0f0f0f0f
  and RAX, RCX
  //RAX=output
end;

However I was hoping I could do better with SSE2, because every x64 CPU has SSE2.

function PopCountPerByteSSE2(input1, input2: int64): int128;
asm
  //RCX = input1, RDX=input2
  mov     RAX,  $5555555555555555
  movq    XMM5, RAX
  movlhps XMM5, XMM5 //XMM5 all 5's
  movq    XMM1, RCX
  movq    XMM2, RDX
  MOVLHPS XMM1, XMM2 //XMM1 = input
  movdqa  XMM2, XMM1
  PSRLD   XMM2, 1
  pand    XMM1, XMM5
  pand    XMM2, XMM5
  mov     RAX,  $3333333333333333
  movq    XMM3, RAX
  movlhps XMM3, XMM3 //XMM3 all 3's
  paddq   XMM1, XMM2
  movdqa  XMM2, XMM1
  PSRLD   XMM2, 2
  pand    XMM1, XMM3
  pand    XMM2, XMM3
  paddq   XMM1, XMM2
  movdqa  XMM2, XMM1
  PSRLD   XMM2, 4
  paddq   XMM1, XMM2
  mov     RAX,  $0f0f0f0f0f0f0f0f
  movq    XMM0, RAX
  movlhps XMM0, XMM0
  pand    XMM0,XMM1
//XMM0 is output
end;

The SSE2 routine works on double the amount of bits, but the timings don't bode well.

freq = 2143105
x64 popcount took 8795 Ticks 4 ms
sse popcount took 13873 Ticks 6 ms
Result: sse is 57% slower for the same amount of bytes processed.

I run the above code in a loop and take the lowest timing after running the loop 20x. Obviously the SSE loop has half as many iterations as the x64 loop because it does double the work.

Can the SSE loop be optimized, so that it outperforms straight x64 code?

The code should run well on any x64 processor and not just on the latest 10nm generations.

EDIT - lets lighten the loop
Moving the masks out of the SSE loop makes the code quite a bit faster

procedure InitPopcountSSE2;
asm
  mov     RAX,$5555555555555555
  movq    XMM5, RAX
  PUNPCKLBW XMM5,XMM5
  mov     RAX,$3333333333333333
  movq    XMM3,RAX
  PUNPCKLBW XMM3,XMM3
  mov     RAX,$0f0f0f0f0f0f0f0f
  movq    XMM0, RAX
  PUNPCKLBW XMM0,XMM0
end;

The result is:

freq = 2143105
x64 popcount took 8797 Ticks 4 ms
sse old popcount took 13873 Ticks 6 ms
SSE improved popcount took 10284 Ticks 4 ms
Result: SSE improved is 15% slower for the same about of bytes processed.

EDIT2 - and break up the dependency chain
Running an extra count in parallel helps to relieve the long dependency chain in the SSE code. This makes it a paltry 18% faster than the plain x64 code.

function PopCountPerByteSSE2(i1,i2,i3,i4: nativeint): nativeint;
asm
//RCX, RDX, R8, R9 = input
  movq    XMM1, RCX
  movq    XMM2, RDX
  MOVLHPS XMM1, XMM2 //XMM1 = input
  movq    XMM4, R8
  movq    XMM6, R9
  MOVLHPS XMM4, XMM6
  movdqa  XMM2, XMM1
  movdqa  XMM6, XMM4
  PSRLD   XMM1, 1
  pand    XMM2,XMM5
  pand    XMM1,XMM5
  PSRLD   XMM4, 1
  pand    XMM4,XMM5
  pand    XMM6,XMM5
  paddq   XMM2,XMM1
  paddq   XMM6,XMM4
  movdqa  XMM1,XMM2
  movdqa  XMM4,XMM6
  PSRLD   XMM2,2
  pand    XMM1,XMM3
  pand    XMM2,XMM3
  PSRLD   XMM4,2
  pand    XMM4,XMM3
  pand    XMM6,XMM3
  paddq   XMM1,XMM2
  paddq   XMM4,XMM6
  movdqa  XMM2,XMM1
  movdqa  XMM6,XMM4
  PSRLD   XMM2,4
  paddq   XMM1,XMM2
  pand    XMM1,XMM0
  PSRLD   XMM6,4
  paddq   XMM4,XMM6
  pand    XMM4,XMM0
  //XMM1, XMM4 = output
end;

freq = 2143105
x64 popcount took 8808 Ticks 4 ms
sse parallel popcount took 7208 Ticks 3 ms
Result: SSE is 18% Faster

Edit 3 - Parallelizing the x64 code

A simple interleaving of a parallel count in the x64 code speeds things up more than the SSE code: result:

freq = 2143105
x64 popcount took 9284 Ticks 4 ms
x64-per 2 popcount took 6850 Ticks 3 ms
sse parallel popcount took 7508 Ticks 3 ms
Result: SSE is 19% faster, x64-per 2 is 26% faster.

edited Feb 23 at 1:21

asked Feb 22 at 23:22

Johan

1765

You do realize that Intel added a dedicated popcnt instruction in Penryn (aka Core 2) circa 2007, right (i.e., what you're writing would probably only be useful on Pentium IV and the original Core processors)? – Jerry Coffin Feb 23 at 0:42

@JerryCoffin, popcnt would not make it faster. I'm running 16 popcounts in parallel. I'm only popcounting individual bytes, not the full thing. The popcount is for GoF BTW. – Johan Feb 23 at 0:55

Ah--fair enough (I guess I should have read more carefully before commenting). – Jerry Coffin Feb 23 at 2:12

add a comment |

Your Answer

Sign up or log in

Post as a guest

Name

Post as a guest

Name

discard

By posting your answer, you agree to the privacy policy and terms of service.

Browse other questions tagged performance bitwise assembly sse or ask your own question.

question feed

asked	7 months ago
viewed	50 times

current community

your communities

more stack exchange communities

Parallel popcount of packed bytes

Your Answer

Browse other questions tagged performance bitwise assembly sse or ask your own question.

Hot Network Questions

current community

your communities

more stack exchange communities

Parallel popcount of packed bytes

Know someone who can answer? Share a link to this question via email, Google+, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest

Browse other questions tagged performance bitwise assembly sse or ask your own question.

Related

Hot Network Questions