Single Instruction, Multiple Data describes CPU instructions that process many operands in parallel.

learn more… | top users | synonyms

4
votes
0answers
37 views

Cache conscious SIMD matrix multiply of unsigned integers

The goal of the code review by order of importance (i.e. What I hope to hear from you): I've verified correctness using a straightforward matrix multiply function though I am open to those who want ...
0
votes
0answers
35 views

Multiplication of n-dimensional arrays with broadcasting

For explaintion of multiplication with broadcasting, see here. Problem: The nested loop of the simplified code is not vectorized. How to fix the simplified code so that its nested loop would ...
1
vote
2answers
130 views

Hash calculation for array of long values in C#

Can the following function be improved in terms of performance? I am calculating millions of such hashes. The long array represents a record from data table where all values are encoded as long ...
1
vote
1answer
146 views

HPC kernel for DGEMM: compiler v.s. assembly

This is a correct version, for computing a small matrix multiplication: C += A * B, where C is ...
2
votes
0answers
155 views

FUTABA SBUS serial communication in C++

I would like to reimplement the current Futaba SBUS protocol in ArduPilot for Navio+. It seems to be a relatively expensive protocol, so I changed the code from an existing git project and to make it ...
7
votes
2answers
189 views

SSE instruction to check if byte array is zeroes C#

My fundamental problem is how to check whether byte[] is full of zeroes. I posted a range of implementations (with timings) and one clearly beats others. In fact, ...
22
votes
1answer
962 views

SIMD matrix multiplication

I recently started toying with SIMD and came up with the following code for matrix multiplication. First I attempted to implement it using SIMD the same way I did in SISD, just using SIMD for things ...
6
votes
0answers
272 views

Bilinear interpolation using Neon intrinsics

I'm trying to do a Bilinear interpolation on the ARM Neon. However, I find that my vectorized code is slower than the regular one, on a BeagleBone Black. Any idea why this could happen? I'm using ...
4
votes
0answers
142 views

SSE optimisation for audio resampling

I'm learning SSE for the first time and trying to optimise some code. Using oprofile shows that the CPU usage in this function went down from 2.5% to 0.9% using the ...
0
votes
1answer
321 views

Computing tangent space basis vectors for an arbitrary mesh

This is more like a share and a request than a question. I converted Eric Lengyel's code, which calculates tangents of a mesh for the purpose of texturing and normal mapping, to support SIMD. For this ...
7
votes
1answer
385 views

Writing SIMD libraries for C++ on FASM in x86-64 Linux

I have recently started a project of SIMD libraries development for C++ on FASM for x86-64 Linux. I would be glad to hear any opinion or feedback about the project, cleanness of the code and ...