Single Instruction, Multiple Data describes CPU instructions that process many operands in parallel.
4
votes
0answers
37 views
Cache conscious SIMD matrix multiply of unsigned integers
The goal of the code review by order of importance (i.e. What I hope to hear from you):
I've verified correctness using a straightforward matrix multiply function though I am open to those who want ...
0
votes
0answers
35 views
Multiplication of n-dimensional arrays with broadcasting
For explaintion of multiplication with broadcasting, see here.
Problem: The nested loop of the simplified code is not vectorized.
How to fix the simplified code so that its nested loop would ...
1
vote
2answers
130 views
Hash calculation for array of long values in C#
Can the following function be improved in terms of performance?
I am calculating millions of such hashes.
The long array represents a record from data table where all values are encoded as long ...
1
vote
1answer
146 views
HPC kernel for DGEMM: compiler v.s. assembly
This is a correct version, for computing a small matrix multiplication: C += A * B, where C is ...
2
votes
0answers
155 views
FUTABA SBUS serial communication in C++
I would like to reimplement the current Futaba SBUS protocol in ArduPilot for Navio+. It seems to be a relatively expensive protocol, so I changed the code from an existing git project and to make it ...
7
votes
2answers
189 views
SSE instruction to check if byte array is zeroes C#
My fundamental problem is how to check whether byte[] is full of zeroes. I posted a range of implementations (with timings) and one clearly beats others. In fact, ...
22
votes
1answer
962 views
SIMD matrix multiplication
I recently started toying with SIMD and came up with the following code for matrix multiplication.
First I attempted to implement it using SIMD the same way I did in SISD, just using SIMD for things ...
6
votes
0answers
272 views
Bilinear interpolation using Neon intrinsics
I'm trying to do a Bilinear interpolation on the ARM Neon. However, I find that my vectorized code is slower than the regular one, on a BeagleBone Black. Any idea why this could happen?
I'm using ...
4
votes
0answers
142 views
SSE optimisation for audio resampling
I'm learning SSE for the first time and trying to optimise some code. Using oprofile shows that the CPU usage in this function went down from 2.5% to 0.9% using the ...
0
votes
1answer
321 views
Computing tangent space basis vectors for an arbitrary mesh
This is more like a share and a request than a question. I converted Eric Lengyel's code, which calculates tangents of a mesh for the purpose of texturing and normal mapping, to support SIMD. For this ...
7
votes
1answer
385 views
Writing SIMD libraries for C++ on FASM in x86-64 Linux
I have recently started a project of SIMD libraries development for C++ on FASM for x86-64 Linux.
I would be glad to hear any opinion or feedback about the project, cleanness of the code and ...