New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merge implementations from "missing SSE implementations" to NEON #855
Comments
These are missed optimizations in gcc, but clang has them. |
|
Here is another one: (edit) more: |
|
Nice, thanks. Those would be great for There are tons of these floating around the internet, and I'd like to try to get as many as possible merged into SIMDe. Sometimes they are for missing functions, sometimes for emulating a newer instruction using an older extension (like the min/max functions you mentioned). Both are very useful to us. |
|
Better lowering: |
Might help with |
|
collection so far: Avoids generating/loading constants which may not be desirable. cc @aklomp |
These are based on implementations suggested by @aqrit at #855 (comment) I've just extended them to other types and added some similar implementations for POWER and WASM SIMD128.
https://godbolt.org/z/T73MbPEnh I agree, the throughput isn't quite as good, but the latency on that mov is painful, plus the memory to store the data… I'll go through your last comment soon, but I think I've got most of them in place (though not merged yet). Thanks for putting them together |
These are based on implementations suggested by @aqrit at #855 (comment) I've just extended them to other types and added some similar implementations for POWER and WASM SIMD128.
These are based on @aqrit's suggestions at #855 (comment) and #855 (comment)
These are based on implementations suggested by @aqrit at #855 (comment) I've just extended them to other types and added some similar implementations for POWER and WASM SIMD128.
These are based on @aqrit's suggestions at #855 (comment) and #855 (comment)
These are based on implementations suggested by @aqrit at #855 (comment) I've just extended them to other types and added some similar implementations for POWER and WASM SIMD128.
These are based on @aqrit's suggestions at #855 (comment) and #855 (comment)
These are based on implementations suggested by @aqrit at #855 (comment) I've just extended them to other types and added some similar implementations for POWER and WASM SIMD128.
These are based on @aqrit's suggestions at #855 (comment) and #855 (comment)
These are based on implementations suggested by @aqrit at simd-everywhere/simde#855 (comment) I've just extended them to other types and added some similar implementations for POWER and WASM SIMD128.
These are based on @aqrit's suggestions at simd-everywhere/simde#855 (comment) and simd-everywhere/simde#855 (comment)
These are based on implementations suggested by @aqrit at simd-everywhere#855 (comment) I've just extended them to other types and added some similar implementations for POWER and WASM SIMD128.
These are based on @aqrit's suggestions at simd-everywhere#855 (comment) and simd-everywhere#855 (comment)
|
FWIW, my "missing SSE intrinsics" project is now canonically hosted at https://github.com/aklomp/missing-sse-intrinsics. |
nemequ commentedJul 10, 2021
http://www.alfredklomp.com/programming/sse-intrinsics/ has a great list of implementations of "missing" SSE instructions.
Unlike SSE, NEON isn't missing a lot of this functionality, so we should steal that code and use it to implement parts of the NEON API. For example:
_mm_cmple_epu8→vcleq_u8(see 5906cc9)_mm_cmpge_epu8→vcgeq_u8_mm_cmpgt_epu8→vcgtq_u8_mm_min_epu16→vminq_u16_mm_absdiff_epu8→vabdq_u8_mm_bswap_epi16→vrev16q_u16/vrev16q_s16We can also use the same techniques for a bunch of other functions which that page doesn't explicitly include (e.g.,
vcleq_u16/vcleq_u32/vcleq_u64can all use the same technique as_mm_cmple_epu8, though 16/32-bit versions require SSE4.1 and 64-bit requires AVX-512VL).Many of the same implementations could also be used in WASM (
wasm_u8x16_le,wasm_u8x16_ge,wasm_u8x16_gt,wasm_u16x8_min, etc.).There are also a few functions which are present in later versions of SSE, but can be emulated with earlier versions. We should make sure our implementations of SSE also have these versions, too.
As an example, 5906cc9 implements
vcleq_u*using the code from_mm_cmple_epu8.The text was updated successfully, but these errors were encountered: