Issue
Im trying to use an AMM-Algorithm (approximate-matrix-multiplication; on Apple's M1), which is fully based on speed and uses the x86 built-in functions listed below. Since using a VM for x86 slows down several crucial processes in the algorithm, I was wondering if there is another way to run it on ARM64.
I also could not find a fitting documentation for the ARM64 built-in functions, which could eventually help mapping some of the x86-64 instructions.
Used built-in functions:
__builtin_ia32_vec_init_v2si
__builtin_ia32_vec_ext_v2si
__builtin_ia32_packsswb
__builtin_ia32_packssdw
__builtin_ia32_packuswb
__builtin_ia32_punpckhbw
__builtin_ia32_punpckhwd
__builtin_ia32_punpckhdq
__builtin_ia32_punpcklbw
__builtin_ia32_punpcklwd
__builtin_ia32_punpckldq
__builtin_ia32_paddb
__builtin_ia32_paddw
__builtin_ia32_paddd
Solution
Normally you'd use intrinsics instead of the raw GCC builtin functions, but see https://gcc.gnu.org/onlinedocs/gcc/ARM-C-Language-Extensions-_0028ACLE_0029.html. The __builtin_arm_...
and __builtin_aarch64_...
functions like __builtin_aarch64_saddl2v16qi
don't seem to be documented in the GCC manual the way the x86 ones are, just another sign they're not intended for direct use.
See also https://developer.arm.com/documentation/102467/0100/Why-Neon-Intrinsics- re intrinsics and #include <arm_neon.h>
. GCC provides a version of that header, with the documented intrinsics API implemented using __builtin_aarch64_...
GCC builtins.
As far as portability libraries, AFAIK not from the raw builtins, but SIMDe (https://github.com/simd-everywhere/simde) has portable implementations of immintrin.h
Intel intrinsics like _mm_packs_epi16
. Most code should be using that API instead of GNU C builtins, unless you're using GNU C native vectors (__attribute__((vector_size(16)))
for portable SIMD without any ISA-specific stuff. But that's not viable when you want to take advantage of special shuffles and stuff.
And yes, ARM does have narrowing with saturation with instructions like vqmovn
(https://developer.arm.com/documentation/dui0473/m/neon-instructions/vqmovn-and-vqmovun), so SIMDe can efficiently emulate pack instructions. That's AArch32, not 64, but hopefully there's an equivalent AArch64 instruction.
Answered By - Peter Cordes Answer Checked By - Mildred Charles (WPSolving Admin)