Issue
I am working on a SIMD wrapper for C++, the base type looks like the following union:
union u{
__m128d sse;
double c[2];
};
In the following, I want to look at the ABI for Linux.
e.g.
__m128d f(__m128d a, __m128d b){
return b;
}
compiles to
f(double __vector(2), double __vector(2)):
vmovaps xmm0, xmm1
ret
This uses the packed XMM registers for SIMD(the __m128d ABI). If I use the union instead it results in using the default float ABI.
f(u, u):
vmovaps xmm1, xmm3
vmovaps xmm0, xmm2
ret
In this case, there is only one more instruction generated. But it can get worse, there are cases where I have to work with the stack, where I should only work with registers.
Is there a way to select the __m128d ABI explicitly?
Solution
Take a step back for a moment, and compare this:
union u{
__m128d sse;
double c[2];
};
double getx(u a){
return a.c[0];
}
u add(u a, u b) {
return { _mm_add_pd(a.sse, b.see) };
}
with this:
double getx(__m128d a){
return a[0];
}
__m128d add(__m128d a, __m128d b) {
return _mm_add_pd(a, b);
}
Which do you prefer?
If this is a linux based ABI, and you're using clang or gcc, the latter will work just fine. So I'm not entirely sure what problem your union type is aiming to solve here?
As an aside, it's generally a good idea to encourage users of your SIMD types to avoid accessing elements in the vector. With exception of accessing element 0, it's always going to incur a runtime cost, so avoid if possible.
The spanner in the works for the above, is that visual C++ doesn't define these operators :( In that particular case, I'd only bother with a wrapper for Visual C+, and leave linux/Mac to use the native types e.g.
#ifdef _WIN32
// If you want decent performance for Windows :(
#define VCALL __vectorcall
struct d128 {
inline d128() = default;
inline d128(const d128&) = default;
inline d128(const __m128d v) { x = v; }
__m128d x;
inline VCALL operator __m128d() const { return x; }
inline double VCALL operator [](int i) const { return ((const double*)(this))[i]; }
inline double& VCALL operator [](int i) { return ((double*)(this))[i]; }
};
#else
#define VCALL
typedef __m128d d128;
#endif
And now this work work nicely on all platforms:
d128 VCALL add(d128 a, d128 b){
return _mm_add_pd(a, b);
}
As will this:
double VCALL getx(d128 a) {
return a[0];
}
(Well, under VC++ accessing individual elements is a little unpleasant, no matter which way you do it!)
If you still insist on having a specific type (because you want to overload the +, -, /, * operators), just be aware that gcc & clang have already overloaded all the common operators, so for gcc/clang I could have written:
d128 VCALL add(d128 a, d128 b){
return a + b;
}
Answered By - robthebloke