Open
Description
For SSE targets XMLoadFloat3A will do
__m128 V = _mm_load_ps(&pSource->x);
return _mm_and_ps(V, g_XMMask3);
And when compiling for AVX this will generate code that looks like either
vmovups xmm2,xmmword ptr [DirectX::g_XMMask3 (07FF78A593DD0h)]
vandps xmm3,xmm2,xmmword ptr [rcx]
; or
vmovups xmm0, XMMWORD PTR [rcx]
vandps xmm0, xmm0, XMMWORD PTR XMVECTORU32 const g_XMMask3
Consider instead doing
__m128 V = _mm_load_ps(&pSource->x);
return _mm_blend_ps(_mm_setzero_ps(), V, 0b0111);
This avoids the memory load of gXMMask3 and generates the slightly more efficient
vxorps xmm0, xmm0, xmm0
vblendps xmm0, xmm0, XMMWORD PTR [rcx], 7
I would like to suggest the same for XMLoadFloat3 but there are edge cases where you could get access violations for reading that extra float (even though it is masked out in the blend). I would be fine with that tradeoff to replace the VMOVSD -> VINSERTPS with XORPS -> VBLENDPS but I can imagine as a general purpose library erring on the side of caution.