Skip to content

Alternative XMLoadFloat3A implementation #231

Open
@MikeMarcin

Description

@MikeMarcin

For SSE targets XMLoadFloat3A will do

__m128 V = _mm_load_ps(&pSource->x);
return _mm_and_ps(V, g_XMMask3);

And when compiling for AVX this will generate code that looks like either

vmovups xmm2,xmmword ptr [DirectX::g_XMMask3 (07FF78A593DD0h)]  
vandps  xmm3,xmm2,xmmword ptr [rcx]  
; or 
vmovups xmm0, XMMWORD PTR [rcx]
vandps  xmm0, xmm0, XMMWORD PTR XMVECTORU32 const g_XMMask3

Consider instead doing

__m128 V = _mm_load_ps(&pSource->x);
return _mm_blend_ps(_mm_setzero_ps(), V, 0b0111);

This avoids the memory load of gXMMask3 and generates the slightly more efficient

vxorps  xmm0, xmm0, xmm0
vblendps xmm0, xmm0, XMMWORD PTR [rcx], 7

I would like to suggest the same for XMLoadFloat3 but there are edge cases where you could get access violations for reading that extra float (even though it is masked out in the blend). I would be fine with that tradeoff to replace the VMOVSD -> VINSERTPS with XORPS -> VBLENDPS but I can imagine as a general purpose library erring on the side of caution.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions