Alternative XMLoadFloat3A implementation

For SSE targets XMLoadFloat3A will do
```
__m128 V = _mm_load_ps(&pSource->x);
return _mm_and_ps(V, g_XMMask3);
```

And when compiling for AVX this will generate code that looks like either
```
vmovups xmm2,xmmword ptr [DirectX::g_XMMask3 (07FF78A593DD0h)]  
vandps  xmm3,xmm2,xmmword ptr [rcx]  
; or 
vmovups xmm0, XMMWORD PTR [rcx]
vandps  xmm0, xmm0, XMMWORD PTR XMVECTORU32 const g_XMMask3
```

Consider instead doing
```
__m128 V = _mm_load_ps(&pSource->x);
return _mm_blend_ps(_mm_setzero_ps(), V, 0b0111);
```

This avoids the memory load of gXMMask3 and generates the slightly more efficient
```
vxorps  xmm0, xmm0, xmm0
vblendps xmm0, xmm0, XMMWORD PTR [rcx], 7
```

I would like to suggest the same for XMLoadFloat3 but there are edge cases where you could get access violations for reading that extra float (even though it is masked out in the blend). I would be fine with that tradeoff to replace the VMOVSD -> VINSERTPS with XORPS -> VBLENDPS but I can imagine as a general purpose library erring on the side of caution.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Alternative XMLoadFloat3A implementation #231

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Alternative XMLoadFloat3A implementation #231

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions