Skip to content

Flaky CSE around pdep mask #477

Open
@damageboy

Description

@damageboy

Repro Repo:

https://github.com/damageboy/coreclr-pdep-mask-flaky-cse

Relevant piece of code:

https://github.com/damageboy/coreclr-pdep-mask-flaky-cse/blob/d6bc610c1dd5416f717211676f2fb0b0ce42e3a2/Program.cs#L28-L41

            ulong t64;
            t64 = P.AsUInt64().GetElement(0);
            var p0 = ParallelBitDeposit(t64, 0x0707070707070707);
            var p1 = ParallelBitDeposit(t64 >> 32, 0x0707070707070707);
            t64 = P.AsUInt64().GetElement(1);
            var p2 = ParallelBitDeposit(t64, 0x0707070707070707);
            var p3 = ParallelBitDeposit(t64 >> 32, 0x0707070707070707);
            var tmp128 = ExtractVector128(P, 1);
            t64 = tmp128.AsUInt64().GetElement(0);
            var p4 = ParallelBitDeposit(t64, 0x0707070707070707);
            var p5 = ParallelBitDeposit(t64 >> 32, 0x0707070707070707);
            t64 = tmp128.AsUInt64().GetElement(1);
            var p6 = ParallelBitDeposit(t64, 0x0707070707070707);
            var p7 = ParallelBitDeposit(t64 >> 32, 0x0707070707070707);```

Generated asm:

https://github.com/damageboy/coreclr-pdep-mask-flaky-cse/blob/d6bc610c1dd5416f717211676f2fb0b0ce42e3a2/listing.asm#L15-L66

;             t64 = P.AsUInt64().GetElement(0);
00007FC3A6AB07F5 C5FC28C8             vmovaps ymm1,ymm0
00007FC3A6AB07F9 C4E1F97EC8           vmovq   rax,xmm1


;             var p0 = ParallelBitDeposit(t64, 0x0707070707070707);
00007FC3A6AB07FE 48BF0707070707070707 mov     rdi,707070707070707h
00007FC3A6AB0808 C4E2FBF5FF           pdep    rdi,rax,rdi


;             var p1 = ParallelBitDeposit(t64 >> 32, 0x0707070707070707);
00007FC3A6AB080D 48C1E820             shr     rax,20h
00007FC3A6AB0811 48BE0707070707070707 mov     rsi,707070707070707h
00007FC3A6AB081B C4E2FBF5F6           pdep    rsi,rax,rsi


;             t64 = P.AsUInt64().GetElement(1);
00007FC3A6AB0820 C5FC28C8             vmovaps ymm1,ymm0
00007FC3A6AB0824 C4E3F916C801         vpextrq rax,xmm1,1


;             var p2 = ParallelBitDeposit(t64, 0x0707070707070707);
00007FC3A6AB082A 48BA0707070707070707 mov     rdx,707070707070707h
00007FC3A6AB0834 C4E2FBF5D2           pdep    rdx,rax,rdx


;             var p3 = ParallelBitDeposit(t64 >> 32, 0x0707070707070707);
00007FC3A6AB0839 48C1E820             shr     rax,20h
00007FC3A6AB083D 48B90707070707070707 mov     rcx,707070707070707h
00007FC3A6AB0847 C4E2FBF5C9           pdep    rcx,rax,rcx


;             var tmp128 = ExtractVector128(P, 1);
00007FC3A6AB084C C4E37D39C001         vextracti128 xmm0,ymm0,1


;             t64 = tmp128.AsUInt64().GetElement(0);
00007FC3A6AB0852 C4E1F97EC0           vmovq   rax,xmm0


;             var p4 = ParallelBitDeposit(t64, 0x0707070707070707);
00007FC3A6AB0857 49B80707070707070707 mov     r8,707070707070707h
00007FC3A6AB0861 C442FBF5C0           pdep    r8,rax,r8


;             var p5 = ParallelBitDeposit(t64 >> 32, 0x0707070707070707);
00007FC3A6AB0866 48C1E820             shr     rax,20h
00007FC3A6AB086A 49B90707070707070707 mov     r9,707070707070707h
00007FC3A6AB0874 C442FBF5C9           pdep    r9,rax,r9


;             t64 = tmp128.AsUInt64().GetElement(1);
00007FC3A6AB0879 C4E3F916C001         vpextrq rax,xmm0,1


;             var p6 = ParallelBitDeposit(t64, 0x0707070707070707);
00007FC3A6AB087F 49BA0707070707070707 mov     r10,707070707070707h
00007FC3A6AB0889 C442FBF5D2           pdep    r10,rax,r10


;             var p7 = ParallelBitDeposit(t64 >> 32, 0x0707070707070707);
00007FC3A6AB088E 48C1E820             shr     rax,20h
00007FC3A6AB0892 49BB0707070707070707 mov     r11,707070707070707h
00007FC3A6AB089C C4C2FBF5C3           pdep    rax,rax,r11

Issue

This is a very minor tweak for the bug I opened yesterday: #442

Somehow, just moving a few of these expressions around causes the JIT to not perform CSE on the mask parameter for PDEP in a dependable way... (unlike the code I posted on the previous issue).

Not sure why this is suddenly happening for such a trivial change compared to the previous listing...

category:cq
theme:cse
skill-level:intermediate
cost:medium
impact:small

Metadata

Metadata

Assignees

No one assigned

    Labels

    area-CodeGen-coreclrCLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMIoptimization

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions