-
Notifications
You must be signed in to change notification settings - Fork 17
Open
Description
I'm consistently seeing scalar being faster on M1 mac, with -Doptimize=ReleaseFast
Example: cross3, dot3, scale, bias benchmark (AOS) - scalar version: 0.9780s, zmath version: 1.0045s
I noticed that the 'swizzle' function call actually has extra CPU instructions generated - see the dot4Old function in this godbolt and play around with the commented out line and the one next to it.
By changing cross3
to use shuffle this seems to help the benchmark:
pub inline fn cross3(v0: Vec, v1: Vec) Vec {
var xmm0 = @shuffle(f32, v0, undefined, [4]i32{ 1, 2, 0, 2 });
var xmm1 = @shuffle(f32, v1, undefined, [4]i32{ 2, 0, 1, 3 });
var result = xmm0 * xmm1;
xmm0 = @shuffle(f32, xmm0, undefined, [4]i32{ 1, 2, 0, 3 });
xmm1 = @shuffle(f32, xmm1, undefined, [4]i32{ 2, 0, 1, 3 });
result = result - xmm0 * xmm1;
return andInt(result, f32x4_mask3);
}
I recommend changing this everywhere. Also the dot2 is weird... there are a lot of potential perf improvements in the zmath area.
kodalli
Metadata
Metadata
Assignees
Labels
No labels