-
-
Notifications
You must be signed in to change notification settings - Fork 878
Add ARM support to WEBP Utilities #2933
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This pull request ported ARM support for WEBP utilities by replacing X86-specific intrinsics (e.g. Sse2/Avx2/Sse41) with generalized Vector128/Vector256 APIs and updated corresponding test names. Key changes include renaming test methods to reflect the new vector types, updating low-level shader conversion functions, and reworking helper methods in the SIMD utility classes.
Reviewed Changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated no comments.
File | Description |
---|---|
tests/ImageSharp.Tests/Formats/WebP/ColorSpaceTransformUtilsTests.cs | Renamed tests to remove “SSE” references in favor of “Vector128/256”. |
src/ImageSharp/Formats/Webp/*.cs | Replaced Sse2/Avx2/Sse41 calls with Vector128/256 equivalents and updated method names to reflect the generalized intrinsics. |
src/ImageSharp/Common/Helpers/Vector*Utilities.cs, SimdUtils.HwIntrinsics.cs | Updated intrinsics helper functions to remove explicit support property checks and use operator overloads directly. |
Comments suppressed due to low confidence (2)
src/ImageSharp/Common/Helpers/SimdUtils.HwIntrinsics.cs:112
- Removing property checks for SupportsShuffleNativeByte may work on most hardware, but please verify that on ARM devices the intrinsic methods used (e.g. ShuffleNative) behave as expected. Consider adding documentation or runtime tests to ensure that these calls are available and perform correctly on all target platforms.
if (Vector512.IsHardwareAccelerated || Vector256.IsHardwareAccelerated || Vector128.IsHardwareAccelerated)
src/ImageSharp/Formats/Webp/AlphaDecoder.cs:8
- The removal of the ARM-specific intrinsics import (System.Runtime.Intrinsics.Arm) is a key change intended to enable ARM support through Vector128 APIs; please ensure that the conversion logic and unfiltering methods have been thoroughly validated on ARM hardware to confirm correct behavior.
using System.Runtime.Intrinsics.X86;
@@ -21,24 +20,6 @@ namespace SixLabors.ImageSharp.Common.Helpers; | |||
internal static class Vector256_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could profit from the new c# extension constructs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'm interested in them. I'm hoping many of these methods will eventually make it into the runtime so I don't actually need them though.
return Vector128<short>.Zero; | ||
} | ||
|
||
if (AdvSimd.IsSupported) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These fallbacks seem to be unnecessary on NET8+. The count >= 16
causes value << count
to emit just a vpsllw
without and
masking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I'd expect we can drop the Sse2
, AdvSimd
, and PackedSimd
path and instead just do:
if (count >= 16)
{
return Vector128<short>.Zero;
}
return value << count;
Since count
is expected to be constant, the if (count >= 16)
path should typically be dropped and we just get vpsllw
or the relevant platform specific shift emitted by the xplat operator.
/// from <paramref name="left"/> and <paramref name="right"/>. | ||
/// </returns> | ||
[MethodImpl(MethodImplOptions.AggressiveInlining)] | ||
public static Vector128<short> MultiplyLow(Vector128<short> left, Vector128<short> right) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MultiplyLow
should be equivalent to Vector128.Multiply
i.e. left * right
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, this can just be left * right
in all cases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing is actually using that... I must have added it in error.
return PackedSimd.SubtractSaturate(left, right); | ||
} | ||
|
||
// Widen inputs to 32-bit signed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe overkill but I'd make these widen-saturate-narrow fallbacks into generic helpers (which may not be optimal on Mono) to reduce amount of duplicated code e.g.
interface IBinaryOp<T> {
static abstract T Apply(T left, T right);
}
struct SubtractOp : IBinaryOp<Vector128<T>> {
static abstract Vector128<T> Apply(Vector128<T> left, Vector128<T> right) => left - right;
}
// With further overloads for ushort@uint, byte@ushort, sbyte@short.
// Could be made generic over integers too if there was a generic Widen helper...
public static Vector128<short> SaturateOp<T>(
Vector128<short> left, Vector128<short> right, Vector128<int> min, Vector128<int> max)
where T : IOp<Vector128<int>>
{
(Vector128<int> leftLo, Vector128<int> leftHi) = Vector128.Widen(left);
(Vector128<int> rightLo, Vector128<int> rightHi) = Vector128.Widen(right);
Vector128<int> lo = T.Apply(leftLo, rightLo);
Vector128<int> hi = T.Apply(leftHi, rightHi);
lo = Clamp(lo, min, max);
hi = Clamp(hi, min, max);
return Vector128.Narrow(lo, hi);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given I'll be getting the runtime versions in the future I'll probably not bother.
/// A vector containing the results of subtracting packed signed 16-bit integers | ||
/// </returns> | ||
[MethodImpl(MethodImplOptions.AggressiveInlining)] | ||
public static Vector128<short> SubtractSaturate(Vector128<short> left, Vector128<short> right) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add #if NET10_0_OR_GREATER
guard in all narrowing and saturating methods to utilize dotnet/runtime#115525.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm only targeting .NET 8 just now. By the time I finally get around to targeting .NET 10, I'll drop 8 support and use the runtime directly.
/// in <paramref name="value"/>. | ||
/// </returns> | ||
[MethodImpl(MethodImplOptions.AggressiveInlining)] | ||
public static int MoveMask(Vector128<byte> value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pains me to say this, seeing this nice port, but Vector128.ExtractMostSignificantBits
is equivalent to MoveMask
(and that JIT intrinsic can benefit from AVX512 masking).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, this can just be value.ExtractMostSignificantBits()
which does all the same handling here, but which can sometimes be optimized by the JIT for special scenarios and on platforms without a native instruction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ha! Fantastic!
int output0 = ref0.AsInt32().ToScalar(); | ||
int output1 = ref1.AsInt32().ToScalar(); | ||
int output2 = ref2.AsInt32().ToScalar(); | ||
int output3 = ref3.AsInt32().ToScalar(); | ||
|
||
Unsafe.As<byte, int>(ref outputRef) = output0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These Unsafe.As
could be replaced with Unsafe.WriteUnaligned
.
ref1 = Sse2.ConvertScalarToVector128Int32(Unsafe.As<byte, int>(ref Unsafe.Add(ref referenceRef, WebpConstants.Bps))).AsByte(); | ||
ref2 = Sse2.ConvertScalarToVector128Int32(Unsafe.As<byte, int>(ref Unsafe.Add(ref referenceRef, WebpConstants.Bps * 2))).AsByte(); | ||
ref3 = Sse2.ConvertScalarToVector128Int32(Unsafe.As<byte, int>(ref Unsafe.Add(ref referenceRef, WebpConstants.Bps * 3))).AsByte(); | ||
Vector128<byte> ref0 = Vector128.CreateScalar(Unsafe.As<byte, int>(ref referenceRef)).AsByte(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These Unsafe.As
could be replaced with Unsafe.ReadUnaligned
.
[MethodImpl(MethodImplOptions.AggressiveInlining)] | ||
get => Ssse3.IsSupported || AdvSimd.IsSupported; | ||
// Portable fallback: (a + b + 1) >> 1 | ||
return (left + right + Vector128.Create((byte)1)) >> 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't quite the same.
Sse2.Average
(pavg) and AdvSimd.FusedAddRoundedHalving
(urhadd) are instead effectively doing:
return Vector128.Narrow(
(Vector128.WidenLower(left) + Vector128.WidenLower(right) + Vector128<ushort>.One) >> 1,
(Vector128.WidenUpper(left) + Vector128.WidenUpper(right) + Vector128<ushort>.One) >> 1
);
Which is to say that, it accounts for a potential 9th bit to ensure a correct rounded result. This can be observed for Average(255, 1)
which should produce 128
but where this fallback will produce 0
if (Sse.IsSupported) | ||
{ | ||
return Sse.Shuffle(vector, vector, control); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably not needed. The xplat call should do the "right thing" and will allow the JIT to use something like Avx.Permute(vector, control)
instead if its a more efficient encoding
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can it use Avx when we're targeting Vector128
support only?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AVX isn't just 256-bit support. It includes a number of newer 128-bit instructions and a general new encoding as well. The same goes for AVX512, where it provides a new encoding and newer 128/256-bit instructions in addition to the 512-bit instruction support.
The JIT knows what the actual underlying hardware supports and will opportunistically light up for newer instruction sets where it will benefit perf or size. So it's already using some of these in various places where safe.
For NAOT or other scenarios where you aren't jitting it won't use these newer instructions or encodings, but will still allow optimizations where feasible. For example, certain control
inputs may allow you to emit Sse.UnpackLow
instead and save 1 byte of encoding space, just because they are equivalent in functionality for that particular control
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah but I only use this method where Vector256.IsHardwareAccelrated
is false.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That can still be AVX capable hardware. V256.IsHardwsreAccelerated requires AVX2
But even without that, the xplat api still gives more flexibility for the JIT to optimize so it’s often preferred where feasible
We tend to be very 1-to-1 for the platform specific APIs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, will update
if (Sse2.IsSupported) | ||
{ | ||
return Sse.Shuffle(vector, vector, control); | ||
return Sse2.Shuffle(vector, control); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar thing here. I'd expect just the xplat call is sufficient and the JIT will continue emitting the optimal shuffle instruction for that control byte.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't see Sse2.Shuffle(vector, control)
for float?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. I was rather trying to say that I expect you can just have this be Vector128.Shuffle
using the same general mechanism for creating the Vector128
indices out of the control
. It can have the same general benefits to codegen and makes it more portable.
Vector64<short> indices = Vector64.Create( | ||
(short)(control & 0x3), | ||
(short)((control >> 2) & 0x3), | ||
(short)((control >> 4) & 0x3), | ||
(short)((control >> 6) & 0x3)); | ||
|
||
return Vector128.Create(value.GetLower(), Vector64.Shuffle(value.GetUpper(), indices)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A better fallback would be to do this:
Vector64<short> indices = Vector64.Create( | |
(short)(control & 0x3), | |
(short)((control >> 2) & 0x3), | |
(short)((control >> 4) & 0x3), | |
(short)((control >> 6) & 0x3)); | |
return Vector128.Create(value.GetLower(), Vector64.Shuffle(value.GetUpper(), indices)); | |
Vector128<short> indices = Vector128.Create( | |
0, | |
1, | |
2, | |
3, | |
(short)((control & 0x3) + 4), | |
(short)(((control >> 2) & 0x3) + 4), | |
(short)(((control >> 4) & 0x3) + 4), | |
(short)(((control >> 6) & 0x3) + 4)); | |
return Vector128.Shuffle(value, indices)); |
The reason is that V64
isn't accelerated on all platforms and you already have a V128
, so this allows you to avoid decomposing the vector and doing more operations.
Vector64<short> indices = Vector64.Create( | ||
(short)(control & 0x3), | ||
(short)((control >> 2) & 0x3), | ||
(short)((control >> 4) & 0x3), | ||
(short)((control >> 6) & 0x3)); | ||
|
||
return Vector128.Create(Vector64.Shuffle(value.GetLower(), indices), value.GetUpper()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar comment
Vector64<short> indices = Vector64.Create( | |
(short)(control & 0x3), | |
(short)((control >> 2) & 0x3), | |
(short)((control >> 4) & 0x3), | |
(short)((control >> 6) & 0x3)); | |
return Vector128.Create(Vector64.Shuffle(value.GetLower(), indices), value.GetUpper()); | |
Vector128<short> indices = Vector128.Create( | |
(short)(control & 0x3), | |
(short)((control >> 2) & 0x3), | |
(short)((control >> 4) & 0x3), | |
(short)((control >> 6) & 0x3), | |
4, | |
5, | |
6, | |
7); | |
return Vector128.Shuffle(value, indices)); |
@@ -133,8 +198,7 @@ public static Vector128<byte> ShiftRightBytesInVector(Vector128<byte> value, [Co | |||
return AdvSimd.ExtractVector128(value, Vector128<byte>.Zero, numBytes); | |||
} | |||
|
|||
ThrowUnreachableException(); | |||
return default; | |||
return Vector128.Shuffle(value, Vector128.Create((byte)0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15) + Vector128.Create(numBytes)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In .NET 9+ the Vector128.Create((byte)0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
can be Vector128<byte>.Indices
The total Vector128.Create((byte)0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15) + Vector128.Create(numBytes)
can itself become Vector128.CreateSequence<byte>(start: numBytes, step: 1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.NET 8 only unfortunately. Good to see a new API though, writing these is cumbersome and not great to read.
@@ -158,8 +222,43 @@ public static Vector128<byte> ShiftLeftBytesInVector(Vector128<byte> value, [Con | |||
#pragma warning restore CA1857 // A constant is expected for the parameter | |||
} | |||
|
|||
ThrowUnreachableException(); | |||
return default; | |||
return Vector128.Shuffle(value, Vector128.Create((byte)0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15) - Vector128.Create(numBytes)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar thing, in .NET 9 this can become Vector128.CreateSequence<byte>(start: numBytes, step: unchecked((byte)(-1)))
Vector128<long> v0 = AdvSimd.AddPairwiseWidening(prodLo); | ||
Vector128<long> v1 = AdvSimd.AddPairwiseWidening(prodHi); | ||
|
||
return Vector128.Narrow(v0, v1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just AdvSimd.Arm64.AddPairwise(prodLo, prodHi)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add that also. I want to be able to support as many Arm chipsets as possible.
return Sse2.MultiplyHigh(left, right); | ||
} | ||
|
||
// Widen each half of the short vectors into two int vectors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For Arm64, you can use AdvSimd.MultiplyWideningLower/Upper
, then shift, then narrow
@@ -73,8 +46,7 @@ public static Vector256<byte> ShuffleNative(Vector256<byte> vector, Vector256<by | |||
return Avx2.Shuffle(vector, indices); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indices
works differently between Avx2
and Vector256
In particular, Avx2
shuffles within 2x 128-bit lanes and so it expects [0, 15]
for the lower and then [0, 15]
for the upper. (it's effectively doing V256.Create(V128.Shuffle(vector.GetLower(), indices.GetLower()), V128.Shuffle(vector.GetUpper(), indices.GetUpper())
While V256
shuffles within a single 256-bit lane and so it expects [0, 31]
across the whole thing. The perf is then dependent on what the hardware supports and if the indices cross lanes at all. That is, it is fastest on all hardware if indices.GetLower()
is only [0, 15]
and indices.GetUpper()
is only [16, 31]
, just because AVX2 doesn't actually provide full-width byte shuffle; only AVX512VBMI capable hardware provides proper full width support.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah... That's interesting. We're relying on the Avx2 behavior currently. I'll have a look at the current calls to see if I can adjust the indices, though maybe I should rename this ShufflePerLane
or Shuffle128
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Went with ShufflePerLane for clarity.
if (AdvSimd.IsSupported) | ||
{ | ||
Vector128<int> prodLo = AdvSimd.MultiplyWideningLower(left.GetLower(), right.GetLower()); | ||
Vector128<int> prodHi = AdvSimd.MultiplyWideningLower(left.GetUpper(), right.GetUpper()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Vector128<int> prodHi = AdvSimd.MultiplyWideningLower(left.GetUpper(), right.GetUpper()); | |
Vector128<int> prodHi = AdvSimd.MultiplyWideningUpper(left, right); |
Vector128<long> v0 = AdvSimd.AddPairwiseWidening(prodLo); | ||
Vector128<long> v1 = AdvSimd.AddPairwiseWidening(prodHi); | ||
|
||
return Vector128.Narrow(v0, v1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on sse2neon it is probably cheaper to not widen. This applies to other helpers too.
Vector128<long> v0 = AdvSimd.AddPairwiseWidening(prodLo); | |
Vector128<long> v1 = AdvSimd.AddPairwiseWidening(prodHi); | |
return Vector128.Narrow(v0, v1); | |
Vector64<int> v0 = AdvSimd.AddPairwise(prodLo.GetLower(), prodLo.GetUpper()); | |
Vector64<int> v1 = AdvSimd.AddPairwise(prodHi.GetLower(), prodHi.GetUpper()); | |
return Vector128.Create(v0, v1); |
Prerequisites
Description
Fixes #2125
Ports the
Sse
specific code to generalVector128
equivalents within all WEBP utilities adding ARM support.@tannergooding I've tried my best to polyfil all the various methods in
Vector128Utilties
but I would really appreciate any guidance you can provide to improve them (and fix any bugs 😛)