Add ARM support to WEBP Utilities #2933

JimBobSquarePants · 2025-06-04T02:33:21Z

Prerequisites

I have written a descriptive pull-request title
I have verified that there are no overlapping pull-requests open
I have verified that I am following the existing coding patterns and practice as demonstrated in the repository. These follow strict Stylecop rules 👮.
I have provided test coverage for my change (where applicable)

Description

Ports the Sse specific code to general Vector128 equivalents within all WEBP utilities adding ARM support.

@tannergooding I've tried my best to polyfil all the various methods in Vector128Utilties but I would really appreciate any guidance you can provide to improve them (and fix any bugs 😛)

Copilot

Pull Request Overview

This pull request ported ARM support for WEBP utilities by replacing X86-specific intrinsics (e.g. Sse2/Avx2/Sse41) with generalized Vector128/Vector256 APIs and updated corresponding test names. Key changes include renaming test methods to reflect the new vector types, updating low-level shader conversion functions, and reworking helper methods in the SIMD utility classes.

Reviewed Changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated no comments.

File	Description
tests/ImageSharp.Tests/Formats/WebP/ColorSpaceTransformUtilsTests.cs	Renamed tests to remove “SSE” references in favor of “Vector128/256”.
src/ImageSharp/Formats/Webp/*.cs	Replaced Sse2/Avx2/Sse41 calls with Vector128/256 equivalents and updated method names to reflect the generalized intrinsics.
src/ImageSharp/Common/Helpers/Vector*Utilities.cs, SimdUtils.HwIntrinsics.cs	Updated intrinsics helper functions to remove explicit support property checks and use operator overloads directly.

Comments suppressed due to low confidence (2)

src/ImageSharp/Common/Helpers/SimdUtils.HwIntrinsics.cs:112

Removing property checks for SupportsShuffleNativeByte may work on most hardware, but please verify that on ARM devices the intrinsic methods used (e.g. ShuffleNative) behave as expected. Consider adding documentation or runtime tests to ensure that these calls are available and perform correctly on all target platforms.

if (Vector512.IsHardwareAccelerated || Vector256.IsHardwareAccelerated || Vector128.IsHardwareAccelerated)

src/ImageSharp/Formats/Webp/AlphaDecoder.cs:8

The removal of the ARM-specific intrinsics import (System.Runtime.Intrinsics.Arm) is a key change intended to enable ARM support through Vector128 APIs; please ensure that the conversion logic and unfiltering methods have been thoroughly validated on ARM hardware to confirm correct behavior.

using System.Runtime.Intrinsics.X86;

stefannikolei · 2025-06-04T18:57:21Z

src/ImageSharp/Common/Helpers/Vector256Utilities.cs

@@ -21,24 +20,6 @@ namespace SixLabors.ImageSharp.Common.Helpers;
 internal static class Vector256_


This could profit from the new c# extension constructs.

Yeah, I'm interested in them. I'm hoping many of these methods will eventually make it into the runtime so I don't actually need them though.

TechPizzaDev · 2025-06-05T12:31:40Z

src/ImageSharp/Common/Helpers/Vector128Utilities.cs

+            return Vector128<short>.Zero;
+        }
+
+        if (AdvSimd.IsSupported)


These fallbacks seem to be unnecessary on NET8+. The count >= 16 causes value << count to emit just a vpsllw without and masking.

Right, I'd expect we can drop the Sse2, AdvSimd, and PackedSimd path and instead just do:

if (count >= 16) { return Vector128<short>.Zero; } return value << count;

Since count is expected to be constant, the if (count >= 16) path should typically be dropped and we just get vpsllw or the relevant platform specific shift emitted by the xplat operator.

TechPizzaDev · 2025-06-05T12:42:37Z

src/ImageSharp/Common/Helpers/Vector128Utilities.cs

+    /// from <paramref name="left"/> and <paramref name="right"/>.
+    /// </returns>
+    [MethodImpl(MethodImplOptions.AggressiveInlining)]
+    public static Vector128<short> MultiplyLow(Vector128<short> left, Vector128<short> right)


MultiplyLow should be equivalent to Vector128.Multiply i.e. left * right.

Yep, this can just be left * right in all cases

Nothing is actually using that... I must have added it in error.

TechPizzaDev · 2025-06-05T13:18:36Z

src/ImageSharp/Common/Helpers/Vector128Utilities.cs

+            return PackedSimd.SubtractSaturate(left, right);
+        }
+
+        // Widen inputs to 32-bit signed


Maybe overkill but I'd make these widen-saturate-narrow fallbacks into generic helpers (which may not be optimal on Mono) to reduce amount of duplicated code e.g.

interface IBinaryOp<T> { static abstract T Apply(T left, T right); } struct SubtractOp : IBinaryOp<Vector128<T>> { static abstract Vector128<T> Apply(Vector128<T> left, Vector128<T> right) => left - right; } // With further overloads for ushort@uint, byte@ushort, sbyte@short. // Could be made generic over integers too if there was a generic Widen helper... public static Vector128<short> SaturateOp<T>( Vector128<short> left, Vector128<short> right, Vector128<int> min, Vector128<int> max) where T : IOp<Vector128<int>> { (Vector128<int> leftLo, Vector128<int> leftHi) = Vector128.Widen(left); (Vector128<int> rightLo, Vector128<int> rightHi) = Vector128.Widen(right); Vector128<int> lo = T.Apply(leftLo, rightLo); Vector128<int> hi = T.Apply(leftHi, rightHi); lo = Clamp(lo, min, max); hi = Clamp(hi, min, max); return Vector128.Narrow(lo, hi); }

Given I'll be getting the runtime versions in the future I'll probably not bother.

TechPizzaDev · 2025-06-05T13:20:30Z

src/ImageSharp/Common/Helpers/Vector128Utilities.cs

+    /// A vector containing the results of subtracting packed signed 16-bit integers
+    /// </returns>
+    [MethodImpl(MethodImplOptions.AggressiveInlining)]
+    public static Vector128<short> SubtractSaturate(Vector128<short> left, Vector128<short> right)


Add #if NET10_0_OR_GREATER guard in all narrowing and saturating methods to utilize dotnet/runtime#115525.

I'm only targeting .NET 8 just now. By the time I finally get around to targeting .NET 10, I'll drop 8 support and use the runtime directly.

TechPizzaDev · 2025-06-05T13:24:45Z

src/ImageSharp/Common/Helpers/Vector128Utilities.cs

+    /// in <paramref name="value"/>.
+    /// </returns>
+    [MethodImpl(MethodImplOptions.AggressiveInlining)]
+    public static int MoveMask(Vector128<byte> value)


Pains me to say this, seeing this nice port, but Vector128.ExtractMostSignificantBits is equivalent to MoveMask (and that JIT intrinsic can benefit from AVX512 masking).

Yep, this can just be value.ExtractMostSignificantBits() which does all the same handling here, but which can sometimes be optimized by the JIT for special scenarios and on platforms without a native instruction.

Ha! Fantastic!

TechPizzaDev · 2025-06-05T13:27:04Z

src/ImageSharp/Formats/Webp/Lossy/Vp8Encoding.cs

+            int output0 = ref0.AsInt32().ToScalar();
+            int output1 = ref1.AsInt32().ToScalar();
+            int output2 = ref2.AsInt32().ToScalar();
+            int output3 = ref3.AsInt32().ToScalar();

            Unsafe.As<byte, int>(ref outputRef) = output0;


These Unsafe.As could be replaced with Unsafe.WriteUnaligned.

TechPizzaDev · 2025-06-05T13:28:11Z

src/ImageSharp/Formats/Webp/Lossy/Vp8Encoding.cs

-            ref1 = Sse2.ConvertScalarToVector128Int32(Unsafe.As<byte, int>(ref Unsafe.Add(ref referenceRef, WebpConstants.Bps))).AsByte();
-            ref2 = Sse2.ConvertScalarToVector128Int32(Unsafe.As<byte, int>(ref Unsafe.Add(ref referenceRef, WebpConstants.Bps * 2))).AsByte();
-            ref3 = Sse2.ConvertScalarToVector128Int32(Unsafe.As<byte, int>(ref Unsafe.Add(ref referenceRef, WebpConstants.Bps * 3))).AsByte();
+            Vector128<byte> ref0 = Vector128.CreateScalar(Unsafe.As<byte, int>(ref referenceRef)).AsByte();


These Unsafe.As could be replaced with Unsafe.ReadUnaligned.

tannergooding · 2025-06-05T13:48:43Z

src/ImageSharp/Common/Helpers/Vector128Utilities.cs

-        [MethodImpl(MethodImplOptions.AggressiveInlining)]
-        get => Ssse3.IsSupported || AdvSimd.IsSupported;
+        // Portable fallback: (a + b + 1) >> 1
+        return (left + right + Vector128.Create((byte)1)) >> 1;


This isn't quite the same.

Sse2.Average (pavg) and AdvSimd.FusedAddRoundedHalving (urhadd) are instead effectively doing:

return Vector128.Narrow( (Vector128.WidenLower(left) + Vector128.WidenLower(right) + Vector128<ushort>.One) >> 1, (Vector128.WidenUpper(left) + Vector128.WidenUpper(right) + Vector128<ushort>.One) >> 1 );

Which is to say that, it accounts for a potential 9th bit to ensure a correct rounded result. This can be observed for Average(255, 1) which should produce 128 but where this fallback will produce 0

tannergooding · 2025-06-05T13:50:43Z

src/ImageSharp/Common/Helpers/Vector128Utilities.cs

+        if (Sse.IsSupported)
+        {
+            return Sse.Shuffle(vector, vector, control);
+        }


This is probably not needed. The xplat call should do the "right thing" and will allow the JIT to use something like Avx.Permute(vector, control) instead if its a more efficient encoding

How can it use Avx when we're targeting Vector128 support only?

AVX isn't just 256-bit support. It includes a number of newer 128-bit instructions and a general new encoding as well. The same goes for AVX512, where it provides a new encoding and newer 128/256-bit instructions in addition to the 512-bit instruction support.

The JIT knows what the actual underlying hardware supports and will opportunistically light up for newer instruction sets where it will benefit perf or size. So it's already using some of these in various places where safe.

For NAOT or other scenarios where you aren't jitting it won't use these newer instructions or encodings, but will still allow optimizations where feasible. For example, certain control inputs may allow you to emit Sse.UnpackLow instead and save 1 byte of encoding space, just because they are equivalent in functionality for that particular control.

Ah but I only use this method where Vector256.IsHardwareAccelrated is false.

That can still be AVX capable hardware. V256.IsHardwsreAccelerated requires AVX2

But even without that, the xplat api still gives more flexibility for the JIT to optimize so it’s often preferred where feasible

We tend to be very 1-to-1 for the platform specific APIs

OK, will update

tannergooding · 2025-06-05T13:51:17Z

src/ImageSharp/Common/Helpers/Vector128Utilities.cs

+        if (Sse2.IsSupported)
        {
-            return Sse.Shuffle(vector, vector, control);
+            return Sse2.Shuffle(vector, control);
        }


Similar thing here. I'd expect just the xplat call is sufficient and the JIT will continue emitting the optimal shuffle instruction for that control byte.

I can't see Sse2.Shuffle(vector, control) for float?

Right. I was rather trying to say that I expect you can just have this be Vector128.Shuffle using the same general mechanism for creating the Vector128 indices out of the control. It can have the same general benefits to codegen and makes it more portable.

tannergooding · 2025-06-05T13:53:56Z

src/ImageSharp/Common/Helpers/Vector128Utilities.cs

+        Vector64<short> indices = Vector64.Create(
+            (short)(control & 0x3),
+            (short)((control >> 2) & 0x3),
+            (short)((control >> 4) & 0x3),
+            (short)((control >> 6) & 0x3));
+
+        return Vector128.Create(value.GetLower(), Vector64.Shuffle(value.GetUpper(), indices));


A better fallback would be to do this:

Suggested change

Vector64<short> indices = Vector64.Create(

(short)(control & 0x3),

(short)((control >> 2) & 0x3),

(short)((control >> 4) & 0x3),

(short)((control >> 6) & 0x3));

return Vector128.Create(value.GetLower(), Vector64.Shuffle(value.GetUpper(), indices));

Vector128<short> indices = Vector128.Create(

0,

1,

2,

3,

(short)((control & 0x3) + 4),

(short)(((control >> 2) & 0x3) + 4),

(short)(((control >> 4) & 0x3) + 4),

(short)(((control >> 6) & 0x3) + 4));

return Vector128.Shuffle(value, indices));

The reason is that V64 isn't accelerated on all platforms and you already have a V128, so this allows you to avoid decomposing the vector and doing more operations.

tannergooding · 2025-06-05T13:55:27Z

src/ImageSharp/Common/Helpers/Vector128Utilities.cs

+        Vector64<short> indices = Vector64.Create(
+            (short)(control & 0x3),
+            (short)((control >> 2) & 0x3),
+            (short)((control >> 4) & 0x3),
+            (short)((control >> 6) & 0x3));
+
+        return Vector128.Create(Vector64.Shuffle(value.GetLower(), indices), value.GetUpper());


Similar comment

Suggested change

Vector64<short> indices = Vector64.Create(

(short)(control & 0x3),

(short)((control >> 2) & 0x3),

(short)((control >> 4) & 0x3),

(short)((control >> 6) & 0x3));

return Vector128.Create(Vector64.Shuffle(value.GetLower(), indices), value.GetUpper());

Vector128<short> indices = Vector128.Create(

(short)(control & 0x3),

(short)((control >> 2) & 0x3),

(short)((control >> 4) & 0x3),

(short)((control >> 6) & 0x3),

4,

5,

6,

7);

return Vector128.Shuffle(value, indices));

tannergooding · 2025-06-05T13:57:09Z

src/ImageSharp/Common/Helpers/Vector128Utilities.cs

@@ -133,8 +198,7 @@ public static Vector128<byte> ShiftRightBytesInVector(Vector128<byte> value, [Co
            return AdvSimd.ExtractVector128(value, Vector128<byte>.Zero, numBytes);
        }

-        ThrowUnreachableException();
-        return default;
+        return Vector128.Shuffle(value, Vector128.Create((byte)0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15) + Vector128.Create(numBytes));


In .NET 9+ the Vector128.Create((byte)0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15) can be Vector128<byte>.Indices

The total Vector128.Create((byte)0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15) + Vector128.Create(numBytes) can itself become Vector128.CreateSequence<byte>(start: numBytes, step: 1)

.NET 8 only unfortunately. Good to see a new API though, writing these is cumbersome and not great to read.

tannergooding · 2025-06-05T13:58:07Z

src/ImageSharp/Common/Helpers/Vector128Utilities.cs

@@ -158,8 +222,43 @@ public static Vector128<byte> ShiftLeftBytesInVector(Vector128<byte> value, [Con
 #pragma warning restore CA1857 // A constant is expected for the parameter
        }

-        ThrowUnreachableException();
-        return default;
+        return Vector128.Shuffle(value, Vector128.Create((byte)0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15) - Vector128.Create(numBytes));


Similar thing, in .NET 9 this can become Vector128.CreateSequence<byte>(start: numBytes, step: unchecked((byte)(-1)))

tannergooding · 2025-06-05T14:18:13Z

src/ImageSharp/Common/Helpers/Vector128Utilities.cs

+            Vector128<long> v0 = AdvSimd.AddPairwiseWidening(prodLo);
+            Vector128<long> v1 = AdvSimd.AddPairwiseWidening(prodHi);
+
+            return Vector128.Narrow(v0, v1);


Why not just AdvSimd.Arm64.AddPairwise(prodLo, prodHi)?

I'll add that also. I want to be able to support as many Arm chipsets as possible.

tannergooding · 2025-06-05T14:21:11Z

src/ImageSharp/Common/Helpers/Vector128Utilities.cs

+            return Sse2.MultiplyHigh(left, right);
+        }
+
+        // Widen each half of the short vectors into two int vectors


For Arm64, you can use AdvSimd.MultiplyWideningLower/Upper, then shift, then narrow

tannergooding · 2025-06-05T14:30:33Z

src/ImageSharp/Common/Helpers/Vector256Utilities.cs

@@ -73,8 +46,7 @@ public static Vector256<byte> ShuffleNative(Vector256<byte> vector, Vector256<by
            return Avx2.Shuffle(vector, indices);


indices works differently between Avx2 and Vector256

In particular, Avx2 shuffles within 2x 128-bit lanes and so it expects [0, 15] for the lower and then [0, 15] for the upper. (it's effectively doing V256.Create(V128.Shuffle(vector.GetLower(), indices.GetLower()), V128.Shuffle(vector.GetUpper(), indices.GetUpper())

While V256 shuffles within a single 256-bit lane and so it expects [0, 31] across the whole thing. The perf is then dependent on what the hardware supports and if the indices cross lanes at all. That is, it is fastest on all hardware if indices.GetLower() is only [0, 15] and indices.GetUpper() is only [16, 31], just because AVX2 doesn't actually provide full-width byte shuffle; only AVX512VBMI capable hardware provides proper full width support.

Ah... That's interesting. We're relying on the Avx2 behavior currently. I'll have a look at the current calls to see if I can adjust the indices, though maybe I should rename this ShufflePerLane or Shuffle128

Went with ShufflePerLane for clarity.

TechPizzaDev · 2025-06-06T14:31:02Z

src/ImageSharp/Common/Helpers/Vector128Utilities.cs

+        if (AdvSimd.IsSupported)
+        {
+            Vector128<int> prodLo = AdvSimd.MultiplyWideningLower(left.GetLower(), right.GetLower());
+            Vector128<int> prodHi = AdvSimd.MultiplyWideningLower(left.GetUpper(), right.GetUpper());


Suggested change

Vector128<int> prodHi = AdvSimd.MultiplyWideningLower(left.GetUpper(), right.GetUpper());

Vector128<int> prodHi = AdvSimd.MultiplyWideningUpper(left, right);

TechPizzaDev · 2025-06-06T14:32:39Z

src/ImageSharp/Common/Helpers/Vector128Utilities.cs

+            Vector128<long> v0 = AdvSimd.AddPairwiseWidening(prodLo);
+            Vector128<long> v1 = AdvSimd.AddPairwiseWidening(prodHi);
+
+            return Vector128.Narrow(v0, v1);


Based on sse2neon it is probably cheaper to not widen. This applies to other helpers too.

Suggested change

Vector128<long> v0 = AdvSimd.AddPairwiseWidening(prodLo);

Vector128<long> v1 = AdvSimd.AddPairwiseWidening(prodHi);

return Vector128.Narrow(v0, v1);

Vector64<int> v0 = AdvSimd.AddPairwise(prodLo.GetLower(), prodLo.GetUpper());

Vector64<int> v1 = AdvSimd.AddPairwise(prodHi.GetLower(), prodHi.GetUpper());

return Vector128.Create(v0, v1);

JimBobSquarePants added 18 commits May 30, 2025 18:22

Port ColorSpaceTransformUtils

82bc797

Port TTransformSse41

c490bc6

Use explicit type

dd9bd0a

Port TransformTwo

3be2b6a

Add explicit AdvSimd to MultiplyAddAdjacent

0a9c407

Add XPlat V128 SubtractSaturate

cfad39b

Port Vp8_Sse16x16

7223e90

Remove all v128 util restrictions

e616844

Port load/store

f0c6f4c

Port filters

f2e4257

Complete LossyUtils port

217450e

Port Vp8Encoding

85d6a2b

Port YuvConversion

b5fe86c

Port common utils and alpha decoder

1a91ec9

Remove restrictions from vector utilities

6b5392b

Add Arm64 movemask

1a63729

Port LosslessUtils V128

e553807

Update LosslessUtils.cs

0c0748e

JimBobSquarePants requested review from tocsoft, antonfirsov, dlemstra, brianpopow and Copilot June 4, 2025 02:33

JimBobSquarePants added area:performance formats:webp arch:arm64 labels Jun 4, 2025

Copilot AI reviewed Jun 4, 2025

View reviewed changes

JimBobSquarePants changed the title ~~Add ARM support to WEBP Utilties~~ Add ARM support to WEBP Utilities Jun 4, 2025

Merge branch 'main' into js/webp-arm

b29c25c

stefannikolei reviewed Jun 4, 2025

View reviewed changes

Merge branch 'main' into js/webp-arm

db28d22

TechPizzaDev reviewed Jun 5, 2025

View reviewed changes

tannergooding reviewed Jun 5, 2025

View reviewed changes

JimBobSquarePants added 2 commits June 6, 2025 11:03

Merge branch 'main' into js/webp-arm

62a0666

Update based on feedback

3627073

TechPizzaDev reviewed Jun 6, 2025

View reviewed changes

		@@ -21,24 +20,6 @@ namespace SixLabors.ImageSharp.Common.Helpers;
		internal static class Vector256_

		@@ -73,8 +46,7 @@ public static Vector256<byte> ShuffleNative(Vector256<byte> vector, Vector256<by
		return Avx2.Shuffle(vector, indices);

	Vector128<int> prodHi = AdvSimd.MultiplyWideningLower(left.GetUpper(), right.GetUpper());
	Vector128<int> prodHi = AdvSimd.MultiplyWideningUpper(left, right);

Uh oh!

Add ARM support to WEBP Utilities #2933

Are you sure you want to change the base?

Add ARM support to WEBP Utilities #2933

Uh oh!

Conversation

JimBobSquarePants commented Jun 4, 2025

Prerequisites

Description

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TechPizzaDev Jun 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TechPizzaDev Jun 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TechPizzaDev Jun 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TechPizzaDev Jun 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TechPizzaDev Jun 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tannergooding Jun 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tannergooding Jun 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

TechPizzaDev Jun 5, 2025 •

edited

Loading

TechPizzaDev Jun 5, 2025 •

edited

Loading

TechPizzaDev Jun 5, 2025 •

edited

Loading

TechPizzaDev Jun 5, 2025 •

edited

Loading

TechPizzaDev Jun 5, 2025 •

edited

Loading

tannergooding Jun 5, 2025 •

edited

Loading

tannergooding Jun 5, 2025 •

edited

Loading