Skip to content

Add ARM support to WEBP Utilities #2933

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 22 commits into
base: main
Choose a base branch
from
Open

Add ARM support to WEBP Utilities #2933

wants to merge 22 commits into from

Conversation

JimBobSquarePants
Copy link
Member

Prerequisites

  • I have written a descriptive pull-request title
  • I have verified that there are no overlapping pull-requests open
  • I have verified that I am following the existing coding patterns and practice as demonstrated in the repository. These follow strict Stylecop rules 👮.
  • I have provided test coverage for my change (where applicable)

Description

Fixes #2125

Ports the Sse specific code to general Vector128 equivalents within all WEBP utilities adding ARM support.

@tannergooding I've tried my best to polyfil all the various methods in Vector128Utilties but I would really appreciate any guidance you can provide to improve them (and fix any bugs 😛)

Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This pull request ported ARM support for WEBP utilities by replacing X86-specific intrinsics (e.g. Sse2/Avx2/Sse41) with generalized Vector128/Vector256 APIs and updated corresponding test names. Key changes include renaming test methods to reflect the new vector types, updating low-level shader conversion functions, and reworking helper methods in the SIMD utility classes.

Reviewed Changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated no comments.

File Description
tests/ImageSharp.Tests/Formats/WebP/ColorSpaceTransformUtilsTests.cs Renamed tests to remove “SSE” references in favor of “Vector128/256”.
src/ImageSharp/Formats/Webp/*.cs Replaced Sse2/Avx2/Sse41 calls with Vector128/256 equivalents and updated method names to reflect the generalized intrinsics.
src/ImageSharp/Common/Helpers/Vector*Utilities.cs, SimdUtils.HwIntrinsics.cs Updated intrinsics helper functions to remove explicit support property checks and use operator overloads directly.
Comments suppressed due to low confidence (2)

src/ImageSharp/Common/Helpers/SimdUtils.HwIntrinsics.cs:112

  • Removing property checks for SupportsShuffleNativeByte may work on most hardware, but please verify that on ARM devices the intrinsic methods used (e.g. ShuffleNative) behave as expected. Consider adding documentation or runtime tests to ensure that these calls are available and perform correctly on all target platforms.
if (Vector512.IsHardwareAccelerated || Vector256.IsHardwareAccelerated || Vector128.IsHardwareAccelerated)

src/ImageSharp/Formats/Webp/AlphaDecoder.cs:8

  • The removal of the ARM-specific intrinsics import (System.Runtime.Intrinsics.Arm) is a key change intended to enable ARM support through Vector128 APIs; please ensure that the conversion logic and unfiltering methods have been thoroughly validated on ARM hardware to confirm correct behavior.
using System.Runtime.Intrinsics.X86;

@JimBobSquarePants JimBobSquarePants changed the title Add ARM support to WEBP Utilties Add ARM support to WEBP Utilities Jun 4, 2025
@@ -21,24 +20,6 @@ namespace SixLabors.ImageSharp.Common.Helpers;
internal static class Vector256_
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could profit from the new c# extension constructs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm interested in them. I'm hoping many of these methods will eventually make it into the runtime so I don't actually need them though.

return Vector128<short>.Zero;
}

if (AdvSimd.IsSupported)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These fallbacks seem to be unnecessary on NET8+. The count >= 16 causes value << count to emit just a vpsllw without and masking.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I'd expect we can drop the Sse2, AdvSimd, and PackedSimd path and instead just do:

if (count >= 16)
{
    return Vector128<short>.Zero;
}
return value << count;

Since count is expected to be constant, the if (count >= 16) path should typically be dropped and we just get vpsllw or the relevant platform specific shift emitted by the xplat operator.

/// from <paramref name="left"/> and <paramref name="right"/>.
/// </returns>
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static Vector128<short> MultiplyLow(Vector128<short> left, Vector128<short> right)
Copy link
Contributor

@TechPizzaDev TechPizzaDev Jun 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MultiplyLow should be equivalent to Vector128.Multiply i.e. left * right.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, this can just be left * right in all cases

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing is actually using that... I must have added it in error.

return PackedSimd.SubtractSaturate(left, right);
}

// Widen inputs to 32-bit signed
Copy link
Contributor

@TechPizzaDev TechPizzaDev Jun 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe overkill but I'd make these widen-saturate-narrow fallbacks into generic helpers (which may not be optimal on Mono) to reduce amount of duplicated code e.g.

interface IBinaryOp<T> { 
    static abstract T Apply(T left, T right); 
}
struct SubtractOp : IBinaryOp<Vector128<T>> { 
    static abstract Vector128<T> Apply(Vector128<T> left, Vector128<T> right) => left - right;
}
// With further overloads for ushort@uint, byte@ushort, sbyte@short.
// Could be made generic over integers too if there was a generic Widen helper...
public static Vector128<short> SaturateOp<T>(
    Vector128<short> left, Vector128<short> right, Vector128<int> min, Vector128<int> max) 
    where T : IOp<Vector128<int>> 
{
    (Vector128<int> leftLo, Vector128<int> leftHi) = Vector128.Widen(left);
    (Vector128<int> rightLo, Vector128<int> rightHi) = Vector128.Widen(right);

    Vector128<int> lo = T.Apply(leftLo, rightLo);
    Vector128<int> hi = T.Apply(leftHi, rightHi);

    lo = Clamp(lo, min, max);
    hi = Clamp(hi, min, max);

    return Vector128.Narrow(lo, hi);
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given I'll be getting the runtime versions in the future I'll probably not bother.

/// A vector containing the results of subtracting packed signed 16-bit integers
/// </returns>
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static Vector128<short> SubtractSaturate(Vector128<short> left, Vector128<short> right)
Copy link
Contributor

@TechPizzaDev TechPizzaDev Jun 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add #if NET10_0_OR_GREATER guard in all narrowing and saturating methods to utilize dotnet/runtime#115525.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm only targeting .NET 8 just now. By the time I finally get around to targeting .NET 10, I'll drop 8 support and use the runtime directly.

/// in <paramref name="value"/>.
/// </returns>
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static int MoveMask(Vector128<byte> value)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pains me to say this, seeing this nice port, but Vector128.ExtractMostSignificantBits is equivalent to MoveMask (and that JIT intrinsic can benefit from AVX512 masking).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, this can just be value.ExtractMostSignificantBits() which does all the same handling here, but which can sometimes be optimized by the JIT for special scenarios and on platforms without a native instruction.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha! Fantastic!

int output0 = ref0.AsInt32().ToScalar();
int output1 = ref1.AsInt32().ToScalar();
int output2 = ref2.AsInt32().ToScalar();
int output3 = ref3.AsInt32().ToScalar();

Unsafe.As<byte, int>(ref outputRef) = output0;
Copy link
Contributor

@TechPizzaDev TechPizzaDev Jun 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These Unsafe.As could be replaced with Unsafe.WriteUnaligned.

ref1 = Sse2.ConvertScalarToVector128Int32(Unsafe.As<byte, int>(ref Unsafe.Add(ref referenceRef, WebpConstants.Bps))).AsByte();
ref2 = Sse2.ConvertScalarToVector128Int32(Unsafe.As<byte, int>(ref Unsafe.Add(ref referenceRef, WebpConstants.Bps * 2))).AsByte();
ref3 = Sse2.ConvertScalarToVector128Int32(Unsafe.As<byte, int>(ref Unsafe.Add(ref referenceRef, WebpConstants.Bps * 3))).AsByte();
Vector128<byte> ref0 = Vector128.CreateScalar(Unsafe.As<byte, int>(ref referenceRef)).AsByte();
Copy link
Contributor

@TechPizzaDev TechPizzaDev Jun 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These Unsafe.As could be replaced with Unsafe.ReadUnaligned.

[MethodImpl(MethodImplOptions.AggressiveInlining)]
get => Ssse3.IsSupported || AdvSimd.IsSupported;
// Portable fallback: (a + b + 1) >> 1
return (left + right + Vector128.Create((byte)1)) >> 1;
Copy link
Contributor

@tannergooding tannergooding Jun 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't quite the same.

Sse2.Average (pavg) and AdvSimd.FusedAddRoundedHalving (urhadd) are instead effectively doing:

return Vector128.Narrow(
    (Vector128.WidenLower(left) + Vector128.WidenLower(right) + Vector128<ushort>.One) >> 1,
    (Vector128.WidenUpper(left) + Vector128.WidenUpper(right) + Vector128<ushort>.One) >> 1
);

Which is to say that, it accounts for a potential 9th bit to ensure a correct rounded result. This can be observed for Average(255, 1) which should produce 128 but where this fallback will produce 0

Comment on lines +63 to +66
if (Sse.IsSupported)
{
return Sse.Shuffle(vector, vector, control);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably not needed. The xplat call should do the "right thing" and will allow the JIT to use something like Avx.Permute(vector, control) instead if its a more efficient encoding

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can it use Avx when we're targeting Vector128 support only?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AVX isn't just 256-bit support. It includes a number of newer 128-bit instructions and a general new encoding as well. The same goes for AVX512, where it provides a new encoding and newer 128/256-bit instructions in addition to the 512-bit instruction support.

The JIT knows what the actual underlying hardware supports and will opportunistically light up for newer instruction sets where it will benefit perf or size. So it's already using some of these in various places where safe.

For NAOT or other scenarios where you aren't jitting it won't use these newer instructions or encodings, but will still allow optimizations where feasible. For example, certain control inputs may allow you to emit Sse.UnpackLow instead and save 1 byte of encoding space, just because they are equivalent in functionality for that particular control.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah but I only use this method where Vector256.IsHardwareAccelrated is false.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That can still be AVX capable hardware. V256.IsHardwsreAccelerated requires AVX2

But even without that, the xplat api still gives more flexibility for the JIT to optimize so it’s often preferred where feasible

We tend to be very 1-to-1 for the platform specific APIs

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, will update

Comment on lines +87 to 90
if (Sse2.IsSupported)
{
return Sse.Shuffle(vector, vector, control);
return Sse2.Shuffle(vector, control);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar thing here. I'd expect just the xplat call is sufficient and the JIT will continue emitting the optimal shuffle instruction for that control byte.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't see Sse2.Shuffle(vector, control) for float?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. I was rather trying to say that I expect you can just have this be Vector128.Shuffle using the same general mechanism for creating the Vector128 indices out of the control. It can have the same general benefits to codegen and makes it more portable.

Comment on lines 120 to 126
Vector64<short> indices = Vector64.Create(
(short)(control & 0x3),
(short)((control >> 2) & 0x3),
(short)((control >> 4) & 0x3),
(short)((control >> 6) & 0x3));

return Vector128.Create(value.GetLower(), Vector64.Shuffle(value.GetUpper(), indices));
Copy link
Contributor

@tannergooding tannergooding Jun 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A better fallback would be to do this:

Suggested change
Vector64<short> indices = Vector64.Create(
(short)(control & 0x3),
(short)((control >> 2) & 0x3),
(short)((control >> 4) & 0x3),
(short)((control >> 6) & 0x3));
return Vector128.Create(value.GetLower(), Vector64.Shuffle(value.GetUpper(), indices));
Vector128<short> indices = Vector128.Create(
0,
1,
2,
3,
(short)((control & 0x3) + 4),
(short)(((control >> 2) & 0x3) + 4),
(short)(((control >> 4) & 0x3) + 4),
(short)(((control >> 6) & 0x3) + 4));
return Vector128.Shuffle(value, indices));

The reason is that V64 isn't accelerated on all platforms and you already have a V128, so this allows you to avoid decomposing the vector and doing more operations.

Comment on lines 147 to 153
Vector64<short> indices = Vector64.Create(
(short)(control & 0x3),
(short)((control >> 2) & 0x3),
(short)((control >> 4) & 0x3),
(short)((control >> 6) & 0x3));

return Vector128.Create(Vector64.Shuffle(value.GetLower(), indices), value.GetUpper());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar comment

Suggested change
Vector64<short> indices = Vector64.Create(
(short)(control & 0x3),
(short)((control >> 2) & 0x3),
(short)((control >> 4) & 0x3),
(short)((control >> 6) & 0x3));
return Vector128.Create(Vector64.Shuffle(value.GetLower(), indices), value.GetUpper());
Vector128<short> indices = Vector128.Create(
(short)(control & 0x3),
(short)((control >> 2) & 0x3),
(short)((control >> 4) & 0x3),
(short)((control >> 6) & 0x3),
4,
5,
6,
7);
return Vector128.Shuffle(value, indices));

@@ -133,8 +198,7 @@ public static Vector128<byte> ShiftRightBytesInVector(Vector128<byte> value, [Co
return AdvSimd.ExtractVector128(value, Vector128<byte>.Zero, numBytes);
}

ThrowUnreachableException();
return default;
return Vector128.Shuffle(value, Vector128.Create((byte)0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15) + Vector128.Create(numBytes));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In .NET 9+ the Vector128.Create((byte)0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15) can be Vector128<byte>.Indices

The total Vector128.Create((byte)0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15) + Vector128.Create(numBytes) can itself become Vector128.CreateSequence<byte>(start: numBytes, step: 1)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.NET 8 only unfortunately. Good to see a new API though, writing these is cumbersome and not great to read.

@@ -158,8 +222,43 @@ public static Vector128<byte> ShiftLeftBytesInVector(Vector128<byte> value, [Con
#pragma warning restore CA1857 // A constant is expected for the parameter
}

ThrowUnreachableException();
return default;
return Vector128.Shuffle(value, Vector128.Create((byte)0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15) - Vector128.Create(numBytes));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar thing, in .NET 9 this can become Vector128.CreateSequence<byte>(start: numBytes, step: unchecked((byte)(-1)))

Comment on lines +539 to +542
Vector128<long> v0 = AdvSimd.AddPairwiseWidening(prodLo);
Vector128<long> v1 = AdvSimd.AddPairwiseWidening(prodHi);

return Vector128.Narrow(v0, v1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just AdvSimd.Arm64.AddPairwise(prodLo, prodHi)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add that also. I want to be able to support as many Arm chipsets as possible.

return Sse2.MultiplyHigh(left, right);
}

// Widen each half of the short vectors into two int vectors
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Arm64, you can use AdvSimd.MultiplyWideningLower/Upper, then shift, then narrow

@@ -73,8 +46,7 @@ public static Vector256<byte> ShuffleNative(Vector256<byte> vector, Vector256<by
return Avx2.Shuffle(vector, indices);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indices works differently between Avx2 and Vector256

In particular, Avx2 shuffles within 2x 128-bit lanes and so it expects [0, 15] for the lower and then [0, 15] for the upper. (it's effectively doing V256.Create(V128.Shuffle(vector.GetLower(), indices.GetLower()), V128.Shuffle(vector.GetUpper(), indices.GetUpper())

While V256 shuffles within a single 256-bit lane and so it expects [0, 31] across the whole thing. The perf is then dependent on what the hardware supports and if the indices cross lanes at all. That is, it is fastest on all hardware if indices.GetLower() is only [0, 15] and indices.GetUpper() is only [16, 31], just because AVX2 doesn't actually provide full-width byte shuffle; only AVX512VBMI capable hardware provides proper full width support.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah... That's interesting. We're relying on the Avx2 behavior currently. I'll have a look at the current calls to see if I can adjust the indices, though maybe I should rename this ShufflePerLane or Shuffle128

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Went with ShufflePerLane for clarity.

if (AdvSimd.IsSupported)
{
Vector128<int> prodLo = AdvSimd.MultiplyWideningLower(left.GetLower(), right.GetLower());
Vector128<int> prodHi = AdvSimd.MultiplyWideningLower(left.GetUpper(), right.GetUpper());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Vector128<int> prodHi = AdvSimd.MultiplyWideningLower(left.GetUpper(), right.GetUpper());
Vector128<int> prodHi = AdvSimd.MultiplyWideningUpper(left, right);

Comment on lines +539 to +542
Vector128<long> v0 = AdvSimd.AddPairwiseWidening(prodLo);
Vector128<long> v1 = AdvSimd.AddPairwiseWidening(prodHi);

return Vector128.Narrow(v0, v1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on sse2neon it is probably cheaper to not widen. This applies to other helpers too.

Suggested change
Vector128<long> v0 = AdvSimd.AddPairwiseWidening(prodLo);
Vector128<long> v1 = AdvSimd.AddPairwiseWidening(prodHi);
return Vector128.Narrow(v0, v1);
Vector64<int> v0 = AdvSimd.AddPairwise(prodLo.GetLower(), prodLo.GetUpper());
Vector64<int> v1 = AdvSimd.AddPairwise(prodHi.GetLower(), prodHi.GetUpper());
return Vector128.Create(v0, v1);

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[WEBP] SaveImage is taking more than x100 times slower in AWS Instance than my local machine
4 participants