Releases: google/highway
1.3.0
Add:
- AddLower, PairwiseAdd/Sub, MaskedAbsOr, BitsFromMask
- AVX10_2 and Loongson LASX/LSX targets
- AVX3_SPR F16, WASM_EMU256 F64 types
- CeilInt/FloorInt, DemoteToNearestInt and F16/F64 NearestInt
- Complex number operations, F16/BF16 assignment operators
- emulated bf16/f16 Load/StoreInterleaved
- hwy::Warn/HWY_WARN, use instead of fprintf
- HWY_UNREACHABLE, HWY_VISIT_TARGETS
- i16 Dot, AverageRound, RoundingShiftRight/RoundingShr
- InterleaveEvenBlocks/InterleaveOddBlocks, MinMagnitude/MaxMagnitude
- masked comparisons, promote, round, GetBiasedExponent
- MulByPow2/MulByFloorPow2, MulRound, MulLower/MulAddLower
- PositiveInfOrHighestValue/NegativeInfOrLowestValue
- RVV groundwork for runtime dispatch, enable tuples
- spin wait, NanoSleep, Counter2/4 barrier, Divisor64, perf_counters
Improvements:
- dpbf16 WidenMulPairwiseAdd Exp2, AVX10.2 float->int, AVX3 GetExponent
- header-only abort.h/cc, tests runnable with Bazel8
- HWY_BROKEN_*: allow individual override
- Lanes: 'optional constexpr', AllBits1
- MaskedEq/Ne, NEON SumOfMulQuadAccumulate, MaskedReduceMin/Max, MulEven
- Profiler: report concurrency stats, 1.36x less overhead
- RVV various ops via superoptimizer
- SetThreadName: support more systems
- SVE2 SatWidenMulPairwiseAccumulate, SSE2/SSSE3 U16 Min/Max
- TargetName: no longer returns unknown for other arch
- ThreadPool autotune, avoid WakeAll
- topology: add NUMA node, support Windows/Apple
Fixes:
- avoid wraparound for -ftrapv, topology for offline CPUs/RVV
- warnings from -Wmissing-declarations/prototypes
- AdvSIMD_HPFPCvt on OSX
- f32->bf16 rounding: avoid unspecified built-in cast
- MSAN, PPC InvariantTicksPerSecond on QEMU, HWY_RCAST_ALIGNED, IsNaN
- vqsort for ascending order, add 8-bit test
Thanks to all contributors, especially johnplatts and eustas!
1.2.0
-
Add InterleaveEven/InterleaveOdd, BitShuffle, GatherIndexNOr
-
Add IsNegative, IfNegativeThenElseZero, IfNegativeThenZeroElse
-
Add NEON_BF16, HWY_VERSION_GE/LT, HWY_EXPORT_T/HWY_DYNAMIC_DISPATCH_T
-
Add PromoteInRangeTo/ConvertInRangeTo/DemoteInRangeTo
-
Add Rol/Ror, RotateLeft/RotateLeftSame/RotateRightSame
-
Add SatWidenMulPairwiseAccumulate, SatWidenMulAccumFixedPoint
-
Add stats.h, bit_set.h, IsEitherNaN
-
Add UI8/UI32/UI64 MulHigh, I64 MulEven/MulOdd/Mul128
-
Add WidenMulAccumulate, MulEvenAdd, MulOddAdd
-
contrib/bit_pack: support 32/64-bit lanes
-
contrib/math: Add Exp2, Hypot
-
contrib/matvec: Add MatVecAdd
-
contrib/sort: Add VQ/HeapSelect, partial sort
-
contrib/topology: add affinity, detect topology/cache size/CPU name
-
Enable runtime dispatch for NEON/RVV, bazel modules, abort handler
-
Remove DASSERT for negative Gather indices
-
Support opting out of GUnit dependency
-
Use SPR/ZEN4 bf16 dot product
-
Known GCC 13 RVV issue: parts of sort_test and bit_pack_test disabled
-
Known Clang RVV/QEMU issue: incorrect rounding mode in upper/lower halves
1.1.0
- Add BitCastScalar, DispatchedTarget, Foreach
- Add Div/Mod and MaskedDiv/ModOr, SaturatedAbs, SaturatedNeg
- Add InterleaveWholeLower/Upper, Dup128VecFromValues
- Add IsInteger, IsIntegerLaneType, RemoveVolatile, RemoveCvRef
- Add MaskedAdd/Sub/Mul/Div/Gather/Min/Max/SatAdd/SatSubOr
- Add MaskFalse, IfNegativeThenNegOrUndefIfZero, PromoteEven/OddTo
- Add ReduceMin/Max, 8-bit reductions, f16 <-> f64 conversions
- Add Span, AlignedArray, matrix-vector mul
- Add SumsOf2/4, I8 SumsOf8, SumsOfAdjQuadAbsDiff, SumsOfShuffledQuadAbsDiff
- Add ThreadPool, hierarchical profiler
- Build: use bazel_platforms
- Enable clang16 Arm/PPC runtime dispatch, F16 for GCC AVX3_SPR
- Extend Dot to f32*bf16, FMA to integer
- Fix: RVV 8-bit overflow, UB in vqsort, big-endian bugs, PPC HTM
- Improved codegen in various ops, fp16/bf16 tests and conversions
- New targets: HWY_Z14, HWY_Z15
- Test: add foreign_arch builders, CodeQL
1.0.7
- Add LoadNOr, GatherIndexN, ScatterIndexN
- Add additional float<->int conversions
- Codegen improvements for 8-bit shift, PPC Compress/Expand
- Fixes for MSVC, PPC, RVV, WASM, GCC 13, GCC 8.2, i686, f16 type, QEMU 7.2
- Support CMake args in Debian packaging
1.0.6
- Add MaskedGatherIndex, MaskedScatterIndex, LoadN, StoreN
- Add SatWidenMulPairwiseAdd, SumOfMulQuadAccumulate, PromoteUpperLowerTo
- Add F64 for Wasm, F64 AbsDiff
- Add F16 support to AVX3_SPR, RVV tuple (both not yet enabled)
- Validate all D args in x86 function signatures
- License: now dual Apache2/BSD3
- Doc: new users, vcpkg install instructions, AVX10 plans
- Doc: advice on dynamic dispatch plus -march flags
- Build: avoid installing hwy_test if !HWY_ENABLE_TESTS
- Codegen: improved PPC9 Find*True, variable-length CopyBytes
- Fix: GCC 8.2, MSVC, ICC, PPC9, SVE, arm64 MSVC issues
- Fix: IfNegativeThenElse, MulFixedPoint15, Debian changelog format
- Tests: faster builds (split up), use release builds
1.0.5
- Add Insert/ExtractBlock, BroadcastBlock/Lane, NumBlocks
- Add integer Le/Ge and [Neg]MulAdd, extend DemoteTo/PromoteTo
- Add Leading/TrailingZeroCount, HighestSetBitIndex, ReverseBits
- Add MaskedLoadOr, tuple Get/Set/Create, ReduceSum, WidenMulPairwiseAdd
- Add [ZeroExtend]ResizeBitCast, BitwiseIfThenElse, Find[Known]LastTrue
- Add AESRoundInv, AESKeyGenAssist
- Add contrib/math Atan2/SinCos, contrib/unroller
- Add fp16/bf16 support (Armv8, SVE, RVV), HWY_DYNAMIC_POINTER
- Add OrderedTruncate2To, Per4LaneBlockShuffle, TwoTablesLookupLanes
- Add SlideUp/Down[Blocks/Lanes], Slide1Up/Down, ReverseLaneBytes
- Add SetBeforeFirst, SetAtOrBefore/AfterFirst, SetOnlyFirst
- Add 8-bit Reverse2/4/8, Shl/Shr, RotateRight, Reverse, Mul
- Add 8/16-bit DupEven/Odd, TableLookupLanes
- Add F64 ApproximateReciprocal[Sqrt], 32/64-bit SaturatedAdd/Sub
- Build: Support Bazel modules
- Codegen improvements
- Compiler: support Clang 15/16
- Doc: add Github pages, support policy, evaluation
- Doc: publish AVX-512 throttling/startup findings
- Release: add signing
- Test: add GCC to Github Actions
- VQSort: small N speedups: fix seeding, func ptr, 8-wide network.
- VQSort: add BenchAllColdSort, VQSortStatic
- VQSort: fix subnormal/inf/NaN, support fp16, fix KV types
- Workarounds: RVV VXRM, x87 excess precision, missing intrinsics
1.0.4
- Add PPC8..10, SSE2, AVX3_ZEN4, NEON_WITHOUT_AES targets
- Add Expand, LoadExpand, integer AbsDiff, SumsOf8AbsDiff
- Improved Half/Twice support, codegen for Shift*Same
- Support Wasm in Godbolt
- Faster KV128 sorting
- Fix armv7 build config, CMake config mode
- Update RVV intrinsics for 1.0-draft
1.0.3
- Add RearrangeToOddPlusEven, Xor3, 8-bit CompressStore, HWY_ASSUME
- Add contrib/bit_pack for 8/16-bit lanes
- Add WASM_EMU256 target
- Documentation improvements
- Allow opting out of C++ stdlib usage for Compiler Explorer
- Update for new RVV intrinsics; faster WASM min/max and extmul/q15mul
- Fix UB, GCC atomic
1.0.2
- Add ExclusiveNeither, FindKnownFirstTrue, Ne128
- Add 16-bit SumOfLanes/ReorderWidenMulAccumulate/ReorderDemote2To
- Faster sort for low-entropy input, improved pivot selection
- Add GN build system, Highway FAQ, k32v32 type to vqsort
- CMake: Support find_package(GTest), add rvv-inl.h, add HWY_ENABLE_TESTS
- Fix MIPS and C++20 build, Apple LLVM 10.3 detection, EMU128 AllTrue on RVV
- Fix missing exec_prefix, RVV build, warnings, libatomic linking
- Work around GCC 10.4 issue, disabled RDCYCLE, arm7 with vfpv3
- Documentation/example improvements
- Support static dispatch to SVE2_128 and SVE_256
1.0.1
- Add Eq128, i64 Mul, unsigned->float ConvertTo
- Faster sort for few unique keys, more robust pivot selection
- Fix: floating-point generator for sort tests, Min/MaxOfLanes for i16
- Fix: avoid always_inline in debug, link atomic
- GCC warnings: string.h, maybe-uninitialized, ignored-attributes
- GCC warnings: preprocessor int overflow, spurious use-after-free/overflow
- Doc: <=HWY_AVX3, Full32/64/128, how to use generic-inl