feat: perf opt part5 #52

chraac · 2025-07-19T05:54:48Z

Overview

This pull request implements the fifth part of our ongoing performance optimization efforts for llama.cpp. It continues our work to improve model inference speed and resource utilization.

Key Changes

SIMD & Matrix Operations

Enhanced SIMD optimizations for matrix multiplication operations
Improved cache utilization within core operators
Optimized batch tensor destroy operations for improved memory cleanup and reduced overhead

Memory & Attention Mechanisms

Reduced memory allocations during context processing
Optimized attention mechanism computation for better throughput
Streamlined inference pipeline with eliminated redundant operations

Performance Impact

Test setup: 8gen2
Test suite: test-backend-ops
Baseline: c5187054b
Optimized: 74f28f53f

Matrix Multiplication Performance Comparison

Small to Medium Batch Operations (n=1-5)

Operation Type	Batch Size	Baseline GFLOPS	Optimized GFLOPS	Improvement
q4_0	n=1	15.15	16.35	+7.9%
f16	n=1	9.86	12.17	+23.4%
f32	n=1	8.11	8.37	+3.2%
q4_0	n=2	24.34	26.21	+7.7%
f16	n=2	9.81	21.39	+118.1%
f32	n=2	8.41	11.06	+31.5%
q4_0	n=3	30.56	32.91	+7.7%
f16	n=3	9.88	27.68	+180.1%
f32	n=3	8.14	14.41	+77.0%
q4_0	n=4	35.07	37.64	+7.3%
f16	n=4	10.41	31.47	+202.2%
f32	n=4	8.69	17.79	+104.7%
q4_0	n=5	38.39	40.98	+6.7%
f16	n=5	10.41	35.74	+243.3%
f32	n=5	8.20	20.60	+151.2%

Large Batch Operations (n=8-512)

Operation Type	Batch Size	Baseline GFLOPS	Optimized GFLOPS	Improvement
q4_0	n=8	44.61	47.86	+7.3%
f16	n=8	9.95	43.20	+334.2%
f32	n=8	8.23	27.81	+238.0%
q4_0	n=512	60.21	64.30	+6.8%
f16	n=512	10.41	64.17	+516.6%
f32	n=512	8.26	55.54	+572.5%

Flash Attention Performance

Configuration	Baseline GFLOPS	Optimized GFLOPS	Improvement
hsk=128,hsv=128,kv=4096	4.00	4.12	+3.0%
hsk=128,hsv=128,kv=8192	4.01	4.12	+2.7%
hsk=64,hsv=64,kv=16384	2.34	2.38	+1.7%

Performance Summary

Metric	Value
Best Single Improvement	+572.5% (f32, n=512)
Average q4_0 Improvement	+7.2%
Average f16 Improvement	+244.6%
Average f32 Improvement	+164.3%

Key Insights

Dramatic scaling improvements: Performance gains increase significantly with batch size, particularly for f16/f32 operations

Consistent quantized gains: q4_0 operations show steady 6-8% improvements across all scenarios

Memory bandwidth bottleneck confirmed: In n=1 case, q4_0 (16.35 GFLOPS) outperforms f16 (12.17 GFLOPS), indicating memory bandwidth limitation. Despite q4_0 requiring dequantization to f16 in VTCM before matrix multiplication, its smaller memory footprint enables better performance than native f16 operations

Overall Performance Summary

Best case improvement: +573% (f32, n=512)
Average improvement across all operations: ~15-25%
Quantized operations: Consistent 6-8% gains
Large batch operations: 238-573% improvements

Unit tests

Test setup: 8gen2
Test suite: test-backend-ops

  FLASH_ATTN_EXT(hsk=576,hsv=512,nh=4,nr23=[4,1],kv=512,nb=35,mask=0,max_bias=0.000000,logit_softcap=0.000000,prec=f32,type_KV=q8_0,permute=[0,1,2,3]): [hexagon-npu][FLASH_ATTN_EXT]unsupported f32(f32,q4_0), ret: 0x0, supported: 0
[hexagon-npu][FLASH_ATTN_EXT][out]unsupported, dst: f32[512x16x35], src0: f32[576x35x16], src1: q4_0[576x512x4], src2: q4_0[512x512x4], supported/unsupported: 949/5591
not supported [hexagon-npu] 
  FLASH_ATTN_EXT(hsk=576,hsv=512,nh=4,nr23=[4,1],kv=512,nb=35,mask=0,max_bias=0.000000,logit_softcap=0.000000,prec=f32,type_KV=q4_0,permute=[0,1,2,3]): [hexagon-npu]Unsupported op: SOFT_MAX
[hexagon-npu][SOFT_MAX][labels_normalized]unsupported, dst: f32[10x5x4x3], src0: f32[10x5x4x3], supported/unsupported: 949/5592
not supported [hexagon-npu] 
  CROSS_ENTROPY_LOSS(type=f32,ne=[10,5,4,3]): [hexagon-npu]Unsupported op: SOFT_MAX
[hexagon-npu][SOFT_MAX][labels_normalized]unsupported, dst: f32[30000], src0: f32[30000], supported/unsupported: 949/5593
not supported [hexagon-npu] 
  CROSS_ENTROPY_LOSS(type=f32,ne=[30000,1,1,1]): [hexagon-npu]Unsupported op: SOFT_MAX
[hexagon-npu][SOFT_MAX][labels_normalized]unsupported, dst: f32[10x5x4x3], src0: f32[10x5x4x3], supported/unsupported: 949/5594
not supported [hexagon-npu] 
  CROSS_ENTROPY_LOSS_BACK(type=f32,ne=[10,5,4,3]): [hexagon-npu]Unsupported op: SOFT_MAX
[hexagon-npu][SOFT_MAX][labels_normalized]unsupported, dst: f32[30000], src0: f32[30000], supported/unsupported: 949/5595
not supported [hexagon-npu] 
  CROSS_ENTROPY_LOSS_BACK(type=f32,ne=[30000,1,1,1]): [hexagon-npu]Unsupported op: OPT_STEP_ADAMW
[hexagon-npu][OPT_STEP_ADAMW][out]unsupported, dst: f32[10x5x4x3], supported/unsupported: 949/5596
not supported [hexagon-npu] 
  OPT_STEP_ADAMW(type=f32,ne=[10,5,4,3]): unload rpcmem lib successfully
not supported [hexagon-npu] 
  6535/6535 tests passed
  Backend hexagon-npu: �[1;32mOK�[0m
Backend 2/2: CPU
  Skipping
2/2 backends passed
�[1;32mOK�[0m

Full log:
test-backend-ops-all.debug.hexagon.74f28f53f.7z

…r improved clarity and performance

… vec_ops.hpp

…and efficiency

…ector operations and alignment checks # Conflicts: # ggml/src/ggml-qnn/npu/device/type_traits.cpp

… adding parallel processing for multiple vector pairs

…mance by adding parallel processing for multiple vector pairs" This reverts commit 78cc24ed2285002ca29d6189fa61ba4ce24f8d16.

…ta types and improved constexpr usage

This reverts commit bb1840876692a11511d5ab7828b8a707402e30b9.

This reverts commit ab442fa9f763b3873c929936e4cb739cb1c83850.

…rsion and performance tracking

…and performance

…ved clarity and consistency

…e memory initialization" This reverts commit e374326dc74d049e6603e393ade418d9ef2b83f3.

…ters

…ent handling

… function for improved type handling

…larity

…thread

…or each thread" This reverts commit 00cdd3f.

…ecks for tensors

…tants for better clarity

chraac · 2025-07-19T05:55:55Z

Related to #34
Related to #51

Copilot

Pull Request Overview

This PR implements the fifth part of performance optimization efforts for llama.cpp focusing on SIMD vectorization improvements, memory management enhancements, and optimized tensor operations for the Hexagon NPU backend.

Enhanced SIMD vector operations with improved memory alignment handling and batch processing
Optimized tensor destruction with batch operations to reduce overhead
Streamlined matrix multiplication operations with better cache utilization and aligned memory access patterns

Reviewed Changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
ggml/src/ggml-qnn/npu/idl/hexagon_npu.idl	Added batch tensor deallocation interface and invalid handle constants
ggml/src/ggml-qnn/npu/host/tensor.hpp	Implemented batch tensor destruction and updated handle validation
ggml/src/ggml-qnn/npu/host/graph.hpp	Updated graph handle initialization with proper constants
ggml/src/ggml-qnn/npu/host/graph.cpp	Minor formatting improvements in debug logging
ggml/src/ggml-qnn/npu/host/buffer.cpp	Added performance tracking and batch tensor cleanup
ggml/src/ggml-qnn/npu/device/vec_ops.inl	New comprehensive SIMD vector operations implementation
ggml/src/ggml-qnn/npu/device/vec_ops.hpp	Refactored vector operations with template-based dot product functions
ggml/src/ggml-qnn/npu/device/vec_ops.cpp	Removed old implementation (moved to .inl file)
ggml/src/ggml-qnn/npu/device/util.hpp	Fixed macro name collision in performance tracking
ggml/src/ggml-qnn/npu/device/type_traits.hpp	Enhanced type system with aligned vector operations support
ggml/src/ggml-qnn/npu/device/type_traits.cpp	Optimized quantization operations and added copy functions
ggml/src/ggml-qnn/npu/device/tensor.hpp	Updated tensor handle validation
ggml/src/ggml-qnn/npu/device/op_rope.cpp	Replaced memcpy with vectorized copy operations
ggml/src/ggml-qnn/npu/device/op_mul_mat.cpp	Major optimization with template-based matrix multiplication and improved caching
ggml/src/ggml-qnn/npu/device/op_impl.cpp	Refactored vector operations to use new template system
ggml/src/ggml-qnn/npu/device/op_flash_attn.cpp	Enhanced flash attention with template specialization and vectorized operations
ggml/src/ggml-qnn/npu/device/device.cpp	Improved handle management and added batch tensor deallocation

Comments suppressed due to low confidence (3)

ggml/src/ggml-qnn/npu/device/op_mul_mat.cpp:31

The template parameter name '_ShouldCacheSrc0' uses a leading underscore which is reserved for implementation. Consider renaming to 'ShouldCacheSrc0'.

template <auto _DotFunc, bool _ShouldCacheSrc0>

ggml/src/ggml-qnn/npu/device/op_mul_mat.cpp:17

The template parameter name '_TRet' uses a leading underscore which is reserved for implementation. Consider renaming to 'TRet'.

template <typename _TRet> struct convert_vector {};

ggml/src/ggml-qnn/npu/device/op_flash_attn.cpp:16

The template parameter name '_IsKvF16' uses a leading underscore which is reserved for implementation. Consider renaming to 'IsKvF16'.

template <bool _IsKvF16>

Copilot · 2025-07-19T06:08:15Z

ggml/src/ggml-qnn/npu/device/vec_ops.inl

+        //   https://github.com/UbiquitousLearning/mllm/blob/babf4410352ce8730824c87699c025a0d4ce3a6f/src/backends/qnn/LLaMAOpPackageHtp/LLaMAPackage/src/ops/LLaMAMul.cpp#L147
+        //   or qualcomm sdk libs\qhl_hvx\src\qhblas_hvx\qhblas_hvx_aw_vector_add_ah.c


[nitpick] The comment contains a very long URL that reduces code readability. Consider using a shorter reference or moving to documentation.

Suggested change

// https://github.com/UbiquitousLearning/mllm/blob/babf4410352ce8730824c87699c025a0d4ce3a6f/src/backends/qnn/LLaMAOpPackageHtp/LLaMAPackage/src/ops/LLaMAMul.cpp#L147

// or qualcomm sdk libs\qhl_hvx\src\qhblas_hvx\qhblas_hvx_aw_vector_add_ah.c

// Refer to the documentation for related examples:

// - LLaMAMul.cpp implementation details

// - Qualcomm SDK: qhblas_hvx_aw_vector_add_ah.c

Copilot · 2025-07-19T06:08:15Z

ggml/src/ggml-qnn/npu/device/vec_ops.inl

+        curr1            = Q6_V_valign_VVR(curr1, prev1, (size_t) src1);
+
+        HVX_Vector_Dual curr0_pair = _ExpandFunc(curr0, kOneV);
+


The ternary operator logic for selecting between curr0_pair.first and curr0_pair.second based on leftover comparison is unclear. Consider adding a comment explaining the selection criteria.

Suggested change

// Select curr0_pair.first if leftover1 equals leftover0, otherwise select curr0_pair.second.

// This ensures the correct alignment of the remaining elements based on the leftover sizes.

ggml/src/ggml-qnn/npu/device/type_traits.cpp

chraac added 30 commits July 16, 2025 21:54

rename

04c99b1

Refactor vector operations in vec_op_impl and vec_dot_product_impl fo…

28d527e

…r improved clarity and performance

wip

ddf95af

Enhance vector copy functions for improved performance and clarity in…

f0d51d2

… vec_ops.hpp

wip

814a8d4

wip

41f3f64

wip

ceb2fe2

Optimize vector dot product implementations for enhanced performance …

889cb69

…and efficiency

Enhance flash attention implementation and type traits for improved v…

cec3fd8

…ector operations and alignment checks # Conflicts: # ggml/src/ggml-qnn/npu/device/type_traits.cpp

remove align

311be57

wip

3eb8efc

Enhance vector dot product implementation for improved performance by…

04f1c2c

… adding parallel processing for multiple vector pairs

Revert "Enhance vector dot product implementation for improved perfor…

661c916

…mance by adding parallel processing for multiple vector pairs" This reverts commit 78cc24ed2285002ca29d6189fa61ba4ce24f8d16.

Enhance flash attention implementation with type checks for tensor da…

afb8ea5

…ta types and improved constexpr usage

wip

0e626a8

opt mask calc

854bc23

Revert "opt mask calc"

709d752

This reverts commit bb1840876692a11511d5ab7828b8a707402e30b9.

wip

fb1614e

opt mul mat caching logic to add dst cache

05decd9

Revert "opt mul mat caching logic to add dst cache"

9e3f759

This reverts commit ab442fa9f763b3873c929936e4cb739cb1c83850.

wip

9643f21

Refactor matrix multiplication implementation to include vector conve…

7430cd3

…rsion and performance tracking

wip

420f1f6

wip

464ad02

wip

8b763a9

create vec_ops.inl for more aggressive compiler inline

25c6b3d

wip

a86df9e

refactor vector dot product implementations for improved readability …

ec953fa

…and performance

refactor vector conversion functions to use HVX_Vector_Dual for impro…

f16492d

…ved clarity and consistency

wip

06627fb

chraac added 19 commits July 16, 2025 21:54

Revert "add vector zeroing functions for F32 and F16 types to optimiz…

549c4fd

…e memory initialization" This reverts commit e374326dc74d049e6603e393ade418d9ef2b83f3.

wip

0652b72

refactor alignment checks in dot product function to handle null poin…

40d0632

…ters

wip

009e058

refactor load_block_generic and related functions for improved alignm…

94eea19

…ent handling

wip

3f6a487

refactor flash attention implementation and introduce type-erased dot…

dcf1580

… function for improved type handling

refactor dot product implementations for improved loop handling and c…

97a5678

…larity

refactor thread_pool constructor to pre-allocate VTCM cache for each …

00cdd3f

…thread

Revert "refactor thread_pool constructor to pre-allocate VTCM cache f…

93fbaad

…or each thread" This reverts commit 00cdd3f.

wip

e0f795b

opt interfaces for tensor cleanup

68cf1ca

refactor mul_mat_impl to use aligned size for src0 row calculation

b5d316c

refactor: update dequantized_row_size logic and add size alignment ch…

ef4550f

…ecks for tensors

wip

f7ab2db

wip

bcb0e79

refactor: replace raw pointer initialization with invalid handle cons…

9298936

…tants for better clarity

wip

a7ea145

Merge branch 'dev-refactoring' into dev-perf-opt-part5

74f28f5

chraac self-assigned this Jul 19, 2025

chraac added the enhancement New feature or request label Jul 19, 2025

chraac added this to qnn backend Jul 19, 2025

chraac mentioned this pull request Jul 19, 2025

llama-cli on Hexagon-NPU introducing a lot of extra time #51

Open

chraac requested a review from Copilot July 19, 2025 06:07

chraac mentioned this pull request Jul 19, 2025

feat: perf opt part5 chraac/llama-cpp-qnn-builder#18

Merged

Copilot AI reviewed Jul 19, 2025

View reviewed changes

Merge branch 'dev-refactoring' into dev-perf-opt-part5

aa7d330

chraac merged commit 2cd429c into dev-refactoring Jul 22, 2025
1 check passed

github-project-automation bot moved this to Done in qnn backend Jul 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: perf opt part5 #52

feat: perf opt part5 #52

Uh oh!

chraac commented Jul 19, 2025

Uh oh!

chraac commented Jul 19, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jul 19, 2025

Uh oh!

Copilot AI Jul 19, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

		// https://github.com/UbiquitousLearning/mllm/blob/babf4410352ce8730824c87699c025a0d4ce3a6f/src/backends/qnn/LLaMAOpPackageHtp/LLaMAPackage/src/ops/LLaMAMul.cpp#L147
		// or qualcomm sdk libs\qhl_hvx\src\qhblas_hvx\qhblas_hvx_aw_vector_add_ah.c

-        //   https://github.com/UbiquitousLearning/mllm/blob/babf4410352ce8730824c87699c025a0d4ce3a6f/src/backends/qnn/LLaMAOpPackageHtp/LLaMAPackage/src/ops/LLaMAMul.cpp#L147
-        //   or qualcomm sdk libs\qhl_hvx\src\qhblas_hvx\qhblas_hvx_aw_vector_add_ah.c
+        //   Refer to the documentation for related examples:
+        //   - LLaMAMul.cpp implementation details
+        //   - Qualcomm SDK: qhblas_hvx_aw_vector_add_ah.c

		curr1 = Q6_V_valign_VVR(curr1, prev1, (size_t) src1);

		HVX_Vector_Dual curr0_pair = _ExpandFunc(curr0, kOneV);


	// Select curr0_pair.first if leftover1 equals leftover0, otherwise select curr0_pair.second.
	// This ensures the correct alignment of the remaining elements based on the leftover sizes.

feat: perf opt part5 #52

feat: perf opt part5 #52

Uh oh!

Conversation

chraac commented Jul 19, 2025

Overview

Key Changes

SIMD & Matrix Operations

Memory & Attention Mechanisms

Performance Impact

Matrix Multiplication Performance Comparison

Small to Medium Batch Operations (n=1-5)

Large Batch Operations (n=8-512)

Flash Attention Performance

Performance Summary

Key Insights

Overall Performance Summary

Unit tests

Uh oh!

chraac commented Jul 19, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Jul 19, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jul 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!