Skip to content

feat: perf opt part5 #52

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 55 commits into from
Jul 22, 2025
Merged

feat: perf opt part5 #52

merged 55 commits into from
Jul 22, 2025

Conversation

chraac
Copy link
Owner

@chraac chraac commented Jul 19, 2025

Overview

This pull request implements the fifth part of our ongoing performance optimization efforts for llama.cpp. It continues our work to improve model inference speed and resource utilization.

Key Changes

SIMD & Matrix Operations

  • Enhanced SIMD optimizations for matrix multiplication operations
  • Improved cache utilization within core operators
  • Optimized batch tensor destroy operations for improved memory cleanup and reduced overhead

Memory & Attention Mechanisms

  • Reduced memory allocations during context processing
  • Optimized attention mechanism computation for better throughput
  • Streamlined inference pipeline with eliminated redundant operations

Performance Impact

Test setup: 8gen2
Test suite: test-backend-ops
Baseline: c5187054b
Optimized: 74f28f53f

Matrix Multiplication Performance Comparison

Small to Medium Batch Operations (n=1-5)

Operation Type Batch Size Baseline GFLOPS Optimized GFLOPS Improvement
q4_0 n=1 15.15 16.35 +7.9%
f16 n=1 9.86 12.17 +23.4%
f32 n=1 8.11 8.37 +3.2%
q4_0 n=2 24.34 26.21 +7.7%
f16 n=2 9.81 21.39 +118.1%
f32 n=2 8.41 11.06 +31.5%
q4_0 n=3 30.56 32.91 +7.7%
f16 n=3 9.88 27.68 +180.1%
f32 n=3 8.14 14.41 +77.0%
q4_0 n=4 35.07 37.64 +7.3%
f16 n=4 10.41 31.47 +202.2%
f32 n=4 8.69 17.79 +104.7%
q4_0 n=5 38.39 40.98 +6.7%
f16 n=5 10.41 35.74 +243.3%
f32 n=5 8.20 20.60 +151.2%

Large Batch Operations (n=8-512)

Operation Type Batch Size Baseline GFLOPS Optimized GFLOPS Improvement
q4_0 n=8 44.61 47.86 +7.3%
f16 n=8 9.95 43.20 +334.2%
f32 n=8 8.23 27.81 +238.0%
q4_0 n=512 60.21 64.30 +6.8%
f16 n=512 10.41 64.17 +516.6%
f32 n=512 8.26 55.54 +572.5%

Flash Attention Performance

Configuration Baseline GFLOPS Optimized GFLOPS Improvement
hsk=128,hsv=128,kv=4096 4.00 4.12 +3.0%
hsk=128,hsv=128,kv=8192 4.01 4.12 +2.7%
hsk=64,hsv=64,kv=16384 2.34 2.38 +1.7%

Performance Summary

Metric Value
Best Single Improvement +572.5% (f32, n=512)
Average q4_0 Improvement +7.2%
Average f16 Improvement +244.6%
Average f32 Improvement +164.3%

Key Insights

Dramatic scaling improvements: Performance gains increase significantly with batch size, particularly for f16/f32 operations

Consistent quantized gains: q4_0 operations show steady 6-8% improvements across all scenarios

Memory bandwidth bottleneck confirmed: In n=1 case, q4_0 (16.35 GFLOPS) outperforms f16 (12.17 GFLOPS), indicating memory bandwidth limitation. Despite q4_0 requiring dequantization to f16 in VTCM before matrix multiplication, its smaller memory footprint enables better performance than native f16 operations

Overall Performance Summary

  • Best case improvement: +573% (f32, n=512)
  • Average improvement across all operations: ~15-25%
  • Quantized operations: Consistent 6-8% gains
  • Large batch operations: 238-573% improvements

Unit tests

Test setup: 8gen2
Test suite: test-backend-ops

  FLASH_ATTN_EXT(hsk=576,hsv=512,nh=4,nr23=[4,1],kv=512,nb=35,mask=0,max_bias=0.000000,logit_softcap=0.000000,prec=f32,type_KV=q8_0,permute=[0,1,2,3]): [hexagon-npu][FLASH_ATTN_EXT]unsupported f32(f32,q4_0), ret: 0x0, supported: 0
[hexagon-npu][FLASH_ATTN_EXT][out]unsupported, dst: f32[512x16x35], src0: f32[576x35x16], src1: q4_0[576x512x4], src2: q4_0[512x512x4], supported/unsupported: 949/5591
not supported [hexagon-npu] 
  FLASH_ATTN_EXT(hsk=576,hsv=512,nh=4,nr23=[4,1],kv=512,nb=35,mask=0,max_bias=0.000000,logit_softcap=0.000000,prec=f32,type_KV=q4_0,permute=[0,1,2,3]): [hexagon-npu]Unsupported op: SOFT_MAX
[hexagon-npu][SOFT_MAX][labels_normalized]unsupported, dst: f32[10x5x4x3], src0: f32[10x5x4x3], supported/unsupported: 949/5592
not supported [hexagon-npu] 
  CROSS_ENTROPY_LOSS(type=f32,ne=[10,5,4,3]): [hexagon-npu]Unsupported op: SOFT_MAX
[hexagon-npu][SOFT_MAX][labels_normalized]unsupported, dst: f32[30000], src0: f32[30000], supported/unsupported: 949/5593
not supported [hexagon-npu] 
  CROSS_ENTROPY_LOSS(type=f32,ne=[30000,1,1,1]): [hexagon-npu]Unsupported op: SOFT_MAX
[hexagon-npu][SOFT_MAX][labels_normalized]unsupported, dst: f32[10x5x4x3], src0: f32[10x5x4x3], supported/unsupported: 949/5594
not supported [hexagon-npu] 
  CROSS_ENTROPY_LOSS_BACK(type=f32,ne=[10,5,4,3]): [hexagon-npu]Unsupported op: SOFT_MAX
[hexagon-npu][SOFT_MAX][labels_normalized]unsupported, dst: f32[30000], src0: f32[30000], supported/unsupported: 949/5595
not supported [hexagon-npu] 
  CROSS_ENTROPY_LOSS_BACK(type=f32,ne=[30000,1,1,1]): [hexagon-npu]Unsupported op: OPT_STEP_ADAMW
[hexagon-npu][OPT_STEP_ADAMW][out]unsupported, dst: f32[10x5x4x3], supported/unsupported: 949/5596
not supported [hexagon-npu] 
  OPT_STEP_ADAMW(type=f32,ne=[10,5,4,3]): unload rpcmem lib successfully
not supported [hexagon-npu] 
  6535/6535 tests passed
  Backend hexagon-npu: �[1;32mOK�[0m
Backend 2/2: CPU
  Skipping
2/2 backends passed
�[1;32mOK�[0m

Full log:
test-backend-ops-all.debug.hexagon.74f28f53f.7z

chraac added 30 commits July 16, 2025 21:54
…ector operations and alignment checks

# Conflicts:
#	ggml/src/ggml-qnn/npu/device/type_traits.cpp
… adding parallel processing for multiple vector pairs
…mance by adding parallel processing for multiple vector pairs"

This reverts commit 78cc24ed2285002ca29d6189fa61ba4ce24f8d16.
This reverts commit bb1840876692a11511d5ab7828b8a707402e30b9.
This reverts commit ab442fa9f763b3873c929936e4cb739cb1c83850.
@chraac chraac self-assigned this Jul 19, 2025
@chraac chraac added the enhancement New feature or request label Jul 19, 2025
@chraac
Copy link
Owner Author

chraac commented Jul 19, 2025

Related to #34
Related to #51

Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements the fifth part of performance optimization efforts for llama.cpp focusing on SIMD vectorization improvements, memory management enhancements, and optimized tensor operations for the Hexagon NPU backend.

  • Enhanced SIMD vector operations with improved memory alignment handling and batch processing
  • Optimized tensor destruction with batch operations to reduce overhead
  • Streamlined matrix multiplication operations with better cache utilization and aligned memory access patterns

Reviewed Changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
ggml/src/ggml-qnn/npu/idl/hexagon_npu.idl Added batch tensor deallocation interface and invalid handle constants
ggml/src/ggml-qnn/npu/host/tensor.hpp Implemented batch tensor destruction and updated handle validation
ggml/src/ggml-qnn/npu/host/graph.hpp Updated graph handle initialization with proper constants
ggml/src/ggml-qnn/npu/host/graph.cpp Minor formatting improvements in debug logging
ggml/src/ggml-qnn/npu/host/buffer.cpp Added performance tracking and batch tensor cleanup
ggml/src/ggml-qnn/npu/device/vec_ops.inl New comprehensive SIMD vector operations implementation
ggml/src/ggml-qnn/npu/device/vec_ops.hpp Refactored vector operations with template-based dot product functions
ggml/src/ggml-qnn/npu/device/vec_ops.cpp Removed old implementation (moved to .inl file)
ggml/src/ggml-qnn/npu/device/util.hpp Fixed macro name collision in performance tracking
ggml/src/ggml-qnn/npu/device/type_traits.hpp Enhanced type system with aligned vector operations support
ggml/src/ggml-qnn/npu/device/type_traits.cpp Optimized quantization operations and added copy functions
ggml/src/ggml-qnn/npu/device/tensor.hpp Updated tensor handle validation
ggml/src/ggml-qnn/npu/device/op_rope.cpp Replaced memcpy with vectorized copy operations
ggml/src/ggml-qnn/npu/device/op_mul_mat.cpp Major optimization with template-based matrix multiplication and improved caching
ggml/src/ggml-qnn/npu/device/op_impl.cpp Refactored vector operations to use new template system
ggml/src/ggml-qnn/npu/device/op_flash_attn.cpp Enhanced flash attention with template specialization and vectorized operations
ggml/src/ggml-qnn/npu/device/device.cpp Improved handle management and added batch tensor deallocation
Comments suppressed due to low confidence (3)

ggml/src/ggml-qnn/npu/device/op_mul_mat.cpp:31

  • The template parameter name '_ShouldCacheSrc0' uses a leading underscore which is reserved for implementation. Consider renaming to 'ShouldCacheSrc0'.
template <auto _DotFunc, bool _ShouldCacheSrc0>

ggml/src/ggml-qnn/npu/device/op_mul_mat.cpp:17

  • The template parameter name '_TRet' uses a leading underscore which is reserved for implementation. Consider renaming to 'TRet'.
template <typename _TRet> struct convert_vector {};

ggml/src/ggml-qnn/npu/device/op_flash_attn.cpp:16

  • The template parameter name '_IsKvF16' uses a leading underscore which is reserved for implementation. Consider renaming to 'IsKvF16'.
template <bool _IsKvF16>

Comment on lines +63 to +64
// https://github.com/UbiquitousLearning/mllm/blob/babf4410352ce8730824c87699c025a0d4ce3a6f/src/backends/qnn/LLaMAOpPackageHtp/LLaMAPackage/src/ops/LLaMAMul.cpp#L147
// or qualcomm sdk libs\qhl_hvx\src\qhblas_hvx\qhblas_hvx_aw_vector_add_ah.c
Copy link
Preview

Copilot AI Jul 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The comment contains a very long URL that reduces code readability. Consider using a shorter reference or moving to documentation.

Suggested change
// https://github.com/UbiquitousLearning/mllm/blob/babf4410352ce8730824c87699c025a0d4ce3a6f/src/backends/qnn/LLaMAOpPackageHtp/LLaMAPackage/src/ops/LLaMAMul.cpp#L147
// or qualcomm sdk libs\qhl_hvx\src\qhblas_hvx\qhblas_hvx_aw_vector_add_ah.c
// Refer to the documentation for related examples:
// - LLaMAMul.cpp implementation details
// - Qualcomm SDK: qhblas_hvx_aw_vector_add_ah.c

Copilot uses AI. Check for mistakes.

curr1 = Q6_V_valign_VVR(curr1, prev1, (size_t) src1);

HVX_Vector_Dual curr0_pair = _ExpandFunc(curr0, kOneV);

Copy link
Preview

Copilot AI Jul 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ternary operator logic for selecting between curr0_pair.first and curr0_pair.second based on leftover comparison is unclear. Consider adding a comment explaining the selection criteria.

Suggested change
// Select curr0_pair.first if leftover1 equals leftover0, otherwise select curr0_pair.second.
// This ensures the correct alignment of the remaining elements based on the leftover sizes.

Copilot uses AI. Check for mistakes.

@chraac chraac merged commit 2cd429c into dev-refactoring Jul 22, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

1 participant