-
Notifications
You must be signed in to change notification settings - Fork 3
feat: perf opt part5 #52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…r improved clarity and performance
…ector operations and alignment checks # Conflicts: # ggml/src/ggml-qnn/npu/device/type_traits.cpp
… adding parallel processing for multiple vector pairs
…mance by adding parallel processing for multiple vector pairs" This reverts commit 78cc24ed2285002ca29d6189fa61ba4ce24f8d16.
…ta types and improved constexpr usage
This reverts commit bb1840876692a11511d5ab7828b8a707402e30b9.
This reverts commit ab442fa9f763b3873c929936e4cb739cb1c83850.
…rsion and performance tracking
…ved clarity and consistency
…e memory initialization" This reverts commit e374326dc74d049e6603e393ade418d9ef2b83f3.
… function for improved type handling
…or each thread" This reverts commit 00cdd3f.
…tants for better clarity
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR implements the fifth part of performance optimization efforts for llama.cpp focusing on SIMD vectorization improvements, memory management enhancements, and optimized tensor operations for the Hexagon NPU backend.
- Enhanced SIMD vector operations with improved memory alignment handling and batch processing
- Optimized tensor destruction with batch operations to reduce overhead
- Streamlined matrix multiplication operations with better cache utilization and aligned memory access patterns
Reviewed Changes
Copilot reviewed 17 out of 17 changed files in this pull request and generated 3 comments.
Show a summary per file
File | Description |
---|---|
ggml/src/ggml-qnn/npu/idl/hexagon_npu.idl | Added batch tensor deallocation interface and invalid handle constants |
ggml/src/ggml-qnn/npu/host/tensor.hpp | Implemented batch tensor destruction and updated handle validation |
ggml/src/ggml-qnn/npu/host/graph.hpp | Updated graph handle initialization with proper constants |
ggml/src/ggml-qnn/npu/host/graph.cpp | Minor formatting improvements in debug logging |
ggml/src/ggml-qnn/npu/host/buffer.cpp | Added performance tracking and batch tensor cleanup |
ggml/src/ggml-qnn/npu/device/vec_ops.inl | New comprehensive SIMD vector operations implementation |
ggml/src/ggml-qnn/npu/device/vec_ops.hpp | Refactored vector operations with template-based dot product functions |
ggml/src/ggml-qnn/npu/device/vec_ops.cpp | Removed old implementation (moved to .inl file) |
ggml/src/ggml-qnn/npu/device/util.hpp | Fixed macro name collision in performance tracking |
ggml/src/ggml-qnn/npu/device/type_traits.hpp | Enhanced type system with aligned vector operations support |
ggml/src/ggml-qnn/npu/device/type_traits.cpp | Optimized quantization operations and added copy functions |
ggml/src/ggml-qnn/npu/device/tensor.hpp | Updated tensor handle validation |
ggml/src/ggml-qnn/npu/device/op_rope.cpp | Replaced memcpy with vectorized copy operations |
ggml/src/ggml-qnn/npu/device/op_mul_mat.cpp | Major optimization with template-based matrix multiplication and improved caching |
ggml/src/ggml-qnn/npu/device/op_impl.cpp | Refactored vector operations to use new template system |
ggml/src/ggml-qnn/npu/device/op_flash_attn.cpp | Enhanced flash attention with template specialization and vectorized operations |
ggml/src/ggml-qnn/npu/device/device.cpp | Improved handle management and added batch tensor deallocation |
Comments suppressed due to low confidence (3)
ggml/src/ggml-qnn/npu/device/op_mul_mat.cpp:31
- The template parameter name '_ShouldCacheSrc0' uses a leading underscore which is reserved for implementation. Consider renaming to 'ShouldCacheSrc0'.
template <auto _DotFunc, bool _ShouldCacheSrc0>
ggml/src/ggml-qnn/npu/device/op_mul_mat.cpp:17
- The template parameter name '_TRet' uses a leading underscore which is reserved for implementation. Consider renaming to 'TRet'.
template <typename _TRet> struct convert_vector {};
ggml/src/ggml-qnn/npu/device/op_flash_attn.cpp:16
- The template parameter name '_IsKvF16' uses a leading underscore which is reserved for implementation. Consider renaming to 'IsKvF16'.
template <bool _IsKvF16>
// https://github.com/UbiquitousLearning/mllm/blob/babf4410352ce8730824c87699c025a0d4ce3a6f/src/backends/qnn/LLaMAOpPackageHtp/LLaMAPackage/src/ops/LLaMAMul.cpp#L147 | ||
// or qualcomm sdk libs\qhl_hvx\src\qhblas_hvx\qhblas_hvx_aw_vector_add_ah.c |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The comment contains a very long URL that reduces code readability. Consider using a shorter reference or moving to documentation.
// https://github.com/UbiquitousLearning/mllm/blob/babf4410352ce8730824c87699c025a0d4ce3a6f/src/backends/qnn/LLaMAOpPackageHtp/LLaMAPackage/src/ops/LLaMAMul.cpp#L147 | |
// or qualcomm sdk libs\qhl_hvx\src\qhblas_hvx\qhblas_hvx_aw_vector_add_ah.c | |
// Refer to the documentation for related examples: | |
// - LLaMAMul.cpp implementation details | |
// - Qualcomm SDK: qhblas_hvx_aw_vector_add_ah.c |
Copilot uses AI. Check for mistakes.
curr1 = Q6_V_valign_VVR(curr1, prev1, (size_t) src1); | ||
|
||
HVX_Vector_Dual curr0_pair = _ExpandFunc(curr0, kOneV); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ternary operator logic for selecting between curr0_pair.first and curr0_pair.second based on leftover comparison is unclear. Consider adding a comment explaining the selection criteria.
// Select curr0_pair.first if leftover1 equals leftover0, otherwise select curr0_pair.second. | |
// This ensures the correct alignment of the remaining elements based on the leftover sizes. |
Copilot uses AI. Check for mistakes.
Overview
This pull request implements the fifth part of our ongoing performance optimization efforts for llama.cpp. It continues our work to improve model inference speed and resource utilization.
Key Changes
SIMD & Matrix Operations
Memory & Attention Mechanisms
Performance Impact
Test setup: 8gen2
Test suite:
test-backend-ops
Baseline:
c5187054b
Optimized:
74f28f53f
Matrix Multiplication Performance Comparison
Small to Medium Batch Operations (n=1-5)
Large Batch Operations (n=8-512)
Flash Attention Performance
Performance Summary
Key Insights
Dramatic scaling improvements: Performance gains increase significantly with batch size, particularly for f16/f32 operations
Consistent quantized gains: q4_0 operations show steady 6-8% improvements across all scenarios
Memory bandwidth bottleneck confirmed: In n=1 case, q4_0 (16.35 GFLOPS) outperforms f16 (12.17 GFLOPS), indicating memory bandwidth limitation. Despite q4_0 requiring dequantization to f16 in VTCM before matrix multiplication, its smaller memory footprint enables better performance than native f16 operations
Overall Performance Summary
Unit tests
Test setup: 8gen2
Test suite:
test-backend-ops
Full log:
test-backend-ops-all.debug.hexagon.74f28f53f.7z