Skip to content

Conversation

@Yssx-g
Copy link
Contributor

@Yssx-g Yssx-g commented Nov 27, 2025

Added Llama Mode matmul for the decode stage.Implemented the handwritten-style next-matmul-llama.mlir and the corresponding Pass,MatMulLlamaOptimize.cpp.Also added the test case matmul-vectorization-llama.mlir.It has been successfully registered in buddy-opt and can be directly invoked using the-matmul-vectorization-llama option.
Below is the performance comparison for the deepseekR1 decode stage use case,with the upper part representing the BLIS mode and the lower part representing the Llama mode.
PixPin_2025-11-28_00-34-56
That‘s all

@Yssx-g
Copy link
Contributor Author

Yssx-g commented Dec 24, 2025

Below are the DeepSeek end-to-end test results. Multiple tests were conducted, and two screenshots are selected here for demonstration: the left side shows the new matmul implementation, while the right side shows BLIS matmul. The average time per token was reduced by 0.05 seconds.
image
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant