Skip to content

Significant performance drop in concurrent requests with 30B model vs 4B model on 2x4090 #14913

Answered by slaren
tylike asked this question in Q&A
Discussion options

You must be logged in to vote

The performance of small batch sizes with MoE models is not very good in the CUDA backend. There is a fast gemv implementation, but it only works with bs=1.

  MUL_MAT_ID(type_a=q4_0,type_b=f32,n_mats=256,n_used=8,b=1,m=2048,n=1,k=7168):                14058 runs -    73.25 us/run - 234.88 MFLOP/run -   3.21 TFLOPS
  MUL_MAT_ID(type_a=q4_0,type_b=f32,n_mats=256,n_used=8,b=1,m=2048,n=2,k=7168):                 2343 runs -   432.78 us/run - 469.76 MFLOP/run -   1.09 TFLOPS
  MUL_MAT_ID(type_a=q4_0,type_b=f32,n_mats=256,n_used=8,b=1,m=2048,n=3,k=7168):                 1420 runs -   711.23 us/run - 704.64 MFLOP/run - 990.74 GFLOPS
  MUL_MAT_ID(type_a=q4_0,type_b=f32,n_mats=256,n_used=8,b=1,m=2…

Replies: 3 comments 1 reply

Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Answer selected by tylike
Comment options

You must be logged in to vote
1 reply
@ggerganov
Comment options

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
3 participants