Too large max_model_len for Gemma 27b on H100

When trying to spin up the recipe for inference on H100 GPU, I got the following issue about GPU Memory usage indicating that the 128k max_model_len is too large for this model to fit into the GPU memory:

(EngineCore_DP0 pid=121) ValueError: To serve at least one request with the models's max seq len (131072), (13.66 GiB KV cache is needed, which is larger than the available KV cache memory (8.85 GiB). Based on the available memory, the estimated maximum model length is 68000. Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.\


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Too large max_model_len for Gemma 27b on H100 #394

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Too large max_model_len for Gemma 27b on H100 #394

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions