-
Notifications
You must be signed in to change notification settings - Fork 25
Open
Description
When trying to spin up the recipe for inference on H100 GPU, I got the following issue about GPU Memory usage indicating that the 128k max_model_len is too large for this model to fit into the GPU memory:
(EngineCore_DP0 pid=121) ValueError: To serve at least one request with the models's max seq len (131072), (13.66 GiB KV cache is needed, which is larger than the available KV cache memory (8.85 GiB). Based on the available memory, the estimated maximum model length is 68000. Try increasing gpu_memory_utilization or decreasing max_model_len when initializing the engine.\
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels