Skip to content

Too large max_model_len for Gemma 27b on H100 #394

@damadei-google

Description

@damadei-google

When trying to spin up the recipe for inference on H100 GPU, I got the following issue about GPU Memory usage indicating that the 128k max_model_len is too large for this model to fit into the GPU memory:

(EngineCore_DP0 pid=121) ValueError: To serve at least one request with the models's max seq len (131072), (13.66 GiB KV cache is needed, which is larger than the available KV cache memory (8.85 GiB). Based on the available memory, the estimated maximum model length is 68000. Try increasing gpu_memory_utilization or decreasing max_model_len when initializing the engine.\

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions