Skip to content

All successful responses reporting an output length of 0 #189

@Bslabe123

Description

@Bslabe123

What happened:

Seeing the following lifecycle metrics report

'{
  "load_summary": {
    "count": 300,
    "schedule_accuracy": {
      "mean": 41.392056,
      "min": 0.00021657249,
      "max": 190.29817,
      "p90": 149.24591
    },
    "send_duration": 300.10364,
    "requested_rate": 1,
    "achieved_rate": 0.9996547
  },
  "successes": {
    "count": 283,
    "latency": {
      "request_latency": {
        "mean": 13.698023,
        "min": 0.06531374,
        "max": 180.07655,
        "p90": 14.04373
      },
      "normalized_time_per_output_token": {},
      "time_per_output_token": {
        "mean": 0.042900182,
        "min": 0.002089543,
        "max": 3.380381,
        "p90": 0.04745394
      },
      "time_to_first_token": {
        "mean": 0.31647572,
        "min": 0.02519054,
        "max": 1.2869523,
        "p90": 0.94861716
      },
      "inter_token_latency": {
        "mean": 0.031301748,
        "min": 0.0000011720113,
        "max": 179.12752,
        "p90": 0.016776484
      }
    },
    "throughput": {
      "input_tokens_per_sec": 243.59926,
      "output_tokens_per_sec": 0,
      "total_tokens_per_sec": 243.59926,
      "requests_per_sec": 0.90133476
    },
    "prompt_len": {
      "mean": 270.265,
      "min": 2,
      "max": 1927,
      "p90": 715.4
    },
    "output_len": {}
  },
  "failures": {
    "count": 17,
    "request_latency": {
      "mean": 4.065165,
      "min": 0.030937817,
      "max": 67.88596,
      "p90": 0.18836944
    }
  }
}'

What you expected to happen:

  • output_len should never be None, either better error handling is needed or theres a completion API response parsing bug.

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:
config.yaml

config:
  api:
    streaming: true
    type: completion
  data:
    input_distribution: null
    output_distribution: null
    type: shareGPT
  load:
    stages:
    - duration: 300
      rate: 1
    - duration: 300
      rate: 12.02
    - duration: 300
      rate: 19.47
    - duration: 300
      rate: 24.5
    - duration: 300
      rate: 27.91
    - duration: 300
      rate: 30.21
    - duration: 300
      rate: 31.76
    - duration: 300
      rate: 32.81
    - duration: 300
      rate: 33.52
    - duration: 300
      rate: 34
    type: constant
  metrics:
    prometheus:
      filters:
      - namespace="default"
      google_managed: true
      scrape_interval: 15
      url: null
    type: prometheus
  report:
    prometheus:
      per_stage: true
      summary: true
    request_lifecycle:
      per_request: true
      per_stage: true
      summary: true
  server:
    base_url: http://gemma-3-4b-it-vllm-service.default.svc.cluster.local:8000
    ignore_eos: true
    model_name: google/gemma-3-4b-it
    type: vllm
  storage:
    google_cloud_storage:
      bucket_name: slabe-dev-bucket
      path: default
  tokenizer:
    pretrained_model_name_or_path: google/gemma-3-4b-it

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions