Skip to content

Releases: IBM/fmwork

v1.0.9

17 Oct 14:41
f241933

Choose a tag to compare

Bug Fix

infer/vllm/prometheus_metrics.py

  • Fixed an issue with post-processing of results when Prom metrics are enabled.
    • Removed extraneous debug print statements from the get_cpu_metrics, get_memory_metrics, and get_arguments functions to ensure the correct data flow after metric collection.

v1.0.8

01 Oct 14:28
c418191

Choose a tag to compare

infer/vllm/client

  • Added UTC timestamp logging at benchmark start and end (client_time_start, client_time_end)

infer/vllm/process

  • Added --enable_prom_metrics flag to collect CPU and memory metrics from Prometheus/Thanos
  • Added new output fields: req_thp, num_prompts, server_id, client_mode, prom_metrics
  • Added support for multi-client benchmark mode (detects client.log.* files)
  • Extract warmup time from server.log for server mode
  • Better fallback logic: calculates average sizes from actual token counts when not explicitly configured

infer/vllm/prometheus_metrics.py (new file)

  • Query CPU and memory metrics from Prometheus/Thanos
  • Supports avg_over_time and max_over_time query functions
  • Requires environment variables: THANOS_API_TOKEN, THANOS_API_URL
  • Uses 5-minute step resolution for queries

infer/vllm/runner

  • Changed timestamp generation to UTC format

Dependencies

  • Added prometheus-api-client

v1.0.7

17 Sep 23:43

Choose a tag to compare

Adding support for multi-client runs for server-mode benchmarking.

Instead of providing a specific shape combo to the client section of runner -- e.g.,

...
    --
client
    --env PYTHONUNBUFFERED=1
    --dataset-name random
    --random-input-len 896
    --random-output-len 128
    --max-concurrency 1
    --num-prompts 10

Now we can provide a --multi parameter:

...
    --
client
    --env PYTHONUNBUFFERED=1
    --multi 896/128/1/10,1920/128/2/20,3968/128/1/10

--multi receives a , separated list of shape combos in the format:

input size / output size / batch size / num prompts

In the example above, three combos are specified:

  • 896/128/1/10 - input size = 896, output size = 128, batch size 1, num prompts = 10
  • 1920/128/2/20 - input size = 1920, output size = 128, batch size 2, num prompts = 20
  • 3968/128/1/10 - input size = 3968, output size = 128, batch size 1, num prompts = 10

If --multi is provided, then the client script will iterate over combos and run one vllm bench serve for each combo. Each instance writes outputs to its own file, client.log.${instance}. Again, in the example above, there would be three instances -- therefore files client.log.0, client.log.1, client.log.2.

Pod / execution output:

waiting for server at PID 56 ...
done, server is ready!
...
Wed Sep 17 23:29:59 UTC 2025 -- starting 0
Wed Sep 17 23:29:59 UTC 2025 -- starting 1
Wed Sep 17 23:29:59 UTC 2025 -- clients started; waiting for completion ...
Wed Sep 17 23:29:59 UTC 2025 -- starting 2
Wed Sep 17 23:34:27 UTC 2025 -- finished 0
Wed Sep 17 23:34:49 UTC 2025 -- finished 1
Wed Sep 17 23:35:12 UTC 2025 -- finished 2
Wed Sep 17 23:35:12 UTC 2025 -- all done!
...

Each client.log.${instance} file will have the usual vllm bench output, including perf metrics.

v1.0.6

17 Sep 13:52
ef20a3b

Choose a tag to compare

infer/vllm/process

  • Added Error Type Support

    • REQ: assert req_index is not None
    • CGF: Failed to compile graphs: compile_graph failed
  • Model Configuration Enhancements

    • Added automatic precision parsing from model names
    • Introduced context_length field
  • Server Mode Improvements

    • Server logs now append to runner.log in server mode for better visibility

v1.0.5

08 Sep 19:41
625332e

Choose a tag to compare

infer/vllm/process

  • Improved Error and Status Reporting

    • The output now includes status and error_msg fields, providing clarity on the run's outcome. Status can be OK (successful), PART_<err> (partially completed with a specific error), or ERR_<err> (failed with a specific error). A new function was also added to detect several predefined error patterns:

      • OOR: terminate called after throwing an instance of 'std::out_of_range'
      • MFS: DtException: Must find space in DDR
      • UMG: DtException: Unable to map graph within architecture constraints
      • PVF: DtException: Program verification failed
      • VMS: DtException: Need to find a valid memory space
      • DIR: RuntimeError.*DDR init retried
      • RPC: TimeoutError: RPC call to execute_model timed out.
      • PLT: assert prompt_len <= self.tkv
      • CTL: Please reduce the length of the messages or completion
      • If a run fails without a recognizable error pattern, the status will be ERR_UNKNOWN.
  • Captured parameter information even for failed runs

    • In direct mode, used FMWORK ARG as a fallback to record values like batch_size, input_size, and output_size.
    • In server mode, parsed input and batch sizes from server.cmd and client.cmd.
  • Captured client request completion data, which is included in the notes field in the format successful_requests:<num> and num-prompts:<num>

  • Added --model to normalize the model name in the output. The script extracts the original model name from the logs and splits it into a standardized model name and a new model_version field.

    • For example, if the model in the log is ibm-granite/granite-3.3-8b-instruct/main and the --model argument is ibm-granite/granite-3.3-8b-instruct, the output will show model: "ibm-granite/granite-3.3-8b-instruct" and model_version: "main".

infer/vllm/runner

  • In server mode, the runner script now prints the contents of server.log directly to the console and appends them to runner.log after execution. This allows pipeline users, who may not have access to the file system, to easily view the complete server logs.

v1.0.4

25 Aug 14:49
13971cc

Choose a tag to compare

infer/vllm/client

  • Removed --base-url http://localhost:8000 This may require changes to
    downstream automation.

infer/vllm/process

  • Added --precision with a fp16 default value.
  • Added code to detect batch mode ('static' or 'continuous') for Spyre
    integration. This requires VLLM_SPYRE_USE_CB to be explicitly defined and
    printed in the server.log file. Note that this should be done automatically
    by the runner - server integration.
  • Changed TTFT metric from server's TTFT (via /metrics) to client's
  • Changed ITL metric from Mean TPOT to Median ITL, as reported by vLLM's
    serving benchmark.
  • To better support experiments with datasets other than random (which
    explicitly allows the definition of shapes); if such definition is not found
    in the log files (e.g., if sharegpt dataset was used), process will read
    the appropriate lines from client.log to get the average input / output
    sizes.

v1.0.3

13 Aug 06:57
87f7ae2

Choose a tag to compare

Finalized server-mode support for infer/vllm and added documentation.

v1.0.2

12 Aug 14:14

Choose a tag to compare

  • Finalize support for direct and server modes for infer/vllm, including process script.

Documentation pending — to be added momentarily.

v1.0.1

01 Aug 20:39
99d4ea6

Choose a tag to compare

General improvements to embed/tf.

  • Improved output formatting for arguments.
  • Added processing script.
  • Oh, and a README ☺️

v1.0.0

01 Aug 08:14

Choose a tag to compare

Still a partial release — but now with the latest scripts to run encoder models on CPUs / GPUs / Spyre. Subsequent releases will cover decoder models, as well as more options / different engines.