Releases · IBM/fmwork

17 Oct 14:41

WarningRan

v1.0.9

f241933

v1.0.9 Latest

Latest

Bug Fix

infer/vllm/prometheus_metrics.py

Fixed an issue with post-processing of results when Prom metrics are enabled.
- Removed extraneous debug print statements from the get_cpu_metrics, get_memory_metrics, and get_arguments functions to ensure the correct data flow after metric collection.

Assets 2

01 Oct 14:28

WarningRan

v1.0.8

c418191

v1.0.8

infer/vllm/client

Added UTC timestamp logging at benchmark start and end (client_time_start, client_time_end)

infer/vllm/process

Added --enable_prom_metrics flag to collect CPU and memory metrics from Prometheus/Thanos
Added new output fields: req_thp, num_prompts, server_id, client_mode, prom_metrics
Added support for multi-client benchmark mode (detects client.log.* files)
Extract warmup time from server.log for server mode
Better fallback logic: calculates average sizes from actual token counts when not explicitly configured

infer/vllm/prometheus_metrics.py (new file)

Query CPU and memory metrics from Prometheus/Thanos
Supports avg_over_time and max_over_time query functions
Requires environment variables: THANOS_API_TOKEN, THANOS_API_URL
Uses 5-minute step resolution for queries

infer/vllm/runner

Changed timestamp generation to UTC format

Dependencies

Added prometheus-api-client

Assets 2

17 Sep 23:43

nelsonspbr

v1.0.7

6523362

v1.0.7

Adding support for multi-client runs for server-mode benchmarking.

Instead of providing a specific shape combo to the client section of runner -- e.g.,

...
    --
client
    --env PYTHONUNBUFFERED=1
    --dataset-name random
    --random-input-len 896
    --random-output-len 128
    --max-concurrency 1
    --num-prompts 10

Now we can provide a --multi parameter:

...
    --
client
    --env PYTHONUNBUFFERED=1
    --multi 896/128/1/10,1920/128/2/20,3968/128/1/10

--multi receives a , separated list of shape combos in the format:

input size / output size / batch size / num prompts

In the example above, three combos are specified:

896/128/1/10 - input size = 896, output size = 128, batch size 1, num prompts = 10
1920/128/2/20 - input size = 1920, output size = 128, batch size 2, num prompts = 20
3968/128/1/10 - input size = 3968, output size = 128, batch size 1, num prompts = 10

If --multi is provided, then the client script will iterate over combos and run one vllm bench serve for each combo. Each instance writes outputs to its own file, client.log.${instance}. Again, in the example above, there would be three instances -- therefore files client.log.0, client.log.1, client.log.2.

Pod / execution output:

waiting for server at PID 56 ...
done, server is ready!
...
Wed Sep 17 23:29:59 UTC 2025 -- starting 0
Wed Sep 17 23:29:59 UTC 2025 -- starting 1
Wed Sep 17 23:29:59 UTC 2025 -- clients started; waiting for completion ...
Wed Sep 17 23:29:59 UTC 2025 -- starting 2
Wed Sep 17 23:34:27 UTC 2025 -- finished 0
Wed Sep 17 23:34:49 UTC 2025 -- finished 1
Wed Sep 17 23:35:12 UTC 2025 -- finished 2
Wed Sep 17 23:35:12 UTC 2025 -- all done!
...

Each client.log.${instance} file will have the usual vllm bench output, including perf metrics.

Assets 2

17 Sep 13:52

WarningRan

v1.0.6

ef20a3b

v1.0.6

infer/vllm/process

Added Error Type Support
- REQ: assert req_index is not None
- CGF: Failed to compile graphs: compile_graph failed
Model Configuration Enhancements
- Added automatic precision parsing from model names
- Introduced context_length field
Server Mode Improvements
- Server logs now append to runner.log in server mode for better visibility

Assets 2

08 Sep 19:41

WarningRan

v1.0.5

625332e

v1.0.5

infer/vllm/process

Improved Error and Status Reporting
- The output now includes status and error_msg fields, providing clarity on the run's outcome. Status can be OK (successful), PART_<err> (partially completed with a specific error), or ERR_<err> (failed with a specific error). A new function was also added to detect several predefined error patterns:
  - OOR: terminate called after throwing an instance of 'std::out_of_range'
  - MFS: DtException: Must find space in DDR
  - UMG: DtException: Unable to map graph within architecture constraints
  - PVF: DtException: Program verification failed
  - VMS: DtException: Need to find a valid memory space
  - DIR: RuntimeError.*DDR init retried
  - RPC: TimeoutError: RPC call to execute_model timed out.
  - PLT: assert prompt_len <= self.tkv
  - CTL: Please reduce the length of the messages or completion
  - If a run fails without a recognizable error pattern, the status will be ERR_UNKNOWN.
Captured parameter information even for failed runs
- In direct mode, used FMWORK ARG as a fallback to record values like batch_size, input_size, and output_size.
- In server mode, parsed input and batch sizes from server.cmd and client.cmd.
Captured client request completion data, which is included in the notes field in the format successful_requests:<num> and num-prompts:<num>
Added --model to normalize the model name in the output. The script extracts the original model name from the logs and splits it into a standardized model name and a new model_version field.
- For example, if the model in the log is ibm-granite/granite-3.3-8b-instruct/main and the --model argument is ibm-granite/granite-3.3-8b-instruct, the output will show model: "ibm-granite/granite-3.3-8b-instruct" and model_version: "main".

infer/vllm/runner

In server mode, the runner script now prints the contents of server.log directly to the console and appends them to runner.log after execution. This allows pipeline users, who may not have access to the file system, to easily view the complete server logs.

Assets 2

25 Aug 14:49

nelsonspbr

v1.0.4

13971cc

v1.0.4

infer/vllm/client

Removed --base-url http://localhost:8000 This may require changes to
downstream automation.

infer/vllm/process

Added --precision with a fp16 default value.
Added code to detect batch mode ('static' or 'continuous') for Spyre
integration. This requires VLLM_SPYRE_USE_CB to be explicitly defined and
printed in the server.log file. Note that this should be done automatically
by the runner - server integration.
Changed TTFT metric from server's TTFT (via /metrics) to client's
Changed ITL metric from Mean TPOT to Median ITL, as reported by vLLM's
serving benchmark.
To better support experiments with datasets other than random (which
explicitly allows the definition of shapes); if such definition is not found
in the log files (e.g., if sharegpt dataset was used), process will read
the appropriate lines from client.log to get the average input / output
sizes.

Assets 2

13 Aug 06:57

nelsonspbr

v1.0.3

87f7ae2

v1.0.3

Finalized server-mode support for infer/vllm and added documentation.

Assets 2

12 Aug 14:14

nelsonspbr

v1.0.2

780af6b

v1.0.2

Finalize support for direct and server modes for infer/vllm, including process script.

Documentation pending — to be added momentarily.

Assets 2

01 Aug 20:39

nelsonspbr

v1.0.1

99d4ea6

v1.0.1

General improvements to embed/tf.

Improved output formatting for arguments.
Added processing script.
Oh, and a README ☺️

Assets 2

01 Aug 08:14

nelsonspbr

v1.0.0

5358a87

v1.0.0

Still a partial release — but now with the latest scripts to run encoder models on CPUs / GPUs / Spyre. Subsequent releases will cover decoder models, as well as more options / different engines.

Assets 2

Releases: IBM/fmwork

v1.0.9

Bug Fix

Uh oh!

v1.0.8

Uh oh!

v1.0.7

Uh oh!

v1.0.6

Uh oh!

v1.0.5

Uh oh!

v1.0.4

Uh oh!

v1.0.3

Uh oh!

v1.0.2

Uh oh!

v1.0.1

Uh oh!

v1.0.0

Uh oh!