Releases: IBM/fmwork
v1.0.9
Bug Fix
infer/vllm/prometheus_metrics.py
- Fixed an issue with post-processing of results when Prom metrics are enabled.
- Removed extraneous debug
printstatements from theget_cpu_metrics,get_memory_metrics, andget_argumentsfunctions to ensure the correct data flow after metric collection.
- Removed extraneous debug
v1.0.8
infer/vllm/client
- Added UTC timestamp logging at benchmark start and end (
client_time_start,client_time_end)
infer/vllm/process
- Added
--enable_prom_metricsflag to collect CPU and memory metrics from Prometheus/Thanos - Added new output fields:
req_thp,num_prompts,server_id,client_mode,prom_metrics - Added support for multi-client benchmark mode (detects
client.log.*files) - Extract warmup time from
server.logfor server mode - Better fallback logic: calculates average sizes from actual token counts when not explicitly configured
infer/vllm/prometheus_metrics.py (new file)
- Query CPU and memory metrics from Prometheus/Thanos
- Supports
avg_over_timeandmax_over_timequery functions - Requires environment variables:
THANOS_API_TOKEN,THANOS_API_URL - Uses 5-minute step resolution for queries
infer/vllm/runner
- Changed timestamp generation to UTC format
Dependencies
- Added
prometheus-api-client
v1.0.7
Adding support for multi-client runs for server-mode benchmarking.
Instead of providing a specific shape combo to the client section of runner -- e.g.,
...
--
client
--env PYTHONUNBUFFERED=1
--dataset-name random
--random-input-len 896
--random-output-len 128
--max-concurrency 1
--num-prompts 10Now we can provide a --multi parameter:
...
--
client
--env PYTHONUNBUFFERED=1
--multi 896/128/1/10,1920/128/2/20,3968/128/1/10
--multi receives a , separated list of shape combos in the format:
input size / output size / batch size / num prompts
In the example above, three combos are specified:
896/128/1/10- input size = 896, output size = 128, batch size 1, num prompts = 101920/128/2/20- input size = 1920, output size = 128, batch size 2, num prompts = 203968/128/1/10- input size = 3968, output size = 128, batch size 1, num prompts = 10
If --multi is provided, then the client script will iterate over combos and run one vllm bench serve for each combo. Each instance writes outputs to its own file, client.log.${instance}. Again, in the example above, there would be three instances -- therefore files client.log.0, client.log.1, client.log.2.
Pod / execution output:
waiting for server at PID 56 ...
done, server is ready!
...
Wed Sep 17 23:29:59 UTC 2025 -- starting 0
Wed Sep 17 23:29:59 UTC 2025 -- starting 1
Wed Sep 17 23:29:59 UTC 2025 -- clients started; waiting for completion ...
Wed Sep 17 23:29:59 UTC 2025 -- starting 2
Wed Sep 17 23:34:27 UTC 2025 -- finished 0
Wed Sep 17 23:34:49 UTC 2025 -- finished 1
Wed Sep 17 23:35:12 UTC 2025 -- finished 2
Wed Sep 17 23:35:12 UTC 2025 -- all done!
...
Each client.log.${instance} file will have the usual vllm bench output, including perf metrics.
v1.0.6
infer/vllm/process
-
Added Error Type Support
- REQ:
assert req_index is not None - CGF:
Failed to compile graphs: compile_graph failed
- REQ:
-
Model Configuration Enhancements
- Added automatic precision parsing from model names
- Introduced
context_lengthfield
-
Server Mode Improvements
- Server logs now append to
runner.login server mode for better visibility
- Server logs now append to
v1.0.5
infer/vllm/process
-
Improved Error and Status Reporting
-
The output now includes
statusanderror_msgfields, providing clarity on the run's outcome. Status can be OK (successful), PART_<err> (partially completed with a specific error), or ERR_<err> (failed with a specific error). A new function was also added to detect several predefined error patterns:- OOR:
terminate called after throwing an instance of 'std::out_of_range' - MFS:
DtException: Must find space in DDR - UMG:
DtException: Unable to map graph within architecture constraints - PVF:
DtException: Program verification failed - VMS:
DtException: Need to find a valid memory space - DIR:
RuntimeError.*DDR init retried - RPC:
TimeoutError: RPC call to execute_model timed out. - PLT:
assert prompt_len <= self.tkv - CTL:
Please reduce the length of the messages or completion - If a run fails without a recognizable error pattern, the status will be ERR_UNKNOWN.
- OOR:
-
-
Captured parameter information even for failed runs
- In
directmode, usedFMWORK ARGas a fallback to record values likebatch_size,input_size, andoutput_size. - In
servermode, parsed input and batch sizes fromserver.cmdandclient.cmd.
- In
-
Captured client request completion data, which is included in the
notesfield in the formatsuccessful_requests:<num>andnum-prompts:<num> -
Added
--modelto normalize the model name in the output. The script extracts the original model name from the logs and splits it into a standardized model name and a new model_version field.- For example, if the model in the log is
ibm-granite/granite-3.3-8b-instruct/mainand the--modelargument isibm-granite/granite-3.3-8b-instruct, the output will showmodel: "ibm-granite/granite-3.3-8b-instruct"andmodel_version: "main".
- For example, if the model in the log is
infer/vllm/runner
- In
servermode, therunnerscript now prints the contents ofserver.logdirectly to the console and appends them torunner.logafter execution. This allows pipeline users, who may not have access to the file system, to easily view the complete server logs.
v1.0.4
infer/vllm/client
- Removed
--base-url http://localhost:8000This may require changes to
downstream automation.
infer/vllm/process
- Added
--precisionwith afp16default value. - Added code to detect batch mode (
'static'or'continuous') for Spyre
integration. This requiresVLLM_SPYRE_USE_CBto be explicitly defined and
printed in theserver.logfile. Note that this should be done automatically
by therunner-serverintegration. - Changed
TTFTmetric from server's TTFT (via/metrics) to client's - Changed
ITLmetric fromMean TPOTtoMedian ITL, as reported by vLLM's
serving benchmark. - To better support experiments with datasets other than
random(which
explicitly allows the definition of shapes); if such definition is not found
in the log files (e.g., ifsharegptdataset was used),processwill read
the appropriate lines fromclient.logto get the average input / output
sizes.
v1.0.3
Finalized server-mode support for infer/vllm and added documentation.
v1.0.2
- Finalize support for
directandservermodes forinfer/vllm, includingprocessscript.
Documentation pending — to be added momentarily.
v1.0.1
General improvements to embed/tf.
- Improved output formatting for arguments.
- Added processing script.
- Oh, and a README
☺️
v1.0.0
Still a partial release — but now with the latest scripts to run encoder models on CPUs / GPUs / Spyre. Subsequent releases will cover decoder models, as well as more options / different engines.