Release v2.9.0 · SeldonIO/seldon-core

Overview

Core 2.9 is a feature-packed release looking to improve usability, simplify operations via autoscaling and scheduling improvements, and support streaming usecases (via model response streaming for both REST and gRPC clients).

Core 2 also has new docs, with revamped content and structure. Documentation will continuously improve to address advanced configurations and use cases.

CRD Updates:

All CRD changes in this release maintain backward compatibility, so clusters with existing CRs can be migrated seamlessly.

Add status.availableReplicas field to the Model CRD (#5873). Part of the partial scheduling feature. Field not directly set by end-users, but updated by the seldon k8s operator
Add spec.llm field to the Model CRD (#6234). The field is used by the PromptRuntime (in Seldon's LLM Module) to reference a LLM model. Only one of spec.llm and spec.explainer should be set at a given time. This allows the deployment of multiple "models" acting as prompt generators for the same LLM.

(Main) Features:

We add inference response streaming support for REST (via SSE) and gRPC for MLServer models that have streaming support (#6293, #6292). This requires MLServer >= 1.6.0.
We introduce partial scheduling for model replicas (#6221, docs), improving the behaviour of Core 2 during autoscaling. With this new feature, the Core 2 scheduler will try to load as many of the requested model replicas as possible, even when no inference server has sufficient replicas to meet this request.
Partial scheduling is only active when end-users provide spec.minReplicas in a model manifest (as a user-provided minimum for considering the model "available"), and takes effect when there is a suitable inference server with at least this number of replicas. With partial scheduling, a model can be:
- Fully scheduled: spec.replicas == status.availableReplicas; The ModelReady condition is True with message ModelAvailable. All requested replicas serve inference requests.
- Partially scheduled: status.availableReplicas >= spec.minReplicas but status.availableReplicas < spec.replicas; The ModelReady condition is True with message ModelAvailable. Core 2 was not able to find sufficient server replicas to load all requested replicas for this model. This state may be transitory, for example when new server replicas are being created but not yet available. The available model replicas serve inference requests.
- Not able to schedule: no suitable inference servers that have a number of replicas greater or equal to the model's spec.minReplicas could be found. The ModelReady condition is False with message ScheduleFailed. Some model replicas may still be available for inference requests (for example, if the model was previously loaded on a server that was forced to scale-down below the model's spec.minReplicas)
We introduce mixed native/HPA autoscaling (#6218, #6222, #6235, with docs for model and server autoscaling) that:
1. enables end-users to configure a single HPA manifest, controlling model replicas.
2. works for multi-model serving scenarios (MMS)
When using this feature, servers are scaled-up/down natively by Core 2 in response to changes in model replicas. If a model scales up and there aren't sufficient server replicas to host it, the number of server replicas is increased; if a model scales down and a server replica remains without any loaded models, the number of server replicas is reduced.

We also introduce experimental functionality to pack models on fewer inference servers on model scale-down, but this is disabled by default and will be improved in future releases. See the scale-down docs for details.
Model scheduling now takes into account model memory requirements based on the inference server config and how many in-memory copies of one model it creates (the parallel_workes MLServer setting and instance_group configurations in Triton). For triton, only KIND_CPU instance groups are considered at this point (#6054)
Log levels for all internal components (#6312) and the envoy accesslog (#6295) can now be controlled in a consistent way.

Features configuration & helm chart updates

Server spec.minReplicas and spec.maxReplicas can be configured via helm (#6283) via the following values:
- mlserver.minReplicas
- mlserver.maxReplicas
- triton.minReplicas
- triton.maxReplicas
Native autoscaling features control (#6301, #6286). All options here have corresponding command-line arguments that can be passed to seldon-scheduler when not using helm as the install method. The following helm values can be set
- autoscaling.autoscalingModelEnabled, with corresponding cmd line arg: --enable-model-autoscaling (defaults to false): enable or disable native model autoscaling based on lag thresholds. Enabling this assumes that lag (number of inference requests "in-flight") is a representative metric based on which to scale your models in a way that makes efficient use of resources.
- autoscaling.autoscalingServerEnabled with corresponding cmd line arg: --enable-server-autoscaling(defaults to "true"): enable to use native server autoscaling, where the number of server replicas is set according to the number of replicas required by the models loaded onto that server.
- autoscaling.serverPackingEnabled with corresponding cmd line arg: --server-packing-enabled(experimental, defaults to "false"): enable server packing to try and reduce the number of server replicas on model scale-down.
- autoscaling.serverPackingPercentage with corresponding cmd line arg: --server-packing-percentage(experimental, defaults to "0.0"): controls the percentage of model replica removals (due to model scale-down or deletion) that should trigger packing
The inference server PVC retention policy can now be configured via helm (#6056) via the following values:
- mlserver.statefulSetPersistentVolumeClaimRetentionPolicy.whenDeleted
- mlserver.statefulSetPersistentVolumeClaimRetentionPolicy.whenScaled
- triton.statefulSetPersistentVolumeClaimRetentionPolicy.whenDeleted
- triton.statefulSetPersistentVolumeClaimRetentionPolicy.whenScaled
Logging options can be configured either via helm values or by passing arguments to individual Core 2 components. Docs for component log levels and the envoy access log are available

Please consult the helm chart docs for a full list of options

(Main) Bugs Fixed:

Transient NC 503 error in envoy logs when rolling out a new version of a model
- fixed in #6082 by configuring envoy to use the Aggregated Discovery Service (ADS) in order to add guarantees regarding the order of routing updates coming from the Core 2 scheduler.
Mislabelled operational metrics for experiments in Prometheus
- fixed in #6118 by recording the actual model name in the model label rather than the experiment name;
Some errors encountered by modelgateway were silently ignored
- fixed in #6014 by propagating those errors via Kafka, by writing into an error topic (DLQ)
seldon-scheduler pod spec overrides (alongisde any other stateful-set pod spec overrides) defined within the SeldonRuntime CR were ignored
- fixed in #6349

Priority bugfixes scheduled for the next patch releases of Core 2.9

[BUG] When there is a network partition between dataflow-engine and the kafka cluster, and dataflow-engine is restarted, pipelines may sometimes be marked as PipelineTerminated with the message "pipeline removed` after the network partiton is solved. The current workaround is to delete any Pipeline in this state and re-deploy the same manifest into the Core 2 cluster.

Kudos:

We would like to highlight the exceptional contributions that the following team members have brought to this release and to Core 2 so far:

Sherif Akoush (@sakoush)

With contributions from @sakoush, @lc525, @driev, @RobertSamoilescu, @Rajakavitha1, @paulb-seldon, @tyndria

Changelog

Dates are displayed in UTC. Generated by auto-changelog.

v2.9.0

7 April 2025

fix(dataflow): Update default kafka log level for dataflow engine #6367
feat(docs): Server native autoscaling #6356
Bump ubi9/openjdk-17-runtime from 1.20 to 1.22 in /scheduler #6359
docs(pipelines): Minor Pipelines doc improvements #6351
fix: adjust logging level for dataflow #6350
fix(operator): Apply scheduler runtime podSpec override #6349
Spelling fix #6340
Add a missing space #6338
Re-generate license info #6333
Re-generate license info #6332
fix (docs) incorporate mark's suggestions #6313
fix(scheduler): Do not try to unload versions that are not live #6331
(re)allow triton rclone port to be set in compose #6330
fix(gha): update upload-artifact version #6329
feat: Config logging via helm #6312
fix: Allow kafka client (librdkafka) to respect log level #6310
docs: fix rewrite of About #6296
Bump sigs.k8s.io/controller-runtime from 0.20.1 to 0.20.3 in /scheduler #6307
Bump grafana/grafana from 11.5.1 to 11.5.2 in /scheduler #6289
feat: add helm for disabling native autoscaling feature #6301
feat(docs): add docs for access log and autoscaling helm config #6300
Update SUMMARY.md #6303
feat: implemented grpc model streaming #6293
IA mapping #6236
feat(envoy): enable accesslog configuration #6295
[update] Operational Monitoring new IA #6257
docs fix for API #6294
test(scheduler): Included infer_stream test for REST #6292
feat: Enable packing configuration via helm #6286
fix(docs) istio.md #6287
fix(operator): allow scaling requests for older generation #6285
fix(scheduler): Remove all versions when model is deleted #6284
Bump rclone/rclone from 1.69.0 to 1.69.1 in /scheduler #6272
adding min and max replicas to the server helm chart #6283
fix(scheduler): ignoring model runtime info in model equality check #6259
feat(scheduler): Scale down server logic #6246
bug(scheduler): do not scale to zero if max replicas is missing #6258
feat(scheduler): mms send scaling request when model shceduling fails #6235
Re-generate license info #6256
[update] installation draft #6131
feat(scheduler): Allow server stats to be returned #6253
Bump github.com/envoyproxy/go-control-plane/envoy in /scheduler #6249
Bump google.golang.org/protobuf from 1.36.4 to 1.36.5 in /operator #6250
upgrade lint to v1.63.4 #6247
fix(ci): builld k6 image - pin xk6 version to 0.13.4 #6245
use go 1.22 for k6 image build #6244
Re-generate license info #6243
fix: Upgrade Go 1.23 and dependencies upgrade #6238
Re-generate license info #6242
fix(docs): Document Scheduling logic #6237
Bump grafana/grafana from 11.4.0 to 11.5.1 in /scheduler #6240
Bump rclone/rclone from 1.68.2 to 1.69.0 in /scheduler #6192
fix(operator): Add Status.AvailableReplicas to Model CRD #5873
feat(operator): Included LLM spec to CRD #6234
feat(scheduler): add partial scheduling based on min replicas #6221
feat(operator): adding a patch for server/spec/replicas upon scaling request #6222
fix for envoy configs #6220
Bump envoyproxy/envoy from v1.32.2 to v1.33.0 in /scheduler #6206
feat: enable min/max replica for Server CR #6218
new introduction to Core 2 #6195
feat(envoy): use the healthcheck filter and a prestop hook to gracefully terminate Envoy #6194
Re-generate license info #6184
feat(scheduler): account for number of model instances when scheduling #6183
Remove faulty link #6168
Add Managed Kafka page to latest docs #6166
feat(envoy): fixing a test #6163
Mark kafka as recommended #6165
fix port to 9004 in seldon cli deps #6164
refactor(envoy): add clusters before updating routes (2) #6145
fix(ansible): Upgrade deps in ansible install #6146
feat(k6): add scenario with multiple stages ramping up/down RPS #6031
fix(docs): Docs on upgrading from 2.7 - 2.8 #6143
fix: Add timeout to contexts in client calls #6125
Format spaces in install docs #6140
fix(docs): add a table for core 2 dependencies in docs #6139
feat(scheduler): account for multiple instances of a model per server when scheduling #6054
Bump grafana/grafana from 11.3.1 to 11.4.0 in /scheduler #6133
Bump envoyproxy/envoy from v1.32.1 to v1.32.2 in /scheduler #6134
Bump google.golang.org/grpc from 1.68.0 to 1.68.1 in /hodometer #6136
fix(docs): first draft of the securing endpoints #5991
refactor(envoy): moving envoy/resources headers to util #6129
fix(cli): Kafka inspect output formatting #6130
feat(docs): improve HPA documentation #6091
refactor(envoy): refactoring and optimising the components that build envoy config #6119
Re-generate license info #6128
change default k6 image in kustomize #6126
fix(operator): regenerate CRDs #6124
Bump grafana/grafana from 11.3.0 to 11.3.1 in /scheduler #6105
feat(envoy): add an envoy config snapshot test #6121
fix(envoy): use ADS in dynamic config config #6120
fix(metrics): Fix model label metric in case of experiment #6118
fix(cli): Add error topic to pipeline inspect #6117
Re-generate license info #6116
feat(cli): cli as k8s deployment for debugging #6090
feat: Expose pvc retention policy via helm #6056
feat(envoy): switch to ADS #6082
Re-generate license info #6089
fix(cli): Fix kafka topic assignment for cli #6085
Re-generate license info #6083
fix: Model gateway silently ignores errors #6014
Re-generate license info #6080
Re-generate license info #6076
Bump github.com/go-playground/validator/v10 in /scheduler #6067
Bump github.com/tidwall/gjson from 1.17.1 to 1.18.0 in /operator #6060
Bump sigs.k8s.io/controller-runtime from 0.17.4 to 0.19.1 in /operator #6059
Bump ubi9/ubi-micro from 9.4-15 to 9.5 in /operator #6058
Bump rclone/rclone from 1.68.1 to 1.68.2 in /scheduler #6062
Bump ubi9/ubi-minimal from 9.4-1227.1726694542 to 9.5 in /scheduler #6063
Bump ubi9/ubi-micro from 9.4-15 to 9.5 in /scheduler #6064
Bump ubi9/ubi-micro from 9.4-15 to 9.5 in /hodometer #6065
Bump github.com/envoyproxy/go-control-plane in /scheduler #6068
Bump github.com/rs/xid from 1.5.0 to 1.6.0 in /scheduler #6069
fix(k6): use seldon-mesh svc for envoy k6 tests #6070
fix mismatched dependencies #6057
fixing the envoy dashboard #6055
Re-generate license info #6052
Bump github.com/onsi/gomega from 1.34.0 to 1.35.1 in /scheduler #6025
Bump grafana/grafana from 11.2.0 to 11.3.0 in /scheduler #6000
Bump github.com/onsi/gomega from 1.33.1 to 1.35.1 in /operator #6026
Bump envoyproxy/envoy from v1.31.2 to v1.32.1 in /scheduler #6027
fix(deps): Bump google.golang.org/grpc from 1.65.0 to 1.68.0 in /components/tls #6039
Bump go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp #6041
fix(deps): Bump google.golang.org/grpc from 1.65.0 to 1.68.0 in /apis/go #6042
fix(deps): Bump google.golang.org/grpc from 1.65.0 to 1.68.0 in /operator #6043
fix(deps): Bump google.golang.org/grpc from 1.66.0 to 1.68.0 in /hodometer #6037
fix(docs) Fixed the rendering issues #6015
allow model versions to increase #6038
fix: Upgrade go 1.22 #5990
Update Changelog #6035
Generating changelog for v2.9.0 fe57037
Generating changelog for v2.9.0-rc2 f5e47ed
Setting version for helm charts 1ccc180
Setting version for helm charts 4865356
Setting version for yaml manifests 456036a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v2.9.0

Overview

CRD Updates:

(Main) Features:

Features configuration & helm chart updates

(Main) Bugs Fixed:

Priority bugfixes scheduled for the next patch releases of Core 2.9

Kudos:

Changelog

v2.9.0

Contributors

Uh oh!