v2.9.0
Overview
Core 2.9 is a feature-packed release looking to improve usability, simplify operations via autoscaling and scheduling improvements, and support streaming usecases (via model response streaming for both REST and gRPC clients).
Core 2 also has new docs, with revamped content and structure. Documentation will continuously improve to address advanced configurations and use cases.
CRD Updates:
All CRD changes in this release maintain backward compatibility, so clusters with existing CRs can be migrated seamlessly.
- Add
status.availableReplicas
field to the Model CRD (#5873). Part of the partial scheduling feature. Field not directly set by end-users, but updated by the seldon k8s operator - Add
spec.llm
field to the Model CRD (#6234). The field is used by the PromptRuntime (in Seldon's LLM Module) to reference a LLM model. Only one ofspec.llm
andspec.explainer
should be set at a given time. This allows the deployment of multiple "models" acting as prompt generators for the same LLM.
(Main) Features:
-
We add inference response streaming support for REST (via SSE) and gRPC for MLServer models that have streaming support (#6293, #6292). This requires MLServer
>= 1.6.0
. -
We introduce partial scheduling for model replicas (#6221, docs), improving the behaviour of Core 2 during autoscaling. With this new feature, the Core 2 scheduler will try to load as many of the requested model replicas as possible, even when no inference server has sufficient replicas to meet this request.
Partial scheduling is only active when end-users providespec.minReplicas
in a model manifest (as a user-provided minimum for considering the model "available"), and takes effect when there is a suitable inference server with at least this number of replicas. With partial scheduling, a model can be:- Fully scheduled:
spec.replicas == status.availableReplicas
; TheModelReady
condition isTrue
with messageModelAvailable
. All requested replicas serve inference requests. - Partially scheduled:
status.availableReplicas >= spec.minReplicas
butstatus.availableReplicas < spec.replicas
; TheModelReady
condition isTrue
with messageModelAvailable
. Core 2 was not able to find sufficient server replicas to load all requested replicas for this model. This state may be transitory, for example when new server replicas are being created but not yet available. The available model replicas serve inference requests. - Not able to schedule: no suitable inference servers that have a number of replicas greater or equal to the model's
spec.minReplicas
could be found. TheModelReady
condition isFalse
with messageScheduleFailed
. Some model replicas may still be available for inference requests (for example, if the model was previously loaded on a server that was forced to scale-down below the model'sspec.minReplicas
)
- Fully scheduled:
-
We introduce mixed native/HPA autoscaling (#6218, #6222, #6235, with docs for model and server autoscaling) that:
- enables end-users to configure a single HPA manifest, controlling model replicas.
- works for multi-model serving scenarios (MMS)
When using this feature, servers are scaled-up/down natively by Core 2 in response to changes in model replicas. If a model scales up and there aren't sufficient server replicas to host it, the number of server replicas is increased; if a model scales down and a server replica remains without any loaded models, the number of server replicas is reduced.
We also introduce experimental functionality to pack models on fewer inference servers on model scale-down, but this is disabled by default and will be improved in future releases. See the scale-down docs for details.
-
Model scheduling now takes into account model memory requirements based on the inference server config and how many in-memory copies of one model it creates (the
parallel_workes
MLServer setting andinstance_group
configurations in Triton). For triton, onlyKIND_CPU
instance groups are considered at this point (#6054) -
Log levels for all internal components (#6312) and the envoy accesslog (#6295) can now be controlled in a consistent way.
Features configuration & helm chart updates
-
Server
spec.minReplicas
andspec.maxReplicas
can be configured via helm (#6283) via the following values:mlserver.minReplicas
mlserver.maxReplicas
triton.minReplicas
triton.maxReplicas
-
Native autoscaling features control (#6301, #6286). All options here have corresponding command-line arguments that can be passed to
seldon-scheduler
when not using helm as the install method. The following helm values can be setautoscaling.autoscalingModelEnabled
, with corresponding cmd line arg:--enable-model-autoscaling
(defaults to false): enable or disable native model autoscaling based on lag thresholds. Enabling this assumes that lag (number of inference requests "in-flight") is a representative metric based on which to scale your models in a way that makes efficient use of resources.autoscaling.autoscalingServerEnabled
with corresponding cmd line arg:--enable-server-autoscaling
(defaults to "true"): enable to use native server autoscaling, where the number of server replicas is set according to the number of replicas required by the models loaded onto that server.autoscaling.serverPackingEnabled
with corresponding cmd line arg:--server-packing-enabled
(experimental, defaults to "false"): enable server packing to try and reduce the number of server replicas on model scale-down.autoscaling.serverPackingPercentage
with corresponding cmd line arg:--server-packing-percentage
(experimental, defaults to "0.0"): controls the percentage of model replica removals (due to model scale-down or deletion) that should trigger packing
-
The inference server PVC retention policy can now be configured via helm (#6056) via the following values:
mlserver.statefulSetPersistentVolumeClaimRetentionPolicy.whenDeleted
mlserver.statefulSetPersistentVolumeClaimRetentionPolicy.whenScaled
triton.statefulSetPersistentVolumeClaimRetentionPolicy.whenDeleted
triton.statefulSetPersistentVolumeClaimRetentionPolicy.whenScaled
-
Logging options can be configured either via helm values or by passing arguments to individual Core 2 components. Docs for component log levels and the envoy access log are available
Please consult the helm chart docs for a full list of options
(Main) Bugs Fixed:
- Transient NC 503 error in envoy logs when rolling out a new version of a model
- fixed in #6082 by configuring envoy to use the Aggregated Discovery Service (ADS) in order to add guarantees regarding the order of routing updates coming from the Core 2 scheduler.
- Mislabelled operational metrics for experiments in Prometheus
- fixed in #6118 by recording the actual model name in the
model
label rather than the experiment name;
- fixed in #6118 by recording the actual model name in the
- Some errors encountered by
modelgateway
were silently ignored- fixed in #6014 by propagating those errors via Kafka, by writing into an error topic (DLQ)
seldon-scheduler
pod spec overrides (alongisde any other stateful-set pod spec overrides) defined within the SeldonRuntime CR were ignored- fixed in #6349
Priority bugfixes scheduled for the next patch releases of Core 2.9
- [BUG] When there is a network partition between
dataflow-engine
and the kafka cluster, anddataflow-engine
is restarted, pipelines may sometimes be marked asPipelineTerminated
with the message "pipeline removed` after the network partiton is solved. The current workaround is to delete any Pipeline in this state and re-deploy the same manifest into the Core 2 cluster.
Kudos:
We would like to highlight the exceptional contributions that the following team members have brought to this release and to Core 2 so far:
- Sherif Akoush (@sakoush)
With contributions from @sakoush, @lc525, @driev, @RobertSamoilescu, @Rajakavitha1, @paulb-seldon, @tyndria
Changelog
Dates are displayed in UTC. Generated by auto-changelog
.
v2.9.0
7 April 2025
- fix(dataflow): Update default kafka log level for dataflow engine
#6367
- feat(docs): Server native autoscaling
#6356
- Bump ubi9/openjdk-17-runtime from 1.20 to 1.22 in /scheduler
#6359
- docs(pipelines): Minor Pipelines doc improvements
#6351
- fix: adjust logging level for dataflow
#6350
- fix(operator): Apply scheduler runtime podSpec override
#6349
- Spelling fix
#6340
- Add a missing space
#6338
- Re-generate license info
#6333
- Re-generate license info
#6332
- fix (docs) incorporate mark's suggestions
#6313
- fix(scheduler): Do not try to unload versions that are not live
#6331
- (re)allow triton rclone port to be set in compose
#6330
- fix(gha): update upload-artifact version
#6329
- feat: Config logging via helm
#6312
- fix: Allow kafka client (
librdkafka
) to respect log level#6310
- docs: fix rewrite of About
#6296
- Bump sigs.k8s.io/controller-runtime from 0.20.1 to 0.20.3 in /scheduler
#6307
- Bump grafana/grafana from 11.5.1 to 11.5.2 in /scheduler
#6289
- feat: add helm for disabling native autoscaling feature
#6301
- feat(docs): add docs for access log and autoscaling helm config
#6300
- Update SUMMARY.md
#6303
- feat: implemented grpc model streaming
#6293
- IA mapping
#6236
- feat(envoy): enable accesslog configuration
#6295
- [update] Operational Monitoring new IA
#6257
- docs fix for API
#6294
- test(scheduler): Included infer_stream test for REST
#6292
- feat: Enable packing configuration via helm
#6286
- fix(docs) istio.md
#6287
- fix(operator): allow scaling requests for older generation
#6285
- fix(scheduler): Remove all versions when model is deleted
#6284
- Bump rclone/rclone from 1.69.0 to 1.69.1 in /scheduler
#6272
- adding min and max replicas to the server helm chart
#6283
- fix(scheduler): ignoring model runtime info in model equality check
#6259
- feat(scheduler): Scale down server logic
#6246
- bug(scheduler): do not scale to zero if max replicas is missing
#6258
- feat(scheduler): mms send scaling request when model shceduling fails
#6235
- Re-generate license info
#6256
- [update] installation draft
#6131
- feat(scheduler): Allow server stats to be returned
#6253
- Bump github.com/envoyproxy/go-control-plane/envoy in /scheduler
#6249
- Bump google.golang.org/protobuf from 1.36.4 to 1.36.5 in /operator
#6250
- upgrade lint to v1.63.4
#6247
- fix(ci): builld k6 image - pin xk6 version to 0.13.4
#6245
- use go 1.22 for k6 image build
#6244
- Re-generate license info
#6243
- fix: Upgrade Go 1.23 and dependencies upgrade
#6238
- Re-generate license info
#6242
- fix(docs): Document Scheduling logic
#6237
- Bump grafana/grafana from 11.4.0 to 11.5.1 in /scheduler
#6240
- Bump rclone/rclone from 1.68.2 to 1.69.0 in /scheduler
#6192
- fix(operator): Add Status.AvailableReplicas to Model CRD
#5873
- feat(operator): Included LLM spec to CRD
#6234
- feat(scheduler): add partial scheduling based on min replicas
#6221
- feat(operator): adding a patch for server/spec/replicas upon scaling request
#6222
- fix for envoy configs
#6220
- Bump envoyproxy/envoy from v1.32.2 to v1.33.0 in /scheduler
#6206
- feat: enable min/max replica for Server CR
#6218
- new introduction to Core 2
#6195
- feat(envoy): use the healthcheck filter and a prestop hook to gracefully terminate Envoy
#6194
- Re-generate license info
#6184
- feat(scheduler): account for number of model instances when scheduling
#6183
- Remove faulty link
#6168
- Add Managed Kafka page to latest docs
#6166
- feat(envoy): fixing a test
#6163
- Mark kafka as recommended
#6165
- fix port to 9004 in seldon cli deps
#6164
- refactor(envoy): add clusters before updating routes (2)
#6145
- fix(ansible): Upgrade deps in ansible install
#6146
- feat(k6): add scenario with multiple stages ramping up/down RPS
#6031
- fix(docs): Docs on upgrading from 2.7 - 2.8
#6143
- fix: Add timeout to contexts in client calls
#6125
- Format spaces in install docs
#6140
- fix(docs): add a table for core 2 dependencies in docs
#6139
- feat(scheduler): account for multiple instances of a model per server when scheduling
#6054
- Bump grafana/grafana from 11.3.1 to 11.4.0 in /scheduler
#6133
- Bump envoyproxy/envoy from v1.32.1 to v1.32.2 in /scheduler
#6134
- Bump google.golang.org/grpc from 1.68.0 to 1.68.1 in /hodometer
#6136
- fix(docs): first draft of the securing endpoints
#5991
- refactor(envoy): moving envoy/resources headers to util
#6129
- fix(cli): Kafka inspect output formatting
#6130
- feat(docs): improve HPA documentation
#6091
- refactor(envoy): refactoring and optimising the components that build envoy config
#6119
- Re-generate license info
#6128
- change default k6 image in kustomize
#6126
- fix(operator): regenerate CRDs
#6124
- Bump grafana/grafana from 11.3.0 to 11.3.1 in /scheduler
#6105
- feat(envoy): add an envoy config snapshot test
#6121
- fix(envoy): use ADS in dynamic config config
#6120
- fix(metrics): Fix model label metric in case of experiment
#6118
- fix(cli): Add error topic to pipeline inspect
#6117
- Re-generate license info
#6116
- feat(cli): cli as k8s deployment for debugging
#6090
- feat: Expose pvc retention policy via helm
#6056
- feat(envoy): switch to ADS
#6082
- Re-generate license info
#6089
- fix(cli): Fix kafka topic assignment for cli
#6085
- Re-generate license info
#6083
- fix: Model gateway silently ignores errors
#6014
- Re-generate license info
#6080
- Re-generate license info
#6076
- Bump github.com/go-playground/validator/v10 in /scheduler
#6067
- Bump github.com/tidwall/gjson from 1.17.1 to 1.18.0 in /operator
#6060
- Bump sigs.k8s.io/controller-runtime from 0.17.4 to 0.19.1 in /operator
#6059
- Bump ubi9/ubi-micro from 9.4-15 to 9.5 in /operator
#6058
- Bump rclone/rclone from 1.68.1 to 1.68.2 in /scheduler
#6062
- Bump ubi9/ubi-minimal from 9.4-1227.1726694542 to 9.5 in /scheduler
#6063
- Bump ubi9/ubi-micro from 9.4-15 to 9.5 in /scheduler
#6064
- Bump ubi9/ubi-micro from 9.4-15 to 9.5 in /hodometer
#6065
- Bump github.com/envoyproxy/go-control-plane in /scheduler
#6068
- Bump github.com/rs/xid from 1.5.0 to 1.6.0 in /scheduler
#6069
- fix(k6): use seldon-mesh svc for envoy k6 tests
#6070
- fix mismatched dependencies
#6057
- fixing the envoy dashboard
#6055
- Re-generate license info
#6052
- Bump github.com/onsi/gomega from 1.34.0 to 1.35.1 in /scheduler
#6025
- Bump grafana/grafana from 11.2.0 to 11.3.0 in /scheduler
#6000
- Bump github.com/onsi/gomega from 1.33.1 to 1.35.1 in /operator
#6026
- Bump envoyproxy/envoy from v1.31.2 to v1.32.1 in /scheduler
#6027
- fix(deps): Bump google.golang.org/grpc from 1.65.0 to 1.68.0 in /components/tls
#6039
- Bump go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp
#6041
- fix(deps): Bump google.golang.org/grpc from 1.65.0 to 1.68.0 in /apis/go
#6042
- fix(deps): Bump google.golang.org/grpc from 1.65.0 to 1.68.0 in /operator
#6043
- fix(deps): Bump google.golang.org/grpc from 1.66.0 to 1.68.0 in /hodometer
#6037
- fix(docs) Fixed the rendering issues
#6015
- allow model versions to increase
#6038
- fix: Upgrade go 1.22
#5990
- Update Changelog
#6035
- Generating changelog for v2.9.0
fe57037
- Generating changelog for v2.9.0-rc2
f5e47ed
- Setting version for helm charts
1ccc180
- Setting version for helm charts
4865356
- Setting version for yaml manifests
456036a