doc/design: Configurable cluster replica sizes #33120

SangJunBak · 2025-07-23T15:52:41Z

Rendered: https://github.com/MaterializeInc/materialize/blob/ade1eb3a1501b4b74335e7774c1dc0f6d9450972/doc/developer/20250723_configurable_replica_sizes.md

Motivation

Tips for reviewer

Checklist

This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.

SangJunBak · 2025-07-24T16:04:46Z

Talked more about this in cloud hangout. Main takeaways:

Cluster replica size CRD would be nice for validation of the cluster replica size schema (rn we'd need to just look for logs in envd)
Generations / hashes of cluster replica sizes would be nice and important if we ever wanted to implement something like when we want to rollout a modification of of a cluster replica size for all clusters. This can be achieved manually by the user via a mechanism like blue/green in self managed, but this'd be useful if we ever wanted to do this in Cloud.

antiguru

Not a full review, just some thoughts.

It feels to me that we're struggling with how to represent data in Materialize that is somewhere between configuration and system data. Configuration data in my eyes is somewhat static and doesn't change once set. System data is something that can evolve over time, especially without restarting envd. This design moves the cluster replica size information closer to system data away from configuration data.

I feel there might be a design that makes this clearer, but I'm not sure if it's the right time to implement it, but hear me out. I think we could have the cluster replica sizes in a table, and now the question of applying configurations becomes a question of who gets to write what to the table. The current way would be that the cluster replica size configuration is blindly applied to the table. A CRD would also be appended to the table. We could imagine SQL commands to write entries to the table. RBAC would allow us to limit who can write to the table.

One wrinkle is that cluster replicas name the replica size, but we don't "depend" on it, as in that the cluster replica size is not a nameable object. We assume it exists, and crash if it doesn't. When we introduce a table with cluster replica sizes, we need to make sure that deleting an entry does not get us into a state where envd couldn't start anymore.

antiguru · 2025-07-25T09:56:33Z

doc/developer/20250723_configurable_replica_sizes.md

The file should be in the design doc location, one level down.

antiguru · 2025-07-25T09:57:51Z

doc/developer/20250723_configurable_replica_sizes.md

+
+1. To verify the cluster replica sizes in the database itself, one can run `SHOW custom_user_cluster_replica_sizes`
+1. If the configmap fails to sync, we’ll print out a warning in the logs of the environmentd pod on which field is causing the issue.
+- If a cluster size is modified, any existing clusters with that size shouldn’t be affected. Only newly created cluster replicas with the modified cluster size will.


What would happen if envd restarts? At the moment, we'd use the new cluster replica size definition, but it's not clear to me what should happen in your design.

I imagine a the statefulset needs to be redeployed for any changes to take affect, if envd diffs the sizes and redeploys that could be a problem, otherwise it'd still be a manual process to redeploy replica's post size change.

What would happen if envd restarts? At the moment, we'd use the new cluster replica size definition, but it's not clear to me what should happen in your design.

We could potentially get the cluster replica size definition from load_remote_system_parameters (code pointer). However, it runs into the potential case where:

We have a default param set for this new dyncfg which includes a size used to build a builtin cluster like mz_catalog_server

User decides to get rid of it in their configmap, causing envd not to bootup

The main approach I thought of was to completely separate the set of replica sizes used to create builtin cluster replicas and keep them as CLI args / config variables, where users aren't expected to edit this set. Then user-defined custom ones can live in the dyncfg and merge into the pre-existing set in the catalog. Another thing we could is keep them all unified but just document / log errors that the sizes they specify to bootstrap builtin cluster replicas need to exist in the configmap

But assuming we keep all sizes unified in the dyncfg, as Justin said we'd either:

No longer watch for statefulset changes for cluster replica sizes and force users to manually update already-created clusters

Sync this dyncfg to make changes to the statefulset via the cloud resource controller.

antiguru · 2025-07-25T09:59:54Z

doc/developer/20250723_configurable_replica_sizes.md

+    ```
+
+1. To verify the cluster replica sizes in the database itself, one can run `SHOW custom_user_cluster_replica_sizes`
+1. If the configmap fails to sync, we’ll print out a warning in the logs of the environmentd pod on which field is causing the issue.


Do we store a copy of the cluster replica size configuration in the catalog so we can restart envd even if the configmap contains errors or is absent?

We don't store a copy of it in the catalog, but my thinking was if we were to have it stored as a dyncfg, it'd be saved in the catalog as a system variable. Then if the configmap contains error, we'd use the last-synced values or the default values, similar to what happens when LD fails to sync

teskje · 2025-07-28T12:49:32Z

doc/developer/20250723_configurable_replica_sizes.md

+
+## Out of Scope
+
+- Allow configurable sizes in Cloud


What's the reason for not wanting to consider cloud? For the rollout of swap, being able to specify the replica sizes in dyncfgs would be a huge boon. We could set up LD to perform a slow rollout of swap, based on the risk segment and the Mz version.

I think mainly because:

Only a single self managed potential customer has asked for it

We don't want to support user-configurable cluster sizes for Cloud and the complexity it comes wit

But I didn't even consider the rollout of swap! I think it'd be nice to find a way to slowrollout sizes in this design doc, but one pretty big piece of complexity is how we'd rollout modifications to existing customer clusters with the sizes we want to modify. One simple idea is have it configurable through dyncfg but only roll it out on a new release using 0dt

Only a single self managed potential customer has asked for it

I'd view asks from the team help our internal management of MZ cloud at about level as "customer ask" in this case. If the ways customers and MZ would use these are very different then we may need to separate, but LD vs file dyncfg are close enough. If customers would really want create replica size that maybe takes on a life of its own.

The latest thinking is that we might introduce new replica sizes, instead of swapping out the spilling mechanism for cc sizes, in which case this might not be relevant for the swap rollout after all. It still seems useful to be able to configure replica sizes through LD, to make one-off adjustments without having to manually edit statefulsets and hope the changes stick around for long enough.

jubrad · 2025-07-28T16:47:45Z

doc/developer/20250723_configurable_replica_sizes.md

+
+## The Problem
+
+We want users to be able to safely configure cluster replica sizes on runtime for self managed.


I think we could scope this more broadly to encompass everyone managing materialize both MZ and self-managed customers and community users. If those don't overlap perhaps we just revisit MZ cloud use-case later.

jubrad · 2025-07-28T16:54:42Z

doc/developer/20250723_configurable_replica_sizes.md

+
+1. To verify the cluster replica sizes in the database itself, one can run `SHOW custom_user_cluster_replica_sizes`
+1. If the configmap fails to sync, we’ll print out a warning in the logs of the environmentd pod on which field is causing the issue.
+- If a cluster size is modified, any existing clusters with that size shouldn’t be affected. Only newly created cluster replicas with the modified cluster size will.


I imagine a the statefulset needs to be redeployed for any changes to take affect, if envd diffs the sizes and redeploys that could be a problem, otherwise it'd still be a manual process to redeploy replica's post size change.

jubrad · 2025-07-28T16:57:21Z

doc/developer/20250723_configurable_replica_sizes.md

+- Allow orchestratord to create a configmap in a volume in environmentd. Then glue the path to the dyncfg synchronization.
+- Create a custom cluster size CRD. This will allow orchestratord to handle statefulset creation in the future.


Yeah, for dyncfg file stuff for self-managed I was thinking that we'd need to add a sessionVarConfigMap or something like that so that we can create the configmap ahead of time and allow it to be configured per MZ (or shared between MZs).

jubrad · 2025-07-28T17:00:18Z

doc/developer/20250723_configurable_replica_sizes.md

+- What should the interface be in the Materialize CR? We have the following options:
+    - A boolean that signals if we want to create the configmap


We could have orchestratord create the configmap if one is not passed in, or we could require one to be passed in via the MZ CR.

Allowing it to be passed in is definitely easier to manage from a IAC perspective, and allows us to share the configmap across MZs if that's ever a thing someone wants to do.

Passing in seems most reasonable!

jubrad · 2025-07-29T02:57:56Z

doc/developer/20250723_configurable_replica_sizes.md

+    ```
+
+1. To verify the cluster replica sizes in the database itself, one can run `SHOW custom_user_cluster_replica_sizes`
+1. If the configmap fails to sync, we’ll print out a warning in the logs of the environmentd pod on which field is causing the issue.


It may be quite easy for users to miss if something won't parse... but could be fine initially as long as we log it pretty loudly.. If we wanted to make the process easier to debug we could potentially pack dynfg parsing issues it into an internal table?

That's a neat idea!

SangJunBak requested review from aljoscha, antiguru and jubrad July 23, 2025 15:54

SangJunBak marked this pull request as ready for review July 23, 2025 15:54

SangJunBak force-pushed the jun/config-cluster-replica-sizes branch from 9c1b057 to d5a4478 Compare July 23, 2025 15:55

Configurable cluster replica sizes design doc

ade1eb3

SangJunBak force-pushed the jun/config-cluster-replica-sizes branch from d5a4478 to ade1eb3 Compare July 23, 2025 15:58

antiguru reviewed Jul 25, 2025

View reviewed changes

teskje reviewed Jul 28, 2025

View reviewed changes

jubrad reviewed Jul 29, 2025

View reviewed changes


		## The Problem

		We want users to be able to safely configure cluster replica sizes on runtime for self managed.

		- Allow orchestratord to create a configmap in a volume in environmentd. Then glue the path to the dyncfg synchronization.
		- Create a custom cluster size CRD. This will allow orchestratord to handle statefulset creation in the future.

		- What should the interface be in the Materialize CR? We have the following options:
		- A boolean that signals if we want to create the configmap

doc/design: Configurable cluster replica sizes #33120

Are you sure you want to change the base?

doc/design: Configurable cluster replica sizes #33120

Uh oh!

Conversation

SangJunBak commented Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Tips for reviewer

Checklist

Uh oh!

SangJunBak commented Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

antiguru left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SangJunBak Jul 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

SangJunBak commented Jul 23, 2025 •

edited

Loading

SangJunBak commented Jul 24, 2025 •

edited

Loading

SangJunBak Jul 28, 2025 •

edited

Loading