Skip to content

Safe Deploys #3598

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 49 commits into
base: main
Choose a base branch
from
Open

Safe Deploys #3598

wants to merge 49 commits into from

Conversation

axfelix
Copy link
Contributor

@axfelix axfelix commented Jun 2, 2025

this won't build yet because it's missing the new safe deploys page, but this way you can see how I'm changing our existing SDK versioning pages before adding that content (all 6 plus the new section should be ready by Wednesday).

@axfelix axfelix requested a review from a team as a code owner June 2, 2025 17:51
@axfelix axfelix changed the title go versioning revisions Safe Deploys Jun 2, 2025
version "5.5.0"
resolved "https://registry.yarnpkg.com/react-icons/-/react-icons-5.5.0.tgz#8aa25d3543ff84231685d3331164c00299cdfaf2"
integrity sha512-MEFcXdkP3dLo8uumGI5xN3lDFNsRtrjbOEKDLD7yv76v4wpnEq2Lt2qeHaQOr34I/wPN3s3+N08WkQ+CW37Xiw==

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a quick review. Can do a deeper review next week. Overall, fantastic work! Thanks for spearheading this and keeping everyone organized with this work.

- Deployment
- Replay
---

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this file is mostly a copy/paste, but wanted to go ahead and make some flags.
I think we should link out to Worker Versioning here, probably upfront, and then talk about patching as a fallback plan.

(In general, if you want to stage the responses to these comments in different PRs or ask me to do some of it, I'm game for whatever!)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this page was already there, and I was trying to restructure it to fit a bit better where it is (I'm not sure exactly why it was added in the first place). I think this is a good place for people who haven't taken the versioning course and maybe don't understand these concepts at a high level to understand why they need replay tests and things like that. I am also happy to add more prominent links to Worker Versioning here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just want to make sure I can track followups here.


## Use Replay Testing before and during your deployments

The best way to verify that your code won't cause non-determinism errors once deployed is to use [replay testing](https://docs.temporal.io/workflow-execution#replay).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be scoped to non-PINNED workflows

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get it, but that assumes that every existing user is quickly going to migrate to Worker Versioning and understand pinning vs not pinning their existing Workflows, and I don't think we're there yet, so I'm reluctant to qualify this statement.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replay testing is not "the best way" anymore now that there are two alternatives. We chatted about my stance over slack.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, I am not recommending patching as the best way, but I was not at all led to believe that we're deemphasizing Replay Testing, which is a very good practice that we try to advocate to all users as an addition to their test suite, whether they are using patch versioning or not. Wouldn't Replay Testing also be extremely important for deploying non-pinned Workflow revisions?

I'm trying to give all of our Versioning methods roughly equal attention so that people understand the space, and encourage them to adopt Worker Versioning. I'm very opposed to deemphasizing our existing methods any more as things stand, because we've been teaching and working with an entire community of users who understand those methods for years. This should be a gradual shift, and I think this framing reflects that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PINNED workflows don't need replay testing. I'm not asking for an editorial bias in favor of new stuff (which I agree is premature), I'm asking for it to be factual.


Because we design for potentially long running Workflows at scale, versioning with Temporal works differently. We explain more in this optional 30 minute introduction:
- [Workflow Type Versioning](#type-versioning). This is the simplest of the three, and acts more like a cutover than true versioning. It is suitable for short-running Workflows.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not very practical advice for most users:

  • It requires callsites to change, often impractical if those are deployed in a different service.
  • It duplicates code, makes it hard to code review, and ruins git blame.
  • It doesn't scale to complex workflows whose implementation spans multiple files.

Happily, we don't need it any more now that we have worker versioning.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Counterargument: I was told this exact same thing 2 years ago (not joking). I think we need to describe the current state of the art as-is, and I've made clear that it's not a "real" method that we recommend.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've survived without documenting this approach for this long, and documenting it now, even if you change it to clarify that it's not real/supported, is really sending the wrong message that we don't believe in what we're shipping.

In rough order of frequency, non-scientifically, here's what I see from customers:

  • YOLO. Terminate workflows or reset them when something goes wrong.
  • Patching
  • Replay testing (either in tests or when deploying)
  • Versioning task queues
  • Versioning Workflow names

(Admittedly, lot depends on who's been giving them advice.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made some updates here to deemphasize this without removing it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Way back when I wrote a forum post about versioning, which this structure is I think still somewhat descended from, I gave context on task queue and type based versioning as a means of saying "here's something you might have discovered and it can work OK but probably you'd prefer something built-in".

Now, we have a better approach to that built-in thing. So, I think making a very brief reference to type based versioning is OK, but I wouldn't glorify it with a code sample, nor put that sample before the content we really want people to see (safe deploys and patching).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

People are still using this right now as far as we know, though, and we do make clear that it doesn't use any platform features! I think I've cut its inclusion down as far as makes sense, honestly (and I do understand the push to do that).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, as discussed live, newest revision only presents this in the context of patching, doesn't refer to it as a top-level method, we now only present 2 of those. Hopefully this is a solid compromise!

temporal workflow update-options \
--workflow-id="MyWorkflowId" \
--versioning-override-behavior=pinned \
--versioning-override-pinned-version="$MY_DEPLOYMENT_VERSION"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Sushisource here too

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't have changed until temporalio/cli#811 is merged and CLI is released

@axfelix
Copy link
Contributor Author

axfelix commented Jun 16, 2025

@drewhoskins-temporal think I've addressed most of your concerns here for now!

</SdkTabs>

This covers the complete lifecycle of working with Worker Versioning.
More features will be added here soon, including the ability to progammaticaly deploy and scale your Workers in Kubernetes pods, and checks to guarantee safe deploys.
Copy link
Contributor

@drewhoskins-temporal drewhoskins-temporal Jun 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd shy away from forward-looking statements scattered around the docs. They are hard to maintain and can turn out wrong, and also distracting for users. I thought we had a policy not to put this kind of thing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally yes, but I understand this roadmap to be very stable at this point, and I think this section ends too abruptly otherwise without being able to talk about programmatic scaling and deploys. I really am trying very hard not to oversell or diminish a public preview feature now that it's finally shipping.

- Deployment
- Replay
---

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just want to make sure I can track followups here.


## Use Replay Testing before and during your deployments

The best way to verify that your code won't cause non-determinism errors once deployed is to use [replay testing](https://docs.temporal.io/workflow-execution#replay).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replay testing is not "the best way" anymore now that there are two alternatives. We chatted about my stance over slack.


Temporal Workflows must be [deterministic](https://docs.temporal.io/workflow-definition#deterministic-constraints) in terms of the events that they generate server-side.
Changes to your Workflows that would create different server-side events, or events in a different order, need to be protected by using the [Versioning APIs](https://docs.temporal.io/workflow-definition#workflow-versioning) within your Workflow code.
You can also use [Worker Versioning](/production-deployment/worker-deployments/worker-versioning) to pin your deployments to specific code revisions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*pin your workflows

# implementation code omitted for this example
}
```
Since incompatible changes only affect open Workflow Executions of the same type, you can avoid determinism errors by creating a whole new Workflow when making changes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since incompatible changes only affect open Workflow Executions of the same type

I had a hard time following this at first. Same type as what?

Consider: "when making changes, you can avoid non-determinism errors by creating a whole new Workflow definition, avoiding changes to your original definition."

@@ -90,9 +79,9 @@ using var worker = new TemporalWorker(
The downside of this method is that it does not use any Temporal platform features.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This didn't land for me. A user's goal isn't to "use Temporal platform features." Do you mean "benefit from" ? If so, which ones?

(The fact that there are so many caveats here is a smell that it might not be worth including.)


Since incompatible changes only affect open Workflow Executions of the same type, you can avoid this problem by changing the Workflow Type for the new version.
To do this, you can copy the Workflow Definition function, giving it a different name, and make sure that both names were registered with your Workers.
To understand why Patching is useful, it's helpful to first demonstrate cutting over an entire Workflow.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good framing.

@@ -90,9 +79,9 @@ using var worker = new TemporalWorker(
The downside of this method is that it does not use any Temporal platform features.
It requires you to duplicate code and to update any code and commands used to start the Workflow.
This can become impractical over time, depending on how you are providing configuration strings to your deployment.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

depending on how you are providing configuration strings to your deployment.

not sure what this is referring to.

- [Workflow Branching with GetVersion](#getversion). This method works by adding branches to your code tied to specific revisions. In most other SDKs, it is called patching.
- [Worker Versioning](/production-deployment/worker-deployments/worker-versioning). The Worker Versioning feature allows you to tag your Workers and programmatically roll them out in deployment versions, so that old Workers can run old code paths and new Workers can run new code paths. If you were using this method experimentally prior to summer 2025, refer to the [Worker Versioning Legacy](worker-versioning-legacy) docs.

## Workflow Type Versioning {#type-versioning}
## Workflow Branching {#getversion}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is Branching? I've never heard this term, and would like to stick to fewer terms. If you'd like to introduce this name, let's chat as a crew.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The education team has been referring to "getVersion" for Go and Java and "patching" for the other SDKs for the last 2 years -- there's a general consensus that we do not call Go and Java "patching" because the feature is implemented somewhat differently, and "branching" is a usefully generic term that captures both methods because it refers to code branches. If we want to call everything patching, that would be a change for us.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we call it "branching" everywhere, I can see that being useful as an umbrella term, but it looks like it's still just called patching in the others? ex: https://github.com/temporalio/documentation/pull/3598/files#diff-638a5973dbe00746fdcd0654427bacb760bab12d3e3713b6c346798005ee6df2R43

If we use this title everywhere, I can see that. But otherwise I'd agree with Drew we should err on the side of conservatism


## Use Replay Testing before and during your deployments

The best way to verify that your code won't cause non-determinism errors once deployed is to use [replay testing](https://docs.temporal.io/workflow-execution#replay).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PINNED workflows don't need replay testing. I'm not asking for an editorial bias in favor of new stuff (which I agree is premature), I'm asking for it to be factual.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants