-
Notifications
You must be signed in to change notification settings - Fork 253
Safe Deploys #3598
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Safe Deploys #3598
Conversation
Co-authored-by: angelazhou32 <[email protected]>
docs/production-deployment/worker-deployments/kubernetes-controller.mdx
Outdated
Show resolved
Hide resolved
version "5.5.0" | ||
resolved "https://registry.yarnpkg.com/react-icons/-/react-icons-5.5.0.tgz#8aa25d3543ff84231685d3331164c00299cdfaf2" | ||
integrity sha512-MEFcXdkP3dLo8uumGI5xN3lDFNsRtrjbOEKDLD7yv76v4wpnEq2Lt2qeHaQOr34I/wPN3s3+N08WkQ+CW37Xiw== | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did a quick review. Can do a deeper review next week. Overall, fantastic work! Thanks for spearheading this and keeping everyone organized with this work.
- Deployment | ||
- Replay | ||
--- | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this file is mostly a copy/paste, but wanted to go ahead and make some flags.
I think we should link out to Worker Versioning here, probably upfront, and then talk about patching as a fallback plan.
(In general, if you want to stage the responses to these comments in different PRs or ask me to do some of it, I'm game for whatever!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this page was already there, and I was trying to restructure it to fit a bit better where it is (I'm not sure exactly why it was added in the first place). I think this is a good place for people who haven't taken the versioning course and maybe don't understand these concepts at a high level to understand why they need replay tests and things like that. I am also happy to add more prominent links to Worker Versioning here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just want to make sure I can track followups here.
|
||
## Use Replay Testing before and during your deployments | ||
|
||
The best way to verify that your code won't cause non-determinism errors once deployed is to use [replay testing](https://docs.temporal.io/workflow-execution#replay). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be scoped to non-PINNED
workflows
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get it, but that assumes that every existing user is quickly going to migrate to Worker Versioning and understand pinning vs not pinning their existing Workflows, and I don't think we're there yet, so I'm reluctant to qualify this statement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replay testing is not "the best way" anymore now that there are two alternatives. We chatted about my stance over slack.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear, I am not recommending patching as the best way, but I was not at all led to believe that we're deemphasizing Replay Testing, which is a very good practice that we try to advocate to all users as an addition to their test suite, whether they are using patch versioning or not. Wouldn't Replay Testing also be extremely important for deploying non-pinned Workflow revisions?
I'm trying to give all of our Versioning methods roughly equal attention so that people understand the space, and encourage them to adopt Worker Versioning. I'm very opposed to deemphasizing our existing methods any more as things stand, because we've been teaching and working with an entire community of users who understand those methods for years. This should be a gradual shift, and I think this framing reflects that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PINNED workflows don't need replay testing. I'm not asking for an editorial bias in favor of new stuff (which I agree is premature), I'm asking for it to be factual.
docs/develop/dotnet/versioning.mdx
Outdated
|
||
Because we design for potentially long running Workflows at scale, versioning with Temporal works differently. We explain more in this optional 30 minute introduction: | ||
- [Workflow Type Versioning](#type-versioning). This is the simplest of the three, and acts more like a cutover than true versioning. It is suitable for short-running Workflows. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not very practical advice for most users:
- It requires callsites to change, often impractical if those are deployed in a different service.
- It duplicates code, makes it hard to code review, and ruins git blame.
- It doesn't scale to complex workflows whose implementation spans multiple files.
Happily, we don't need it any more now that we have worker versioning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Counterargument: I was told this exact same thing 2 years ago (not joking). I think we need to describe the current state of the art as-is, and I've made clear that it's not a "real" method that we recommend.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've survived without documenting this approach for this long, and documenting it now, even if you change it to clarify that it's not real/supported, is really sending the wrong message that we don't believe in what we're shipping.
In rough order of frequency, non-scientifically, here's what I see from customers:
- YOLO. Terminate workflows or reset them when something goes wrong.
- Patching
- Replay testing (either in tests or when deploying)
- Versioning task queues
- Versioning Workflow names
(Admittedly, lot depends on who's been giving them advice.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made some updates here to deemphasize this without removing it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Way back when I wrote a forum post about versioning, which this structure is I think still somewhat descended from, I gave context on task queue and type based versioning as a means of saying "here's something you might have discovered and it can work OK but probably you'd prefer something built-in".
Now, we have a better approach to that built-in thing. So, I think making a very brief reference to type based versioning is OK, but I wouldn't glorify it with a code sample, nor put that sample before the content we really want people to see (safe deploys and patching).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
People are still using this right now as far as we know, though, and we do make clear that it doesn't use any platform features! I think I've cut its inclusion down as far as makes sense, honestly (and I do understand the push to do that).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, as discussed live, newest revision only presents this in the context of patching, doesn't refer to it as a top-level method, we now only present 2 of those. Hopefully this is a solid compromise!
temporal workflow update-options \ | ||
--workflow-id="MyWorkflowId" \ | ||
--versioning-override-behavior=pinned \ | ||
--versioning-override-pinned-version="$MY_DEPLOYMENT_VERSION" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Sushisource here too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This won't have changed until temporalio/cli#811 is merged and CLI is released
@drewhoskins-temporal think I've addressed most of your concerns here for now! |
</SdkTabs> | ||
|
||
This covers the complete lifecycle of working with Worker Versioning. | ||
More features will be added here soon, including the ability to progammaticaly deploy and scale your Workers in Kubernetes pods, and checks to guarantee safe deploys. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd shy away from forward-looking statements scattered around the docs. They are hard to maintain and can turn out wrong, and also distracting for users. I thought we had a policy not to put this kind of thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally yes, but I understand this roadmap to be very stable at this point, and I think this section ends too abruptly otherwise without being able to talk about programmatic scaling and deploys. I really am trying very hard not to oversell or diminish a public preview feature now that it's finally shipping.
- Deployment | ||
- Replay | ||
--- | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just want to make sure I can track followups here.
|
||
## Use Replay Testing before and during your deployments | ||
|
||
The best way to verify that your code won't cause non-determinism errors once deployed is to use [replay testing](https://docs.temporal.io/workflow-execution#replay). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replay testing is not "the best way" anymore now that there are two alternatives. We chatted about my stance over slack.
|
||
Temporal Workflows must be [deterministic](https://docs.temporal.io/workflow-definition#deterministic-constraints) in terms of the events that they generate server-side. | ||
Changes to your Workflows that would create different server-side events, or events in a different order, need to be protected by using the [Versioning APIs](https://docs.temporal.io/workflow-definition#workflow-versioning) within your Workflow code. | ||
You can also use [Worker Versioning](/production-deployment/worker-deployments/worker-versioning) to pin your deployments to specific code revisions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*pin your workflows
# implementation code omitted for this example | ||
} | ||
``` | ||
Since incompatible changes only affect open Workflow Executions of the same type, you can avoid determinism errors by creating a whole new Workflow when making changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since incompatible changes only affect open Workflow Executions of the same type
I had a hard time following this at first. Same type as what?
Consider: "when making changes, you can avoid non-determinism errors by creating a whole new Workflow definition, avoiding changes to your original definition."
@@ -90,9 +79,9 @@ using var worker = new TemporalWorker( | |||
The downside of this method is that it does not use any Temporal platform features. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This didn't land for me. A user's goal isn't to "use Temporal platform features." Do you mean "benefit from" ? If so, which ones?
(The fact that there are so many caveats here is a smell that it might not be worth including.)
|
||
Since incompatible changes only affect open Workflow Executions of the same type, you can avoid this problem by changing the Workflow Type for the new version. | ||
To do this, you can copy the Workflow Definition function, giving it a different name, and make sure that both names were registered with your Workers. | ||
To understand why Patching is useful, it's helpful to first demonstrate cutting over an entire Workflow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good framing.
@@ -90,9 +79,9 @@ using var worker = new TemporalWorker( | |||
The downside of this method is that it does not use any Temporal platform features. | |||
It requires you to duplicate code and to update any code and commands used to start the Workflow. | |||
This can become impractical over time, depending on how you are providing configuration strings to your deployment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
depending on how you are providing configuration strings to your deployment.
not sure what this is referring to.
- [Workflow Branching with GetVersion](#getversion). This method works by adding branches to your code tied to specific revisions. In most other SDKs, it is called patching. | ||
- [Worker Versioning](/production-deployment/worker-deployments/worker-versioning). The Worker Versioning feature allows you to tag your Workers and programmatically roll them out in deployment versions, so that old Workers can run old code paths and new Workers can run new code paths. If you were using this method experimentally prior to summer 2025, refer to the [Worker Versioning Legacy](worker-versioning-legacy) docs. | ||
|
||
## Workflow Type Versioning {#type-versioning} | ||
## Workflow Branching {#getversion} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is Branching? I've never heard this term, and would like to stick to fewer terms. If you'd like to introduce this name, let's chat as a crew.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The education team has been referring to "getVersion" for Go and Java and "patching" for the other SDKs for the last 2 years -- there's a general consensus that we do not call Go and Java "patching" because the feature is implemented somewhat differently, and "branching" is a usefully generic term that captures both methods because it refers to code branches. If we want to call everything patching, that would be a change for us.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we call it "branching" everywhere, I can see that being useful as an umbrella term, but it looks like it's still just called patching in the others? ex: https://github.com/temporalio/documentation/pull/3598/files#diff-638a5973dbe00746fdcd0654427bacb760bab12d3e3713b6c346798005ee6df2R43
If we use this title everywhere, I can see that. But otherwise I'd agree with Drew we should err on the side of conservatism
|
||
## Use Replay Testing before and during your deployments | ||
|
||
The best way to verify that your code won't cause non-determinism errors once deployed is to use [replay testing](https://docs.temporal.io/workflow-execution#replay). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PINNED workflows don't need replay testing. I'm not asking for an editorial bias in favor of new stuff (which I agree is premature), I'm asking for it to be factual.
this won't build yet because it's missing the new safe deploys page, but this way you can see how I'm changing our existing SDK versioning pages before adding that content (all 6 plus the new section should be ready by Wednesday).