-
Notifications
You must be signed in to change notification settings - Fork 320
Fix telemetry reporting 500 for successful activities/orchestrations with caught exceptions #1270
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
When an exception is caught and handled within user code (not causing the activity/orchestration to fail), the Activity span status was being left in an Error state. This caused Application Insights telemetry to incorrectly report Response Code 500 even though Succeeded was True. The fix ensures dispatchers explicitly set ActivityStatusCode.OK for successful completions, overriding any stale error status from intermediate exception handling or custom instrumentation. Changes: - TaskActivityDispatcher: Set OK status before stopping span on success - TaskOrchestrationDispatcher: Add centralized status logic for orchestration completions (OK for Completed/ContinuedAsNew, Error for Failed/Terminated) - TraceHelper: Entity invocation spans now get explicit OK/Error status Fixes Azure/azure-functions-durable-extension#3199
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR fixes a telemetry issue where successful activities and orchestrations incorrectly reported HTTP 500 response codes in Application Insights when Distributed Tracing v2 was enabled and intermediate exceptions were caught and handled. The fix ensures that ActivityStatusCode.OK is explicitly set for successful completions, overriding any stale error status from intermediate exception handling.
Key Changes:
- Added explicit status setting to
ActivityStatusCode.OKfor successful completions in all dispatcher paths - Introduced centralized status handling logic for orchestrations via
SetOrchestrationActivityStatus()helper - Added comprehensive tests to verify status reset behavior for activities, orchestrations, and entities
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
test/DurableTask.Core.Tests/DispatcherMiddlewareTests.cs |
Adds integration test ActivityAndOrchestrationSpansResetStatuses with helper classes to verify that activity/orchestration spans correctly report OK status when they succeed despite setting Error status during execution |
src/DurableTask.Core/Tracing/TraceHelper.cs |
Updates entity invocation handling to explicitly set ActivityStatusCode.OK for successful operations and ActivityStatusCode.Error for failures |
src/DurableTask.Core/TaskOrchestrationDispatcher.cs |
Introduces SetOrchestrationActivityStatus() helper that sets appropriate status based on orchestration outcome (OK for Completed/ContinuedAsNew, Error for Failed/Terminated); removes duplicate status-setting code |
src/DurableTask.Core/TaskActivityDispatcher.cs |
Adds explicit ActivityStatusCode.OK status setting for TaskCompletedEvent to ensure successful executions override prior error statuses from custom instrumentation |
Test/DurableTask.Core.Tests/TraceHelperTests.cs |
New unit test file with tests for entity invocation status handling: EndActivitiesForEntityInvocationResetsSuccessfulStatus and EndActivitiesForEntityInvocationMarksFailures |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <[email protected]>
Summary
This PR fixes an issue where successful activities and orchestrations were incorrectly reporting HTTP 500 response codes in Application Insights telemetry when Distributed Tracing v2 was enabled.
Fixes: Azure/azure-functions-durable-extension#3199
Problem
When an exception is thrown within an activity function but caught and handled via
try-catch(for example, logging it then continuing), theSystem.Diagnostics.Activityspan's status was being left in anErrorstate. The telemetry module (WebJobsTelemetryModule/DurableTelemetryModule) translatesActivityStatusCode.Errorinto HTTP 500, resulting in misleading telemetry where:Response Code=500Succeeded=TrueThis was confusing for users monitoring their applications, as the telemetry falsely indicated failures for operations that actually succeeded.
Root Cause
The dispatchers in DurableTask.Core were setting
ActivityStatusCode.Errorwhen exceptions occurred (which is correct), but they never reset the status toOKwhen the activity/orchestration completed successfully. This meant any intermediate error status from:...would persist to the final telemetry even when the overall operation succeeded.
Solution
The fix ensures that all dispatcher paths explicitly set
ActivityStatusCode.OKfor successful completions, overriding any stale error status from intermediate exception handling.Changes
TaskActivityDispatcher.cs: Before stopping the trace activity, explicitly setActivityStatusCode.OKwhen the response is aTaskCompletedEvent(successful completion).TaskOrchestrationDispatcher.cs: Added a centralizedSetOrchestrationActivityStatus()helper that sets the appropriate status based on orchestration outcome:Completed/ContinuedAsNew→ActivityStatusCode.OKFailed/Terminated→ActivityStatusCode.ErrorTraceHelper.cs: UpdatedEndActivitiesForProcessingEntityInvocation()to set explicit OK/Error status for entity operation spans.Why This Is The Right Solution
Follows OpenTelemetry semantics: The final span status should reflect the actual outcome of the operation, not intermediate states. Setting OK on success is the correct behavior per OpenTelemetry spec.
Minimal and targeted: The fix only modifies the final status assignment at completion time, preserving all existing error-path behavior.
Preserves true failures: Activities/orchestrations that actually fail (unhandled exceptions, explicit failures) still correctly report Error status because the success status reset only happens for
TaskCompletedEvent/OrchestrationStatus.Completed.Covers all dispatcher types: The fix addresses activities, orchestrations, and entities - all paths that generate telemetry spans.
Testing
Added regression tests that simulate the exact issue scenario:
ActivityAndOrchestrationSpansResetStatuses: Integration test that:Errorstatus during executionStatus == OkTraceHelperTests: Unit tests for entity invocation status handling:EndActivitiesForEntityInvocationResetsSuccessfulStatus: Verifies OK status on successEndActivitiesForEntityInvocationMarksFailures: Verifies Error status on failureAll 94 tests pass.
Checklist
dotnet test