Skip to content

Conversation

@eftiquar
Copy link
Contributor

@eftiquar eftiquar commented Nov 7, 2025

Stack Capture Engine: Design Overview

Introduction

The Stack Capture Engine enables safe, deadlock-free continuous profiling of .NET Framework applications by capturing managed call stacks from running threads. This document describes the core design principles that make this possible, focusing on two critical mechanisms: the Canary Thread pattern for runtime safety detection and RTL-based Stack Seeding for accurate context preparation. The .NET Framework runtime cannot be suspended; therefore, the stack capture implementation must employ the safety mechanisms outlined below to prevent application deadlocks.

  1. First, the target thread is suspended to prevent it from progressing during the snapshot operation.
  2. Then, a Canary Thread, which is a designated known thread not involved in the critical path of snapshotting, performs a probe (snapshot).
  3. If the Canary Thread probe fails, it indicates that the target thread is likely in a critical path, so the stack capture is aborted, and the target thread is resumed to avoid deadlock.
  4. Conversely, if the Canary Thread snapshot succeeds while the target thread remains suspended, it is inferred that the snapshot on the target thread will also succeed safely.
  5. Thus, the Canary Thread acts as a first line of defense, performing safety checks to ensure that stack capture proceeds only when it is safe to do so.

1. The Canary Thread Pattern

1.1 The Core Problem

The CLR's DoStackSnapshot API can cause deadlocks when called during unsafe runtime states—such as during garbage collection, JIT compilation, or when critical runtime locks are held. Traditional approaches that directly suspend and snapshot threads risk:

  • Suspending threads that hold critical CLR locks
  • Triggering CORPROF_E_STACKSNAPSHOT_UNSAFE errors
  • Causing application hangs or process termination

1.2 The Canary Solution

Rather than blindly attempting to capture stacks, the engine uses a dedicated canary thread as a safety sentinel. This is a known, controlled thread created by the application specifically for profiling purposes, identified by a configurable name prefix (default: "OpenTelemetry Continuous").

How It Works:

Before capturing stacks from any production threads, the engine performs a safety probe using the canary thread:

  1. Suspend the target thread - it is in arbitrary state, maybe holding locks that are in the path of snapshotting API
  2. Suspend Canary thread (which holds no application locks)
  3. Execute a safety probe on a dedicated worker thread.
    • This isolates the continuous profiler thread itself from deadlocking
    • Heap allocation is done (tests if heap locks are available)
    • RTL function lookup is done (tests if loader locks are available)
    • DoStackSnapshot called on the canary itself (tests CLR profiling API safety)
  4. Resume the canary thread
  5. Analyze the results with a timeout.
  6. If the canary probe succeeds - proceed with snapshotting target thread
  7. Else abort snapshot operation; resume target thread via RAII mechanisms so that it is resumed unfailingly.

If all probe operations complete successfully within the timeout (default 250ms), the runtime is considered safe for capturing production thread stacks. If the probe fails or times out, the engine skips the current capture cycle and waits for the next interval.

Key Safety Properties:

  • Timeout Protection: The probe operations execute on a worker thread with a configurable timeout, preventing indefinite blocking
  • SEH Protection: All probe operations are wrapped in Structured Exception Handling (__try/__except) to catch access violations gracefully
  • RAII Guarantees: Thread suspension/resumption uses RAII patterns to ensure threads are always resumed, even in exception scenarios
  • Isolated Testing: The canary thread performs no application work and holds no locks, making it safe to suspend and test

1.3 Canary Thread Lifecycle

The engine tracks all managed threads through profiler callbacks (ThreadNameChanged, ThreadAssignedToOSThread). When a thread with the canary name prefix is detected:

  1. The engine registers both its managed thread ID and native OS thread ID
  2. The capture loop is notified via condition variable
  3. All subsequent capture cycles use this thread for safety probes

If the canary thread is destroyed, the engine clears its registration and waits for a new canary to be designated before resuming captures.


2. RTL-Based Stack Seeding

2.1 The Context Preparation Challenge

The CLR's DoStackSnapshot API requires a valid starting context pointing to managed code. However, when a thread is suspended for profiling, its instruction pointer (RIP on x64) may be in:

  • Native Win32 APIs (e.g., Sleep, WaitForSingleObject)
  • P/Invoke transitions
  • COM/WinRT interop layers
  • Native library code

Passing a context pointing to native code causes DoStackSnapshot to fail or return incomplete stacks. The engine must walk the native stack frames to locate the first managed frame before invoking DoStackSnapshot.

2.2 The RTL Function Solution

Windows provides low-level Runtime Library (RTL) functions for exception handling and stack unwinding:

  • RtlLookupFunctionEntry: Retrieves unwind metadata for a given instruction pointer
  • RtlVirtualUnwind: Simulates stack unwinding using function metadata, updating the context to the caller's frame

These functions are the same primitives the CLR uses internally for exception handling, making them reliable and accurate.

The Seeding Algorithm:

  1. Quick Check: Test if the current instruction pointer is already managed code via GetFunctionFromIP
  2. Native Stack Walk: If in native code, iterate through stack frames:
    • Use RtlLookupFunctionEntry to get unwind metadata for the current RIP
    • If metadata exists (non-leaf function): Extract the function's begin address and use RtlVirtualUnwind to update the context to the caller frame
    • If no metadata (leaf function): Manually read the return address from the stack pointer ([RSP]) and adjust RSP
    • After each frame transition, test the instruction pointer with GetFunctionFromIP to check if we've reached managed code
  3. Termination: Stop when:
    • Managed code is found (success)
    • RIP becomes zero (end of stack)
    • Stack pointer stops progressing (corruption detection)
    • Maximum frame count exceeded (safety limit: 10,000 frames)
  4. Seed DoStackSnapshot: Pass the prepared context (now pointing to a managed frame) to DoStackSnapshot

2.3 Critical Design Details

Function Begin Address vs. Current RIP:

The CLR's metadata associates managed functions with their entry points (begin addresses), not arbitrary instruction pointers mid-function. When unwind metadata is available, the engine uses imageBase + runtimeFunction->BeginAddress for managed detection, not the current RIP. This ensures reliable GetFunctionFromIP lookups.

Leaf Function Handling:

Leaf functions (functions with no stack frame) lack unwind metadata. The engine detects this case and manually pops the return address from the stack:

returnAddress = *[RSP]
RSP += 8  // Advance past return address

Stack Corruption Detection:

The engine tracks the stack pointer (RSP) across frames. If RSP fails to grow (or moves backward), the stack is considered corrupted and the walk terminates to prevent crashes.

SEH Protection:

All memory reads and RTL function calls are wrapped in Structured Exception Handling. If an access violation occurs (invalid memory, corrupted unwind metadata), the operation fails gracefully without crashing the application.

2.4 Why This Matters

Without accurate seeding:

  • P/Invoke-heavy applications would have incomplete stack traces
  • Threads blocked in native APIs would be skipped entirely
  • Async/await state machine transitions might be missed
  • Profiling data would have significant blind spots

With RTL-based seeding:

  • Stacks are accurately captured regardless of where threads are suspended
  • Native-to-managed transitions are handled correctly
  • The engine works reliably across diverse application patterns (sync, async, P/Invoke, COM)

3. How They Work Together

3.1 The Capture Flow

For each profiling cycle:

  1. Wait for Canary Availability: Block until a canary thread is registered (or timeout)
  2. For Each Target Thread:
    • Skip the canary thread itself
    • Suspend the target thread (RAII-based)
    • Safety Probe: Suspend canary → test operations → resume canary
    • If probe succeeds:
      • Capture target thread context
      • Seed the context (walk native frames to find managed code)
      • Invoke DoStackSnapshot with the seeded context
      • Collect managed stack frames
    • If probe fails: Skip this thread and continue to next
    • Resume the target thread (automatic via RAII)
  3. Repeat at the configured interval (default: 1 second)

3.2 Layered Safety

The design provides defense in depth:

Layer Protection
Canary Probe Detects unsafe runtime states before touching production threads
Worker Thread + Timeout Prevents indefinite blocking in probe operations
SEH Wrappers Catches memory access violations without process termination
RAII Thread Suspend Guarantees thread resumption even on exceptions
Stack Walk Limits Prevents runaway walks from corrupted stacks
Context Validation Verifies managed code is found before calling DoStackSnapshot

Each layer can independently fail without bringing down the application. The engine simply logs the failure and skips to the next capture cycle.

4. Platform & Technology Constraints

4.1 Windows x64 Only

The RTL functions (RtlLookupFunctionEntry, RtlVirtualUnwind) are Windows-specific APIs. The engine is currently limited to:

  • OS: Windows
  • Architecture: x64 (unwind metadata availability is guaranteed by the x64 ABI)

@eftiquar eftiquar requested a review from a team as a code owner November 7, 2025 05:35
@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Nov 7, 2025

CLA Signed

The committers listed above are authorized under a signed CLA.

@pellared
Copy link
Member

pellared commented Nov 7, 2025

From the contributing docs:

If you would like to work on something that is not listed as an issue, please request a feature first. It is best to do this in advance so that maintainers can decide if the proposal is a good fit for this repository. This will help avoid situations when you spend significant time on something that maintainers may decide this repo is not the right place for.

Given this a significant feature it should be discussed in an issue before it can be accepted.

Changing to this PR to draft so that you can reference it (e.g. as PoC) in the issue.

PS. Please sign the CLA 😉

@pellared pellared marked this pull request as draft November 7, 2025 09:34
@Kielek Kielek marked this pull request as ready for review November 18, 2025 18:28
@nrcventura
Copy link
Member

The approach looks good to me. It looks like there is some commented out code that could use some cleanup. There should also be a test implemented for the netfx side of things.

@eftiquar
Copy link
Contributor Author

eftiquar commented Nov 20, 2025

The approach looks good to me. It looks like there is some commented out code that could use some cleanup. There should also be a test implemented for the netfx side of things.

Thank you for the feedback. Integration tests are under development. Sure - there may be some dead code or commented code, I will clean it up.

@Kielek
Copy link
Member

Kielek commented Nov 20, 2025

@eftiquar, tests are almost in place. When #4631 is merged, you should be able just to remove conditional compilation from ContinuousProfilerContextTrackingTests, ContinuousProfilerSpanStoppageHandlingTests, and ContinuousProfilerTests.

For now, all of them are red on .NET Fx on your branch locally compiled.

@eftiquar
Copy link
Contributor Author

@eftiquar, tests are almost in place. When #4631 is merged, you should be able just to remove conditional compilation from ContinuousProfilerContextTrackingTests, ContinuousProfilerSpanStoppageHandlingTests, and ContinuousProfilerTests.

For now, all of them are red on .NET Fx on your branch locally compiled.

Thanks @Kielek I have merged your changes into my branch. I will verify the tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants