Adding support for capturing NetFx call stacks #4591

eftiquar · 2025-11-07T05:35:20Z

Stack Capture Engine: Design Overview

Introduction

The Stack Capture Engine enables safe, deadlock-free continuous profiling of .NET Framework applications by capturing managed call stacks from running threads. This document describes the core design principles that make this possible, focusing on two critical mechanisms: the Canary Thread pattern for runtime safety detection and RTL-based Stack Seeding for accurate context preparation. The .NET Framework runtime cannot be suspended; therefore, the stack capture implementation must employ the safety mechanisms outlined below to prevent application deadlocks.

First, the target thread is suspended to prevent it from progressing during the snapshot operation.
Then, a Canary Thread, which is a designated known thread not involved in the critical path of snapshotting, performs a probe (snapshot).
If the Canary Thread probe fails, it indicates that the target thread is likely in a critical path, so the stack capture is aborted, and the target thread is resumed to avoid deadlock.
Conversely, if the Canary Thread snapshot succeeds while the target thread remains suspended, it is inferred that the snapshot on the target thread will also succeed safely.
Thus, the Canary Thread acts as a first line of defense, performing safety checks to ensure that stack capture proceeds only when it is safe to do so.

1. The Canary Thread Pattern

1.1 The Core Problem

The CLR's DoStackSnapshot API can cause deadlocks when called during unsafe runtime states—such as during garbage collection, JIT compilation, or when critical runtime locks are held. Traditional approaches that directly suspend and snapshot threads risk:

Suspending threads that hold critical CLR locks
Triggering CORPROF_E_STACKSNAPSHOT_UNSAFE errors
Causing application hangs or process termination

1.2 The Canary Solution

Rather than blindly attempting to capture stacks, the engine uses a dedicated canary thread as a safety sentinel. This is a known, controlled thread created by the application specifically for profiling purposes, identified by a configurable name prefix (default: "OpenTelemetry Continuous").

How It Works:

Before capturing stacks from any production threads, the engine performs a safety probe using the canary thread:

Suspend the target thread - it is in arbitrary state, maybe holding locks that are in the path of snapshotting API
Suspend Canary thread (which holds no application locks)
Execute a safety probe on a dedicated worker thread.
- This isolates the continuous profiler thread itself from deadlocking
- Heap allocation is done (tests if heap locks are available)
- RTL function lookup is done (tests if loader locks are available)
- DoStackSnapshot called on the canary itself (tests CLR profiling API safety)
Resume the canary thread
Analyze the results with a timeout.
If the canary probe succeeds - proceed with snapshotting target thread
Else abort snapshot operation; resume target thread via RAII mechanisms so that it is resumed unfailingly.

If all probe operations complete successfully within the timeout (default 250ms), the runtime is considered safe for capturing production thread stacks. If the probe fails or times out, the engine skips the current capture cycle and waits for the next interval.

Key Safety Properties:

Timeout Protection: The probe operations execute on a worker thread with a configurable timeout, preventing indefinite blocking
SEH Protection: All probe operations are wrapped in Structured Exception Handling (__try/__except) to catch access violations gracefully
RAII Guarantees: Thread suspension/resumption uses RAII patterns to ensure threads are always resumed, even in exception scenarios
Isolated Testing: The canary thread performs no application work and holds no locks, making it safe to suspend and test

1.3 Canary Thread Lifecycle

The engine tracks all managed threads through profiler callbacks (ThreadNameChanged, ThreadAssignedToOSThread). When a thread with the canary name prefix is detected:

The engine registers both its managed thread ID and native OS thread ID
The capture loop is notified via condition variable
All subsequent capture cycles use this thread for safety probes

If the canary thread is destroyed, the engine clears its registration and waits for a new canary to be designated before resuming captures.

2. RTL-Based Stack Seeding

2.1 The Context Preparation Challenge

The CLR's DoStackSnapshot API requires a valid starting context pointing to managed code. However, when a thread is suspended for profiling, its instruction pointer (RIP on x64) may be in:

Native Win32 APIs (e.g., Sleep, WaitForSingleObject)
P/Invoke transitions
COM/WinRT interop layers
Native library code

Passing a context pointing to native code causes DoStackSnapshot to fail or return incomplete stacks. The engine must walk the native stack frames to locate the first managed frame before invoking DoStackSnapshot.

2.2 The RTL Function Solution

Windows provides low-level Runtime Library (RTL) functions for exception handling and stack unwinding:

RtlLookupFunctionEntry: Retrieves unwind metadata for a given instruction pointer
RtlVirtualUnwind: Simulates stack unwinding using function metadata, updating the context to the caller's frame

These functions are the same primitives the CLR uses internally for exception handling, making them reliable and accurate.

The Seeding Algorithm:

Quick Check: Test if the current instruction pointer is already managed code via GetFunctionFromIP
Native Stack Walk: If in native code, iterate through stack frames:
- Use RtlLookupFunctionEntry to get unwind metadata for the current RIP
- If metadata exists (non-leaf function): Extract the function's begin address and use RtlVirtualUnwind to update the context to the caller frame
- If no metadata (leaf function): Manually read the return address from the stack pointer ([RSP]) and adjust RSP
- After each frame transition, test the instruction pointer with GetFunctionFromIP to check if we've reached managed code
Termination: Stop when:
- Managed code is found (success)
- RIP becomes zero (end of stack)
- Stack pointer stops progressing (corruption detection)
- Maximum frame count exceeded (safety limit: 10,000 frames)
Seed DoStackSnapshot: Pass the prepared context (now pointing to a managed frame) to DoStackSnapshot

2.3 Critical Design Details

Function Begin Address vs. Current RIP:

The CLR's metadata associates managed functions with their entry points (begin addresses), not arbitrary instruction pointers mid-function. When unwind metadata is available, the engine uses imageBase + runtimeFunction->BeginAddress for managed detection, not the current RIP. This ensures reliable GetFunctionFromIP lookups.

Leaf Function Handling:

Leaf functions (functions with no stack frame) lack unwind metadata. The engine detects this case and manually pops the return address from the stack:

returnAddress = *[RSP]
RSP += 8  // Advance past return address

Stack Corruption Detection:

The engine tracks the stack pointer (RSP) across frames. If RSP fails to grow (or moves backward), the stack is considered corrupted and the walk terminates to prevent crashes.

SEH Protection:

All memory reads and RTL function calls are wrapped in Structured Exception Handling. If an access violation occurs (invalid memory, corrupted unwind metadata), the operation fails gracefully without crashing the application.

2.4 Why This Matters

Without accurate seeding:

P/Invoke-heavy applications would have incomplete stack traces
Threads blocked in native APIs would be skipped entirely
Async/await state machine transitions might be missed
Profiling data would have significant blind spots

With RTL-based seeding:

Stacks are accurately captured regardless of where threads are suspended
Native-to-managed transitions are handled correctly
The engine works reliably across diverse application patterns (sync, async, P/Invoke, COM)

3. How They Work Together

3.1 The Capture Flow

For each profiling cycle:

Wait for Canary Availability: Block until a canary thread is registered (or timeout)
For Each Target Thread:
- Skip the canary thread itself
- Suspend the target thread (RAII-based)
- Safety Probe: Suspend canary → test operations → resume canary
- If probe succeeds:
  - Capture target thread context
  - Seed the context (walk native frames to find managed code)
  - Invoke DoStackSnapshot with the seeded context
  - Collect managed stack frames
- If probe fails: Skip this thread and continue to next
- Resume the target thread (automatic via RAII)
Repeat at the configured interval (default: 1 second)

3.2 Layered Safety

The design provides defense in depth:

Layer	Protection
Canary Probe	Detects unsafe runtime states before touching production threads
Worker Thread + Timeout	Prevents indefinite blocking in probe operations
SEH Wrappers	Catches memory access violations without process termination
RAII Thread Suspend	Guarantees thread resumption even on exceptions
Stack Walk Limits	Prevents runaway walks from corrupted stacks
Context Validation	Verifies managed code is found before calling DoStackSnapshot

Each layer can independently fail without bringing down the application. The engine simply logs the failure and skips to the next capture cycle.

4. Platform & Technology Constraints

4.1 Windows x64 Only

The RTL functions (RtlLookupFunctionEntry, RtlVirtualUnwind) are Windows-specific APIs. The engine is currently limited to:

OS: Windows
Architecture: x64 (unwind metadata availability is guaranteed by the x64 ABI)

linux-foundation-easycla · 2025-11-07T05:35:28Z

The committers listed above are authorized under a signed CLA.

✅ login: eftiquar / name: Eftiquar Shaikh (0b0b446, b7ee8a8)

pellared · 2025-11-07T09:33:18Z

From the contributing docs:

If you would like to work on something that is not listed as an issue, please request a feature first. It is best to do this in advance so that maintainers can decide if the proposal is a good fit for this repository. This will help avoid situations when you spend significant time on something that maintainers may decide this repo is not the right place for.

Given this a significant feature it should be discussed in an issue before it can be accepted.

Changing to this PR to draft so that you can reference it (e.g. as PoC) in the issue.

PS. Please sign the CLA 😉

…gs done via OutputDebugString

…ry-dotnet-instrumentation into NetFX-Stack-Capture

…ing function with detailed comments

…nctions

…ents

nrcventura · 2025-11-20T01:22:22Z

The approach looks good to me. It looks like there is some commented out code that could use some cleanup. There should also be a test implemented for the netfx side of things.

eftiquar · 2025-11-20T03:38:04Z

The approach looks good to me. It looks like there is some commented out code that could use some cleanup. There should also be a test implemented for the netfx side of things.

Thank you for the feedback. Integration tests are under development. Sure - there may be some dead code or commented code, I will clean it up.

Kielek · 2025-11-20T11:02:16Z

@eftiquar, tests are almost in place. When #4631 is merged, you should be able just to remove conditional compilation from ContinuousProfilerContextTrackingTests, ContinuousProfilerSpanStoppageHandlingTests, and ContinuousProfilerTests.

For now, all of them are red on .NET Fx on your branch locally compiled.

eftiquar · 2025-11-21T04:16:51Z

@eftiquar, tests are almost in place. When #4631 is merged, you should be able just to remove conditional compilation from ContinuousProfilerContextTrackingTests, ContinuousProfilerSpanStoppageHandlingTests, and ContinuousProfilerTests.

For now, all of them are red on .NET Fx on your branch locally compiled.

Thanks @Kielek I have merged your changes into my branch. I will verify the tests.

test/IntegrationTests/ContinuousProfilerTests.cs

…e reported if allocation sampling is enabled on Net FX

NetFX-Stack-Capture Adding support for capturing NetFx call stacks

bfdd675

eftiquar requested a review from a team as a code owner November 7, 2025 05:35

Merge branch 'main' into NetFX-Stack-Capture

6acc2ff

pellared marked this pull request as draft November 7, 2025 09:34

eftiquar and others added 3 commits November 17, 2025 21:16

NetFX-Stack-Capture - use logger machinery; remove excessive debug lo…

54b0721

…gs done via OutputDebugString

Merge branch 'NetFX-Stack-Capture' of github.com:eftiquar/opentelemet…

6e99887

…ry-dotnet-instrumentation into NetFX-Stack-Capture

Merge branch 'main' into NetFX-Stack-Capture

18eb604

Kielek marked this pull request as ready for review November 18, 2025 18:28

eftiquar and others added 5 commits November 18, 2025 14:06

NetFX-Stack-Capture merge main

058e872

NetFX-Stack-Capture removed unused variable, annotated the stack seed…

ee32a92

…ing function with detailed comments

NetFX-Stack-Capture add logs for critical failures; removed unused fu…

f58ea70

…nctions

NetFX-Stack-Capture - format native code as per the workflow requirem…

80e89d0

…ents

Merge branch 'main' into NetFX-Stack-Capture

af7c764

Merge branch 'main' into NetFX-Stack-Capture

5a3467f

Kielek mentioned this pull request Nov 20, 2025

[Continuous Profiler] prepare test infrastructure for .NET Fx #4631

Merged

eftiquar added 2 commits November 20, 2025 09:35

Merge branch 'main' into NetFX-Stack-Capture

9171ea2

Merge branch 'main' into NetFX-Stack-Capture

c57c5c2

NetFX-Stack-Capture - enable netfx sampling tests

c7cd58b

Kielek reviewed Nov 21, 2025

View reviewed changes

test/IntegrationTests/ContinuousProfilerTests.cs Show resolved Hide resolved

eftiquar and others added 2 commits November 21, 2025 10:20

NetFX-Stack-Capture adding test to verify empty allocation samples ar…

b7ee8a8

…e reported if allocation sampling is enabled on Net FX

Merge branch 'main' into NetFX-Stack-Capture

0b0b446

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adding support for capturing NetFx call stacks #4591

Adding support for capturing NetFx call stacks #4591

Uh oh!

eftiquar commented Nov 7, 2025 •

edited

Loading

Uh oh!

linux-foundation-easycla bot commented Nov 7, 2025 •

edited

Loading

Uh oh!

pellared commented Nov 7, 2025 •

edited

Loading

Uh oh!

nrcventura commented Nov 20, 2025

Uh oh!

eftiquar commented Nov 20, 2025 •

edited

Loading

Uh oh!

Kielek commented Nov 20, 2025

Uh oh!

eftiquar commented Nov 21, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Adding support for capturing NetFx call stacks #4591

Are you sure you want to change the base?

Adding support for capturing NetFx call stacks #4591

Uh oh!

Conversation

eftiquar commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Stack Capture Engine: Design Overview

Introduction

1. The Canary Thread Pattern

1.1 The Core Problem

1.2 The Canary Solution

1.3 Canary Thread Lifecycle

2. RTL-Based Stack Seeding

2.1 The Context Preparation Challenge

2.2 The RTL Function Solution

2.3 Critical Design Details

2.4 Why This Matters

3. How They Work Together

3.1 The Capture Flow

3.2 Layered Safety

4. Platform & Technology Constraints

4.1 Windows x64 Only

Uh oh!

linux-foundation-easycla bot commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pellared commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nrcventura commented Nov 20, 2025

Uh oh!

eftiquar commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Kielek commented Nov 20, 2025

Uh oh!

eftiquar commented Nov 21, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

eftiquar commented Nov 7, 2025 •

edited

Loading

linux-foundation-easycla bot commented Nov 7, 2025 •

edited

Loading

pellared commented Nov 7, 2025 •

edited

Loading

eftiquar commented Nov 20, 2025 •

edited

Loading