Skip to content

Conversation

@fenilfaldu
Copy link
Contributor

📥 Pull Request

📘 Description
While implementing mem0 instrumentation, discovered that multiple isolated traces were being created instead of proper parent-child span relationships within a session. This broke the trace hierarchy and made it difficult to understand the flow of operations.

Root Cause: mem0 (and potentially other LLM libraries) uses ThreadPoolExecutor for concurrent operations, but AgentOps lacked instrumentation for thread pool execution contexts. This caused trace context to be lost when operations moved to background threads, resulting in orphaned spans instead of a cohesive trace tree.

  • Added a new module for instrumenting the concurrent.futures.ThreadPoolExecutor to ensure proper OpenTelemetry context propagation across threads.
  • Introduced the ConcurrentFuturesInstrumentor class, which wraps the ThreadPoolExecutor's initialization and submission methods to maintain context.
  • Enhanced the _should_instrument_package function to include utility instrumentors in the instrumentation decision process.
  • Updated the TARGET_PACKAGES set to include utility instrumentors for comprehensive monitoring.
  • Added unit tests to validate context propagation behavior in concurrent execution scenarios, ensuring spans share trace IDs correctly across threads.

These changes improve the observability and reliability of concurrent operations within the AgentOps SDK.

🧪 Testing
https://gist.github.com/fenilfaldu/e3d2b8479e8a0eee3c10608878e49a2e

@codecov
Copy link

codecov bot commented May 29, 2025

Codecov Report

Attention: Patch coverage is 57.73196% with 41 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...trumentation/concurrent_futures/instrumentation.py 44.59% 41 Missing ⚠️

📢 Thoughts on this report? Let us know!

@bboynton97 bboynton97 self-requested a review May 29, 2025 15:49
tcdent
tcdent previously requested changes Jun 6, 2025
Copy link
Contributor

@tcdent tcdent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

Can you merge in #1021 before I approve since there will likely be some conflicts.

I do like having the constants at the top of instrumentation/__init__.py

@fenilfaldu fenilfaldu force-pushed the add-concurrency-instrumentaion branch 3 times, most recently from 744ef47 to 397760c Compare June 6, 2025 21:22
@fenilfaldu fenilfaldu requested a review from tcdent June 7, 2025 19:30
@fenilfaldu fenilfaldu enabled auto-merge (squash) June 11, 2025 01:41
Copy link
Member

@dot-agi dot-agi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My primary concern is this should not be instrumented but rather be a part of the core SDK using proper context management and thread handling.

The SDK is responsible for creating the traces, instrumentation, span management and export so it does not make sense for it to be an instrumentation object. Also, monkey patching is for the external libraries and not for the default internal libraries.

Fundamental design issue here so this must be refactored into the core SDK.

@fenilfaldu
Copy link
Contributor Author

With this instrumentation, context propagation for concurrent.futures only activates when ThreadPoolExecutor is actually used. If we integrate this into the SDK core, the context management overhead would be present for all applications, even those that never use concurrent.futures threading.

The current approach follows the principle of lazy loading - context propagation is only applied when the specific threading mechanism is utilized. This keeps the core SDK lightweight while ensuring robust context handling only where needed.

@dot-agi dot-agi self-requested a review June 12, 2025 15:43
Copy link
Member

@dot-agi dot-agi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Using the code in the gist, I got only 2 LLM calls for the first 2 prompts. Validate this.
  2. with trace context management is now a feature but does not work with this.

@fenilfaldu
Copy link
Contributor Author

fenilfaldu commented Jun 12, 2025

On my end, the code correctly traces all LLM calls across prompt and also with trace context management is working fine might be the dependency issue from your side.
Everything is up and running :)

Screenshot 2025-06-13 at 1 02 11 AM

Screenshot 2025-06-13 at 1 00 35 AM

@dot-agi dot-agi self-requested a review June 12, 2025 20:38
Copy link
Member

@dot-agi dot-agi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We rolling @fenilfaldu 🔥

@dot-agi dot-agi dismissed tcdent’s stale review June 12, 2025 20:38

We need this again Travis

@fenilfaldu fenilfaldu merged commit d81108d into main Jun 12, 2025
10 checks passed
@fenilfaldu fenilfaldu deleted the add-concurrency-instrumentaion branch June 12, 2025 20:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants