Skip to content

Latest commit

 

History

History
321 lines (218 loc) · 14.7 KB

File metadata and controls

321 lines (218 loc) · 14.7 KB

Observe Workflows

The NeMo Agent Toolkit uses a flexible, plugin-based observability system that provides comprehensive support for configuring logging, tracing, and metrics for workflows. Users can configure multiple telemetry exporters simultaneously from the available options or create custom integrations. The observability system:

  • Uses an event-driven architecture with IntermediateStepManager publishing workflow events to a reactive stream
  • Supports multiple concurrent telemetry exporters processing events asynchronously
  • Provides built-in exporters for popular observability platforms (LangSmith, Phoenix, Langfuse, Weave, etc.)
  • Enables custom telemetry exporter development for any observability service

These features enable developers to test their workflows locally and integrate observability seamlessly with their preferred monitoring stack.

Installation

The core observability features (console and file logging) are included by default. For advanced telemetry features like OpenTelemetry and Phoenix tracing, you need to install the optional telemetry extras.

If you have already installed the NeMo Agent Toolkit from source, you can install package extras with the following commands, depending on whether you installed the NeMo Agent Toolkit from source or from a package.

::::{tab-set} :sync-group: install-tool

:::{tab-item} source :selected: :sync: source

# Install specific telemetry extras
uv pip install -e ".[data-flywheel]"
uv pip install -e ".[opentelemetry]"
uv pip install -e ".[phoenix]"
uv pip install -e ".[weave]"
# Note: conflicts with .[strands] and .[adk]
uv pip install -e ".[ragaai]"

:::

:::{tab-item} package :sync: package

# Install specific telemetry extras
uv pip install "nvidia-nat[data-flywheel]"
uv pip install "nvidia-nat[opentelemetry]"
uv pip install "nvidia-nat[phoenix]"
uv pip install "nvidia-nat[weave]"
# Note: conflicts with nvidia-nat[strands] and nvidia-nat[adk]
uv pip install "nvidia-nat[ragaai]"

:::

::::

Available Tracing Exporters

The following table lists each exporter with its supported features and configuration guide:

Provider Integration Documentation Supported Features
Catalyst Observing with Catalyst{.external} Logging, Tracing
NVIDIA Data Flywheel Blueprint Observing with Data Flywheel{.external} Logging, Tracing
DBNL Observing with DBNL{.external} Logging, Tracing
Dynatrace Observing with Dynatrace{.external} Logging, Tracing
Galileo Observing with Galileo{.external} Logging, Tracing
Langfuse Refer to the examples/observability/simple_calculator_observability example for usage details Logging, Tracing
LangSmith Observing with LangSmith{.external} Logging, Tracing, Evaluation Metrics
OpenTelemetry Collector Observing with OTel Collector{.external} Logging, Tracing
Patronus Refer to the examples/observability/simple_calculator_observability example for usage details Logging, Tracing
Phoenix Observing with Phoenix{.external} Logging, Tracing
W&B Weave Observing with W&B Weave{.external} Logging, Tracing, W&B Weave Redaction, Evaluation Metrics

Additional options:

  • File Export - Built-in file-based tracing for local development and debugging
  • Custom Exporters - Refer to Adding Telemetry Exporters for creating custom integrations

For complete configuration examples and setup instructions, check the examples/observability/ directory.

Configurable Components

The flexible observability system is configured using the general.telemetry section in the workflow configuration file. This section contains two subsections: logging and tracing, and each subsection can contain multiple telemetry exporters running simultaneously.

For a complete list of logging and tracing plugins and corresponding configuration settings use the following CLI commands.

# For all registered logging plugins
nat info components -t logging

# For all registered tracing plugins
nat info components -t tracing

Illustrated below is a sample configuration file demonstrating multiple exporters configured to run concurrently.

general:
  telemetry:
    logging:
      console:
        _type: console
        level: WARN
      file:
        _type: file
        path: ./.tmp/workflow.log
        level: DEBUG
    tracing:
      # Multiple exporters can run simultaneously
      phoenix:
        _type: phoenix
        # ... configuration fields
      weave:
        _type: weave
        # ... configuration fields
      file_backup:
        _type: file
        # ... configuration fields

Logging Configuration

The logging section contains one or more logging providers. Each provider has a _type and optional configuration fields. The following logging providers are supported by default:

  • console: Writes logs to the console.
  • file: Writes logs to a file.

Available log levels:

  • DEBUG: Detailed information for debugging.
  • INFO: General information about the workflow.
  • WARNING: Potential issues that should be addressed.
  • ERROR: Issues that affect the workflow from running correctly.
  • CRITICAL: Severe issues that prevent the workflow from continuing to run.

If a log level is specified, all logs at or above that level will be logged. For example, if the log level is set to WARNING, all logs at or above that level will be logged. If the log level is set to ERROR, all logs at or above that level will be logged.

Tracing Configuration

The tracing section contains one or more tracing providers. Each provider has a _type and optional configuration fields. The observability system supports multiple concurrent exporters.

NeMo Agent Toolkit Observability Components

The NeMo Agent Toolkit observability system uses a generic, plugin-based architecture built on the Subject-Observer pattern. The system consists of several key components working together to provide comprehensive workflow monitoring:

Event Stream Architecture

  • IntermediateStepManager: Publishes workflow events (IntermediateStep objects) to a reactive event stream, tracking function execution boundaries, LLM calls, tool usage, and intermediate operations.
  • Event Stream: A reactive stream that broadcasts IntermediateStep events to all subscribed telemetry exporters, enabling real-time observability.
  • Asynchronous Processing: All telemetry exporters process events asynchronously in background tasks, keeping observability "off the hot path" for optimal performance.

Telemetry Exporter Types

The system supports multiple exporter types, each optimized for different use cases:

  • Raw Exporters: Process IntermediateStep events directly for simple logging, file output, or custom event processing.
  • Span Exporters: Convert events into spans with lifecycle management, ideal for distributed tracing and span-based observability services.
  • OpenTelemetry Exporters: Specialized exporters for OTLP-compatible services with pre-built integrations for popular observability platforms.
  • Advanced Custom Exporters: Support complex business logic, stateful processing, and enterprise reliability patterns with circuit breakers and dead letter queues.

Processing Pipeline System

Each exporter can optionally include a processing pipeline that transforms, filters, batches, or aggregates data before export:

  • Processors: Modular components for data transformation, filtering, batching, and format conversion.
  • Pipeline Composition: Chain multiple processors together for complex data processing workflows.
  • Type Safety: Generic type system ensures compile-time safety for data transformations through the pipeline.

Integration Components

  • {py:class}nat.plugins.profiler.decorators: Decorators that wrap workflow and LLM framework context managers to inject usage-collection callbacks.
  • {py:class}~nat.plugins.profiler.callbacks: Callback handlers that track usage statistics (tokens, time, inputs/outputs) and push them to the event stream. Supports LangChain/LangGraph, LLama Index, CrewAI, Semantic Kernel, and Google ADK frameworks.

Registering a New Telemetry Provider as a Plugin

For complete information about developing and integrating custom telemetry exporters, including detailed examples, best practices, and advanced configuration options, Refer to Adding Telemetry Exporters.

Provider Integration Guides

::::{tab-set} :sync-group: provider

:::{tab-item} Catalyst :sync: Catalyst

:::{include} ./observe-workflow-with-catalyst.md

:::

:::{tab-item} Data Flywheel :sync: Data-Flywheel

:::{include} ./observe-workflow-with-data-flywheel.md

:::

:::{tab-item} DBNL :sync: DBNL

:::{include} ./observe-workflow-with-dbnl.md

:::

:::{tab-item} Dynatrace :sync: Dynatrace

:::{include} ./observe-workflow-with-dynatrace.md

:::

:::{tab-item} Galileo :sync: Galileo

:::{include} ./observe-workflow-with-galileo.md

:::

:::{tab-item} LangSmith :sync: LangSmith

:::{include} ./observe-workflow-with-langsmith.md

:::

:::{tab-item} OTel Collector :sync: OTel-collector

:::{include} ./observe-workflow-with-otel-collector.md

:::

:::{tab-item} Phoenix :sync: Phoenix

:::{include} ./observe-workflow-with-phoenix.md

:::

:::{tab-item} W&B Weave :sync: Wandb-Weave

:::{include} ./observe-workflow-with-weave.md

:::

::::

Cross-Workflow Observability

When one workflow invokes another (for example, by calling a remote workflow over HTTP or by running a child workflow programmatically), you can link the trace of the child workflow to the parent so that observability backends show a single, connected tree instead of separate traces.

Specifying Parent When Running a Workflow Programmatically

If you run a workflow from code using a session, pass parent_id and parent_name into session.run(). The toolkit uses these to set the root of the intermediate steps of the child workflow so the first step has the correct parent.

async with session_manager.session() as session:
    async with session.run(
        prompt,
        parent_id="parent-step-uuid",
        parent_name="Caller Workflow",
    ) as runner:
        result = await runner.result(to_type=str)
  • parent_id: The step ID of the parent (for example, the current workflow step or span that is invoking the child). The root workflow step of the child run is emitted with this as its parent.
  • parent_name: Optional display name for the parent (for example, the workflow or function name). The function ancestry of the root uses this as the parent name for observability.

HTTP Headers When Triggering a Workflow

When a workflow is triggered over HTTP (such as a POST to /generate/full), the server reads request headers to set the parent for that run. If present, they are applied before the workflow starts so the root step has the correct parent.

Header Description
workflow-parent-id Step ID of the parent. The root workflow step is emitted with this as its parent.
workflow-parent-name Optional display name for the parent (workflow or function name).

Example with curl:

curl -X POST http://localhost:8000/generate/full \
  -H "workflow-parent-id: <parent-step-id>" \
  -H "workflow-parent-name: Parent Workflow Name" \
  -H "Content-Type: application/json" \
  -d '{"input_message": "..."}'

Use these headers when the caller (orchestrator, API gateway, or another workflow) has a step or span ID and wants the child workflow to appear under that step in traces.

Replaying Intermediate Steps from a Remote Workflow

When your workflow calls a remote workflow (for example, by calling its /generate/full endpoint) and receives intermediate step data in the response, you can push those steps into the observability stream of the current run. That way, the steps of the remote workflow appear as part of the same trace tree.

Use the {py:meth}~nat.builder.intermediate_step_manager.IntermediateStepManager.push_intermediate_steps method from any code that runs inside the current workflow context. Pass the list of intermediate steps (for example, parsed from the remote response); they are injected into the event stream of the current run. The parent of the replayed root step is determined by how the remote was invoked: set workflow-parent-id and workflow-parent-name headers when calling the remote, or use session.run(parent_id=..., parent_name=...) when running a child workflow programmatically, so the trace tree links correctly.

from nat.builder.context import Context

# After calling a remote workflow (for example, /generate/full) and parsing
# the response into a list of IntermediateStep:
Context.get().intermediate_step_manager.push_intermediate_steps(remote_intermediate_steps)

This is useful when you call a remote workflow and want its steps to appear under the trace of the current workflow in your observability backend, so you get one connected tree for the full request.