Skip to content

[Tracing] ByteStream objects cause oversized payloads in tracing backends #10063

@LastRemote

Description

@LastRemote

[Generated by Cursor]

Description

Problem

When using Haystack's tracing feature with components that handle multimodal data (images, audio, video via ByteStream objects), the tracing system serializes the full binary data, causing:

  1. Oversized payloads that exceed backend limits (Langfuse, OpenTelemetry, etc.)
  2. Performance degradation due to serializing/transmitting megabytes of base64 data
  3. Tracing failures when backends reject large payloads
  4. No practical debugging value (you rarely need the full image in traces)

Similarly, ImageContent may be affected as well.

Root Cause

In haystack/tracing/utils.py, the _serializable_value() function calls to_dict() on objects that have it:

def _serializable_value(value: Any) -> Any:
    if isinstance(value, list):
        return [_serializable_value(v) for v in value]

    if isinstance(value, dict):
        return {k: _serializable_value(v) for k, v in value.items()}

    if getattr(value, "to_dict", None):
        return _serializable_value(value.to_dict())  # ⚠️ Problem here

    return value

When a ByteStream (or any object containing one) is traced:

  • ByteStream.to_dict() converts binary data to a list of integers or base64 string
  • A 1MB image becomes ~1.3MB of serialized data in the trace
  • This gets multiplied across multiple spans/components

Example Scenario

from haystack.dataclasses import ByteStream, ChatMessage
from haystack import Pipeline, tracing

# Create a message with an image
image_data = open("large_image.png", "rb").read()  # 5MB image
bytestream = ByteStream(data=image_data, mime_type="image/png")
message = ChatMessage.from_user(text="What's in this image?", media=[bytestream])

# When tracing is enabled
tracing.enable_tracing()
pipeline.run({"messages": [message]})

# Result: ~6.5MB of base64 data in EACH span that touches this message
# Langfuse/OpenTelemetry may reject the payload or timeout

The problem is recursive: ChatMessageMediaContentByteStream means the serialization happens at multiple levels.


Proposed Solution

Add special handling for ByteStream objects before calling to_dict():

def _serializable_value(value: Any) -> Any:
    # Special handling for ByteStream to avoid oversized payloads
    if type(value).__name__ == "ByteStream":
        return {
            "type": "ByteStream",
            "mime_type": getattr(value, "mime_type", None),
            "size_bytes": len(getattr(value, "data", b"")),
            "meta": getattr(value, "meta", {}),
            # Optional: small preview for text content
            "preview": _get_text_preview(value, max_bytes=100),
        }
    
    if isinstance(value, list):
        return [_serializable_value(v) for v in value]

    if isinstance(value, dict):
        return {k: _serializable_value(v) for k, v in value.items()}

    if getattr(value, "to_dict", None):
        return _serializable_value(value.to_dict())

    return value


def _get_text_preview(bytestream: Any, max_bytes: int = 100) -> Optional[str]:
    """Get a small preview of ByteStream data if it's text-like."""
    try:
        mime_type = getattr(bytestream, "mime_type", "")
        if mime_type and mime_type.startswith("text/"):
            data = getattr(bytestream, "data", b"")
            preview = data[:max_bytes].decode("utf-8", errors="ignore")
            return preview + "..." if len(data) > max_bytes else preview
    except Exception:
        pass
    return None

Alternative: Add to_trace_dict() Method

Add a tracing-specific serialization protocol:

def _serializable_value(value: Any) -> Any:
    # ... existing code ...
    
    # Check for trace-specific serialization first
    if getattr(value, "to_trace_dict", None):
        return _serializable_value(value.to_trace_dict())
    
    if getattr(value, "to_dict", None):
        return _serializable_value(value.to_dict())
    
    # ... rest of code ...

Then ByteStream can implement to_trace_dict() that returns a lightweight summary.


Impact

Affected Users:

  • Anyone using multimodal pipelines with tracing enabled
  • Vision/audio/video processing applications
  • RAG systems that index images/PDFs with media

Severity: High - This can make tracing completely unusable for multimodal applications

Workaround: Users must currently implement custom serializers or monkey-patch Haystack's tracing code


Environment

  • Haystack version: 2.x
  • Tracing backend: Langfuse, OpenTelemetry (affects all backends)
  • Python version: 3.9+

Additional Context

This issue is particularly problematic because:

  1. ByteStream is the recommended way to handle media in Haystack 2.x
  2. Multimodal LLMs are becoming increasingly common
  3. The serialization happens automatically and silently - users may not realize why tracing is failing
  4. The fix is straightforward and backward compatible

Related: The same issue could potentially affect other large data structures in the future (embeddings, large documents, etc.)

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1High priority, add to the next sprint

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions