-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Description
[Generated by Cursor]
Description
Problem
When using Haystack's tracing feature with components that handle multimodal data (images, audio, video via ByteStream objects), the tracing system serializes the full binary data, causing:
- Oversized payloads that exceed backend limits (Langfuse, OpenTelemetry, etc.)
- Performance degradation due to serializing/transmitting megabytes of base64 data
- Tracing failures when backends reject large payloads
- No practical debugging value (you rarely need the full image in traces)
Similarly, ImageContent may be affected as well.
Root Cause
In haystack/tracing/utils.py, the _serializable_value() function calls to_dict() on objects that have it:
def _serializable_value(value: Any) -> Any:
if isinstance(value, list):
return [_serializable_value(v) for v in value]
if isinstance(value, dict):
return {k: _serializable_value(v) for k, v in value.items()}
if getattr(value, "to_dict", None):
return _serializable_value(value.to_dict()) # ⚠️ Problem here
return valueWhen a ByteStream (or any object containing one) is traced:
ByteStream.to_dict()converts binary data to a list of integers or base64 string- A 1MB image becomes ~1.3MB of serialized data in the trace
- This gets multiplied across multiple spans/components
Example Scenario
from haystack.dataclasses import ByteStream, ChatMessage
from haystack import Pipeline, tracing
# Create a message with an image
image_data = open("large_image.png", "rb").read() # 5MB image
bytestream = ByteStream(data=image_data, mime_type="image/png")
message = ChatMessage.from_user(text="What's in this image?", media=[bytestream])
# When tracing is enabled
tracing.enable_tracing()
pipeline.run({"messages": [message]})
# Result: ~6.5MB of base64 data in EACH span that touches this message
# Langfuse/OpenTelemetry may reject the payload or timeoutThe problem is recursive: ChatMessage → MediaContent → ByteStream means the serialization happens at multiple levels.
Proposed Solution
Add special handling for ByteStream objects before calling to_dict():
def _serializable_value(value: Any) -> Any:
# Special handling for ByteStream to avoid oversized payloads
if type(value).__name__ == "ByteStream":
return {
"type": "ByteStream",
"mime_type": getattr(value, "mime_type", None),
"size_bytes": len(getattr(value, "data", b"")),
"meta": getattr(value, "meta", {}),
# Optional: small preview for text content
"preview": _get_text_preview(value, max_bytes=100),
}
if isinstance(value, list):
return [_serializable_value(v) for v in value]
if isinstance(value, dict):
return {k: _serializable_value(v) for k, v in value.items()}
if getattr(value, "to_dict", None):
return _serializable_value(value.to_dict())
return value
def _get_text_preview(bytestream: Any, max_bytes: int = 100) -> Optional[str]:
"""Get a small preview of ByteStream data if it's text-like."""
try:
mime_type = getattr(bytestream, "mime_type", "")
if mime_type and mime_type.startswith("text/"):
data = getattr(bytestream, "data", b"")
preview = data[:max_bytes].decode("utf-8", errors="ignore")
return preview + "..." if len(data) > max_bytes else preview
except Exception:
pass
return NoneAlternative: Add to_trace_dict() Method
Add a tracing-specific serialization protocol:
def _serializable_value(value: Any) -> Any:
# ... existing code ...
# Check for trace-specific serialization first
if getattr(value, "to_trace_dict", None):
return _serializable_value(value.to_trace_dict())
if getattr(value, "to_dict", None):
return _serializable_value(value.to_dict())
# ... rest of code ...Then ByteStream can implement to_trace_dict() that returns a lightweight summary.
Impact
Affected Users:
- Anyone using multimodal pipelines with tracing enabled
- Vision/audio/video processing applications
- RAG systems that index images/PDFs with media
Severity: High - This can make tracing completely unusable for multimodal applications
Workaround: Users must currently implement custom serializers or monkey-patch Haystack's tracing code
Environment
- Haystack version: 2.x
- Tracing backend: Langfuse, OpenTelemetry (affects all backends)
- Python version: 3.9+
Additional Context
This issue is particularly problematic because:
- ByteStream is the recommended way to handle media in Haystack 2.x
- Multimodal LLMs are becoming increasingly common
- The serialization happens automatically and silently - users may not realize why tracing is failing
- The fix is straightforward and backward compatible
Related: The same issue could potentially affect other large data structures in the future (embeddings, large documents, etc.)