Skip to content

Scenario SDK: emit audio refs instead of inline base64 #451

@drewdrewthis

Description

@drewdrewthis

Summary

Once voice agents land in the Scenario SDK (#355 against #350), audio bytes are currently emitted inline as base64 inside input_audio.data in trace events. Switch to: upload the audio to LangWatch's audio-asset endpoint (presigned PUT) and emit a stable audio_ref content part in the trace event instead.

This implements the SDK-side of the contract defined in langwatch/langwatch#3964.

Reference shape to emit

Replace:

{
  "type": "input_audio",
  "input_audio": {"data": "<base64 wav>", "format": "wav"}
}

with:

{
  "type": "audio_ref",
  "key": "audio/<projectId>/<traceId>/<uuid>.wav",
  "contentType": "audio/wav",
  "durationMs": 4320,
  "format": "wav"
}

Acceptance Criteria

  1. When a voice scenario produces audio (TTS output, captured user simulator audio, recorded responses), Scenario SDK calls the LangWatch backend at POST /api/audio-assets/presigned-put to mint a URL, PUTs the bytes, and emits the audio_ref content part in the resulting trace event.
  2. Behind a flag (default on) so users can revert to inline base64 if LangWatch ingest is unreachable. Flag name something like LANGWATCH_AUDIO_INLINE=true to force-fallback.
  3. Both directions: user simulator audio AND agent audio responses both get offloaded.
  4. Streaming: if the audio is being captured chunk-by-chunk (Pipecat, LiveKit, OpenAI Realtime), buffer + upload once per logical utterance, not per chunk. One ref per utterance.
  5. Failure mode: if presigned PUT fails (network, auth), fall back to inline base64 emission with a warning log.
  6. Tests: unit test for the upload path with a mocked LangWatch endpoint; integration test against staging LangWatch if feasible.

Out of scope

  • TS SDK parity for the Scenario adapters — that's covered in langwatch/scenario#372 (Voice TS SDK parity) and depends on this issue landing first.
  • Changing the internal AudioChunk format (still PCM16 @ 24kHz mono per Voice Agents #370 design lock Add Mintlify documentation #1).

Dependencies

Why this matters

Without ref-based emission, every voice scenario will either (a) bloat trace payloads with megabytes of base64 WAV, or (b) hit payload caps and ship truncated audio. The LangWatch backend already supports audio asset storage (#3964); this issue is the consumer-side change that makes voice scenarios actually usable for replay/debug.

Part of langwatch/langwatch#1727 (audio-storage epic). Related: langwatch/langwatch#3552 (current inline player), langwatch/scenario#370 (voice epic), langwatch/scenario#350 (voice foundation).

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

Status
Backlog

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions