Scenario SDK: emit audio refs instead of inline base64

## Summary

Once voice agents land in the Scenario SDK (#355 against #350), audio bytes are currently emitted inline as base64 inside `input_audio.data` in trace events. Switch to: upload the audio to LangWatch's audio-asset endpoint (presigned PUT) and emit a stable `audio_ref` content part in the trace event instead.

This implements the SDK-side of the contract defined in `langwatch/langwatch#3964`.

## Reference shape to emit

Replace:

```python
{
  "type": "input_audio",
  "input_audio": {"data": "<base64 wav>", "format": "wav"}
}
```

with:

```python
{
  "type": "audio_ref",
  "key": "audio/<projectId>/<traceId>/<uuid>.wav",
  "contentType": "audio/wav",
  "durationMs": 4320,
  "format": "wav"
}
```

## Acceptance Criteria

1. When a voice scenario produces audio (TTS output, captured user simulator audio, recorded responses), Scenario SDK calls the LangWatch backend at `POST /api/audio-assets/presigned-put` to mint a URL, PUTs the bytes, and emits the `audio_ref` content part in the resulting trace event.
2. Behind a flag (default on) so users can revert to inline base64 if LangWatch ingest is unreachable. Flag name something like `LANGWATCH_AUDIO_INLINE=true` to force-fallback.
3. **Both directions**: user simulator audio AND agent audio responses both get offloaded.
4. **Streaming**: if the audio is being captured chunk-by-chunk (Pipecat, LiveKit, OpenAI Realtime), buffer + upload once per logical utterance, not per chunk. One ref per utterance.
5. **Failure mode**: if presigned PUT fails (network, auth), fall back to inline base64 emission with a warning log.
6. **Tests**: unit test for the upload path with a mocked LangWatch endpoint; integration test against staging LangWatch if feasible.

## Out of scope

- TS SDK parity for the Scenario adapters — that's covered in `langwatch/scenario#372` (Voice TS SDK parity) and depends on this issue landing first.
- Changing the internal `AudioChunk` format (still PCM16 @ 24kHz mono per #370 design lock #1).

## Dependencies

- **Blocked on #355** (voice agents foundation merging). Until voice is in the SDK, there's nothing to redirect to S3.
- **Hard dep on `langwatch/langwatch#3964`** for the endpoint and ref shape — but can develop against fixtures once #3964's PR description spec is published.

## Why this matters

Without ref-based emission, every voice scenario will either (a) bloat trace payloads with megabytes of base64 WAV, or (b) hit payload caps and ship truncated audio. The LangWatch backend already supports audio asset storage (#3964); this issue is the consumer-side change that makes voice scenarios actually usable for replay/debug.

Part of langwatch/langwatch#1727 (audio-storage epic). Related: langwatch/langwatch#3552 (current inline player), `langwatch/scenario#370` (voice epic), `langwatch/scenario#350` (voice foundation).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scenario SDK: emit audio refs instead of inline base64 #451

Summary

Reference shape to emit

Acceptance Criteria

Out of scope

Dependencies

Why this matters

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Scenario SDK: emit audio refs instead of inline base64 #451

Description

Summary

Reference shape to emit

Acceptance Criteria

Out of scope

Dependencies

Why this matters

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions