Skip to content

fix: add SDK-level caching for Speech and Music generation#201

Merged
SecurityQQ merged 5 commits intomainfrom
fix/speech-caching
Apr 7, 2026
Merged

fix: add SDK-level caching for Speech and Music generation#201
SecurityQQ merged 5 commits intomainfrom
fix/speech-caching

Conversation

@SecurityQQ
Copy link
Copy Markdown
Contributor

@SecurityQQ SecurityQQ commented Apr 7, 2026

Summary

  • Adds withCache() wrappers for Speech and Music in both the standalone (await Speech()/await Music()) and render pipeline paths
  • Uses computeCacheKey(element) consistently — the same canonical key format used by Image and Video
  • Adds pendingFiles deduplication for concurrent Speech/Music renders
  • Removes manual ctx.cache.get()/ctx.cache.set() + JSON.stringify cache keys from both renderers

Problem

Speech was completely missing SDK-level caching. Music had manual caching via ctx.cache.get()/ctx.cache.set() with JSON.stringify keys, which was inconsistent with Image/Video and produced different cache keys than the standalone await Music() path.

Before:

Layer Image Video Music Speech
Gateway (Redis/R2) yes yes yes yes
SDK withCache() render pipeline yes yes manual get/set no
SDK withCache() standalone await yes yes yes no
computeCacheKey(element) yes yes no (JSON.stringify) no (ignored)
pendingFiles dedup yes yes no no

After:

Layer Image Video Music Speech
Gateway (Redis/R2) yes yes yes yes
SDK withCache() render pipeline yes yes yes yes
SDK withCache() standalone await yes yes yes yes
computeCacheKey(element) yes yes yes yes
pendingFiles dedup yes yes yes yes

Changes

src/react/resolve.ts — standalone await Speech():

  • Added getCachedGenerateSpeech() wrapping generateSpeechAI with withCache(), matching getCachedGenerateVideo()/getCachedGenerateMusic()
  • Replaced direct generateSpeechAI() call with cached wrapper

src/react/renderers/context.ts:

  • Added generateSpeech and generateMusic fields to RenderContext, matching generateImage/generateVideo

src/react/renderers/render.ts (renderRoot()):

  • Creates cachedGenerateSpeech and cachedGenerateMusic via withCache(), same pattern as Image/Video
  • Passes them into RenderContext

src/react/renderers/speech.ts:

  • Rewrote to use computeCacheKey(element) + ctx.generateSpeech() + pendingFiles dedup
  • Removed manual ctx.cache.get()/ctx.cache.set() and JSON.stringify key

src/react/renderers/music.ts:

  • Same rewrite: computeCacheKey(element) + ctx.generateMusic() + pendingFiles dedup
  • Removed manual caching logic

src/studio/step-renderer.ts + test fixtures:

  • Added generateSpeech/generateMusic to all RenderContext construction sites

Closes #200

Speech was the only media type missing withCache/ctx.cache caching at the
SDK level. Every await Speech() and <Speech> in render re-hit ElevenLabs
even with identical inputs, wasting API credits and adding latency.

- resolve.ts: add getCachedGenerateSpeech() wrapping generateSpeechAI with
  withCache(), matching getCachedGenerateVideo/Music pattern
- renderers/speech.ts: add manual ctx.cache get/set matching renderMusic()
  pattern for the render pipeline path

Closes #200
…nd Music renderers

Align Speech and Music render pipeline caching with the Image/Video
pattern:

- Use computeCacheKey(element) for canonical cache keys (captures model
  provider, settings, providerOptions, children structure)
- Route generation through ctx.generateSpeech/ctx.generateMusic which
  are withCache() wrappers created in renderRoot(), matching
  ctx.generateImage/ctx.generateVideo
- Add pendingFiles deduplication for concurrent renders
- Remove manual ctx.cache.get()/set() and JSON.stringify cache keys

Also adds generateSpeech/generateMusic to RenderContext and wires them
up in render.ts, step-renderer.ts, and test fixtures.
@SecurityQQ SecurityQQ changed the title fix: add SDK-level caching for Speech generation fix: add SDK-level caching for Speech and Music generation Apr 7, 2026
… throwing stubs

Replace 'not implemented in test' stubs with real withCache-wrapped mock
functions for generateSpeech/generateMusic, matching how generateImage
and generateVideo are already mocked in the same test files.
Add 5 new tests verifying render-pipeline caching for Speech and Music:

- Speech: reuses cache when only volume/id differ (ignored props)
- Speech: does NOT reuse cache when text differs
- Speech: does NOT reuse cache when voice differs
- Music: reuses cache with identical prompt/model/duration
- Music: does NOT reuse cache when prompt differs

All tests follow the same pattern as the existing Image/Video cache tests:
create element, render in ctx1, render variant in ctx2, assert call count.
Previously only the raw TTS API call was cached via withCache. The
expensive post-processing (ffprobe duration, ffmpeg segment slicing,
S3 uploads) ran on every invocation even with identical inputs.

Now the entire resolved result — including segments with their sliced
audio bytes, word timings, and duration — is cached under a
'resolveSpeech:' key. On cache hit, segments are reconstructed from
cached binary data without calling ffmpeg or ElevenLabs.

Also removes non-deterministic upload URLs (Date.now + Math.random)
from ResolvedElement serialization in computeCacheKey. These URLs were
causing downstream cache misses for Video elements that take speech
segments as audio input (e.g. VEED lip-sync videos).
@SecurityQQ SecurityQQ merged commit d4acc2d into main Apr 7, 2026
1 check passed
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 7, 2026

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 64a8725c-1514-470f-99ed-c1080e520e0a

📥 Commits

Reviewing files that changed from the base of the PR and between 1f4903d and 5d84741.

📒 Files selected for processing (10)
  • src/react/renderers/cache.test.ts
  • src/react/renderers/context.ts
  • src/react/renderers/music.ts
  • src/react/renderers/packshot.test.ts
  • src/react/renderers/render.ts
  • src/react/renderers/speech.ts
  • src/react/renderers/talking-head.test.ts
  • src/react/renderers/utils.ts
  • src/react/resolve.ts
  • src/studio/step-renderer.ts

📝 Walkthrough

Walkthrough

the pr adds speech and music generation caching across the sdk's render and resolve layers. it implements cached wrappers for generateSpeech and generateMusic, threads them through RenderContext, adds resolve-level caching for speech with segment serialization, and removes non-deterministic file urls from cache keys.

Changes

Cohort / File(s) Summary
context & core wiring
src/react/renderers/context.ts, src/react/renderers/render.ts, src/studio/step-renderer.ts
added generateSpeech and generateMusic properties to RenderContext; wired cached versions in render setup and studio step sessions.
resolve-level speech caching
src/react/resolve.ts
introduced getCachedGenerateSpeech() wrapper and resolve-level caching for resolveSpeechElement with CachedSegment/CachedSpeechResult serialization; reconstructs segments on cache hits.
renderer speech/music
src/react/renderers/speech.ts, src/react/renderers/music.ts
swapped direct generation imports for context-based calls; added concurrent deduplication via pendingFiles tracking; refactored progress lifecycle and file construction.
cache serialization
src/react/renderers/utils.ts
removed non-deterministic file upload urls from serializeValue() to stabilize downstream cache keys.
test coverage
src/react/renderers/cache.test.ts, src/react/renderers/packshot.test.ts, src/react/renderers/talking-head.test.ts
extended renderer cache tests for speech/music with mock generators; added packshot renderer tests; updated test context mocks to include generateSpeech/generateMusic.

Sequence Diagram(s)

sequenceDiagram
    participant Client as Render Pipeline
    participant RenderContext as RenderContext
    participant PendingFiles as pendingFiles<br/>(Dedup)
    participant Cache as CacheStorage
    participant Generator as ctx.generateSpeech
    participant FileStore as generatedFiles

    Client->>RenderContext: renderSpeech(element)
    RenderContext->>RenderContext: compute cacheKeyStr
    RenderContext->>PendingFiles: check if in-flight
    alt In-flight promise exists
        PendingFiles-->>Client: return existing promise
    else Cache miss or no pending
        RenderContext->>Cache: check cache.get(cacheKeyStr)
        alt Cache hit
            Cache-->>RenderContext: return cached audio
            RenderContext->>FileStore: push file metadata
            RenderContext-->>Client: resolve promise
        else Cache miss
            RenderContext->>Generator: call with params
            Generator-->>RenderContext: return audio object
            RenderContext->>Cache: cache.set(cacheKeyStr, audio)
            RenderContext->>FileStore: push generated file
            PendingFiles->>PendingFiles: clean up pending entry
            RenderContext-->>Client: resolve promise
        end
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related issues

  • Speech segments are not cached at the SDK level #200 — directly addresses the root cause of speech caching regression by implementing getCachedGenerateSpeech() and resolve-level caching for speech generation, matching the proposed fix for both resolveSpeechElement() and renderSpeech().

Possibly related PRs

Poem

🎵 speech and music now cache their way,
through resolve and render they stay,
segments serialize, duplicates fade—
no more api calls for the same soundwave made 🎙️

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/speech-caching

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Speech segments are not cached at the SDK level

1 participant