Skip to content

Caption rendering is slow — optimize font download and measurement pipeline #207

@SecurityQQ

Description

@SecurityQQ

Problem

Caption rendering has become noticeably slower since adding the universal font support system. The pipeline now downloads font files from S3, loads them with opentype.js for measurement, computes per-character positions, and (for emoji) may spawn ffmpeg subprocesses for pixel scanning — all before the actual video burn even starts.

Where time is spent

  1. Font downloads (ensureLocalFonts) — downloads TTF/OTF files from s3.varg.ai/fonts/ on first use. Noto CJK fonts are 16-17MB each. Cached on disk after first download, but cold starts are slow.

  2. opentype.js font loading (loadFont) — opentype.loadSync() parses the full font file including all tables. CJK fonts with 60K+ glyphs are especially slow to parse.

  3. Per-character position computation (getCharXPositions) — iterates every character in every subtitle line, computing glyph advances via opentype.js. For long videos with many subtitle entries, this adds up.

  4. Arabic/complex script measurement (measureRenderedWidth) — spawns an ffmpeg subprocess per Arabic text segment to measure the actual rendered width via pixel scanning. Each subprocess takes ~100-200ms.

  5. Emoji gap detection (measureEmojiGapPositions) — spawns ffmpeg to render the full subtitle line and scan for gaps. ~200ms per line with emoji.

  6. ZIP creation for Rendi (runWithCompressedFolder) — downloads all font files again into memory to create the ZIP, even though they're already cached on disk locally.

Optimization opportunities

Quick wins

  • Read cached fonts from disk for ZIPrunWithCompressedFolder downloads fonts from URLs even when they're already in /tmp/varg-caption-fonts/. Read from local cache instead.
  • Lazy font loading — only call loadFont() when measurement is actually needed (i.e., when emoji are present). Plain text captions don't need opentype.js at all.
  • Cache getFontMetrics() results — metrics for a given font+size combo are deterministic. Cache them in a Map.

Medium effort

  • Parallel font downloadsensureLocalFonts already uses Promise.all, but the ZIP builder downloads sequentially. Parallelize.
  • Pre-warm font cache on service start — the render service could download common fonts (Montserrat, Noto CJK JP) at startup instead of on first request.
  • Batch measureRenderedWidth calls — instead of one ffmpeg subprocess per Arabic segment, render all segments in a single ffmpeg call with multiple drawtext filters.

Larger changes

  • Pre-compute emoji positions in the ASS file — instead of computing X/Y positions at render time, embed position metadata in the ASS file or a sidecar JSON.
  • Use HarfBuzz WASM for text shaping — replace the opentype.js + ffmpeg pixel scanning approach with HarfBuzz compiled to WASM. This would give accurate text shaping for all scripts (including Arabic) without spawning ffmpeg subprocesses.
  • Font subsetting — for CJK fonts, subset to only the glyphs used in the caption text before including in the ZIP. This could reduce a 17MB font to <100KB for typical subtitle text.

Impact

Primarily affects:

  • Cloud rendering via the render service (every render job pays this cost)
  • Local rendering with <Captions> component
  • Worst case: Japanese/Korean/Chinese text with emoji = CJK font download (17MB) + opentype parsing + per-character measurement

Related files

  • src/react/renderers/captions.ts — main pipeline
  • src/react/renderers/burn-captions.ts — font download + ZIP creation
  • src/react/renderers/text-measure.ts — opentype.js measurement
  • src/react/renderers/fonts.ts — font resolution + S3 URLs
  • src/ai-sdk/providers/editly/rendi/index.tsrunWithCompressedFolder ZIP builder

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions