Skip to content

[JS] Pipeline for transcription using the whisper model seems broken #3366

@nsrCodes

Description

@nsrCodes

Describe the bug
trying to use the whisper model for transcription has two problems:

  1. the underlying openai module expects that the file name contains
  2. even if that get's fixed, the raw time stamped segments/words cannot be extracted because the raw response is not returned from the result of generate

To Reproduce

Sample flow to take an audio buffer and generate transcript with whisper
import { genkit, z} from 'genkit';
import { openAI } from '@genkit-ai/compat-oai/openai';

const whisperModel = openAI.model('whisper-1',{
  response_format: 'verbose_json', 
  timestamp_granularities: ['word', 'segment']
})
const ai = genkit({
  plugins: [
    openAI({apiKey: process.env.OPENAI_API_KEY}),
  ],
})

const inputSchema = z.object({
  audioFile: z.instanceof(Uint8Array).describe('Audio file data as Uint8Array'),
})

const outputSchema = z.object({
  text: z.string().describe('Transcribed text from the audio'),
  segments: z.array(z.object({
    id: z.number().describe('id of the segment'),
    seek: z.number().describe('Seek time of the segment in seconds'),
    start: z.number().describe('Start time of the segment in seconds'),
    end: z.number().describe('End time of the segment in seconds'),
    text: z.string().describe('Text content of the segment'),
  })).describe('List of segments with their start and end times'),
  words: z.array(z.object({
    start: z.number().describe('Start time of the word in seconds'),
    end: z.number().describe('End time of the word in seconds'),
    word: z.string().describe('Text content of the word'),
  })).describe('List of words with their start and end times'),
  title: z.string().optional().describe('Optional title for the transcription'),
  description: z.string().optional().describe('Optional description for the transcription'),
})


export const transcribeAudio = ai.defineFlow({
  name: 'transcribeAudio',
  inputSchema: inputSchema,
  outputSchema: outputSchema,
}, async ({ 
  audioFile
}) => {
  const audioDataURL = `data:audio/webm;base64,${Buffer.from(audioFile).toString('base64')}`
  const transcriptionResponse = await ai.generate({
    prompt: [
      {
        media: {
          contentType: 'audio/webm;codecs=opus',
          url: audioDataURL,
        }
      }
    ],
    model: whisperModel,
    config: {
      response_format: 'verbose_json',
      timestamp_granularities: ['word', 'segment'],
    }
  })
  // @ts-ignore
  const {segments, words, text} = transcriptionResponse.raw;
  console.log("Transcription response:", {segments, words, text});


  return {
    text,
    segments,
    words,
  }
})

Expected behavior
segments, words and text all should be available

Screenshots

  1. error when webm is sent using a data url
Image
  1. Genkit flow showing that the openai/whisper-1 plugin has the segments and words, but they are not passed above
Image Image

Runtime (please complete the following information):

  • OS: MacOS 15.5

Node version
v22.14.0

Additional context
I triggering this inside a firebase function that I was testing inside the firebase emulator locally

Edit:
Raised potential fixes for each part of the problem. I am patching these in for my own project. Looking forward to helping get this fixed. thanks for genkit.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingjs

    Type

    No type

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions