-
Notifications
You must be signed in to change notification settings - Fork 352
Open
Labels
Description
Describe the bug
trying to use the whisper model for transcription has two problems:
- the underlying
openai
module expects that the file name contains - even if that get's fixed, the raw time stamped segments/words cannot be extracted because the raw response is not returned from the result of
generate
To Reproduce
Sample flow to take an audio buffer and generate transcript with whisper
import { genkit, z} from 'genkit';
import { openAI } from '@genkit-ai/compat-oai/openai';
const whisperModel = openAI.model('whisper-1',{
response_format: 'verbose_json',
timestamp_granularities: ['word', 'segment']
})
const ai = genkit({
plugins: [
openAI({apiKey: process.env.OPENAI_API_KEY}),
],
})
const inputSchema = z.object({
audioFile: z.instanceof(Uint8Array).describe('Audio file data as Uint8Array'),
})
const outputSchema = z.object({
text: z.string().describe('Transcribed text from the audio'),
segments: z.array(z.object({
id: z.number().describe('id of the segment'),
seek: z.number().describe('Seek time of the segment in seconds'),
start: z.number().describe('Start time of the segment in seconds'),
end: z.number().describe('End time of the segment in seconds'),
text: z.string().describe('Text content of the segment'),
})).describe('List of segments with their start and end times'),
words: z.array(z.object({
start: z.number().describe('Start time of the word in seconds'),
end: z.number().describe('End time of the word in seconds'),
word: z.string().describe('Text content of the word'),
})).describe('List of words with their start and end times'),
title: z.string().optional().describe('Optional title for the transcription'),
description: z.string().optional().describe('Optional description for the transcription'),
})
export const transcribeAudio = ai.defineFlow({
name: 'transcribeAudio',
inputSchema: inputSchema,
outputSchema: outputSchema,
}, async ({
audioFile
}) => {
const audioDataURL = `data:audio/webm;base64,${Buffer.from(audioFile).toString('base64')}`
const transcriptionResponse = await ai.generate({
prompt: [
{
media: {
contentType: 'audio/webm;codecs=opus',
url: audioDataURL,
}
}
],
model: whisperModel,
config: {
response_format: 'verbose_json',
timestamp_granularities: ['word', 'segment'],
}
})
// @ts-ignore
const {segments, words, text} = transcriptionResponse.raw;
console.log("Transcription response:", {segments, words, text});
return {
text,
segments,
words,
}
})
Expected behavior
segments, words and text all should be available
Screenshots
- error when webm is sent using a data url

- Genkit flow showing that the
openai/whisper-1
plugin has the segments and words, but they are not passed above


Runtime (please complete the following information):
- OS: MacOS 15.5
Node version
v22.14.0
Additional context
I triggering this inside a firebase function that I was testing inside the firebase emulator locally
Edit:
Raised potential fixes for each part of the problem. I am patching these in for my own project. Looking forward to helping get this fixed. thanks for genkit.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
No status