[JS] Pipeline for transcription using the whisper model seems broken

**Describe the bug**
trying to use the whisper model for transcription has two problems:
1. the underlying `openai` module expects that the file name contains 
2. even if that get's fixed, the raw time stamped segments/words cannot be extracted because the raw response is not returned from the result of `generate`


**To Reproduce**
<details>
  <summary>Sample flow to take an audio buffer and generate transcript with whisper</summary>

```
import { genkit, z} from 'genkit';
import { openAI } from '@genkit-ai/compat-oai/openai';

const whisperModel = openAI.model('whisper-1',{
  response_format: 'verbose_json', 
  timestamp_granularities: ['word', 'segment']
})
const ai = genkit({
  plugins: [
    openAI({apiKey: process.env.OPENAI_API_KEY}),
  ],
})

const inputSchema = z.object({
  audioFile: z.instanceof(Uint8Array).describe('Audio file data as Uint8Array'),
})

const outputSchema = z.object({
  text: z.string().describe('Transcribed text from the audio'),
  segments: z.array(z.object({
    id: z.number().describe('id of the segment'),
    seek: z.number().describe('Seek time of the segment in seconds'),
    start: z.number().describe('Start time of the segment in seconds'),
    end: z.number().describe('End time of the segment in seconds'),
    text: z.string().describe('Text content of the segment'),
  })).describe('List of segments with their start and end times'),
  words: z.array(z.object({
    start: z.number().describe('Start time of the word in seconds'),
    end: z.number().describe('End time of the word in seconds'),
    word: z.string().describe('Text content of the word'),
  })).describe('List of words with their start and end times'),
  title: z.string().optional().describe('Optional title for the transcription'),
  description: z.string().optional().describe('Optional description for the transcription'),
})


export const transcribeAudio = ai.defineFlow({
  name: 'transcribeAudio',
  inputSchema: inputSchema,
  outputSchema: outputSchema,
}, async ({ 
  audioFile
}) => {
  const audioDataURL = `data:audio/webm;base64,${Buffer.from(audioFile).toString('base64')}`
  const transcriptionResponse = await ai.generate({
    prompt: [
      {
        media: {
          contentType: 'audio/webm;codecs=opus',
          url: audioDataURL,
        }
      }
    ],
    model: whisperModel,
    config: {
      response_format: 'verbose_json',
      timestamp_granularities: ['word', 'segment'],
    }
  })
  // @ts-ignore
  const {segments, words, text} = transcriptionResponse.raw;
  console.log("Transcription response:", {segments, words, text});


  return {
    text,
    segments,
    words,
  }
})
```
</details>

**Expected behavior**
segments, words and text all should be available

**Screenshots**
1. error when webm is sent using a data url
<img width="1200" height="237" alt="Image" src="https://github.com/user-attachments/assets/6a0f7438-bb72-41e1-8ab3-8c4c7e71c0da" />

2. Genkit flow showing that the `openai/whisper-1` plugin has the segments and words, but they are not passed above

<img width="1440" height="900" alt="Image" src="https://github.com/user-attachments/assets/5d5aac32-7fc0-4489-9c2c-3c2e41834519" />

<img width="1440" height="900" alt="Image" src="https://github.com/user-attachments/assets/4451a4b1-8b82-4e00-bb54-cae583e9a2ef" />

**Runtime (please complete the following information):**
 - OS: MacOS 15.5

**Node version**
 v22.14.0

**Additional context**
I triggering this inside a firebase function that I was testing inside the firebase emulator locally

Edit:
Raised potential fixes for each part of the problem. I am patching these in for my own project. Looking forward to helping get this fixed. thanks for genkit.
- https://github.com/firebase/genkit/pull/3367
- https://github.com/firebase/genkit/pull/3368


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[JS] Pipeline for transcription using the whisper model seems broken #3366

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[JS] Pipeline for transcription using the whisper model seems broken #3366

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions