-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Feature Request: Native video file input support in Responses API (parity with Google Gemini & Volcengine) #1778
Description
Summary
The OpenAI Responses API currently supports images, PDFs, documents, spreadsheets, and code files as input_file, but does not support video files (mp4, webm, mov, etc.) as native input. Google Gemini and Volcengine Seed2 have supported native video input for over a year.
The gap
GPT-4o launch (May 2024) explicitly stated the model accepts "any combination of text, audio, image, and video inputs." The demo showed real-time video understanding. ChatGPT's Advanced Voice Mode supports live camera input today.
Nearly 2 years later, the API still has no video input support.
Current input_file accepted types (from docs):
- PDF, DOCX, PPTX, XLSX, CSV ✅
- Text, code, markdown ✅
- Images (via
input_image) ✅ - Video: not listed, not supported ❌
- Audio files as
input_file: not supported ❌ (only via Realtime API orinput_audioin Chat Completions)
The official cookbook recommends extracting video frames with ffmpeg and sending them as an image array — a workaround, not a solution. This loses temporal information, audio track, and motion context.
What competitors offer
Google Gemini (available since Gemini 1.5, mid-2024)
# Upload video via File API
video_file = client.files.upload(file="video.mp4")
# Pass directly to generateContent
response = client.models.generate_content(
model="gemini-2.5-flash",
contents=[video_file, "Describe what happens in this video"]
)- Native video input with audio
- Supports mp4, webm, mov, avi, etc.
- Model sees full temporal + audio information
- Up to 1 hour of video
Volcengine Seed2
input_videofield in API request- Native video understanding with audio
Proposed API
Following the existing input_file pattern:
{
"role": "user",
"content": [
{
"type": "input_file",
"file_id": "file-xxxxx"
},
{
"type": "input_text",
"text": "Describe what happens in this video"
}
]
}Or add video/mp4, video/webm, video/quicktime to the accepted MIME types for input_file. The Files API already supports arbitrary uploads with purpose: "user_data".
Why this matters
- Video editing / understanding products cannot use OpenAI as a provider for their core workflows
- Frame extraction workaround loses audio, temporal context, and motion — the model is literally blind to what happens between frames
- Competitive gap is widening — Gemini has had this for 1.5 years, and OpenAI's own ChatGPT product already has video understanding via Advanced Voice Mode
- Developer community has been asking since May 2024 with zero official response:
Scope
At minimum:
- Accept video files (mp4, webm) via
input_filein the Responses API - Extract and process both visual frames and audio track natively
Stretch:
- Support in Realtime API for streaming video input
- Video file support in the Files API with
purpose: "user_data"