Skip to content

Feature Request: Native video file input support in Responses API (parity with Google Gemini & Volcengine) #1778

@leeclouddragon

Description

@leeclouddragon

Summary

The OpenAI Responses API currently supports images, PDFs, documents, spreadsheets, and code files as input_file, but does not support video files (mp4, webm, mov, etc.) as native input. Google Gemini and Volcengine Seed2 have supported native video input for over a year.

The gap

GPT-4o launch (May 2024) explicitly stated the model accepts "any combination of text, audio, image, and video inputs." The demo showed real-time video understanding. ChatGPT's Advanced Voice Mode supports live camera input today.

Nearly 2 years later, the API still has no video input support.

Current input_file accepted types (from docs):

  • PDF, DOCX, PPTX, XLSX, CSV ✅
  • Text, code, markdown ✅
  • Images (via input_image) ✅
  • Video: not listed, not supported
  • Audio files as input_file: not supported ❌ (only via Realtime API or input_audio in Chat Completions)

The official cookbook recommends extracting video frames with ffmpeg and sending them as an image array — a workaround, not a solution. This loses temporal information, audio track, and motion context.

What competitors offer

Google Gemini (available since Gemini 1.5, mid-2024)

# Upload video via File API
video_file = client.files.upload(file="video.mp4")

# Pass directly to generateContent
response = client.models.generate_content(
    model="gemini-2.5-flash",
    contents=[video_file, "Describe what happens in this video"]
)
  • Native video input with audio
  • Supports mp4, webm, mov, avi, etc.
  • Model sees full temporal + audio information
  • Up to 1 hour of video

Volcengine Seed2

  • input_video field in API request
  • Native video understanding with audio

Proposed API

Following the existing input_file pattern:

{
  "role": "user",
  "content": [
    {
      "type": "input_file",
      "file_id": "file-xxxxx"
    },
    {
      "type": "input_text",
      "text": "Describe what happens in this video"
    }
  ]
}

Or add video/mp4, video/webm, video/quicktime to the accepted MIME types for input_file. The Files API already supports arbitrary uploads with purpose: "user_data".

Why this matters

  1. Video editing / understanding products cannot use OpenAI as a provider for their core workflows
  2. Frame extraction workaround loses audio, temporal context, and motion — the model is literally blind to what happens between frames
  3. Competitive gap is widening — Gemini has had this for 1.5 years, and OpenAI's own ChatGPT product already has video understanding via Advanced Voice Mode
  4. Developer community has been asking since May 2024 with zero official response:

Scope

At minimum:

  • Accept video files (mp4, webm) via input_file in the Responses API
  • Extract and process both visual frames and audio track natively

Stretch:

  • Support in Realtime API for streaming video input
  • Video file support in the Files API with purpose: "user_data"

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions