Chat with Image, Video, and Audio
The Chat API supports conversations that include images, short videos, and audio.
Our Chat API performs best with videos shorter than 30 seconds.
For longer videos, use our Vision API instead. After uploading your video, we run a sophisticated pipeline to process your video so that it is optimized for our models to query your video content.
For one-shot flows, see below on Working with video.
You can insert multimodal content in the conversation by using media content types. The supported types are: image_url, video_url, audio_url, and pdf_url.
Below is an example of sending an image of a cat by URL:
This will output a response like:
The animal in the image is a domestic cat. Specifically, it appears to be a ginger or orange tabby cat, which is characterized by its reddish-brown fur with darker stripes or patches. The cat is engaging in a common feline behavior of sniffing or licking objects, which in this case is a computer keyboard. Cats are known for their curiosity and often explore their environment by using their sense of smell, which is highly developed. The act of licking or sniffing can also be a way for cats to mark their territory with pheromones from their saliva.
Data URLs
The API supports sending media via data URLs, for example you could URL-encode a jpeg image and then set image_url to a value like "data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAASABIAAD/4QmoRXhpZgAATU0AKgAAAAgADQEPAAIAA...".
Multiple media
You can send multiple media files in your request by appending them to the content array for a user message:
Working with video
Using the Chat API with a video URL
Note that this method only works if the video URL is unprotected - providers like Youtube employ defensive measures to prevent video download.
Our Vision API provides a managed service which helps you download and process videos from Youtube.
If you have a short video (less than 30 seconds), you can pass it into the Chat API using the video_url content type:
Longer videos using the Vision API (recommended)
For longer videos, use our Vision API. After uploading your video, we run a sophisticated pipeline to process and extract information from your video so that it is optimized for our models to query your video content. The Vision API supports downloads from Youtube so all you need to do is specify a video URL.
Longer videos using the Chat API - processing videos yourself
If you prefer to handle video processing of longer yourself (for e.g. to control frame sampling or reduce upload size), you can send pre-extracted frames to the Chat API using the video/jpeg MIME type.
First, extract frames from your video using ffmpeg. This command extracts one frame per second:
Then encode each frame as base64 and join them with commas:
In practice this is not too different from sending multiple image_url entries. However, currently there is a limit of 6 media items per user turn so video_url allows you to send more frames at once than if you used image_url.
Streaming, Async, and other advanced usage
Please see the guide for text-only chat for more guidance on the advanced features of the Chat API, which also work for multimodal inputs.