Audio Transcription
The Speech API provides audio transcription capabilities that convert speech to text with optional word-level timestamps. This is useful for generating captions, creating searchable text from audio, or analyzing spoken content.
Transcribe Audio
Convert speech to text with optional word-level timestamps.
Endpoint
- POST /v1/transcription_or_translation
Request
Bash
Python
Parameters
- audio_url(required): URL to the audio file or base64-encoded audio as data URI
- sampling_rate(required): Audio sampling rate in Hz (recommended: 16000)
- temperature(optional): Controls randomness in generation. Use 0.0 for deterministic output. Default: 0.0
- max_tokens(optional): Maximum number of tokens to generate. Default: 1024
Response
Returns a transcription result with:
- transcript: Complete transcribed text
- transcript_translation_with_timestamp: Array of word-level segments- start: Start time in seconds
- end: End time in seconds
- transcript: Transcribed word or phrase
 
Use Cases
- Captioning: Generate accurate captions for videos with precise timing
- Transcription: Convert meetings, interviews, or podcasts to text
- Search: Make audio content searchable by text
- Analysis: Analyze speech patterns and content with timestamps