Audio Transcription
The Speech API provides audio transcription capabilities that convert speech to text with optional word-level timestamps. This is useful for generating captions, creating searchable text from audio, or analyzing spoken content.
Transcribe Audio
Convert speech to text with optional word-level timestamps.
Endpoint
POST /v1/transcription_or_translation
Request
Bash
Python
Parameters
audio_url(required): URL to the audio file or base64-encoded audio as data URIsampling_rate(required): Audio sampling rate in Hz (recommended: 16000)temperature(optional): Controls randomness in generation. Use 0.0 for deterministic output. Default: 0.0max_tokens(optional): Maximum number of tokens to generate. Default: 1024
Response
Returns a transcription result with:
transcript: Complete transcribed texttranscript_translation_with_timestamp: Array of word-level segmentsstart: Start time in secondsend: End time in secondstranscript: Transcribed word or phrase
Use Cases
- Captioning: Generate accurate captions for videos with precise timing
- Transcription: Convert meetings, interviews, or podcasts to text
- Search: Make audio content searchable by text
- Analysis: Analyze speech patterns and content with timestamps