The Speech API provides audio transcription capabilities that convert speech to text with optional word-level timestamps. This is useful for generating captions, creating searchable text from audio, or analyzing spoken content.
Convert speech to text with optional word-level timestamps.
POST /v1/transcription_or_translationaudio_url (required): URL to the audio file or base64-encoded audio as data URIsampling_rate (required): Audio sampling rate in Hz (recommended: 16000)temperature (optional): Controls randomness in generation. Use 0.0 for deterministic output. Default: 0.0max_tokens (optional): Maximum number of tokens to generate. Default: 1024Returns a transcription result with:
transcript: Complete transcribed texttranscript_translation_with_timestamp: Array of word-level segments
start: Start time in secondsend: End time in secondstranscript: Transcribed word or phrase