Audio Transcription

The Speech API provides audio transcription capabilities that convert speech to text with optional word-level timestamps. This is useful for generating captions, creating searchable text from audio, or analyzing spoken content.

Transcribe Audio

Convert speech to text with optional word-level timestamps.

Endpoint

POST /v1/transcription_or_translation

Request

Bash

$ curl -X POST https://api.reka.ai/v1/transcription_or_translation \
>   -H "X-Api-Key: YOUR_API_KEY" \
>   -H "Content-Type: application/json" \
>   -d '{
>     "audio_url": "data:audio/wav;base64,<your_base64_encoded_audio>",
>     "sampling_rate": 16000,
>     "temperature": 0.0,
>     "max_tokens": 1024
>   }'

Python

1 import base64
2 import io
3 import httpx
4 import librosa
5 import soundfile
6 
7 REKA_API_KEY = "YOUR_API_KEY"
8 SAMPLING_RATE = 16_000
9 
10 # Prepare audio
11 with soundfile.SoundFile("/path/to/audio.wav") as sound_file:
12     waveform, _ = librosa.load(
13         sound_file,
14         sr=SAMPLING_RATE,
15     )
16     cache = io.BytesIO()
17     soundfile.write(cache, waveform, SAMPLING_RATE, format="WAV")
18     cache.seek(0)
19     audio_in_base64 = base64.b64encode(cache.read()).decode("ascii")
20 
21 audio_url = f"data:audio/wav;base64,{audio_in_base64}"
22 
23 # Make request
24 with httpx.Client(timeout=180, follow_redirects=True) as client:
25     response = client.request(
26         method="POST",
27         url="https://api.reka.ai/v1/transcription_or_translation",
28         json={
29             "audio_url": audio_url,
30             "sampling_rate": SAMPLING_RATE,
31             "temperature": 0.0,
32             "max_tokens": 1024,
33         },
34         headers={
35             "X-Api-Key": REKA_API_KEY,
36         },
37     )
38     result = response.json()
39     print(result["transcript"])
40     
41     # Print word-level timestamps
42     for word in result["transcript_translation_with_timestamp"]:
43         print(f"{word['start']:.2f}s -> {word['end']:.2f}s: {word['transcript']}")

Parameters

audio_url (required): URL to the audio file or base64-encoded audio as data URI
sampling_rate (required): Audio sampling rate in Hz (recommended: 16000)
temperature (optional): Controls randomness in generation. Use 0.0 for deterministic output. Default: 0.0
max_tokens (optional): Maximum number of tokens to generate. Default: 1024

Response

Returns a transcription result with:

1 {
2   "transcript": "Full transcribed text of the audio",
3   "transcript_translation_with_timestamp": [
4     {
5       "start": 0.0,
6       "end": 0.5,
7       "transcript": "Hello"
8     },
9     {
10       "start": 0.5,
11       "end": 1.2,
12       "transcript": "world"
13     }
14   ]
15 }

transcript: Complete transcribed text
transcript_translation_with_timestamp: Array of word-level segments
- start: Start time in seconds
- end: End time in seconds
- transcript: Transcribed word or phrase

Use Cases

Captioning: Generate accurate captions for videos with precise timing
Transcription: Convert meetings, interviews, or podcasts to text
Search: Make audio content searchable by text
Analysis: Analyze speech patterns and content with timestamps