Audio Transcription

The Speech API provides audio transcription capabilities that convert speech to text with optional word-level timestamps. This is useful for generating captions, creating searchable text from audio, or analyzing spoken content.

Transcribe Audio

Convert speech to text with optional word-level timestamps.

Endpoint

  • POST /v1/transcription_or_translation

Request

Bash

$curl -X POST https://api.reka.ai/v1/transcription_or_translation \
> -H "X-Api-Key: YOUR_API_KEY" \
> -H "Content-Type: application/json" \
> -d '{
> "audio_url": "data:audio/wav;base64,<your_base64_encoded_audio>",
> "sampling_rate": 16000,
> "temperature": 0.0,
> "max_tokens": 1024
> }'

Python

1import base64
2import io
3import httpx
4import librosa
5import soundfile
6
7REKA_API_KEY = "YOUR_API_KEY"
8SAMPLING_RATE = 16_000
9
10# Prepare audio
11with soundfile.SoundFile("/path/to/audio.wav") as sound_file:
12 waveform, _ = librosa.load(
13 sound_file,
14 sr=SAMPLING_RATE,
15 )
16 cache = io.BytesIO()
17 soundfile.write(cache, waveform, SAMPLING_RATE, format="WAV")
18 cache.seek(0)
19 audio_in_base64 = base64.b64encode(cache.read()).decode("ascii")
20
21audio_url = f"data:audio/wav;base64,{audio_in_base64}"
22
23# Make request
24with httpx.Client(timeout=180, follow_redirects=True) as client:
25 response = client.request(
26 method="POST",
27 url="https://api.reka.ai/v1/transcription_or_translation",
28 json={
29 "audio_url": audio_url,
30 "sampling_rate": SAMPLING_RATE,
31 "temperature": 0.0,
32 "max_tokens": 1024,
33 },
34 headers={
35 "X-Api-Key": REKA_API_KEY,
36 },
37 )
38 result = response.json()
39 print(result["transcript"])
40
41 # Print word-level timestamps
42 for word in result["transcript_translation_with_timestamp"]:
43 print(f"{word['start']:.2f}s -> {word['end']:.2f}s: {word['transcript']}")

Parameters

  • audio_url (required): URL to the audio file or base64-encoded audio as data URI
  • sampling_rate (required): Audio sampling rate in Hz (recommended: 16000)
  • temperature (optional): Controls randomness in generation. Use 0.0 for deterministic output. Default: 0.0
  • max_tokens (optional): Maximum number of tokens to generate. Default: 1024

Response

Returns a transcription result with:

1{
2 "transcript": "Full transcribed text of the audio",
3 "transcript_translation_with_timestamp": [
4 {
5 "start": 0.0,
6 "end": 0.5,
7 "transcript": "Hello"
8 },
9 {
10 "start": 0.5,
11 "end": 1.2,
12 "transcript": "world"
13 }
14 ]
15}
  • transcript: Complete transcribed text
  • transcript_translation_with_timestamp: Array of word-level segments
    • start: Start time in seconds
    • end: End time in seconds
    • transcript: Transcribed word or phrase

Use Cases

  • Captioning: Generate accurate captions for videos with precise timing
  • Transcription: Convert meetings, interviews, or podcasts to text
  • Search: Make audio content searchable by text
  • Analysis: Analyze speech patterns and content with timestamps