Skip to content
Docs
xAI logo

Grok STT

Automatic Speech RecognitionxAI

xAI's Grok speech-to-text model. Transcribes audio files into text across 25 languages with word-level timestamps, multichannel transcription, speaker diarization, and key-term biasing.

Model Info
Terms and Licenselink
More informationlink
PricingView pricing in the Cloudflare dashboard

Usage

TypeScript
const response = await env.AI.run(
'xai/grok-stt',
{ url: 'https://storage.googleapis.com/cloud-samples-data/speech/brooklyn_bridge.mp3' },
)
console.log(response)
How old is the Brooklyn Bridge?

Examples

With Language and Formatting — Enable Inverse Text Normalization so spoken numbers become digits
TypeScript
const response = await env.AI.run(
'xai/grok-stt',
{
url: 'https://storage.googleapis.com/cloud-samples-data/speech/brooklyn_bridge.mp3',
language: 'en',
format: true,
},
)
console.log(response)
How old is the Brooklyn Bridge?
Speaker Diarization with Key Terms — Identify speakers and bias transcription toward proper nouns
TypeScript
const response = await env.AI.run(
'xai/grok-stt',
{
url: 'https://storage.googleapis.com/cloud-samples-data/speech/brooklyn_bridge.mp3',
language: 'en',
diarize: true,
keyterm: ['Brooklyn', 'Manhattan'],
},
)
console.log(response)
How old is the Brooklyn Bridge?
Filler Words Preserved — Keep filler words (uh, um, er) in the transcript instead of removing them
TypeScript
const response = await env.AI.run(
'xai/grok-stt',
{
url: 'https://storage.googleapis.com/cloud-samples-data/speech/brooklyn_bridge.mp3',
language: 'en',
filler_words: true,
},
)
console.log(response)
How old is the Brooklyn Bridge?
Data URI Upload — Pass the audio file directly as a base64 data URI (mutually exclusive with `url`)
TypeScript
const response = await env.AI.run(
'xai/grok-stt',
{ file: 'data:audio/wav;base64,<...>' },
)
console.log(response)
{
"state": "Completed",
"result": {
"text": "",
"language": "",
"duration": 1
},
"gatewayMetadata": {
"keySource": "Unified"
}
}

Parameters

audio_format
stringenum: pcm, mulaw, alawFormat hint for raw/headerless audio. Required for pcm, mulaw, alaw. Omit for container formats (mp3, wav, etc.) — xAI auto-detects them.
channels
integermaximum: 8minimum: 2Number of audio channels (2–8). Required only for multichannel raw audio; auto-detected for container formats.
diarize
booleanWhen true, enables speaker diarization. Each word in the response includes a `speaker` integer identifying the detected speaker.
file
stringAudio file as a data URI (data:audio/...;base64,...) or an HTTPS URL the gateway fetches and uploads. Supported container formats: flac, mp3, mp4, m4a, mkv, ogg, opus, wav, aac. Raw formats (pcm, mulaw, alaw) also accepted — supply audio_format and sample_rate. Gateway-side size limit: 25 MB. Mutually exclusive with `url`.
filler_words
booleanWhen true, filler words (uh, um, er) are included in the transcript. Defaults to false — filler words are removed.
format
booleanWhen true, enables Inverse Text Normalization — spoken numbers and currencies are converted to written form (e.g. "one hundred dollars" → "$100"). Requires language to be set.
language
stringLanguage code (e.g. "en", "fr", "de"). Used with format=true to enable Inverse Text Normalization. xAI transcribes in any language regardless — supplying this enables number/currency formatting in the transcript.
multichannel
booleanWhen true, each audio channel is transcribed independently. Results are returned in the `channels` array. Requires channels ≥ 2.
sample_rate
integermaximum: 9007199254740991minimum: -9007199254740991Sample rate in Hz. Required when audio_format is set.
url
stringformat: uriHTTPS URL of an audio file for xAI to fetch server-side. Mutually exclusive with `file`. No gateway-side size limit applies.

API Schemas (Raw)

Input
Output