Grok STT
Automatic Speech Recognition • xAIxAI's Grok speech-to-text model. Transcribes audio files into text across 25 languages with word-level timestamps, multichannel transcription, speaker diarization, and key-term biasing.
| Model Info | |
|---|---|
| Terms and License | link ↗ |
| More information | link ↗ |
| Pricing | View pricing in the Cloudflare dashboard ↗ |
Usage
const response = await env.AI.run( 'xai/grok-stt', { url: 'https://storage.googleapis.com/cloud-samples-data/speech/brooklyn_bridge.mp3' },)console.log(response)curl https://api.cloudflare.com/client/v4/accounts/$CLOUDFLARE_ACCOUNT_ID/ai/run \ --header "Authorization: Bearer $CLOUDFLARE_API_TOKEN" \ --header "Content-Type: application/json" \ --data '{ "model": "xai/grok-stt", "input": { "url": "https://storage.googleapis.com/cloud-samples-data/speech/brooklyn_bridge.mp3" }}'How old is the Brooklyn Bridge?
{ "state": "Completed", "result": { "text": "How old is the Brooklyn Bridge?", "language": "English", "duration": 1.85, "words": [ { "text": "How", "start": 0.14, "end": 0.28 }, { "text": "old", "start": 0.4, "end": 0.6 }, { "text": "is", "start": 0.65, "end": 0.75 }, { "text": "the", "start": 0.81, "end": 0.89 }, { "text": "Brooklyn", "start": 0.95, "end": 1.29 }, { "text": "Bridge?", "start": 1.35, "end": 1.69 } ] }, "gatewayMetadata": { "keySource": "Unified" }}Examples
With Language and Formatting — Enable Inverse Text Normalization so spoken numbers become digits
const response = await env.AI.run( 'xai/grok-stt', { url: 'https://storage.googleapis.com/cloud-samples-data/speech/brooklyn_bridge.mp3', language: 'en', format: true, },)console.log(response)curl https://api.cloudflare.com/client/v4/accounts/$CLOUDFLARE_ACCOUNT_ID/ai/run \ --header "Authorization: Bearer $CLOUDFLARE_API_TOKEN" \ --header "Content-Type: application/json" \ --data '{ "model": "xai/grok-stt", "input": { "url": "https://storage.googleapis.com/cloud-samples-data/speech/brooklyn_bridge.mp3", "language": "en", "format": true }}'How old is the Brooklyn Bridge?
{ "state": "Completed", "result": { "text": "How old is the Brooklyn Bridge?", "language": "English", "duration": 1.85, "words": [ { "text": "How", "start": 0.14, "end": 0.28 }, { "text": "old", "start": 0.4, "end": 0.6 }, { "text": "is", "start": 0.65, "end": 0.75 }, { "text": "the", "start": 0.81, "end": 0.89 }, { "text": "Brooklyn", "start": 0.95, "end": 1.29 }, { "text": "Bridge?", "start": 1.35, "end": 1.69 } ] }, "gatewayMetadata": { "keySource": "Unified" }}Speaker Diarization with Key Terms — Identify speakers and bias transcription toward proper nouns
const response = await env.AI.run( 'xai/grok-stt', { url: 'https://storage.googleapis.com/cloud-samples-data/speech/brooklyn_bridge.mp3', language: 'en', diarize: true, keyterm: ['Brooklyn', 'Manhattan'], },)console.log(response)curl https://api.cloudflare.com/client/v4/accounts/$CLOUDFLARE_ACCOUNT_ID/ai/run \ --header "Authorization: Bearer $CLOUDFLARE_API_TOKEN" \ --header "Content-Type: application/json" \ --data '{ "model": "xai/grok-stt", "input": { "url": "https://storage.googleapis.com/cloud-samples-data/speech/brooklyn_bridge.mp3", "language": "en", "diarize": true, "keyterm": [ "Brooklyn", "Manhattan" ] }}'How old is the Brooklyn Bridge?
{ "state": "Completed", "result": { "text": "How old is the Brooklyn Bridge?", "language": "English", "duration": 1.85, "words": [ { "text": "How", "start": 0.14, "end": 0.28, "speaker": 0 }, { "text": "old", "start": 0.4, "end": 0.6, "speaker": 0 }, { "text": "is", "start": 0.65, "end": 0.75, "speaker": 0 }, { "text": "the", "start": 0.81, "end": 0.89, "speaker": 0 }, { "text": "Brooklyn", "start": 0.95, "end": 1.29, "speaker": 0 }, { "text": "Bridge?", "start": 1.35, "end": 1.69, "speaker": 0 } ] }, "gatewayMetadata": { "keySource": "Unified" }}Filler Words Preserved — Keep filler words (uh, um, er) in the transcript instead of removing them
const response = await env.AI.run( 'xai/grok-stt', { url: 'https://storage.googleapis.com/cloud-samples-data/speech/brooklyn_bridge.mp3', language: 'en', filler_words: true, },)console.log(response)curl https://api.cloudflare.com/client/v4/accounts/$CLOUDFLARE_ACCOUNT_ID/ai/run \ --header "Authorization: Bearer $CLOUDFLARE_API_TOKEN" \ --header "Content-Type: application/json" \ --data '{ "model": "xai/grok-stt", "input": { "url": "https://storage.googleapis.com/cloud-samples-data/speech/brooklyn_bridge.mp3", "language": "en", "filler_words": true }}'How old is the Brooklyn Bridge?
{ "state": "Completed", "result": { "text": "How old is the Brooklyn Bridge?", "language": "English", "duration": 1.85, "words": [ { "text": "How", "start": 0.14, "end": 0.28 }, { "text": "old", "start": 0.4, "end": 0.6 }, { "text": "is", "start": 0.65, "end": 0.75 }, { "text": "the", "start": 0.81, "end": 0.89 }, { "text": "Brooklyn", "start": 0.95, "end": 1.29 }, { "text": "Bridge?", "start": 1.35, "end": 1.69 } ] }, "gatewayMetadata": { "keySource": "Unified" }}Data URI Upload — Pass the audio file directly as a base64 data URI (mutually exclusive with `url`)
const response = await env.AI.run( 'xai/grok-stt', { file: 'data:audio/wav;base64,<...>' },)console.log(response)curl https://api.cloudflare.com/client/v4/accounts/$CLOUDFLARE_ACCOUNT_ID/ai/run \ --header "Authorization: Bearer $CLOUDFLARE_API_TOKEN" \ --header "Content-Type: application/json" \ --data '{ "model": "xai/grok-stt", "input": { "file": "data:audio/wav;base64,<...>" }}'{ "state": "Completed", "result": { "text": "", "language": "", "duration": 1 }, "gatewayMetadata": { "keySource": "Unified" }}Parameters
audio_format
stringenum: pcm, mulaw, alawFormat hint for raw/headerless audio. Required for pcm, mulaw, alaw. Omit for container formats (mp3, wav, etc.) — xAI auto-detects them.channels
integermaximum: 8minimum: 2Number of audio channels (2–8). Required only for multichannel raw audio; auto-detected for container formats.diarize
booleanWhen true, enables speaker diarization. Each word in the response includes a `speaker` integer identifying the detected speaker.file
stringAudio file as a data URI (data:audio/...;base64,...) or an HTTPS URL the gateway fetches and uploads. Supported container formats: flac, mp3, mp4, m4a, mkv, ogg, opus, wav, aac. Raw formats (pcm, mulaw, alaw) also accepted — supply audio_format and sample_rate. Gateway-side size limit: 25 MB. Mutually exclusive with `url`.filler_words
booleanWhen true, filler words (uh, um, er) are included in the transcript. Defaults to false — filler words are removed.format
booleanWhen true, enables Inverse Text Normalization — spoken numbers and currencies are converted to written form (e.g. "one hundred dollars" → "$100"). Requires language to be set.▶keyterm[]
arraymaxItems: 100Key terms to bias transcription toward (e.g. product names, proper nouns). Each term up to 50 characters, max 100 terms. Sent as repeated form fields: keyterm=Term+One&keyterm=Term+Two.language
stringLanguage code (e.g. "en", "fr", "de"). Used with format=true to enable Inverse Text Normalization. xAI transcribes in any language regardless — supplying this enables number/currency formatting in the transcript.multichannel
booleanWhen true, each audio channel is transcribed independently. Results are returned in the `channels` array. Requires channels ≥ 2.sample_rate
integermaximum: 9007199254740991minimum: -9007199254740991Sample rate in Hz. Required when audio_format is set.url
stringformat: uriHTTPS URL of an audio file for xAI to fetch server-side. Mutually exclusive with `file`. No gateway-side size limit applies.▶channels[]
arrayPer-channel transcripts when multichannel=true.duration
numberAudio duration in seconds (2 d.p.).language
stringDetected language name (e.g. "English", "French").text
stringFull transcript text.▶words[]
arrayWord-level segments. Each entry has text, start, end (seconds). Includes speaker integer when diarize=true.