Grok STT

Automatic Speech Recognition • xAI

xAI's Grok speech-to-text model. Transcribes audio files into text across 25 languages with word-level timestamps, multichannel transcription, speaker diarization, and key-term biasing.

Model Info
Terms and License	link ↗
More information	link ↗
Pricing	View pricing in the Cloudflare dashboard ↗

const response = await env.AI.run(
  'xai/grok-stt',
  { url: 'https://storage.googleapis.com/cloud-samples-data/speech/brooklyn_bridge.mp3' },
)
console.log(response)

curl https://api.cloudflare.com/client/v4/accounts/$CLOUDFLARE_ACCOUNT_ID/ai/run \
  --header "Authorization: Bearer $CLOUDFLARE_API_TOKEN" \
  --header "Content-Type: application/json" \
  --data '{
  "model": "xai/grok-stt",
  "input": {
    "url": "https://storage.googleapis.com/cloud-samples-data/speech/brooklyn_bridge.mp3"
  }
}'

Output
Raw response

How old is the Brooklyn Bridge?

{
  "state": "Completed",
  "result": {
    "text": "How old is the Brooklyn Bridge?",
    "language": "English",
    "duration": 1.85,
    "words": [
      {
        "text": "How",
        "start": 0.14,
        "end": 0.28
      },
      {
        "text": "old",
        "start": 0.4,
        "end": 0.6
      },
      {
        "text": "is",
        "start": 0.65,
        "end": 0.75
      },
      {
        "text": "the",
        "start": 0.81,
        "end": 0.89
      },
      {
        "text": "Brooklyn",
        "start": 0.95,
        "end": 1.29
      },
      {
        "text": "Bridge?",
        "start": 1.35,
        "end": 1.69
      }
    ]
  },
  "gatewayMetadata": {
    "keySource": "Unified"
  }
}

Examples

With Language and Formatting — Enable Inverse Text Normalization so spoken numbers become digits

TypeScript
cURL

const response = await env.AI.run(
  'xai/grok-stt',
  {
    url: 'https://storage.googleapis.com/cloud-samples-data/speech/brooklyn_bridge.mp3',
    language: 'en',
    format: true,
  },
)
console.log(response)

curl https://api.cloudflare.com/client/v4/accounts/$CLOUDFLARE_ACCOUNT_ID/ai/run \
  --header "Authorization: Bearer $CLOUDFLARE_API_TOKEN" \
  --header "Content-Type: application/json" \
  --data '{
  "model": "xai/grok-stt",
  "input": {
    "url": "https://storage.googleapis.com/cloud-samples-data/speech/brooklyn_bridge.mp3",
    "language": "en",
    "format": true
  }
}'

Output
Raw response

How old is the Brooklyn Bridge?

{
  "state": "Completed",
  "result": {
    "text": "How old is the Brooklyn Bridge?",
    "language": "English",
    "duration": 1.85,
    "words": [
      {
        "text": "How",
        "start": 0.14,
        "end": 0.28
      },
      {
        "text": "old",
        "start": 0.4,
        "end": 0.6
      },
      {
        "text": "is",
        "start": 0.65,
        "end": 0.75
      },
      {
        "text": "the",
        "start": 0.81,
        "end": 0.89
      },
      {
        "text": "Brooklyn",
        "start": 0.95,
        "end": 1.29
      },
      {
        "text": "Bridge?",
        "start": 1.35,
        "end": 1.69
      }
    ]
  },
  "gatewayMetadata": {
    "keySource": "Unified"
  }
}

Speaker Diarization with Key Terms — Identify speakers and bias transcription toward proper nouns

TypeScript
cURL

const response = await env.AI.run(
  'xai/grok-stt',
  {
    url: 'https://storage.googleapis.com/cloud-samples-data/speech/brooklyn_bridge.mp3',
    language: 'en',
    diarize: true,
    keyterm: ['Brooklyn', 'Manhattan'],
  },
)
console.log(response)

curl https://api.cloudflare.com/client/v4/accounts/$CLOUDFLARE_ACCOUNT_ID/ai/run \
  --header "Authorization: Bearer $CLOUDFLARE_API_TOKEN" \
  --header "Content-Type: application/json" \
  --data '{
  "model": "xai/grok-stt",
  "input": {
    "url": "https://storage.googleapis.com/cloud-samples-data/speech/brooklyn_bridge.mp3",
    "language": "en",
    "diarize": true,
    "keyterm": [
      "Brooklyn",
      "Manhattan"
    ]
  }
}'

Output
Raw response

How old is the Brooklyn Bridge?

{
  "state": "Completed",
  "result": {
    "text": "How old is the Brooklyn Bridge?",
    "language": "English",
    "duration": 1.85,
    "words": [
      {
        "text": "How",
        "start": 0.14,
        "end": 0.28,
        "speaker": 0
      },
      {
        "text": "old",
        "start": 0.4,
        "end": 0.6,
        "speaker": 0
      },
      {
        "text": "is",
        "start": 0.65,
        "end": 0.75,
        "speaker": 0
      },
      {
        "text": "the",
        "start": 0.81,
        "end": 0.89,
        "speaker": 0
      },
      {
        "text": "Brooklyn",
        "start": 0.95,
        "end": 1.29,
        "speaker": 0
      },
      {
        "text": "Bridge?",
        "start": 1.35,
        "end": 1.69,
        "speaker": 0
      }
    ]
  },
  "gatewayMetadata": {
    "keySource": "Unified"
  }
}

Filler Words Preserved — Keep filler words (uh, um, er) in the transcript instead of removing them

TypeScript
cURL

const response = await env.AI.run(
  'xai/grok-stt',
  {
    url: 'https://storage.googleapis.com/cloud-samples-data/speech/brooklyn_bridge.mp3',
    language: 'en',
    filler_words: true,
  },
)
console.log(response)

curl https://api.cloudflare.com/client/v4/accounts/$CLOUDFLARE_ACCOUNT_ID/ai/run \
  --header "Authorization: Bearer $CLOUDFLARE_API_TOKEN" \
  --header "Content-Type: application/json" \
  --data '{
  "model": "xai/grok-stt",
  "input": {
    "url": "https://storage.googleapis.com/cloud-samples-data/speech/brooklyn_bridge.mp3",
    "language": "en",
    "filler_words": true
  }
}'

Output
Raw response

How old is the Brooklyn Bridge?

{
  "state": "Completed",
  "result": {
    "text": "How old is the Brooklyn Bridge?",
    "language": "English",
    "duration": 1.85,
    "words": [
      {
        "text": "How",
        "start": 0.14,
        "end": 0.28
      },
      {
        "text": "old",
        "start": 0.4,
        "end": 0.6
      },
      {
        "text": "is",
        "start": 0.65,
        "end": 0.75
      },
      {
        "text": "the",
        "start": 0.81,
        "end": 0.89
      },
      {
        "text": "Brooklyn",
        "start": 0.95,
        "end": 1.29
      },
      {
        "text": "Bridge?",
        "start": 1.35,
        "end": 1.69
      }
    ]
  },
  "gatewayMetadata": {
    "keySource": "Unified"
  }
}

Data URI Upload — Pass the audio file directly as a base64 data URI (mutually exclusive with `url`)

TypeScript
cURL

const response = await env.AI.run(
  'xai/grok-stt',
  { file: 'data:audio/wav;base64,<...>' },
)
console.log(response)

curl https://api.cloudflare.com/client/v4/accounts/$CLOUDFLARE_ACCOUNT_ID/ai/run \
  --header "Authorization: Bearer $CLOUDFLARE_API_TOKEN" \
  --header "Content-Type: application/json" \
  --data '{
  "model": "xai/grok-stt",
  "input": {
    "file": "data:audio/wav;base64,<...>"
  }
}'

{
  "state": "Completed",
  "result": {
    "text": "",
    "language": "",
    "duration": 1
  },
  "gatewayMetadata": {
    "keySource": "Unified"
  }
}

Parameters

Input
Output

audio_format

stringenum: pcm, mulaw, alawFormat hint for raw/headerless audio. Required for pcm, mulaw, alaw. Omit for container formats (mp3, wav, etc.) — xAI auto-detects them.

channels

integermaximum: 8minimum: 2Number of audio channels (2–8). Required only for multichannel raw audio; auto-detected for container formats.

diarize

booleanWhen true, enables speaker diarization. Each word in the response includes a `speaker` integer identifying the detected speaker.

file

stringAudio file as a data URI (data:audio/...;base64,...) or an HTTPS URL the gateway fetches and uploads. Supported container formats: flac, mp3, mp4, m4a, mkv, ogg, opus, wav, aac. Raw formats (pcm, mulaw, alaw) also accepted — supply audio_format and sample_rate. Gateway-side size limit: 25 MB. Mutually exclusive with `url`.

filler_words

booleanWhen true, filler words (uh, um, er) are included in the transcript. Defaults to false — filler words are removed.

format

booleanWhen true, enables Inverse Text Normalization — spoken numbers and currencies are converted to written form (e.g. "one hundred dollars" → "$100"). Requires language to be set.

▶keyterm[]

arraymaxItems: 100Key terms to bias transcription toward (e.g. product names, proper nouns). Each term up to 50 characters, max 100 terms. Sent as repeated form fields: keyterm=Term+One&keyterm=Term+Two.

language

stringLanguage code (e.g. "en", "fr", "de"). Used with format=true to enable Inverse Text Normalization. xAI transcribes in any language regardless — supplying this enables number/currency formatting in the transcript.

multichannel

booleanWhen true, each audio channel is transcribed independently. Results are returned in the `channels` array. Requires channels ≥ 2.

sample_rate

integermaximum: 9007199254740991minimum: -9007199254740991Sample rate in Hz. Required when audio_format is set.

url

stringformat: uriHTTPS URL of an audio file for xAI to fetch server-side. Mutually exclusive with `file`. No gateway-side size limit applies.

▶channels[]

arrayPer-channel transcripts when multichannel=true.

duration

numberAudio duration in seconds (2 d.p.).

language

stringDetected language name (e.g. "English", "French").

text

stringFull transcript text.

▶words[]

arrayWord-level segments. Each entry has text, start, end (seconds). Includes speaker integer when diarize=true.

API Schemas (Raw)

Input

Output