Grok TTS

Text-to-Speech • xAI

xAI's Grok text-to-speech model. Generates high-fidelity spoken audio in 5 expressive voices (eve, ara, rex, sal, leo) with 20+ supported languages. Supports inline speech tags for laughter, whispers, and pauses.

Model Info
Terms and License	link ↗
More information	link ↗
Pricing	View pricing in the Cloudflare dashboard ↗

const response = await env.AI.run(
  'xai/grok-tts',
  { text: 'Hello! Welcome to the xAI Text to Speech API.', language: 'en' },
)
console.log(response)

curl https://api.cloudflare.com/client/v4/accounts/$CLOUDFLARE_ACCOUNT_ID/ai/run \
  --header "Authorization: Bearer $CLOUDFLARE_API_TOKEN" \
  --header "Content-Type: application/json" \
  --data '{
  "model": "xai/grok-tts",
  "input": {
    "text": "Hello! Welcome to the xAI Text to Speech API.",
    "language": "en"
  }
}'

Output
Raw response

{
  "state": "Completed",
  "result": {
    "audio": "https://examples.aig.cloudflare.com/xai/grok-tts/simple-generation.mp3"
  },
  "gatewayMetadata": {
    "keySource": "Unified"
  }
}

Examples

Different Voice — Use the warm, conversational `ara` voice

TypeScript
cURL

const response = await env.AI.run(
  'xai/grok-tts',
  { text: 'Thank you for calling. How can I help you today?', voice_id: 'ara', language: 'en' },
)
console.log(response)

curl https://api.cloudflare.com/client/v4/accounts/$CLOUDFLARE_ACCOUNT_ID/ai/run \
  --header "Authorization: Bearer $CLOUDFLARE_API_TOKEN" \
  --header "Content-Type: application/json" \
  --data '{
  "model": "xai/grok-tts",
  "input": {
    "text": "Thank you for calling. How can I help you today?",
    "voice_id": "ara",
    "language": "en"
  }
}'

Output
Raw response

{
  "state": "Completed",
  "result": {
    "audio": "https://examples.aig.cloudflare.com/xai/grok-tts/different-voice.mp3"
  },
  "gatewayMetadata": {
    "keySource": "Unified"
  }
}

High-Fidelity MP3 — 44.1 kHz / 192 kbps MP3 for production use

TypeScript
cURL

const response = await env.AI.run(
  'xai/grok-tts',
  {
    text: 'Crystal clear audio at maximum quality.',
    voice_id: 'rex',
    language: 'en',
    output_format: { codec: 'mp3', sample_rate: 44100, bit_rate: 192000 },
  },
)
console.log(response)

curl https://api.cloudflare.com/client/v4/accounts/$CLOUDFLARE_ACCOUNT_ID/ai/run \
  --header "Authorization: Bearer $CLOUDFLARE_API_TOKEN" \
  --header "Content-Type: application/json" \
  --data '{
  "model": "xai/grok-tts",
  "input": {
    "text": "Crystal clear audio at maximum quality.",
    "voice_id": "rex",
    "language": "en",
    "output_format": {
      "codec": "mp3",
      "sample_rate": 44100,
      "bit_rate": 192000
    }
  }
}'

Output
Raw response

{
  "state": "Completed",
  "result": {
    "audio": "https://examples.aig.cloudflare.com/xai/grok-tts/high-fidelity-mp3.mp3"
  },
  "gatewayMetadata": {
    "keySource": "Unified"
  }
}

Telephony (mulaw) — G.711 μ-law at 8 kHz for SIP / PSTN integration

TypeScript
cURL

const response = await env.AI.run(
  'xai/grok-tts',
  {
    text: 'Hello, thank you for calling. How can I help you today?',
    voice_id: 'ara',
    language: 'en',
    output_format: { codec: 'mulaw', sample_rate: 8000 },
  },
)
console.log(response)

curl https://api.cloudflare.com/client/v4/accounts/$CLOUDFLARE_ACCOUNT_ID/ai/run \
  --header "Authorization: Bearer $CLOUDFLARE_API_TOKEN" \
  --header "Content-Type: application/json" \
  --data '{
  "model": "xai/grok-tts",
  "input": {
    "text": "Hello, thank you for calling. How can I help you today?",
    "voice_id": "ara",
    "language": "en",
    "output_format": {
      "codec": "mulaw",
      "sample_rate": 8000
    }
  }
}'

Output
Raw response

{
  "state": "Completed",
  "result": {
    "audio": "https://examples.aig.cloudflare.com/xai/grok-tts/telephony-law.mp3"
  },
  "gatewayMetadata": {
    "keySource": "Unified"
  }
}

Expressive Delivery — Inline speech tags for laughter, pauses, and whispers

TypeScript
cURL

const response = await env.AI.run(
  'xai/grok-tts',
  {
    text: 'So I walked in and [pause] there it was. [laugh] I honestly could not believe it! <whisper>It was a secret the whole time.</whisper>',
    voice_id: 'eve',
    language: 'en',
  },
)
console.log(response)

curl https://api.cloudflare.com/client/v4/accounts/$CLOUDFLARE_ACCOUNT_ID/ai/run \
  --header "Authorization: Bearer $CLOUDFLARE_API_TOKEN" \
  --header "Content-Type: application/json" \
  --data '{
  "model": "xai/grok-tts",
  "input": {
    "text": "So I walked in and [pause] there it was. [laugh] I honestly could not believe it! <whisper>It was a secret the whole time.</whisper>",
    "voice_id": "eve",
    "language": "en"
  }
}'

Output
Raw response

{
  "state": "Completed",
  "result": {
    "audio": "https://examples.aig.cloudflare.com/xai/grok-tts/expressive-delivery.mp3"
  },
  "gatewayMetadata": {
    "keySource": "Unified"
  }
}

Text Normalization — Convert written numbers and abbreviations to spoken form

TypeScript
cURL

const response = await env.AI.run(
  'xai/grok-tts',
  {
    text: 'The total is $1,234.56 and the meeting is at 3pm on Jan 15th.',
    voice_id: 'rex',
    language: 'en',
    text_normalization: true,
  },
)
console.log(response)

curl https://api.cloudflare.com/client/v4/accounts/$CLOUDFLARE_ACCOUNT_ID/ai/run \
  --header "Authorization: Bearer $CLOUDFLARE_API_TOKEN" \
  --header "Content-Type: application/json" \
  --data '{
  "model": "xai/grok-tts",
  "input": {
    "text": "The total is $1,234.56 and the meeting is at 3pm on Jan 15th.",
    "voice_id": "rex",
    "language": "en",
    "text_normalization": true
  }
}'

Output
Raw response

{
  "state": "Completed",
  "result": {
    "audio": "https://examples.aig.cloudflare.com/xai/grok-tts/text-normalization.mp3"
  },
  "gatewayMetadata": {
    "keySource": "Unified"
  }
}

Parameters

Input
Output

language

stringrequiredBCP-47 language code (e.g. "en", "zh", "pt-BR") or "auto" for automatic language detection. Required — xAI returns 400 if omitted. Supported codes: auto, en, ar-EG, ar-SA, ar-AE, bn, zh, fr, de, hi, id, it, ja, ko, pt-BR, pt-PT, ru, es-MX, es-ES, tr, vi.

▶optimize_streaming_latency

one of

▶output_format{}

objectOutput audio format. Defaults to MP3 at 24 kHz / 128 kbps when omitted.

text

stringrequiredmaxLength: 15000minLength: 1Text to convert to speech. Maximum 15,000 characters. Supports inline speech tags: [pause], [laugh], <whisper>…</whisper>, etc.

text_normalization

booleanWhen true, normalizes written-form text into spoken-form before synthesis (e.g. "Dr." → "Doctor", "100" → "one hundred"). Defaults to false.

voice_id

stringminLength: 1Voice for synthesis. Defaults to "eve". Built-in voices: eve (energetic), ara (warm), rex (confident), sal (balanced), leo (authoritative). Custom voice IDs from /v1/tts/voices are also accepted. Case-insensitive — "Eve", "EVE", and "eve" are equivalent.

audio

stringPresigned R2 URL for the generated audio file. MIME type reflects the requested codec (audio/mpeg for mp3, audio/wav for wav, etc.).

API Schemas (Raw)

Input

Output