API Intermediate medium · 5 min

Voice options: alloy, echo, fable, onyx, nova, shimmer

What you will learn

Choose from six distinct voices when converting text to speech via the OpenAI API to match your application's tone and audience.

Why this matters

Voice selection directly impacts user experience and perceived quality of your speech output: the wrong voice can make professional content sound robotic or unsuitable for your use case, while the right choice builds trust and engagement.

Skip if: If you need custom voice cloning, multilingual speakers, or voices optimized for accessibility (e.g., dyslexia-friendly), consider dedicated TTS services like Google Cloud Text-to-Speech or ElevenLabs instead of OpenAI's fixed voice set.

Explanation

The OpenAI Text-to-Speech API offers six pre-built voices: alloy, echo, fable, onyx, nova, and shimmer: each with distinct tonal characteristics. You specify a voice when calling the create() method, and the API returns audio bytes in your chosen format (MP3, Opus, AAC, FLAC, or PCM).

Under the hood, OpenAI uses a neural vocoder trained on human speech samples to synthesize audio. Each voice is a separate acoustic model optimized for natural prosody and clarity. The voice parameter is immutable: you cannot blend voices or adjust vocal properties within a single request; if you need multiple voices in one audio file, you must make separate API calls and combine the audio streams downstream.

Use voice selection as your primary lever for brand consistency: alloy and nova project professional neutrality, echo and onyx work well for narrative/audiobook contexts, and fable and shimmer suit friendly, approachable applications. Test all six voices with your target demographic before shipping.

Request code

python

import os
from openai import OpenAI

client = OpenAI(api_key=os.environ.get('OPENAI_API_KEY'))

response = client.audio.speech.create(
    model='tts-1',
    voice='nova',
    input='Welcome to our customer support portal. How can we assist you today?',
    response_format='mp3'
)

with open('output.mp3', 'wb') as f:
    f.write(response.content)

print(f'Audio saved. Content-Type: {response.content_type}')
print(f'Audio duration in bytes: {len(response.content)}')

Authentication

Ensure your OPENAI_API_KEY environment variable is set before instantiating the OpenAI client. The SDK reads this key at initialization time, not at request time. Example: export OPENAI_API_KEY='sk-...' in your shell, then create the client with client = OpenAI().

Response shape

Field	Description
`content`	bytes: raw audio data in your specified format (mp3, opus, aac, flac, pcm)
`content_type`	string — MIME type, e.g. 'audio/mpeg' for mp3

Field guide

content

The binary audio payload you write directly to a file or stream to clients. Always open files in binary mode ('wb') when writing.

content_type

Confirms your requested format was applied—useful when proxying audio through an HTTP response to set the Content-Type header correctly

Setup trap

The OpenAI SDK reads OPENAI_API_KEY at the moment you call OpenAI(), not when you make the API request. If you set the environment variable after instantiating the client, the client will use None and fail silently until the first request. Always set the key before creating the client instance.

Cost

TTS pricing is $0.015 per 1K characters for <code>tts-1</code> (faster) and $0.030 per 1K characters for <code>tts-1-hd</code> (higher quality). A typical email newsletter (500 chars) costs $0.0075; a 5-minute audiobook chapter (~2,500 chars) costs $0.0375–$0.075. Voice choice does not affect pricing: all six voices cost the same.

Rate limits

TTS requests are rate-limited at 500 requests per minute on standard accounts. If you're generating audio for high-volume use cases (e.g., automated customer support for 10K+ calls/day), request a higher quota from OpenAI support or implement a queue with exponential backoff.

Common gotcha

Developers often assume all six voices sound equally natural at all speech rates and languages. In reality, alloy and nova handle rapid speech and technical jargon better, while fable and shimmer shine with slower, narrative-driven content. Test voice choice with actual input samples, not synthetic demo text.

Error recovery

InvalidRequestError: 'voice' field did not match enum values

You passed an invalid voice string (e.g., 'alto' or 'voice1'). Use only: alloy, echo, fable, onyx, nova, shimmer. Check for typos and case sensitivity.

AuthenticationError

OPENAI_API_KEY is not set, empty, or revoked. Verify the key is exported as an environment variable before running your script: echo $OPENAI_API_KEY should print a non-empty string.

RateLimitError

You've exceeded 500 requests/minute or your account's monthly token quota. Implement exponential backoff with jitter, or batch requests during off-peak hours. Contact OpenAI support for a quota increase.

APIError with 'tts-1-hd not available'

The tts-1-hd model may be temporarily unavailable in your region. Fallback to tts-1 or retry after 60 seconds.

Experienced dev note

Cache voice synthesis results aggressively. Since all six voices use deterministic models, identical input + voice combination always produces identical output: leverage Redis or S3 to skip redundant API calls. For a SaaS product with 100K users, a simple hash(voice + input_text) cache reduces TTS costs by 60–80% and improves response latency from 3s (API round-trip) to <100ms (cache hit). Also: tts-1 uses a smaller, faster model suitable for real-time chat; tts-1-hd is reserved for pre-recorded, polished content like marketing videos or audiobooks.

Check your understanding

You're building a chatbot that generates voice responses for both quick customer replies and longer FAQ explanations. Should you use the same voice for both response types, and if not, why?

Show answer hint

Consider latency expectations and content length. Short replies benefit from <code>tts-1</code> with a neutral voice like alloy; longer content benefits from <code>tts-1-hd</code> with a narrative-friendly voice like fable. Users perceive robotic synthesis more on longer outputs, making voice quality selection critical for FAQ responses but less noticeable on sub-10-second replies.

VERSION OpenAI SDK 1.0+ removed the legacy audio.speech.create_mp3() shortcut. Always use audio.speech.create() with the response_format parameter. TTS models are stable since v1.3.0; no breaking changes expected for 2026.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.