Voice options: alloy, echo, fable, onyx, nova, shimmer
Why this matters
Voice selection directly impacts user experience and perceived quality of your speech output: the wrong voice can make professional content sound robotic or unsuitable for your use case, while the right choice builds trust and engagement.
Explanation
The OpenAI Text-to-Speech API offers six pre-built voices: alloy, echo, fable, onyx, nova, and shimmer: each with distinct tonal characteristics. You specify a voice when calling the create() method, and the API returns audio bytes in your chosen format (MP3, Opus, AAC, FLAC, or PCM).
Under the hood, OpenAI uses a neural vocoder trained on human speech samples to synthesize audio. Each voice is a separate acoustic model optimized for natural prosody and clarity. The voice parameter is immutable: you cannot blend voices or adjust vocal properties within a single request; if you need multiple voices in one audio file, you must make separate API calls and combine the audio streams downstream.
Use voice selection as your primary lever for brand consistency: alloy and nova project professional neutrality, echo and onyx work well for narrative/audiobook contexts, and fable and shimmer suit friendly, approachable applications. Test all six voices with your target demographic before shipping.
Request code
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ.get('OPENAI_API_KEY'))
response = client.audio.speech.create(
model='tts-1',
voice='nova',
input='Welcome to our customer support portal. How can we assist you today?',
response_format='mp3'
)
with open('output.mp3', 'wb') as f:
f.write(response.content)
print(f'Audio saved. Content-Type: {response.content_type}')
print(f'Audio duration in bytes: {len(response.content)}') Authentication
Ensure your OPENAI_API_KEY environment variable is set before instantiating the OpenAI client. The SDK reads this key at initialization time, not at request time. Example: export OPENAI_API_KEY='sk-...' in your shell, then create the client with client = OpenAI().
Response shape
| Field | Description |
|---|---|
content | bytes: raw audio data in your specified format (mp3, opus, aac, flac, pcm) |
content_type | string — MIME type, e.g. 'audio/mpeg' for mp3 |
Field guide
content The binary audio payload you write directly to a file or stream to clients. Always open files in binary mode ('wb') when writing.
content_type Confirms your requested format was applied—useful when proxying audio through an HTTP response to set the Content-Type header correctly
Setup trap
The OpenAI SDK reads OPENAI_API_KEY at the moment you call OpenAI(), not when you make the API request. If you set the environment variable after instantiating the client, the client will use None and fail silently until the first request. Always set the key before creating the client instance.
Cost
TTS pricing is $0.015 per 1K characters for <code>tts-1</code> (faster) and $0.030 per 1K characters for <code>tts-1-hd</code> (higher quality). A typical email newsletter (500 chars) costs $0.0075; a 5-minute audiobook chapter (~2,500 chars) costs $0.0375–$0.075. Voice choice does not affect pricing: all six voices cost the same.
Rate limits
TTS requests are rate-limited at 500 requests per minute on standard accounts. If you're generating audio for high-volume use cases (e.g., automated customer support for 10K+ calls/day), request a higher quota from OpenAI support or implement a queue with exponential backoff.
Common gotcha
Developers often assume all six voices sound equally natural at all speech rates and languages. In reality, alloy and nova handle rapid speech and technical jargon better, while fable and shimmer shine with slower, narrative-driven content. Test voice choice with actual input samples, not synthetic demo text.
Error recovery
InvalidRequestError: 'voice' field did not match enum valuesAuthenticationErrorRateLimitErrorAPIError with 'tts-1-hd not available'Experienced dev note
Cache voice synthesis results aggressively. Since all six voices use deterministic models, identical input + voice combination always produces identical output: leverage Redis or S3 to skip redundant API calls. For a SaaS product with 100K users, a simple hash(voice + input_text) cache reduces TTS costs by 60–80% and improves response latency from 3s (API round-trip) to <100ms (cache hit). Also: tts-1 uses a smaller, faster model suitable for real-time chat; tts-1-hd is reserved for pre-recorded, polished content like marketing videos or audiobooks.
Check your understanding
You're building a chatbot that generates voice responses for both quick customer replies and longer FAQ explanations. Should you use the same voice for both response types, and if not, why?
Show answer hint
Consider latency expectations and content length. Short replies benefit from <code>tts-1</code> with a neutral voice like alloy; longer content benefits from <code>tts-1-hd</code> with a narrative-friendly voice like fable. Users perceive robotic synthesis more on longer outputs, making voice quality selection critical for FAQ responses but less noticeable on sub-10-second replies.