Chinese
TTS Voices

Chinese text-to-speech voices with accurate tonal precision

TelnyxInWorldMiniMaxRimeAzureAWS

Top 7 TTS for Chinese

Name	Provider
Tao - Lecturer	telnyx
Radio Host	minimax
Mei - Expressive Assistant	telnyx
Xiaoyin	inworld
Yunyi Multilingual	azure
Xinyi	inworld
Jing	inworld

Test Chinese voices

[ VOICE AI PLATFORM ]

From text to talk.
Pick your path.

Call our TTS & STT endpoints directly, wire voice into LiveKit rooms with one plug-in, or spin up an AI assistant on a real phone number.

TTS & STT Endpoints

Production-grade streaming and batch TTS/STT. Low latency, 50+ languages, customizable voices, and SDKs for Node/Python/Browser.

›Streaming for live apps
›Multi-speaker diarization & punctuation
›SDKs, code samples, and latency benchmarks

TTS — CURL
$ curl -X POST \
".../v1/tts" \
-H "Authorization: Bearer $API_KEY" \
-d '{
"voice": "alloy_female_v1",
"language": "en-US",
"format": "mp3",
"text": "Hello, welcome..."
} ' --output speech.mp3

Sends text to the TTS endpoint and saves the synthesized audio as an MP3 file.

View TTS docs →

LiveKit Plug-in

Plug our real-time speech pipeline into LiveKit rooms — transcribe live sessions, synthesize responses and stream audio back into the room.

›One-line install, example room demo
›WebRTC + server bridge patterns
›Works in browser & mobile

LIVEKIT — NODE.JS
import { Room } from "livekit-client";
import { TelnyxSpeechPlugin }
from "@telnyx/livekit-plugin";
const room = new Room();
await room.connect(URL, token);
const plugin = new TelnyxSpeechPlugin({
apiKey: process.env.TELNYX_API_KEY,
voice: "alloy_female_v1",
});
plugin.attach(room);

Connects to a LiveKit room and attaches real-time TTS/STT — transcribes audio in, synthesizes audio out.

Try LiveKit demo →

AI-Assistants (Phone)

Deploy a phone-number based AI assistant in minutes — inbound/outbound calls, IVR, call recording, and DTMF support.

›Purchase & map a phone number
›Templates: Support Bot, Sales Assistant, Reminder Bot
›PSTN reliability & compliance tools

AI-ASSISTANT — CURL
$ curl -X POST \
".../v1/assistants" \
-H "Authorization: Bearer $API_KEY" \
-d '{
"name": "Support Bot",
"phone_number": "+18005551234",
"voice": "alloy_female_v1",
"system_prompt": "You are a
helpful support agent.",
"capabilities": ["inbound",
"recording", "dtmf"]
} '

Creates an AI assistant bound to a phone number with inbound call handling, recording, and DTMF support.

Create your assistant →

Spanish voices

294TTS voices

Español

Browse →

French voices

98TTS voices

Français

Browse →

German voices

82TTS voices

Deutsch

Browse →

Indonesian voices

31TTS voices

Bahasa Indonesia

Browse →

Italian voices

51TTS voices

Italiano

Browse →

Japanese voices

85TTS voices

日本語

Browse →

Korean voices

171TTS voices

한국어

Browse →

Portuguese voices

277TTS voices

Português

Browse →

Russian voices

34TTS voices

Русский

Browse →

Chinese voices

189TTS voices

中文

Browse →

Chinese phonology and prosody

Pitch that carries the dictionary

Mandarin is a tone language^[1]: nearly every syllable carries one of four fixed pitch contours: high, rising, dipping, or falling: and changing the contour changes the word. The syllable "ma" means mother (mā), hemp (má), horse (mǎ), or scold (mà). English uses pitch to signal stress and focus^[2]; Mandarin pitch is built into the word itself^[3]. A TTS system that misshapes a single tone doesn't sound unnatural: it says the wrong word. Producing accurate Mandarin requires inference that resolves tone at the syllable level, with no inter-provider routing degrading the pitch signal.

[1] “tone language.” frontiersin.org [2] “pitch to signal stress and focus.” pmc.ncbi.nlm.nih.gov [3] “built into the word itself.” taylorfrancis.com

Evenly chopped, not bouncy

English is stress-timed^[1]: stressed syllables land at regular intervals while unstressed syllables compress between them, creating a strong-weak-weak bounce. Mandarin is syllable-timed^[2]: syllables stay closer to equal in duration with far less reduction, producing what sounds like a row of similarly sized beats^[3] carrying different pitch shapes. A voice engine trained on English stress-timing will squeeze and stretch syllables that should stay even. Getting this right requires models built for Mandarin rhythm, running co-located with the audio pipeline.

[1] “stress-timed.” files.eric.ed.gov [2] “syllable-timed.” chinalinkesl.com [3] “row of similarly sized beats.” reddit.com

Sentence melody on a tonal tightrope

English intonation is relatively free: pitch accents move around a sentence to mark focus or signal questions^[1] without changing word identity. Mandarin intonation must ride on top of lexical tones^[2], using post-focus pitch compression to convey emphasis while keeping each syllable's tone intact. The same discourse function: focus, question, statement: is realized through different prosodic strategies^[3] than English uses. Imposing English-style rising question contours onto Mandarin warps lexical tones into wrong words. This two-layer pitch system demands synthesis infrastructure where tone and intonation are resolved together, not split across providers.

[1] “move around a sentence to mark focus or signal questions.” taylorfrancis.com [2] “ride on top of lexical tones.” pmc.ncbi.nlm.nih.gov [3] “realized through different prosodic strategies.” pmc.ncbi.nlm.nih.gov

Chinese
TTS Voices

Female Chinese TTS Voices

Male Chinese TTS Voices

Mainland China Chinese TTS Voices

Hong Kong Chinese TTS Voices

Taiwan Chinese TTS Voices

Spanish voices

French voices

German voices

Indonesian voices

Italian voices

Japanese voices

Korean voices

Portuguese voices

Russian voices

Chinese voices

Chinese phonology and prosody

Pitch that carries the dictionary

Evenly chopped, not bouncy

Sentence melody on a tonal tightrope