Chinese
TTS Voices
Chinese text-to-speech voices with accurate tonal precision
Chinese phonology and prosody
Pitch that carries the dictionary
Mandarin is a tone language[1]: nearly every syllable carries one of four fixed pitch contours: high, rising, dipping, or falling: and changing the contour changes the word. The syllable "ma" means mother (mā), hemp (má), horse (mǎ), or scold (mà). English uses pitch to signal stress and focus[2]; Mandarin pitch is built into the word itself[3]. A TTS system that misshapes a single tone doesn't sound unnatural: it says the wrong word. Producing accurate Mandarin requires inference that resolves tone at the syllable level, with no inter-provider routing degrading the pitch signal.
Evenly chopped, not bouncy
English is stress-timed[1]: stressed syllables land at regular intervals while unstressed syllables compress between them, creating a strong-weak-weak bounce. Mandarin is syllable-timed[2]: syllables stay closer to equal in duration with far less reduction, producing what sounds like a row of similarly sized beats[3] carrying different pitch shapes. A voice engine trained on English stress-timing will squeeze and stretch syllables that should stay even. Getting this right requires models built for Mandarin rhythm, running co-located with the audio pipeline.
Sentence melody on a tonal tightrope
English intonation is relatively free: pitch accents move around a sentence to mark focus or signal questions[1] without changing word identity. Mandarin intonation must ride on top of lexical tones[2], using post-focus pitch compression to convey emphasis while keeping each syllable's tone intact. The same discourse function: focus, question, statement: is realized through different prosodic strategies[3] than English uses. Imposing English-style rising question contours onto Mandarin warps lexical tones into wrong words. This two-layer pitch system demands synthesis infrastructure where tone and intonation are resolved together, not split across providers.