Harness the linguistic depth of the Chinese-speaking world with our China Speech Dataset collection. Spanning Mandarin, Cantonese, Wu, Min, Hakka, and dozens of regional topolects — across mainland China, Taiwan, Hong Kong, and diaspora communities — these datasets are built for teams developing voice AI that reflects how over a billion people truly communicate.Each recording is sourced from native speakers across diverse acoustic environments — from bustling megacities and rural villages to broadcast studios and spontaneous conversational settings. Meticulously annotated with tonal markers, character-level transcriptions, pinyin alignments, and speaker demographics, our Chinese collections are engineered for the phonetic precision and tonal complexity that Mandarin and its sister languages demand.

China Speech Datasets

12 December 2025

Yue Chinese Speech Dataset

speech_data_

China Speech Datasets

Yue Chinese Speech Dataset