Home » Cebuano Speech Dataset

Cebuano Speech Dataset

The Cebuano Speech Dataset is a professionally compiled collection of high-fidelity audio recordings featuring native Cebuano speakers from Cebu, Mindanao, Bohol, and Leyte. This comprehensive dataset includes 108 hours of authentic Cebuano speech data, meticulously transcribed and structured for cutting-edge machine learning applications.

Cebuano, spoken by over 20 million people as the most widely spoken regional language in Philippines, is captured with its distinctive phonological features critical for developing effective speech recognition models. The dataset encompasses diverse demographic representation across age groups and gender, ensuring comprehensive coverage of Cebuano phonological variations across Visayas and Mindanao regions.

Audio Sample

Dataset General Info

Parameter	Details
Size	108 hours
Format	MP3/WAV
Tasks	Speech recognition, AI training, voice assistant development, natural language processing, acoustic modeling, speaker identification
File size	135 MB
Number of files	807 files
Gender of speakers	Female: 49%, Male: 51%
Age of speakers	18-30 years: 33%, 31-40 years: 28%, 40-50 years: 18%, 50+ years: 21%
Countries	Philippines (Cebu, Mindanao, Bohol, Leyte)

Use Cases

Regional Services and Commerce

Organizations in Visayas and Mindanao can utilize the Cebuano Speech Dataset to develop regional business platforms and local government services. Voice interfaces in Cebuano make services accessible to Philippines’ most widely spoken regional language community.

Cultural Preservation

Cultural organizations can leverage this dataset to create digital archives of Cebuano literature and traditions. Voice technology preserves Cebuano cultural heritage and maintains linguistic identity for over 20 million speakers.

Education Technology

Educational institutions can employ this dataset to build Cebuano language learning applications and educational content delivery systems, supporting mother-tongue education in Visayas and Mindanao regions.

FAQ

Q: What is included in the Cebuano Speech Dataset?

A: The dataset includes 108 hours of audio recordings from native Cebuano speakers. Contains 807 files in MP3/WAV format, totaling approximately 135 MB, with transcriptions, speaker demographics, and linguistic annotations.

Q: Why is Cebuano speech technology important?

A: Cebuano represents a significant linguistic community. Speech technology enables voice interfaces serving this population, supports linguistic rights and cultural preservation, and makes technology accessible in native language.

Q: How diverse is the speaker demographic?

A: Dataset features 49% female and 51% male speakers with age distribution: 33% (18-30), 28% (31-40), 18% (40-50), 21% (50+).

How to Use the Speech Dataset

Step 1: Dataset Acquisition – Download the dataset package from the provided link.

Step 2: Extract and Organize – Extract to your storage and review the structured folder organization.

Step 3: Environment Setup – Install required ML framework dependencies and audio processing libraries.

Step 4: Data Preprocessing – Load audio files and apply preprocessing steps like resampling and feature extraction.

Step 5: Model Training – Split into training/validation/test sets and train your model.

Step 6: Evaluation and Fine-tuning – Evaluate performance and iterate on architecture.

Step 7: Deployment – Export and integrate your trained model into production.