The Galician Speech Dataset is a comprehensive collection of high-quality audio recordings from native Galician speakers across Galicia and northern Portugal. This professionally curated dataset contains 158 hours of authentic Galician speech data meticulously annotated for machine learning applications.

Galician, a Western Ibero-Romance language spoken by over 2.4 million people with official status in Galicia, is captured with distinctive phonological features bridging Spanish and Portuguese linguistic characteristics.

Dataset General Info

Parameter	Details
Size	158 hours
Format	MP3/WAV
Tasks	Speech recognition, AI training, voice assistant development, natural language processing, acoustic modeling, speaker identification
File size	276 MB
Number of files	718 files
Gender of speakers	Female: 48%, Male: 52%
Age of speakers	18-30 years: 35%, 31-40 years: 20%, 40-50 years: 18%, 50+ years: 27%
Countries	Spain (Galicia), Portugal (northern regions)

Use Cases

Regional Services and Linguistic Rights: Galician government (Xunta de Galicia) can utilize the Galician Speech Dataset to develop voice-enabled regional services, administration platforms, and citizen communication systems in co-official language. Voice technology implements Galician linguistic rights in autonomous community, supports regional identity, enables digital services respecting linguistic preferences, and facilitates governance in Galician. Applications include Xunta digital services, municipal platforms, healthcare systems, education administration, and regional information serving Galicia’s predominantly Galician-speaking population.

Cultural Heritage and Celtic Identity: Cultural organizations can leverage this dataset to build platforms preserving Galician linguistic heritage, documenting Celtic cultural traditions, and maintaining regional identity. Voice technology preserves Galician language related to Portuguese but distinct, supports cultural transmission including gaita music and traditions, enables documentation of oral literature, and maintains Galician identity. Applications include cultural heritage platforms, traditional music archives, Celtic festival coordination, oral history projects, and resources promoting Galician language vitality in face of Spanish linguistic dominance.

Cross-Border Cooperation with Portugal: Organizations working across Galicia-Portugal border can employ this dataset to develop cross-border communication tools, regional cooperation platforms, and Galician-Portuguese linguistic bridge services. Voice technology facilitates communication in closely related Galician and Portuguese languages, supports Euroregion cooperation, enables cultural and economic exchange, and strengthens Iberian linguistic connections. Applications include cross-border trade platforms, tourism cooperation systems, cultural exchange programs, and services leveraging Galician-Portuguese linguistic proximity for regional integration.

FAQ

Q: What is included in this dataset?

A: The dataset includes 158 hours of audio recordings with 718 files totaling 276 MB, complete with transcriptions and linguistic annotations.

Q: How diverse is the speaker demographic?

A: Features 48% female and 52% male speakers across age groups: 35% (18-30), 20% (31-40), 18% (40-50), 27% (50+).

How to Use the Speech Dataset

Step 1: Dataset Acquisition – Download the dataset package from the provided link upon purchase.

Step 2: Extract and Organize – Extract to your storage and review the structured folder organization.

Step 3: Environment Setup – Install ML framework dependencies and audio processing libraries.

Step 4: Data Preprocessing – Load audio files and apply preprocessing steps like resampling and feature extraction.

Step 5: Model Training – Split into training/validation/test sets and train your model.

Step 6: Evaluation and Fine-tuning – Evaluate performance and iterate on architecture.

Step 7: Deployment – Export and integrate your trained model into production systems.

For comprehensive documentation, refer to included guides.

SPEECH DATA

Galician Speech Dataset

Dataset General Info

Use Cases

FAQ

How to Use the Speech Dataset

English Speech Dataset

Arabic Speech Dataset

Shona Speech Dataset

Trending

English Speech Dataset

Arabic Speech Dataset

Shona Speech Dataset

Welsh Speech Dataset