The Farsi Speech Dataset is a comprehensive collection of high-quality audio recordings featuring native Farsi (Persian) speakers from Iran, Afghanistan, Tajikistan, and diaspora communities worldwide. This professionally curated dataset contains 109 hours of authentic Farsi speech data, meticulously annotated and structured for machine learning applications. As one of the world’s major languages with over 110 million speakers and a literary tradition spanning millennia, Farsi is captured with its distinctive phonological features and elegant linguistic characteristics essential for developing accurate speech recognition systems.
With balanced representation across gender and age groups, the dataset provides researchers and developers with a robust foundation for building Farsi language models, voice assistants, and conversational AI systems serving Persian-speaking populations across three countries and global diaspora. The audio files are delivered in MP3/WAV format with consistent quality standards, making them immediately ready for integration into ML pipelines for one of the world’s classical languages.
Dataset General Info
| Parameter | Details |
| Size | 109 hours |
| Format | MP3/WAV |
| Tasks | Speech recognition, AI training, voice assistant development, natural language processing, acoustic modeling, speaker identification |
| File size | 387 MB |
| Number of files | 707 files |
| Gender of speakers | Female: 45%, Male: 55% |
| Age of speakers | 18-30 years: 28%, 31-40 years: 23%, 40-50 years: 20%, 50+ years: 29% |
| Countries | Iran, Afghanistan, Tajikistan, Persian diaspora |
Use Cases
Digital Services and E-Commerce: Iranian technology companies and international businesses serving Persian-speaking markets can utilize the Farsi Speech Dataset to develop voice-enabled shopping platforms, customer service automation, and digital payment systems. Voice interfaces make e-commerce and digital services accessible across Iran, Afghanistan, and Tajikistan, while supporting diaspora communities in accessing Persian-language digital services, enhancing user experience and market reach for businesses targeting over 110 million Persian speakers globally.
Educational Technology and Cultural Preservation: Educational institutions and cultural organizations can leverage this dataset to create interactive learning applications, digital libraries of Persian literature, and voice-enabled access to classical texts. Speech technology supports Persian language education, preserves Iran’s rich literary heritage including works of Rumi, Hafez, and Ferdowsi, and enables modern educational delivery while maintaining connection to centuries of Persian cultural and intellectual traditions across multiple countries.
Media and Broadcasting Services: Broadcasting companies and content creators across Persian-speaking regions can employ this dataset to develop automatic transcription for Persian television and radio, voice-enabled content discovery platforms, and subtitle generation tools. These applications support Persian media industry serving audiences across Iran, Afghanistan, Tajikistan, and global diaspora, while making cultural content more accessible and preserving Persian linguistic presence in digital media landscape.
FAQ
Q: What is included in the Farsi Speech Dataset?
A: The Farsi Speech Dataset includes 109 hours of audio recordings from native Farsi speakers across Iran, Afghanistan, Tajikistan, and diaspora communities. The dataset contains 707 files in MP3/WAV format, totaling approximately 387 MB. Each recording is professionally annotated with transcriptions in Persian script, speaker metadata including age, gender, and geographic origin, along with quality markers to ensure optimal performance for machine learning applications serving Persian-speaking populations worldwide.
Q: How does the dataset handle Farsi’s rich literary heritage?
A: Farsi has one of the world’s longest continuous literary traditions spanning over a millennium. While the dataset focuses on modern spoken Farsi, it captures phonological features and linguistic characteristics that connect contemporary speech to classical Persian. This supports development of applications that can handle both modern and classical Persian texts, important for educational and cultural preservation applications.
Q: What regional variations are captured in the dataset?
A: The dataset captures Farsi speakers from Iran (Tehran and regional varieties), Dari speakers from Afghanistan, and Tajik influences where applicable. With 707 recordings from diverse geographic areas, it ensures models can understand Persian speakers regardless of country or regional accent, important for applications serving the entire Persian-speaking world.
Q: Why is Farsi important for Middle Eastern technology?
A: Farsi is spoken by over 110 million people across Iran (80+ million), Afghanistan (as Dari), Tajikistan (as Tajik), and substantial diaspora communities. It’s a major language of Middle East and Central Asia with significant cultural influence. Speech technology in Farsi enables voice interfaces for large markets and supports digital services for Persian-speaking populations globally.
Q: Can this dataset support diaspora communities?
A: Yes, the dataset includes consideration of Persian diaspora speech patterns and supports development of applications serving global Persian-speaking communities. This enables heritage language learning tools, cultural content platforms, and communication services that maintain linguistic connections for millions of Persian speakers outside traditional Persian-speaking regions.
Q: How diverse is the speaker demographic?
A: The dataset features 45% female and 55% male speakers with age distribution of 28% aged 18-30, 23% aged 31-40, 20% aged 40-50, and 29% aged 50+. Geographic diversity spans multiple countries, ensuring comprehensive representation of global Persian-speaking community.
Q: What applications are common for Farsi speech technology?
A: Applications include voice assistants for Persian-speaking homes, e-commerce platforms for Iranian market, customer service automation, educational technology for Persian language learning, digital libraries for Persian literature, media transcription services, and voice-enabled access to cultural and governmental services across Persian-speaking regions.
Q: What technical support is provided?
A: Comprehensive documentation includes guides for Persian script handling, right-to-left text processing, integration with ML frameworks, preprocessing pipelines optimized for Farsi phonology, code examples, and best practices. Technical support covers implementation questions, regional variation handling, and optimization strategies for Persian speech recognition systems.
How to Use the Speech Dataset
Step 1: Dataset Acquisition
Download the dataset package from the provided link. Upon purchase, you will receive access credentials and download instructions via email. The dataset is delivered as a compressed archive file containing all audio files, transcriptions, and metadata.
Step 2: Extract and Organize
Extract the downloaded archive to your local storage or cloud environment. The dataset follows a structured folder organization with separate directories for audio files, transcriptions, metadata, and documentation. Review the README file for detailed information about file structure and naming conventions.
Step 3: Environment Setup
Install required dependencies for your chosen ML framework such as TensorFlow, PyTorch, Kaldi, or others. Ensure you have necessary audio processing libraries installed including librosa, soundfile, pydub, and scipy. Set up your Python environment with the provided requirements.txt file for seamless integration.
Step 4: Data Preprocessing
Load the audio files using the provided sample scripts. Apply necessary preprocessing steps such as resampling, normalization, and feature extraction including MFCCs, spectrograms, or mel-frequency features. Use the included metadata to filter and organize data based on speaker demographics, recording quality, or other criteria relevant to your application.
Step 5: Model Training
Split the dataset into training, validation, and test sets using the provided speaker-independent split recommendations to avoid data leakage. Configure your model architecture for the specific task whether speech recognition, speaker identification, or other applications. Train your model using the transcriptions and audio pairs, monitoring performance on the validation set.
Step 6: Evaluation and Fine-tuning
Evaluate model performance on the test set using standard metrics such as Word Error Rate for speech recognition or accuracy for classification tasks. Analyze errors and iterate on model architecture, hyperparameters, or preprocessing steps. Use the diverse speaker demographics to assess model fairness and performance across different groups.
Step 7: Deployment
Once satisfactory performance is achieved, export your trained model for deployment. Integrate the model into your application or service infrastructure. Continue monitoring real-world performance and use the dataset for ongoing model updates and improvements as needed.
For detailed code examples, integration guides, and troubleshooting tips, refer to the comprehensive documentation included with the dataset.





