The Luri Speech Dataset offers an extensive collection of authentic audio recordings from native Luri speakers across western Iran. This specialized dataset comprises 111 hours of carefully curated Luri speech, professionally recorded and annotated for advanced machine learning applications. Luri, a Northwestern Iranian language spoken by several million people in Lorestan, Khuzestan, and surrounding provinces, is captured with its unique phonetic characteristics and linguistic features essential for developing robust speech recognition systems.
The dataset features diverse speakers across multiple age groups and balanced gender representation, providing comprehensive coverage of Luri phonetics and dialectal variations from Iran’s western regions. Formatted in MP3/WAV with high-quality audio standards, this dataset is optimized for AI training, natural language processing, voice technology development, and computational linguistics research focused on underrepresented Iranian languages and regional linguistic diversity.
Dataset General Info
| Parameter | Details |
| Size | 111 hours |
| Format | MP3/WAV |
| Tasks | Speech recognition, AI training, voice assistant development, natural language processing, acoustic modeling, speaker identification |
| File size | 100 MB |
| Number of files | 806 files |
| Gender of speakers | Female: 52%, Male: 48% |
| Age of speakers | 18-30 years: 28%, 31-40 years: 23%, 40-50 years: 22%, 50+ years: 27% |
| Countries | Iran (western provinces) |
Use Cases
Regional Cultural Preservation: Cultural organizations and linguistic institutions can utilize the Luri Speech Dataset to develop digital archives of Luri oral traditions, folk music, and traditional knowledge systems. Voice-enabled access to cultural resources preserves Luri linguistic heritage, supports documentation of endangered cultural practices, and maintains linguistic identity for western Iranian communities where Luri faces pressure from dominant Persian language and modernization.
Local Governance and Community Services: Regional government agencies in Lorestan, Khuzestan, and surrounding provinces can leverage this dataset to create voice-enabled local government services, community information systems, and public service platforms in Luri. Voice interfaces respect linguistic preferences of local populations, support multilingual governance alongside Persian, and ensure Luri speakers can access government services in their mother tongue, promoting linguistic inclusion in regional administration.
Agricultural Extension Services: Agricultural departments and rural development organizations can employ this dataset to create voice-based farming advisory systems, livestock management guidance, and market information platforms for Luri-speaking rural communities. Voice interfaces in Luri deliver agricultural information to farming populations in western Iran, support traditional pastoral practices, and improve access to modern agricultural techniques while respecting local linguistic and cultural contexts.
FAQ
Q: What does the Luri Speech Dataset contain?
A: The Luri Speech Dataset contains 111 hours of high-quality audio recordings from native Luri speakers across western Iran. The dataset includes 806 files in MP3/WAV format totaling approximately 100 MB, with transcriptions, speaker demographics, regional dialect information from Lorestan and surrounding areas, and linguistic annotations optimized for machine learning applications.
Q: Why is Luri speech technology important?
A: Luri is spoken by several million people in western Iran but remains underrepresented in language technology despite its significant speaker population. This dataset enables voice interfaces that serve Luri-speaking communities, supports linguistic rights and cultural preservation, and makes technology accessible in mother tongue for populations in Lorestan, Khuzestan, and surrounding provinces.
Q: How does the dataset address Luri’s linguistic diversity?
A: Luri has several varieties including Greater Luri, Lesser Luri, and others. The dataset captures speakers representing major Luri varieties from different regions of western Iran, ensuring comprehensive coverage. Annotations indicate variety information where applicable, supporting development of applications that serve diverse Luri-speaking populations across western Iranian provinces.
Q: What makes Luri linguistically distinctive?
A: Luri is a Northwestern Iranian language with distinctive phonological and grammatical features different from Persian. The dataset includes linguistic annotations marking Luri-specific characteristics, ensuring trained models recognize Luri as distinct language rather than Persian dialect. This respects Luri linguistic identity and cultural significance in western Iran.
Q: Can this dataset support cultural preservation?
A: Yes, Luri has rich oral traditions and cultural practices. The dataset supports development of applications that preserve Luri cultural heritage including folk music, traditional knowledge, and oral literature. Voice technology helps document and maintain these traditions for future generations as Luri faces pressure from dominant Persian language.
Q: What is the demographic distribution?
A: The dataset includes 52% female and 48% male speakers with age distribution of 28% aged 18-30, 23% aged 31-40, 22% aged 40-50, and 27% aged 50+. This representation ensures models serve diverse Luri-speaking populations across different age groups.
Q: What applications benefit from Luri speech technology?
A: Applications include cultural heritage documentation and digital archives, local government services in western provinces, agricultural advisory systems for rural communities, regional media transcription, educational tools supporting mother-tongue education, and community information systems serving Luri speakers in Lorestan and surrounding areas.
Q: How does this support regional linguistic diversity?
A: Iran has significant linguistic diversity beyond Persian. This dataset contributes to recognizing and supporting that diversity by enabling technology for Luri speakers. It promotes linguistic inclusion, respects cultural identity, and ensures technological development benefits all linguistic communities in Iran, not only Persian speakers.
How to Use the Speech Dataset
Step 1: Dataset Acquisition
Download the dataset package from the provided link. Upon purchase, you will receive access credentials and download instructions via email. The dataset is delivered as a compressed archive file containing all audio files, transcriptions, and metadata.
Step 2: Extract and Organize
Extract the downloaded archive to your local storage or cloud environment. The dataset follows a structured folder organization with separate directories for audio files, transcriptions, metadata, and documentation. Review the README file for detailed information about file structure and naming conventions.
Step 3: Environment Setup
Install required dependencies for your chosen ML framework such as TensorFlow, PyTorch, Kaldi, or others. Ensure you have necessary audio processing libraries installed including librosa, soundfile, pydub, and scipy. Set up your Python environment with the provided requirements.txt file for seamless integration.
Step 4: Data Preprocessing
Load the audio files using the provided sample scripts. Apply necessary preprocessing steps such as resampling, normalization, and feature extraction including MFCCs, spectrograms, or mel-frequency features. Use the included metadata to filter and organize data based on speaker demographics, recording quality, or other criteria relevant to your application.
Step 5: Model Training
Split the dataset into training, validation, and test sets using the provided speaker-independent split recommendations to avoid data leakage. Configure your model architecture for the specific task whether speech recognition, speaker identification, or other applications. Train your model using the transcriptions and audio pairs, monitoring performance on the validation set.
Step 6: Evaluation and Fine-tuning
Evaluate model performance on the test set using standard metrics such as Word Error Rate for speech recognition or accuracy for classification tasks. Analyze errors and iterate on model architecture, hyperparameters, or preprocessing steps. Use the diverse speaker demographics to assess model fairness and performance across different groups.
Step 7: Deployment
Once satisfactory performance is achieved, export your trained model for deployment. Integrate the model into your application or service infrastructure. Continue monitoring real-world performance and use the dataset for ongoing model updates and improvements as needed.
For detailed code examples, integration guides, and troubleshooting tips, refer to the comprehensive documentation included with the dataset.





