IndicWav2Vec
IndicWav2Vec is a multilingual speech model pretrained on 40 Indian langauges. This model represents the largest diversity of Indian languages in the pool of multilingual speech models. We fine-tune this model for downstream ASR for 9 languages and obtain state-of-the-art results on 3 public benchmarks, namely MUCS, MSR and OpenSLR.
As part of IndicWav2Vec we create largest publicly available corpora for 40 languages from 4 different language families. We also trained state-of-the-art ASR models for 9 Indian languages.
All the resources (i) pretraining data (ii) pre-trained models (iii) fine-tuned models (iv) language models are made publicly available.
Our paper is available on arXiv
We are currently facing issues with access to the data. We will resolve this ASAP.
All the code and scripts for data processing, pretraining and fine-tuning can be found in our GitHub repository here
The YouTube Video IDs we used for creating the pretraining data can be found here
The pipeline for data processing can be found here
Language-wise fine-tuned models can be found in the table
Language | Url |
---|---|
bengali | download |
gujarati | download |
hindi | download |
marathi | download |
nepali | download |
odia | download |
sinhala | download |
tamil | download |
telugu | download |
Language-wise KenLM langauge models can be found in the table
Language | Url |
---|---|
bengali | download |
gujarati | download |
hindi | download |
marathi | download |
nepali | download |
odia | download |
tamil | download |
telugu | download |
- Tahir Javed, (IITM, AI4Bharat)
- Sumanth Doddapaneni, (AI4Bharat, RBCDSAI)
- Abhigyan Raman, (AI4Bharat)
- Kaushal Bhogale, (AI4Bharat)
- Gowtham Ramesh, (AI4Bharat, RBCDSAI)
- Anoop Kunchukuttan, (Microsoft, AI4Bharat)
- Pratyush Kumar, (Microsoft, AI4Bharat)
- Mitesh Khapra, (IITM, AI4Bharat, RBCDSAI)
If you are using any of the resources, please cite the following article:
@inproceedings{javed2021building,
title = {Towards Building ASR Systems for the Next Billion Users},
author = {Tahir Javed and Sumanth Doddapaneni and Abhigyan Raman and Kaushal Santosh Bhogale and Gowtham Ramesh and Anoop Kunchukuttan and Pratyush Kumar and Mitesh M. Khapra},
booktitle = "Proceedings of the AAAI Conference on Artificial Intelligence",
year = "2022 (to appear)",
}
The pretraining data and YouTube videos are licensed under a Attribution 4.0 International license.
IndicWav2Vec is MIT-licensed. The license applies to all pretrained, fine-tuned and language models as well.