IndicWav2Vec

IndicWav2Vec is a multilingual speech model pretrained on 40 Indian langauges. This model represents the largest diversity of Indian languages in the pool of multilingual speech models. We fine-tune this model for downstream ASR for 9 languages and obtain state-of-the-art results on 3 public benchmarks, namely MUCS, MSR and OpenSLR.

As part of IndicWav2Vec we create largest publicly available corpora for 40 languages from 4 different language families. We also trained state-of-the-art ASR models for 9 Indian languages.

All the resources (i) pretraining data (ii) pre-trained models (iii) fine-tuned models (iv) language models are made publicly available.

Our paper is available on arXiv

Update 05-01-2022

We are currently facing issues with access to the data. We will resolve this ASAP.

Code

All the code and scripts for data processing, pretraining and fine-tuning can be found in our GitHub repository here

Downloads

Pretraining Data

The YouTube Video IDs we used for creating the pretraining data can be found here

The pipeline for data processing can be found here

Pretrained Models

IndicWav2Vec BASE can be downloaded from here
IndicWav2Vec LARGE can be downloaded from here

Fine-tuned Models

Language-wise fine-tuned models can be found in the table

Language	Url
bengali	download
gujarati	download
hindi	download
marathi	download
nepali	download
odia	download
sinhala	download
tamil	download
telugu	download

Language Models

Language-wise KenLM langauge models can be found in the table

Language	Url
bengali	download
gujarati	download
hindi	download
marathi	download
nepali	download
odia	download
tamil	download
telugu	download

Contributors

Tahir Javed, _{(IITM, AI4Bharat)}
Sumanth Doddapaneni, _{(AI4Bharat, RBCDSAI)}
Abhigyan Raman, _(AI4Bharat)
Kaushal Bhogale, _(AI4Bharat)
Gowtham Ramesh, _{(AI4Bharat, RBCDSAI)}
Anoop Kunchukuttan, _{(Microsoft, AI4Bharat)}
Pratyush Kumar, _{(Microsoft, AI4Bharat)}
Mitesh Khapra, _{(IITM, AI4Bharat, RBCDSAI)}

Citing

If you are using any of the resources, please cite the following article:

@inproceedings{javed2021building,
    title = {Towards Building ASR Systems for the Next Billion Users},
    author = {Tahir Javed and Sumanth Doddapaneni and Abhigyan Raman and Kaushal Santosh Bhogale and Gowtham Ramesh and Anoop Kunchukuttan and Pratyush Kumar and Mitesh M. Khapra},
    booktitle = "Proceedings of the AAAI Conference on Artificial Intelligence",
    year = "2022 (to appear)",
}

License

The pretraining data and YouTube videos are licensed under a Attribution 4.0 International license.

IndicWav2Vec is MIT-licensed. The license applies to all pretrained, fine-tuned and language models as well.