IndicXlit
IndicXlit is a Transformer-based multilingual transliteration model (with ~11M parameters) for romanised to Indic script conversion, supporting 21 languages from the Indian subcontinent.
English-Indic transliteration model can be downloaded here.
All information and instructions pertaining to using the IndicXlit model can be found on the IndicXlit GitHub repository.
IndicXlit is trained on Aksharantar dataset which covers 21 language pairs. The volume of training, validation and test splits for each language pair is as listed below:
Subset | as-en | bn-en | brx-en | gu-en | hi-en | kn-en | ks-en | kok-en | mai-en | ml-en | mni-en | mr-en | ne-en | or-en | pa-en | san-en | sd-en | ta-en | te-en | ur-en |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Training | 179K | 1231K | 36K | 1143K | 1299K | 2907K | 47K | 613K | 283K | 4101K | 10K | 1453K | 2397K | 346K | 515K | 1813K | 60K | 3231K | 2430K | 699K |
Validation | 4K | 11K | 3K | 12K | 6K | 7K | 4K | 4K | 4K | 8K | 3K | 8K | 3K | 3K | 9K | 3K | 8K | 9K | 8K | 12K |
Test | 5531 | 5009 | 4136 | 7768 | 5693 | 6396 | 7707 | 5093 | 5512 | 6911 | 4925 | 6573 | 4133 | 4256 | 4316 | 5334 | - | 4682 | 4567 | 4463 |
In total, the Aksharantar dataset contains ~26M word pairs.
We evaluate IndicXlit model on the Dakshina testset and compare our results with the Dakshina paper. Results are as mentioned:
Language | bn | gu | hi | kn | ml | mr | pa | sd | si | ta | te | urd | Avg |
Dakshina | 49.40 | 49.50 | 50.00 | 66.20 | 58.30 | 49.70 | 40.90 | 33.20 | 54.70 | 65.70 | 67.60 | 36.70 | 51.83 |
IndicXlit | 55.49 | 62.02 | 60.56 | 77.18 | 63.56 | 64.85 | 47.24 | 48.56 | 63.91 | 68.10 | 73.38 | 42.12 | 60.58 |
The paper also contains strong baseline results on the Aksharantar testset.
- Yash Madhani (AI4Bharat, IITM)
- Sushane Parthan (AI4Bharat, IITM)
- Priyanka Bedekar (AI4Bharat, IITM)
- Ruchi Khapra (AI4Bharat)
- Anoop Kunchukuttan (AI4Bharat, Microsoft)
- Pratyush Kumar (AI4Bharat, IITM, Microsoft)
- Mitesh M. Khapra (AI4Bharat, IITM)
If you are using IndicXlit, please cite the following paper:
@misc{madhani2022aksharantar,
title={Aksharantar: Towards Building Open Transliteration Tools for the Next Billion Users},
author={Yash Madhani and Sushane Parthan and Priyanka Bedekar and Ruchi Khapra and Anoop Kunchukuttan and Pratyush Kumar and Mitesh Shantadevi Khapra},
year={2022},
eprint={},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
The IndicXlit code and model are released under the MIT License.