Aksharantar
Aksharantar is the largest publicly available transliteration dataset for 21 Indic languages. The corpus has 26M Indic language-English transliteration pairs.
- The Aksharantar dataset can be downloaded from the Aksharantar Hugging Face repository.
- Each language-pair corpus in the Aksharantar dataset is split into training, validation and test subsets. Each subset is a JSONL file consisting of individual data instances comprising a unique identifier, native word, English word, transliteration source and a score (if applicable).
- Individual language-pair download links are provided in the data split below.
The language-wise splits for Aksharantar is shown in the table with total number of word pairs (in millions). Individual download links for each language-pair are as against the hyperlink.
Subset | as-en (4.72 MB) | bn-en (31.5 MB) | brx-en (0.933 MB) | gu-en (29.5 MB) | hi-en (31.4 MB) | kn-en (83.7 MB) | ks-en (1.1 MB) | kok-en (16.6 MB) | mai-en (6.74 MB) | ml-en (125 MB) | mni-en (0.313 MB) | mr-en (39.9 MB) | ne-en (67 MB) | or-en (9.09 MB) | pa-en (12.1 MB) | sa-en (56 MB) | sd-en (1.37 MB) | ta-en (92.7 MB) | te-en (69.1 MB) | ur-en (17 MB) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Training | 179K | 1231K | 36K | 1143K | 1299K | 2907K | 47K | 613K | 283K | 4101K | 10K | 1453K | 2397K | 346K | 515K | 1813K | 60K | 3231K | 2430K | 699K |
Validation | 4K | 11K | 3K | 12K | 6K | 7K | 4K | 4K | 4K | 8K | 3K | 8K | 3K | 3K | 9K | 3K | 8K | 9K | 8K | 12K |
Test | 5531 | 5009 | 4136 | 7768 | 5693 | 6396 | 7707 | 5093 | 5512 | 6911 | 4925 | 6573 | 4133 | 4256 | 4316 | 5334 | - | 4682 | 4567 | 4463 |
- 07 May 2022 - The Aksharantar dataset is now available for download.
- Yash Madhani (AI4Bharat, IITM)
- Sushane Parthan (AI4Bharat, IITM)
- Priyanka Bedekar (AI4Bharat, IITM)
- Ruchi Khapra (AI4Bharat)
- Anoop Kunchukuttan (AI4Bharat, Microsoft)
- Pratyush Kumar (AI4Bharat, IITM, Microsoft)
- Mitesh Shantadevi Khapra (AI4Bharat, IITM)
If you are using any of the resources, please cite the following article:
@misc{madhani2022aksharantar,
title={Aksharantar: Towards Building Open Transliteration Tools for the Next Billion Users},
author={Yash Madhani and Sushane Parthan and Priyanka Bedekar and Ruchi Khapra and Anoop Kunchukuttan and Pratyush Kumar and Mitesh Shantadevi Khapra},
year={2022},
eprint={},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
This data is released under the following licensing scheme:
- Manually collected data: Released under CC-BY license.
- Mined dataset (from Samanantar and IndicCorp): Released under CC0 license.
- Existing sources: Released under CC0 license.
CC-BY License
CC0 License Statement
- We do not own any of the text from which this data has been extracted.
- We license the actual packaging of the mined data under the Creative Commons CC0 license (“no rights reserved”).
- To the extent possible under law, AI4Bharat has waived all copyright and related or neighboring rights to Aksharantar manually collected data and existing sources.
- This work is published from: India.