Aksharantar


Aksharantar is the largest publicly available transliteration dataset for 21 Indic languages. The corpus has 26M Indic language-English transliteration pairs.

Downloads

  • The Aksharantar dataset can be downloaded from the Aksharantar Hugging Face repository.
  • Each language-pair corpus in the Aksharantar dataset is split into training, validation and test subsets. Each subset is a JSONL file consisting of individual data instances comprising a unique identifier, native word, English word, transliteration source and a score (if applicable).
  • Individual language-pair download links are provided in the data split below.

Data Split

The language-wise splits for Aksharantar is shown in the table with total number of word pairs (in millions). Individual download links for each language-pair are as against the hyperlink.

Subset as-en (4.72 MB) bn-en (31.5 MB) brx-en (0.933 MB) gu-en (29.5 MB) hi-en (31.4 MB) kn-en (83.7 MB) ks-en (1.1 MB) kok-en (16.6 MB) mai-en (6.74 MB) ml-en (125 MB) mni-en (0.313 MB) mr-en (39.9 MB) ne-en (67 MB) or-en (9.09 MB) pa-en (12.1 MB) sa-en (56 MB) sd-en (1.37 MB) ta-en (92.7 MB) te-en (69.1 MB) ur-en (17 MB)
Training 179K 1231K 36K 1143K 1299K 2907K 47K 613K 283K 4101K 10K 1453K 2397K 346K 515K 1813K 60K 3231K 2430K 699K
Validation 4K 11K 3K 12K 6K 7K 4K 4K 4K 8K 3K 8K 3K 3K 9K 3K 8K 9K 8K 12K
Test 5531 5009 4136 7768 5693 6396 7707 5093 5512 6911 4925 6573 4133 4256 4316 5334 - 4682 4567 4463

Change Log

  • 07 May 2022 - The Aksharantar dataset is now available for download.

Contributors

Citing

If you are using any of the resources, please cite the following article:

@misc{madhani2022aksharantar,
      title={Aksharantar: Towards Building Open Transliteration Tools for the Next Billion Users}, 
      author={Yash Madhani and Sushane Parthan and Priyanka Bedekar and Ruchi Khapra and Anoop Kunchukuttan and Pratyush Kumar and Mitesh Shantadevi Khapra},
      year={2022},
      eprint={},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

License

This data is released under the following licensing scheme:

  • Manually collected data: Released under CC-BY license.
  • Mined dataset (from Samanantar and IndicCorp): Released under CC0 license.
  • Existing sources: Released under CC0 license.

CC-BY License

CC-BY

CC0 License Statement

CC0

  • We do not own any of the text from which this data has been extracted.
  • We license the actual packaging of the mined data under the Creative Commons CC0 license (“no rights reserved”).
  • To the extent possible under law, AI4Bharat has waived all copyright and related or neighboring rights to Aksharantar manually collected data and existing sources.
  • This work is published from: India.