Aksharantar

Aksharantar is the largest publicly available transliteration dataset for 21 Indic languages. The corpus has 26M Indic language-English transliteration pairs.

Downloads

The Aksharantar dataset can be downloaded from the Aksharantar Hugging Face repository.
Each language-pair corpus in the Aksharantar dataset is split into training, validation and test subsets. Each subset is a JSONL file consisting of individual data instances comprising a unique identifier, native word, English word, transliteration source and a score (if applicable).
Individual language-pair download links are provided in the data split below.

Data Split

The language-wise splits for Aksharantar is shown in the table with total number of word pairs (in millions). Individual download links for each language-pair are as against the hyperlink.

Subset	as-en _{(4.72 MB)}	bn-en _{(31.5 MB)}	brx-en _{(0.933 MB)}	gu-en _{(29.5 MB)}	hi-en _{(31.4 MB)}	kn-en _{(83.7 MB)}	ks-en _{(1.1 MB)}	kok-en _{(16.6 MB)}	mai-en _{(6.74 MB)}	ml-en _{(125 MB)}	mni-en _{(0.313 MB)}	mr-en _{(39.9 MB)}	ne-en _{(67 MB)}	or-en _{(9.09 MB)}	pa-en _{(12.1 MB)}	sa-en _{(56 MB)}	sd-en _{(1.37 MB)}	ta-en _{(92.7 MB)}	te-en _{(69.1 MB)}	ur-en _{(17 MB)}
Training	179K	1231K	36K	1143K	1299K	2907K	47K	613K	283K	4101K	10K	1453K	2397K	346K	515K	1813K	60K	3231K	2430K	699K
Validation	4K	11K	3K	12K	6K	7K	4K	4K	4K	8K	3K	8K	3K	3K	9K	3K	8K	9K	8K	12K
Test	5531	5009	4136	7768	5693	6396	7707	5093	5512	6911	4925	6573	4133	4256	4316	5334	-	4682	4567	4463

Change Log

07 May 2022 - The Aksharantar dataset is now available for download.

Contributors

Yash Madhani _{(AI4Bharat, IITM)}
Sushane Parthan _{(AI4Bharat, IITM)}
Priyanka Bedekar _{(AI4Bharat, IITM)}
Ruchi Khapra _(AI4Bharat)
Anoop Kunchukuttan _{(AI4Bharat, Microsoft)}
Pratyush Kumar _{(AI4Bharat, IITM, Microsoft)}
Mitesh Shantadevi Khapra _{(AI4Bharat, IITM)}

Citing

If you are using any of the resources, please cite the following article:

@misc{madhani2022aksharantar,
      title={Aksharantar: Towards Building Open Transliteration Tools for the Next Billion Users}, 
      author={Yash Madhani and Sushane Parthan and Priyanka Bedekar and Ruchi Khapra and Anoop Kunchukuttan and Pratyush Kumar and Mitesh Shantadevi Khapra},
      year={2022},
      eprint={},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

License

This data is released under the following licensing scheme:

Manually collected data: Released under CC-BY license.
Mined dataset (from Samanantar and IndicCorp): Released under CC0 license.
Existing sources: Released under CC0 license.

CC-BY License

CC0 License Statement

We do not own any of the text from which this data has been extracted.
We license the actual packaging of the mined data under the Creative Commons CC0 license (“no rights reserved”).
To the extent possible under law, AI4Bharat has waived all copyright and related or neighboring rights to Aksharantar manually collected data and existing sources.
This work is published from: India.