Samanantar
Samanantar is the largest publicly available parallel corpora collection for Indic languages: Assamese, Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, Telugu. The corpus has 49.6M sentence pairs between English to Indian Languages.
Samanantar v0.3 along with LaBSE scores metadata is available for download. Go to Downloads
The publicly released version is randomly shuffled, untokenized, and deduplicated.
The testsets used to benchmark IndicTrans can be found here
The Semantic Textual Similarity (STS) benchmark can be downloaded from here
The entire dataset can be downloaded from Samanantar v0.3.
The folder has 2 directories
- existing - all existing data compiled before samanantar
- created - mined as part of samanantar
- We have separate sub-dir for each source
- Please use this mirror gdrive link to download the source-wise splits of v0.3 data
- Please use this mirror gdrive link to download the v0.3 data
- Please use this mirror gdrive link to download the benchmarks
The language wise splits for Samanantar v0.2 can be found in the table below. Each link contains the number of sentence pairs in millions.
Language Pair | Link |
---|---|
en-as (9 MB) | 0.14M |
en-bn (580 MB) | 8.52M |
en-gu (178 MB) | 3.05M |
en-hi (818 MB) | 8.56M |
en-kn (229 MB) | 4.07M |
en-ml (365 MB) | 5.85M |
en-mr (210 MB) | 3.32M |
en-or (65 MB) | 1.00M |
en-pa (175 MB) | 2.42M |
en-ta (350 MB) | 5.16M |
en-te (280 MB) | 4.82M |
The entire Indic-Indic data can be downloaded from here
Language wise splits for Indic-Indic data can be downloaded from the table below. Each link contains the number of sentence pairs in millions.
as | bn | gu | hi | kn | ml | mr | or | pa | ta | te | |
---|---|---|---|---|---|---|---|---|---|---|---|
as | 0.36M | 0.14M | 0.16M | 0.19M | 0.23M | 0.16M | 0.07M | 0.11M | 0.22M | 0.21M | |
bn | 0.36M | 1.58M | 2.53M | 2.14M | 2.88M | 1.83M | 0.59M | 1.11M | 2.44M | 2.35M | |
gu | 0.14M | 1.58M | 1.86M | 2.07M | 2.36M | 1.76M | 0.53M | 1.13M | 2.07M | 2.31M | |
hi | 0.16M | 2.53M | 1.86M | 2.14M | 2.72M | 1.99M | 0.66M | 1.44M | 2.48M | 2.42M | |
kn | 0.19M | 2.14M | 2.07M | 2.14M | 2.89M | 1.82M | 0.54M | 1.12M | 2.52M | 2.81M | |
ml | 0.23M | 2.88M | 2.36M | 2.72M | 2.89M | 1.82M | 0.56M | 1.11M | 2.60M | 2.68M | |
mr | 0.16M | 1.83M | 1.76M | 1.99M | 1.82M | 1.82M | 0.58M | 1.06M | 21.12M | 2.23M | |
or | 0.07M | 0.59M | 0.53M | 0.66M | 0.54M | 0.56M | 0.58M | 0.50M | 1.09M | 1.12M | |
pa | 0.11M | 1.11M | 1.13M | 1.44M | 1.12M | 1.11M | 1.06M | 0.50M | 1.75M | 1.76M | |
ta | 0.22M | 2.44M | 2.07M | 2.48M | 2.52M | 2.60M | 2.12M | 1.09M | 1.75M | 2.61M | |
te | 0.21M | 2.35M | 2.31M | 2.42M | 2.81M | 2.68M | 2.23M | 1.12M | 1.76M | 2.61M |
- 06 July 2021, v0.2.1 data with metadata of source and Labse Alignment Score (LAS) was made available here
- 09 June 2021, The Semantic Textual Similarity (STS) benchmark is now available for download
- 05 June 2021, The benchmarking testsets are now available for download
- 15 May 2021, The language wise splits are now available for download
- 02 May 2021, Indic-Indic v0.2 data has been updated with super strict overlap removal
- 30 April 2021, v0.2 uses super strict overlap removal of validation and test data with train data
- Gowtham Ramesh, (RBCDSAI, IITM)
- Sumanth Doddapaneni, (RBCDSAI, IITM)
- Aravinth Bheemaraj, (Tarento, EkStep)
- Mayank Jobanputra, (IITM)
- Raghavan AK, (AI4Bharat)
- Ajitesh Sharma, (Tarento, EkStep)
- Sujit Sahoo, (Tarento, EkStep)
- Harshita Diddee, (AI4Bharat)
- Mahalakshmi J, (AI4Bharat)
- Divyanshu Kakwani, (IITM, AI4Bharat)
- Navneet Kumar, (Tarento, EkStep)
- Aswin Pradeep, (Tarento, EkStep)
- Srihari Nagaraj, (Tarento, EkStep)
- Kumar Deepak, (Tarento, EkStep)
- Vivek Raghavan, (EkStep)
- Anoop Kunchukuttan, (Microsoft, AI4Bharat)
- Pratyush Kumar, (RBCDSAI, AI4Bharat, IITM)
- Mitesh Shantadevi Khapra, (RBCDSAI, AI4Bharat, IITM)
If you are using any of the resources, please cite the following article:
@misc{ramesh2021samanantar,
title={Samanantar: The Largest Publicly Available Parallel Corpora Collection for 11 Indic Languages},
author={Gowtham Ramesh and Sumanth Doddapaneni and Aravinth Bheemaraj and Mayank Jobanputra and Raghavan AK and Ajitesh Sharma and Sujit Sahoo and Harshita Diddee and Mahalakshmi J and Divyanshu Kakwani and Navneet Kumar and Aswin Pradeep and Srihari Nagaraj and Kumar Deepak and Vivek Raghavan and Anoop Kunchukuttan and Pratyush Kumar and Mitesh Shantadevi Khapra},
year={2021},
eprint={2104.05596},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
The bibtex entries for the existing data sources is available here
This data is released under this licensing scheme:
- We do not own any of the text from which this data has been extracted.
- We license the actual packaging of this data under the Creative Commons CC0 license (“no rights reserved”).
- To the extent possible under law, AI4Bharat has waived all copyright and related or neighboring rights to Samanantar
- This work is published from: India.