Samanantar


Samanantar is the largest publicly available parallel corpora collection for Indic languages: Assamese, Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, Telugu. The corpus has 49.6M sentence pairs between English to Indian Languages.

Update 09-06-2021

The Semantic Textual Similarity (STS) benchmark is now available for download

Update 05-06-2021

The benchmarking testsets are now available for download

Dataset Format

The publicly released version is randomly shuffled, untokenized, and deduplicated.

Downloads

Benchmarks

The testsets used to benchmark IndicTrans can be found here

STS Benchmark

The Semantic Textual Similarity (STS) benchmark can be downloaded from here

En-Indic

The entire dataset can be downloaded from here

The language wise splits can be found in the table below. Each link contains the number of sentence pairs in millions.

Language Pair Link
en-as (9 MB) 0.14M
en-bn (580 MB) 8.52M
en-gu (178 MB) 3.05M
en-hi (818 MB) 8.56M
en-kn (229 MB) 4.07M
en-ml (365 MB) 5.85M
en-mr (210 MB) 3.32M
en-or (65 MB) 1.00M
en-pa (175 MB) 2.42M
en-ta (350 MB) 5.16M
en-te (280 MB) 4.82M

Indic-Indic

The entire Indic-Indic data can be downloaded from here

Language wise splits for Indic-Indic data can be downloaded from the table below. Each link contains the number of sentence pairs in millions.

as bn gu hi kn ml mr or pa ta te
as 0.36M 0.14M 0.16M 0.19M 0.23M 0.16M 0.07M 0.11M 0.22M 0.21M
bn 0.36M 1.58M 2.53M 2.14M 2.88M 1.83M 0.59M 1.11M 2.44M 2.35M
gu 0.14M 1.58M 1.86M 2.07M 2.36M 1.76M 0.53M 1.13M 2.07M 2.31M
hi 0.16M 2.53M 1.86M 2.14M 2.72M 1.99M 0.66M 1.44M 2.48M 2.42M
kn 0.19M 2.14M 2.07M 2.14M 2.89M 1.82M 0.54M 1.12M 2.52M 2.81M
ml 0.23M 2.88M 2.36M 2.72M 2.89M 1.82M 0.56M 1.11M 2.60M 2.68M
mr 0.16M 1.83M 1.76M 1.99M 1.82M 1.82M 0.58M 1.06M 21.12M 2.23M
or 0.07M 0.59M 0.53M 0.66M 0.54M 0.56M 0.58M 0.50M 1.09M 1.12M
pa 0.11M 1.11M 1.13M 1.44M 1.12M 1.11M 1.06M 0.50M 1.75M 1.76M
ta 0.22M 2.44M 2.07M 2.48M 2.52M 2.60M 2.12M 1.09M 1.75M 2.61M
te 0.21M 2.35M 2.31M 2.42M 2.81M 2.68M 2.23M 1.12M 1.76M 2.61M

Change Log

  • 15 May 2021, The language wise splits are now available for download
  • 02 May 2021, Indic-Indic v0.2 data has been updated with super strict overlap removal
  • 30 April 2021, v0.2 uses super strict overlap removal of validation and test data with train data

Contributors

Citing

If you are using any of the resources, please cite the following article:

@misc{ramesh2021samanantar,
      title={Samanantar: The Largest Publicly Available Parallel Corpora Collection for 11 Indic Languages}, 
      author={Gowtham Ramesh and Sumanth Doddapaneni and Aravinth Bheemaraj and Mayank Jobanputra and Raghavan AK and Ajitesh Sharma and Sujit Sahoo and Harshita Diddee and Mahalakshmi J and Divyanshu Kakwani and Navneet Kumar and Aswin Pradeep and Kumar Deepak and Vivek Raghavan and Anoop Kunchukuttan and Pratyush Kumar and Mitesh Shantadevi Khapra},
      year={2021},
      eprint={2104.05596},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

The bibtex entries for the existing data sources is available here

License

Creative Commons License

Samanantar is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. This license applies to datasets created as part of the project. For external datasets in the IndicGLUE benchmark, please look at the respective license terms.