IndicTrans


IndicTrans is a Transformer-4X model trained on samanantar dataset. Two models are available which can translate from Indic to English and English to Indic. The model can perform translations for 11 lanaguages: Assamese, Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, Telugu.

Update 30-04-2021

The models are now available for download

Download Model

  • Indic-English model can be downloaded from here
  • English-Indic model can be downloaded from here

Usage

The instructions for running inference can be found at IndicTrans GitHub repository

Model Details

IndicTrans is trained with Samanantar dataset which covers 11 language pairs.The amount of pretraining data for each language pair is listed below:

Language Pair # Sentence Pairs
en-as 0.14M
en-bn 8.52M
en-gu 3.05M
en-hi 8.56M
en-kn 4.07M
en-ml 5.85M
en-mr 3.32M
en-or 1.00M
en-pa 2.42M
en-ta 5.16M
en-te 4.82M

In total, the training data has 46.9M sentence pairs.

Benchmarking

We evaluate IndicTrans model on a WAT2021, WAT2020, WMT, UFAL, PMI. Here are the results that we obtain:

WAT2021 WAT2020 WMT UFAL pmi
bn gu hi kn ml mr or pa ta te bn gu hi ml mr ta te hi gu ta ta as
IN-EN 28.4 39.5 43.2 34.9 33.4 32.4 33.4 42. 32. 35.1 19.2 23. 23.5 19.6 19.6 17.9 17.8 29.4 23.4 24.3 30.1 28.7
EN-IN 14.7 24.8 37.9 18.2 14.4 19.2 18.5 31.4 13.3 13.2 10.2 14.6 19.4 6.9 12.5 5.8 7.2 25. 16.2 8.8 11.6 12