IndicTrans


IndicTrans is a Transformer-4X model trained on samanantar dataset. Two models are available which can translate from Indic to English and English to Indic. The model can perform translations for 11 lanaguages: Assamese, Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, Telugu.

Update 05-06-2021

The Indic-Indic model is now available for download

Update 30-04-2021

The models are now available for download

Download Model

  • Indic-English model can be downloaded from here
  • English-Indic model can be downloaded from here
  • Indic-Indic can be downloaded from here
  • Please use this mirror gdrive link to download the models

Usage

The instructions for running inference can be found at IndicTrans GitHub repository

Model Details

IndicTrans is trained with Samanantar dataset which covers 11 language pairs.The amount of pretraining data for each language pair is listed below:

Language Pair # Sentence Pairs
en-as 0.14M
en-bn 8.6M
en-gu 3.06M
en-hi 10.13M
en-kn 4.09M
en-ml 5.92M
en-mr 3.63M
en-or 1.00M
en-pa 2.98M
en-ta 5.26M
en-te 4.95M

In total, the training data has 49.7M sentence pairs.

Benchmarking

We evaluate IndicTrans model on a WAT2021, WAT2020, WMT, UFAL, PMI. Here are the results that we obtain:

WAT2021 WAT2020 WMT UFAL pmi Flores
bn gu hi kn ml mr or pa ta te bn gu hi ml mr ta te hi gu ta ta as as bn gu hi kn ml mr or pa ta te
IN-EN 29.6 40.3 43.9 36.4 34.6 33.5 34.4 43.2 33.2 36.2 20.0 24.1 23.6 20.4 20.4 18.3 18.5 29.7 25.1 24.1 30.2 29.9 23.3 32.2 34.3 37.9 28.8 31.7 30.8 30.1 35.8 28.6 33.5
EN-IN 15.3 25.6 38.6 19.1 14.7 20.1 18.9 33.1 13.5 14.1 11.4 15.3 20.0 7.2 12.7 6.2 7.6 25.5 17.2 9.9 10.9 11.6 6.9 20.3 22.6 34.5 18.9 16.3 16.1 13.9 26.9 16.3 22.0

License

The IndicTrans code (and models) are released under the MIT License.