IndicTrans
IndicTrans is a Transformer-4X model trained on samanantar dataset. Two models are available which can translate from Indic to English and English to Indic. The model can perform translations for 11 lanaguages: Assamese, Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, Telugu.
The Indic-Indic model is now available for download
The models are now available for download
- Indic-English model can be downloaded from here
- English-Indic model can be downloaded from here
- Indic-Indic can be downloaded from here
- Please use this mirror gdrive link to download the models
The instructions for running inference can be found at IndicTrans GitHub repository
IndicTrans is trained with Samanantar dataset which covers 11 language pairs.The amount of pretraining data for each language pair is listed below:
Language Pair | # Sentence Pairs |
---|---|
en-as | 0.14M |
en-bn | 8.6M |
en-gu | 3.06M |
en-hi | 10.13M |
en-kn | 4.09M |
en-ml | 5.92M |
en-mr | 3.63M |
en-or | 1.00M |
en-pa | 2.98M |
en-ta | 5.26M |
en-te | 4.95M |
In total, the training data has 49.7M sentence pairs.
We evaluate IndicTrans model on a WAT2021, WAT2020, WMT, UFAL, PMI. Here are the results that we obtain:
WAT2021 | WAT2020 | WMT | UFAL | pmi | Flores | ||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
bn | gu | hi | kn | ml | mr | or | pa | ta | te | bn | gu | hi | ml | mr | ta | te | hi | gu | ta | ta | as | as | bn | gu | hi | kn | ml | mr | or | pa | ta | te | |
IN-EN | 29.6 | 40.3 | 43.9 | 36.4 | 34.6 | 33.5 | 34.4 | 43.2 | 33.2 | 36.2 | 20.0 | 24.1 | 23.6 | 20.4 | 20.4 | 18.3 | 18.5 | 29.7 | 25.1 | 24.1 | 30.2 | 29.9 | 23.3 | 32.2 | 34.3 | 37.9 | 28.8 | 31.7 | 30.8 | 30.1 | 35.8 | 28.6 | 33.5 |
EN-IN | 15.3 | 25.6 | 38.6 | 19.1 | 14.7 | 20.1 | 18.9 | 33.1 | 13.5 | 14.1 | 11.4 | 15.3 | 20.0 | 7.2 | 12.7 | 6.2 | 7.6 | 25.5 | 17.2 | 9.9 | 10.9 | 11.6 | 6.9 | 20.3 | 22.6 | 34.5 | 18.9 | 16.3 | 16.1 | 13.9 | 26.9 | 16.3 | 22.0 |
The IndicTrans code (and models) are released under the MIT License.