IndicTrans

IndicTrans is a Transformer-4X model trained on samanantar dataset. Two models are available which can translate from Indic to English and English to Indic. The model can perform translations for 11 lanaguages: Assamese, Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, Telugu.

Update 05-06-2021

The Indic-Indic model is now available for download

Update 30-04-2021

The models are now available for download

Download Model

Indic-English model can be downloaded from here
English-Indic model can be downloaded from here
Indic-Indic can be downloaded from here

Mirror Links

Please use this mirror gdrive link to download the models

Usage

The instructions for running inference can be found at IndicTrans GitHub repository

Model Details

IndicTrans is trained with Samanantar dataset which covers 11 language pairs.The amount of pretraining data for each language pair is listed below:

Language Pair	# Sentence Pairs
en-as	0.14M
en-bn	8.6M
en-gu	3.06M
en-hi	10.13M
en-kn	4.09M
en-ml	5.92M
en-mr	3.63M
en-or	1.00M
en-pa	2.98M
en-ta	5.26M
en-te	4.95M

In total, the training data has 49.7M sentence pairs.

Benchmarking

We evaluate IndicTrans model on a WAT2021, WAT2020, WMT, UFAL, PMI. Here are the results that we obtain:

	WAT2021										WAT2020							WMT			UFAL	pmi	Flores
	bn	gu	hi	kn	ml	mr	or	pa	ta	te	bn	gu	hi	ml	mr	ta	te	hi	gu	ta	ta	as	as	bn	gu	hi	kn	ml	mr	or	pa	ta	te
IN-EN	29.6	40.3	43.9	36.4	34.6	33.5	34.4	43.2	33.2	36.2	20.0	24.1	23.6	20.4	20.4	18.3	18.5	29.7	25.1	24.1	30.2	29.9	23.3	32.2	34.3	37.9	28.8	31.7	30.8	30.1	35.8	28.6	33.5
EN-IN	15.3	25.6	38.6	19.1	14.7	20.1	18.9	33.1	13.5	14.1	11.4	15.3	20.0	7.2	12.7	6.2	7.6	25.5	17.2	9.9	10.9	11.6	6.9	20.3	22.6	34.5	18.9	16.3	16.1	13.9	26.9	16.3	22.0

License

The IndicTrans code (and models) are released under the MIT License.