IndicFT


fastText is a subword-aware word embedding model. It is particularly well-suited for Indian languages due to their highly agglutinative morphology. We train fastText models on our IndicNLP Corpora and evaluate them on a set of tasks to measure its performance.

Our fastText models are available for 11 Indian languages: Assamese, Bengali, English, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, Telugu.

Usage

To use our fastText models, first download them. Next, install the fastText library:

pip3 install fasttext

and then load the models like this:

import fasttext
model = fasttext.load_model(path_to_binary_file)

For instructions on how to use these models, please refer to the official fastText documentation

Downloads

Language as pa hi bn or gu mr kn te ml ta
Vectors link link link link link link link link link link link
Model link link link link link link link link link link link

Evaluation

For a full results of evaluation, check our paper. Here, we show some of the evaluations.

Word Similarity

Language fastText wiki fastText wiki+CC Indic fastText
pa 0.467 0.384 0.445
hi 0.575 0.551 0.598
gu 0.507 0.521 0.600
mr 0.497 0.544 0.509
te 0.559 0.543 0.578
ta 0.439 0.438 0.422
Average 0.507 0.497 0.525

News Genre Classification

Language fastText wiki fastText wiki+CC Indic fastText
pa 97.12 95.53 96.47
bn 96.57 97.57 97.71
or 94.80 96.20 98.43
gu 95.12 94.63 99.02
mr 96.44 97.07 99.37
kn 95.93 96.53 97.43
te 98.67 98.08 99.17
ml 89.02 89.18 92.83
ta 95.99 95.90 97.26
Average 95.52 95.63 97.52

Citing

If you are using IndicFT, please cite the following paper:

@inproceedings{kakwani2020indicnlpsuite,
    title={{IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian Languages}},
    author={Divyanshu Kakwani and Anoop Kunchukuttan and Satish Golla and Gokul N.C. and Avik Bhattacharyya and Mitesh M. Khapra and Pratyush Kumar},
    year={2020},
    booktitle={Findings of EMNLP},
}

License

The IndicFT embeddings are released under the MIT License.