IndicFT
fastText is a subword-aware word embedding model. It is particularly well-suited for Indian languages due to their highly agglutinative morphology. We train fastText models on our IndicNLP Corpora and evaluate them on a set of tasks to measure its performance.
Our fastText models are available for 11 Indian languages: Assamese, Bengali, English, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, Telugu.
To use our fastText models, first download them. Next, install the fastText library:
pip3 install fasttext
and then load the models like this:
import fasttext
model = fasttext.load_model(path_to_binary_file)
For instructions on how to use these models, please refer to the official fastText documentation
Language | as | pa | hi | bn | or | gu | mr | kn | te | ml | ta |
---|---|---|---|---|---|---|---|---|---|---|---|
Vectors | link | link | link | link | link | link | link | link | link | link | link |
Model | link | link | link | link | link | link | link | link | link | link | link |
For a full results of evaluation, check our paper. Here, we show some of the evaluations.
Language | fastText wiki | fastText wiki+CC | Indic fastText |
---|---|---|---|
pa | 0.467 | 0.384 | 0.445 |
hi | 0.575 | 0.551 | 0.598 |
gu | 0.507 | 0.521 | 0.600 |
mr | 0.497 | 0.544 | 0.509 |
te | 0.559 | 0.543 | 0.578 |
ta | 0.439 | 0.438 | 0.422 |
Average | 0.507 | 0.497 | 0.525 |
Language | fastText wiki | fastText wiki+CC | Indic fastText |
---|---|---|---|
pa | 97.12 | 95.53 | 96.47 |
bn | 96.57 | 97.57 | 97.71 |
or | 94.80 | 96.20 | 98.43 |
gu | 95.12 | 94.63 | 99.02 |
mr | 96.44 | 97.07 | 99.37 |
kn | 95.93 | 96.53 | 97.43 |
te | 98.67 | 98.08 | 99.17 |
ml | 89.02 | 89.18 | 92.83 |
ta | 95.99 | 95.90 | 97.26 |
Average | 95.52 | 95.63 | 97.52 |
If you are using IndicFT, please cite the following paper:
@inproceedings{kakwani2020indicnlpsuite,
title={{IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian Languages}},
author={Divyanshu Kakwani and Anoop Kunchukuttan and Satish Golla and Gokul N.C. and Avik Bhattacharyya and Mitesh M. Khapra and Pratyush Kumar},
year={2020},
booktitle={Findings of EMNLP},
}
The IndicFT embeddings are released under the MIT License.