IndicFT
fastText is a subword-aware word embedding model. It is particularly well-suited for Indian languages due to their highly agglutinative morphology. We train fastText models on our IndicNLP Corpora and evaluate them on a set of tasks to measure its performance.
Our fastText models are available for 11 Indian languages: Assamese, Bengali, English, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, Telugu.
To use our fastText models, first download them. Next, install the fastText library:
pip3 install fasttext
and then load the models like this:
import fasttext
model = fasttext.load_model(path_to_binary_file)
For instructions on how to use these models, please refer to the official fastText documentation
| Language | as | pa | hi | bn | or | gu | mr | kn | te | ml | ta |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Vectors | link | link | link | link | link | link | link | link | link | link | link |
| Model | link | link | link | link | link | link | link | link | link | link | link |
For a full results of evaluation, check our paper. Here, we show some of the evaluations.
| Language | fastText wiki | fastText wiki+CC | Indic fastText |
|---|---|---|---|
| pa | 0.467 | 0.384 | 0.445 |
| hi | 0.575 | 0.551 | 0.598 |
| gu | 0.507 | 0.521 | 0.600 |
| mr | 0.497 | 0.544 | 0.509 |
| te | 0.559 | 0.543 | 0.578 |
| ta | 0.439 | 0.438 | 0.422 |
| Average | 0.507 | 0.497 | 0.525 |
| Language | fastText wiki | fastText wiki+CC | Indic fastText |
|---|---|---|---|
| pa | 97.12 | 95.53 | 96.47 |
| bn | 96.57 | 97.57 | 97.71 |
| or | 94.80 | 96.20 | 98.43 |
| gu | 95.12 | 94.63 | 99.02 |
| mr | 96.44 | 97.07 | 99.37 |
| kn | 95.93 | 96.53 | 97.43 |
| te | 98.67 | 98.08 | 99.17 |
| ml | 89.02 | 89.18 | 92.83 |
| ta | 95.99 | 95.90 | 97.26 |
| Average | 95.52 | 95.63 | 97.52 |
If you are using IndicFT, please cite the following paper:
@inproceedings{kakwani2020indicnlpsuite,
title={{IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian Languages}},
author={Divyanshu Kakwani and Anoop Kunchukuttan and Satish Golla and Gokul N.C. and Avik Bhattacharyya and Mitesh M. Khapra and Pratyush Kumar},
year={2020},
booktitle={Findings of EMNLP},
}
The IndicFT embeddings are released under the MIT License.