IndicFT
fastText is a subword-aware word embedding model. It is particularly well-suited for Indian languages due to their highly agglutinative morphology. We train fastText models on our IndicNLP Corpora and evaluate them on a set of tasks to measure its performance.
Our fastText models are available for 11 Indian languages: Assamese, Bengali, English, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, Telugu.
To use our fastText models, first download them. Next, install the fastText library:
pip3 install fasttext
and then load the models like this:
import fasttext
model = fasttext.load_model(path_to_binary_file)
For tutorials on using this model, please refer to the official fastText documentation
Language | Vectors | Models |
---|---|---|
pa | link | link |
hi | link | link |
bn | link | link |
or | link | link |
gu | link | link |
mr | link | link |
kn | link | link |
te | link | link |
ml | link | link |
ta | link | link |
For a full result of evaluation, check our paper. Here, we show some of the evaluations.
Language | fastText wiki | fastText wiki+CC | Indic fastText |
---|---|---|---|
pa | 0.467 | 0.384 | 0.445 |
hi | 0.575 | 0.551 | 0.598 |
gu | 0.507 | 0.521 | 0.600 |
mr | 0.497 | 0.544 | 0.509 |
te | 0.559 | 0.543 | 0.578 |
ta | 0.439 | 0.438 | 0.422 |
Average | 0.507 | 0.497 | 0.525 |
Language | fastText wiki | fastText wiki+CC | Indic fastText |
---|---|---|---|
pa | 97.12 | 95.53 | 96.47 |
bn | 96.57 | 97.57 | 97.71 |
or | 94.80 | 96.20 | 98.43 |
gu | 95.12 | 94.63 | 99.02 |
mr | 96.44 | 97.07 | 99.37 |
kn | 95.93 | 96.53 | 97.43 |
te | 98.67 | 98.08 | 99.17 |
ml | 89.02 | 89.18 | 92.83 |
ta | 95.99 | 95.90 | 97.26 |
Average | 95.52 | 95.63 | 97.52 |