IndicNLP Corpora
IndicNLP corpora has been developed by discovering and scraping thousands of web sources - primarily news, magazines and books, over a duration of several months. It has been used to train our released models.
Language | Sentences | Tokens | Types | Vocab | Corpus |
---|---|---|---|---|---|
pa | 6.5M | 179.4M | 0.5M | link | link |
hi | 62.9M | 1199.8M | 5.3M | link | link |
bn | 7.2M | 100.1M | 1.5M | link | link |
or | 3.5M | 51.5M | 0.7M | link | link |
gu | 7.8M | 129.7M | 2.4M | link | link |
mr | 9.9M | 142.4M | 2.6M | link | link |
kn | 14.7M | 174.9M | 3.0M | link | link |
te | 15.1M | 190.2M | 4.1M | link | link |
ml | 11.6M | 167.4M | 8.8M | link | link |
ta | 20.9M | 362.8M | 9.4M | link | link |