IndicNLP Corpora
IndicNLP corpora has been developed by discovering and scraping thousands of web sources - primarily news, magazines and books, over a duration of several months. It has been used to train our released models.
| Language | Sentences | Tokens | Types | Vocab | Corpus |
|---|---|---|---|---|---|
| pa | 6.5M | 179.4M | 0.5M | link | link |
| hi | 62.9M | 1199.8M | 5.3M | link | link |
| bn | 7.2M | 100.1M | 1.5M | link | link |
| or | 3.5M | 51.5M | 0.7M | link | link |
| gu | 7.8M | 129.7M | 2.4M | link | link |
| mr | 9.9M | 142.4M | 2.6M | link | link |
| kn | 14.7M | 174.9M | 3.0M | link | link |
| te | 15.1M | 190.2M | 4.1M | link | link |
| ml | 11.6M | 167.4M | 8.8M | link | link |
| ta | 20.9M | 362.8M | 9.4M | link | link |