IndicNLP Corpora


IndicNLP corpora has been developed by discovering and scraping thousands of web sources - primarily news, magazines and books, over a duration of several months. It has been used to train our released models.

Language Sentences Tokens Types Vocab Corpus
pa 6.5M 179.4M 0.5M link link
hi 62.9M 1199.8M 5.3M link link
bn 7.2M 100.1M 1.5M link link
or 3.5M 51.5M 0.7M link link
gu 7.8M 129.7M 2.4M link link
mr 9.9M 142.4M 2.6M link link
kn 14.7M 174.9M 3.0M link link
te 15.1M 190.2M 4.1M link link
ml 11.6M 167.4M 8.8M link link
ta 20.9M 362.8M 9.4M link link