IndicNLP Corpora

IndicNLP corpora has been developed by discovering and scraping thousands of web sources - primarily news, magazines and books, over a duration of several months. It has been used to train our released models.

Download Links

Language	Sentences	Tokens	Types	Vocab	Corpus
pa	6.5M	179.4M	0.5M	link	link
hi	62.9M	1199.8M	5.3M	link	link
bn	7.2M	100.1M	1.5M	link	link
or	3.5M	51.5M	0.7M	link	link
gu	7.8M	129.7M	2.4M	link	link
mr	9.9M	142.4M	2.6M	link	link
kn	14.7M	174.9M	3.0M	link	link
te	15.1M	190.2M	4.1M	link	link
ml	11.6M	167.4M	8.8M	link	link
ta	20.9M	362.8M	9.4M	link	link