IndicBERT
IndicBERT is a multilingual ALBERT model trained on large-scale corpora, covering 12 major Indian languages.
The easiest way to use Indic BERT is through the Huggingface transformers library. It can be simply loaded like this:
from transformers import AutoModel, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('ai4bharat/indic-bert')
model = AutoModel.from_pretrained('ai4bharat/indic-bert')
If you want to quickly try experimenting with IndicBERT, we suggest checking out our tutorials and other fine-tuning notebooks that run on Google Colab:
IndicBERT is pre-trained with IndicNLP corpus which covers 12 Indian languages (including English): Assamese, Bengali, English, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, Telugu. The amount of pretraining data for each language is listed below:
Language | as | bn | en | gu | hi | kn | |
---|---|---|---|---|---|---|---|
No. of Tokens | 36.9M | 815M | 1.34B | 724M | 1.84B | 712M | |
Language | ml | mr | or | pa | ta | te | all |
No. of Tokens | 767M | 560M | 104M | 814M | 549M | 671M | 8.9B |
In total, the pretraining corpus has a size of 120GB and contains 8.9B tokens.
- Kakwani, D., Kunchukuttan, A.,, Golla, S., N.C. G., Bhattacharyya, A., Khapra, M.M. and Kumar, P., 2020. IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian Languages. Accepted by Findings of EMNLP 2020 pdf
- IndicBERT zip