IndicNLP


We are working towards building a better ecosystem for Indian languages while also keeping up with the recent advancements in NLP. To this end, we are releasing IndicNLPSuite, which is a collection of various resources and models for Indian languages:

  • IndicCorp: A lot of NLP models require a large amount of training data, which most of the Indian languages lack. In this project, we develop a large-scale Indic corpora by intesively crawling the web. The corpora that we build has a total of 8.9 billion tokens and covers 12 major Indian languages - making it the largest public corpus for most of the Indian languages.
  • IndicFT: fastText is a well-suited model for Indian languages because of their rich morphological structure. We pre-train and benchmark fastText embeddings on our corpora, producing embeddings that outperform the official fastText embeddings for Indian languages on a variety of tasks.
  • IndicBERT: To improve performance and coverage of Indian languages on a wide variety of tasks, we also develop and evaluate IndicBERT. IndicBERT is a multilingual ALBERT model (a lighter variant of BERT) pre-trained on 12 major Indian languages: Assamese, Bengali, English, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, Telugu. It provides state-of-the-art performance on some of the tasks.
  • IndicBART: IndicBART is a multilingual, sequence-to-sequence pre-trained model focusing on Indic languages and English. It currently supports 12 languages and is based on the mBART architecture.
  • IndicGLUE: This is a benchmark containing various tasks to evaluate the natural language understanding capabilities of language models fpr Indian languages.
  • Samanantar: Samanantar is the largest publicly available parallel corpora collection for Indic languages : Assamese, Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, Telugu. The corpus has 49.6M sentence pairs between English to Indian Languages.
  • IndicTrans: IndicTrans is a Transformer-XL model trained on samanantar dataset. Two models are available which can translate from Indic to English and English to Indic. The model can perform translations for 11 lanaguages: Assamese, Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, Telugu. You can read more about the IndicNLPSuite in this paper. This is a pre-print version of our upcoming paper at Findings of EMNLP. Camera-ready copy will be available soon.

Citing

If you are using any of the resources, please cite the following article:

@inproceedings{kakwani2020indicnlpsuite,
    title={{IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian Languages}},
    author={Divyanshu Kakwani and Anoop Kunchukuttan and Satish Golla and Gokul N.C. and Avik Bhattacharyya and Mitesh M. Khapra and Pratyush Kumar},
    year={2020},
    booktitle={Findings of EMNLP},
}

If you are using IndicGLUE and additional evaluation datasets in your work, then we request you to use the following detailed citation text so that the original authors of the datasets also get credit for their work. As more authors contribute to this benchmark we will add their references also to the below text.

We use the IndicGLUE dataset \cite{kakwani2020indicnlpsuite} which is an evaluation benchmark containing datasets for NLU tasks in Indian languages. Some of these datasets were built from Wikipidea and IndicCorp \cite{kakwani2020indicnlpsuite}. In addition, it also contains other publicly available datasets for cross-lingual similarity \cite{siripragrada-etal-2020-multilingual}, named entity recognition \cite{pan-etal-2017-cross}, paraphrase detection \cite{Kumar2016DPILFIRE2016OO}, discourse analysis \cite{Dhanwal2020AnAD}, sentiment analysis \cite{cicling/Akhtar16}, \cite{DBLP:conf/coling/Akhtar0EB16}, \cite{mukku-mamidi-2017-actsa} and genre classification \footnote{https://github.com/goru001/inltk} \footnote{https://www.kaggle.com/csoham/classification-bengali-news-articles-indicnlp} \footnote{https://github.com/NirantK/hindi2vec/releases/tag/bbc-hindi-v0.1}. It also contains translations of the original WNLI \cite{Levesque2011TheWS} and COPA \cite{Gordon2011SemEval2012T7} datasets in 3 Indian languages.

The bibtex entries for the above sources is available here.


IndicNLP Catalog

In an effort to help discoverability of Indian language resources, we have started a collaborative catalog of known NLP resources for Indic languages. Check out the IndicNLP Catalog if you are looking for Indian NLP resources, and add to the catalog any resources you may know of or have created.


License

Creative Commons License

IndicNLP Suite is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. This license applies to datasets created as part of the project. For external datasets in the IndicGLUE benchmark, please look at the respective license terms.

The IndicBERT code (and model) are released under the MIT License.