IndicNLG Suite


IndicNLG suite is a collection of datasets for benchmarking Natural Language Generation (NLG) for 11 Indic languages spanning five diverse NLG tasks. The datasets were created using a combination of crawling websites, machine translation, n-gram count and regular expression based cleaning . Overall, the suite contains about 8.5M examples across all languages and tasks and is the largest multilingual NLG dataset to date as well as the first of its kind for Indic languages. You can use these datasets to benchmark your own NLG systems.

  • Supported languages: Assamese, Bengali, Gujarati, Hindi, Marathi, Odiya, Punjabi, Kannada, Malayalam, Tamil, and Telugu.
  • Supported NLG tasks and datasets: Biography generation using Wikipedia infoboxes (WikiBio), news headline generation, sentence summarization, question generation and paraphrase generation.
  • Datasets are available in json file and HuggingFace format.

You can read more about IndicNLGSuite in this paper. We have benchmarked our own monolingual and multilingual models based on IndicBART and found that our models perform at par with or are better than baseline models such as mT5.

Downloads

The datasets and models are available on HuggingFace

Task Dataset Model
Biography Generation IndicWikiBio Coming Soon
Headline Generation IndicHeadlineGeneration Coming Soon
Sentence Summarization IndicSentenceSummarization Coming Soon
Paraphrase Generation IndicParaphrase Coming Soon
Question Generation IndicQuestionGeneration Coming Soon

IndicBART fine-tuning and decoding

  • Follow the setup instructions here.
    • We use the YANMTT toolkit for fine-tuning IndicBART.
  • Extract the input and target text from the jsonl format files or HuggingFace format files.
    • For question generation, concatenate the question and context into a single line.
    • Convert the scripts in the extracted files into Devanagari using the Indic Script Converter.
  • Here is a command for fine-tuning IndicBART for summarization.
    • The correct input and output file paths should be provided.
    • Use appropriate hyperparameters according the paper.
  • Decode the test set using the fine-tuned model after modifying this command.
    • Map the output to the original script using the script converter.
  • Alternatively: IndicBART is uploaded to HuggingFace hub here.
    • Modify the HuggingFace summarization script to use the IndicBART model.
    • This script can use the json as well as HuggingFace format files.
    • Ensure that script mapping is done before training and after decoding.

Contributors

  • Aman Kumar
  • Prachi Sahu
  • Himani Shrotriya
  • Raj Dabre
  • Ratish Puduppully
  • Anoop Kunchukuttan
  • Amogh Mishra
  • Mitesh M. Khapra
  • Pratyush Kumar

Citing

If you use IndicNLG Suite, please cite the following paper:

@misc{kumar2022indicnlg,
      title={IndicNLG Suite: Multilingual Datasets for Diverse NLG Tasks in Indic Languages}, 
      author={Aman Kumar and Himani Shrotriya and Prachi Sahu and Raj Dabre and Ratish Puduppully and Anoop Kunchukuttan and Amogh Mishra and Mitesh M. Khapra and Pratyush Kumar},
      year={2022},
      eprint={2203.05437},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}   

License

Datasets

Different datasets are released under different licenses

Creative Commons License

IndicHeadlineGeneration, IndicSentenceSummarization and IndicParaphrase are licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Creative Commons License
IndicWikiBio and IndicQuestionGeneration are licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Models

All models are released under the MIT license.