fine-tuned-IndicNER / README.md
techysanoj's picture
Update README.md
08f1d7c verified
---
language:
- as
- bn
- gu
- hi
- kn
- ml
- mr
- or
- pa
- ta
- te
license: mit
datasets:
- Samanantar
tags:
- ner
- Pytorch
- transformer
- multilingual
- nlp
- indicnlp
---
# fine-tuned IndicNER
fine-tuned IndicNER is a model trained to complete the task of identifying named entities from sentences in Indian languages. Our model is specifically fine-tuned to the 11 Indian languages mentioned above over millions of sentences. The model is then benchmarked over a human annotated testset and multiple other publicly available Indian NER datasets.
The 11 languages covered by IndicNER are: Assamese, Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, Telugu.
## Training Corpus
Our model was trained on a [dataset](https://huggingface.co/datasets/ai4bharat/naamapadam) which we mined from the existing [Samanantar Corpus](https://huggingface.co/datasets/ai4bharat/samanantar). We used a bert-base-multilingual-uncased model as the starting point and then fine-tuned it to the NER dataset mentioned previously.
## Downloads
Download from this same Huggingface repo.
Update 20 Dec 2022: We released a new paper documenting IndicNER and Naamapadam. We have a different model reported in the paper. We will update the repo here soon with this model.
## Usage
You can use [this Colab notebook](https://colab.research.google.com/drive/1sYa-PDdZQ_c9SzUgnhyb3Fl7j96QBCS8?usp=sharing) for samples on using IndicNER or for finetuning a pre-trained model on Naampadam dataset to build your own NER models.
<!-- License -->
## License
The fine-tuned-IndicNER code (and models) are released under the MIT License.