techysanoj
/

fine-tuned-IndicNER

Model card Files Files and versions

fine-tuned-IndicNER / README.md

techysanoj's picture

Update README.md

08f1d7c verified about 1 month ago

|

history blame contribute delete

1.65 kB

	---
	language:
	- as
	- bn
	- gu
	- hi
	- kn
	- ml
	- mr
	- or
	- pa
	- ta
	- te
	license: mit
	datasets:
	- Samanantar
	tags:
	- ner
	- Pytorch
	- transformer
	- multilingual
	- nlp
	- indicnlp
	---

	# fine-tuned IndicNER
	fine-tuned IndicNER is a model trained to complete the task of identifying named entities from sentences in Indian languages. Our model is specifically fine-tuned to the 11 Indian languages mentioned above over millions of sentences. The model is then benchmarked over a human annotated testset and multiple other publicly available Indian NER datasets.
	The 11 languages covered by IndicNER are: Assamese, Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, Telugu.

	## Training Corpus
	Our model was trained on a [dataset](https://huggingface.co/datasets/ai4bharat/naamapadam) which we mined from the existing [Samanantar Corpus](https://huggingface.co/datasets/ai4bharat/samanantar). We used a bert-base-multilingual-uncased model as the starting point and then fine-tuned it to the NER dataset mentioned previously.

	## Downloads
	Download from this same Huggingface repo.

	Update 20 Dec 2022: We released a new paper documenting IndicNER and Naamapadam. We have a different model reported in the paper. We will update the repo here soon with this model.

	## Usage

	You can use [this Colab notebook](https://colab.research.google.com/drive/1sYa-PDdZQ_c9SzUgnhyb3Fl7j96QBCS8?usp=sharing) for samples on using IndicNER or for finetuning a pre-trained model on Naampadam dataset to build your own NER models.



	<!-- License -->
	## License

	The fine-tuned-IndicNER code (and models) are released under the MIT License.