SriLankan_Tamil_NER

This model is a fine-tuned version of ai4bharat/IndicNER on an Srilankan-Tamil-NER Dataset. It achieves the following results on the evaluation set:

Loss: 0.2084
Precision: 0.6023
Recall: 0.7065
F1: 0.6503
Accuracy: 0.9604
F1 Per: 0.7212
Precision Per: 0.6727
Recall Per: 0.7773
F1 Loc: 0.6983
Precision Loc: 0.6548
Recall Loc: 0.7481
F1 Org: 0.4835
Precision Org: 0.4347
Recall Org: 0.5448

Model description

This model is a fine-tuned version of IndicNER specifically adapted for Sri Lankan Tamil Named Entity Recognition (NER). The model was developed under the Center for Tamil Natural Language Processing Research (CTNLPR) to improve entity recognition performance for low-resource Sri Lankan Tamil linguistic contexts.

The fine-tuning process was conducted using a custom annotated Sri Lankan Tamil NER dataset containing approximately 10K Tamil NER samples with BIO tagging annotations for:

PERSON entities
LOCATION entities
ORGANIZATION entities

The model focuses on handling:

Sri Lankan Tamil vocabulary
regional organization names
Tamil morphological variations

Intended uses & limitations

Intended Uses

This model is intended for:

Sri Lankan Tamil Named Entity Recognition (NER)
Tamil document intelligence systems
Semantic search systems
Retrieval-Augmented Generation (RAG)
Tamil chatbot pipelines
Knowledge graph generation
Government and institutional document processing
Low-resource multilingual NLP research

Limitations

Although the model improves Sri Lankan Tamil contextual understanding, several limitations still remain:

Organization entities remain challenging due to naming variability and contextual ambiguity.
Performance may degrade on heavily noisy OCR outputs.
The model may struggle with highly code-mixed Tamil-English content.
Rare or unseen regional entities may not generalize effectively.
The model is optimized primarily for Sri Lankan Tamil and may behave differently on Indian Tamil corpora.

This model should therefore be considered a research-oriented low-resource Tamil NLP system rather than a fully production-optimized NER solution.

Training and evaluation data

The model was fine-tuned using the Srilankan-Tamil-NER Dataset, a manually curated Sri Lankan Tamil NER corpus developed under CTNLPR.

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 3e-05
train_batch_size: 16
eval_batch_size: 16
seed: 42
optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
lr_scheduler_type: cosine
lr_scheduler_warmup_steps: 0.1
num_epochs: 10
mixed_precision_training: Native AMP

Training results

Training Loss	Epoch	Step	Validation Loss	Precision	Recall	F1	Accuracy	F1 Per	Precision Per	Recall Per	F1 Loc	Precision Loc	Recall Loc	F1 Org	Precision Org	Recall Org
0.1970	1.0	510	0.1850	0.3393	0.4712	0.3945	0.9393	0.4767	0.4227	0.5467	0.4695	0.3983	0.5718	0.0689	0.0590	0.0828
0.1368	2.0	1020	0.1457	0.4919	0.6357	0.5546	0.9492	0.6049	0.5391	0.6889	0.6142	0.5550	0.6876	0.3365	0.2834	0.4139
0.1005	3.0	1530	0.1412	0.5917	0.6344	0.6123	0.9570	0.6473	0.6502	0.6444	0.6631	0.6479	0.6791	0.4399	0.3947	0.4967
0.0659	4.0	2040	0.1506	0.5685	0.6850	0.6213	0.9589	0.6284	0.5588	0.7178	0.6880	0.6335	0.7527	0.4224	0.3977	0.4503
0.0384	5.0	2550	0.1582	0.5952	0.6863	0.6375	0.9601	0.6583	0.6213	0.7	0.6941	0.6512	0.7431	0.4583	0.4162	0.5099
0.0325	6.0	3060	0.1595	0.6090	0.7034	0.6528	0.9611	0.6835	0.6487	0.7222	0.7050	0.6621	0.7539	0.4744	0.4252	0.5364
0.0250	7.0	3570	0.1863	0.6022	0.7008	0.6478	0.9589	0.6902	0.6667	0.7156	0.6971	0.6536	0.7467	0.4691	0.4073	0.5530
0.0136	8.0	4080	0.1987	0.6041	0.7141	0.6545	0.9598	0.6900	0.6430	0.7444	0.7074	0.6663	0.7539	0.4747	0.4122	0.5596
0.0152	9.0	4590	0.2023	0.6088	0.7116	0.6562	0.9604	0.7034	0.6721	0.7378	0.7052	0.6652	0.7503	0.4743	0.4081	0.5662
0.0106	10.0	5100	0.2024	0.6134	0.7097	0.6581	0.9606	0.6998	0.6673	0.7356	0.7070	0.6703	0.7479	0.4817	0.4191	0.5662

Framework versions

Transformers 5.0.0
Pytorch 2.10.0+cu128
Datasets 4.0.0
Tokenizers 0.22.2

Downloads last month: -

Safetensors

Model size

0.2B params

Tensor type

F32

Model tree for exentai/SriLankan_Tamil_NER

Base model

ai4bharat/IndicNER

Finetuned

(1)

this model