SriLankan_Tamil_NER

This model is a fine-tuned version of ai4bharat/IndicNER on an Srilankan-Tamil-NER Dataset. It achieves the following results on the evaluation set:

  • Loss: 0.2084
  • Precision: 0.6023
  • Recall: 0.7065
  • F1: 0.6503
  • Accuracy: 0.9604
  • F1 Per: 0.7212
  • Precision Per: 0.6727
  • Recall Per: 0.7773
  • F1 Loc: 0.6983
  • Precision Loc: 0.6548
  • Recall Loc: 0.7481
  • F1 Org: 0.4835
  • Precision Org: 0.4347
  • Recall Org: 0.5448

Model description

This model is a fine-tuned version of IndicNER specifically adapted for Sri Lankan Tamil Named Entity Recognition (NER). The model was developed under the Center for Tamil Natural Language Processing Research (CTNLPR) to improve entity recognition performance for low-resource Sri Lankan Tamil linguistic contexts.

The fine-tuning process was conducted using a custom annotated Sri Lankan Tamil NER dataset containing approximately 10K Tamil NER samples with BIO tagging annotations for:

  • PERSON entities
  • LOCATION entities
  • ORGANIZATION entities

The model focuses on handling:

  • Sri Lankan Tamil vocabulary
  • regional organization names
  • Tamil morphological variations

Intended uses & limitations

Intended Uses

This model is intended for:

  • Sri Lankan Tamil Named Entity Recognition (NER)
  • Tamil document intelligence systems
  • Semantic search systems
  • Retrieval-Augmented Generation (RAG)
  • Tamil chatbot pipelines
  • Knowledge graph generation
  • Government and institutional document processing
  • Low-resource multilingual NLP research

Limitations

Although the model improves Sri Lankan Tamil contextual understanding, several limitations still remain:

  • Organization entities remain challenging due to naming variability and contextual ambiguity.
  • Performance may degrade on heavily noisy OCR outputs.
  • The model may struggle with highly code-mixed Tamil-English content.
  • Rare or unseen regional entities may not generalize effectively.
  • The model is optimized primarily for Sri Lankan Tamil and may behave differently on Indian Tamil corpora.

This model should therefore be considered a research-oriented low-resource Tamil NLP system rather than a fully production-optimized NER solution.

Training and evaluation data

The model was fine-tuned using the Srilankan-Tamil-NER Dataset, a manually curated Sri Lankan Tamil NER corpus developed under CTNLPR.

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 3e-05
  • train_batch_size: 16
  • eval_batch_size: 16
  • seed: 42
  • optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
  • lr_scheduler_type: cosine
  • lr_scheduler_warmup_steps: 0.1
  • num_epochs: 10
  • mixed_precision_training: Native AMP

Training results

Training Loss Epoch Step Validation Loss Precision Recall F1 Accuracy F1 Per Precision Per Recall Per F1 Loc Precision Loc Recall Loc F1 Org Precision Org Recall Org
0.1970 1.0 510 0.1850 0.3393 0.4712 0.3945 0.9393 0.4767 0.4227 0.5467 0.4695 0.3983 0.5718 0.0689 0.0590 0.0828
0.1368 2.0 1020 0.1457 0.4919 0.6357 0.5546 0.9492 0.6049 0.5391 0.6889 0.6142 0.5550 0.6876 0.3365 0.2834 0.4139
0.1005 3.0 1530 0.1412 0.5917 0.6344 0.6123 0.9570 0.6473 0.6502 0.6444 0.6631 0.6479 0.6791 0.4399 0.3947 0.4967
0.0659 4.0 2040 0.1506 0.5685 0.6850 0.6213 0.9589 0.6284 0.5588 0.7178 0.6880 0.6335 0.7527 0.4224 0.3977 0.4503
0.0384 5.0 2550 0.1582 0.5952 0.6863 0.6375 0.9601 0.6583 0.6213 0.7 0.6941 0.6512 0.7431 0.4583 0.4162 0.5099
0.0325 6.0 3060 0.1595 0.6090 0.7034 0.6528 0.9611 0.6835 0.6487 0.7222 0.7050 0.6621 0.7539 0.4744 0.4252 0.5364
0.0250 7.0 3570 0.1863 0.6022 0.7008 0.6478 0.9589 0.6902 0.6667 0.7156 0.6971 0.6536 0.7467 0.4691 0.4073 0.5530
0.0136 8.0 4080 0.1987 0.6041 0.7141 0.6545 0.9598 0.6900 0.6430 0.7444 0.7074 0.6663 0.7539 0.4747 0.4122 0.5596
0.0152 9.0 4590 0.2023 0.6088 0.7116 0.6562 0.9604 0.7034 0.6721 0.7378 0.7052 0.6652 0.7503 0.4743 0.4081 0.5662
0.0106 10.0 5100 0.2024 0.6134 0.7097 0.6581 0.9606 0.6998 0.6673 0.7356 0.7070 0.6703 0.7479 0.4817 0.4191 0.5662

Framework versions

  • Transformers 5.0.0
  • Pytorch 2.10.0+cu128
  • Datasets 4.0.0
  • Tokenizers 0.22.2
Downloads last month
-
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for exentai/SriLankan_Tamil_NER

Finetuned
(1)
this model