NortheastNER / README.md
Badnyal's picture
Update README.md
dd5dd7a verified
---
language: en
license: cc-by-nc-4.0
tags:
- token-classification
- ner
- northeast-india
- low-resource
- xlm-roberta
metrics:
- f1
- precision
- recall
model-index:
- name: MWirelabs/NortheastNER
results:
- task:
type: token-classification
name: Named Entity Recognition
dataset:
name: Custom Northeast India Gazetteers + News Corpus
type: custom
split: dev
metrics:
- name: Overall F1
type: f1
value: 0.964
- name: Precision
type: precision
value: 0.962
- name: Recall
type: recall
value: 0.967
---
# MWirelabs/NortheastNER
**NortheastNER** is a Named Entity Recognition (NER) model fine-tuned by [MWirelabs](https://huggingface.co/MWirelabs) to recognize entities specific to **Northeast India**.
It is based on [xlm-roberta-base](https://huggingface.co/xlm-roberta-base) and trained on a mix of gazetteers, curated news, and domain-specific data (tribes, villages, flora, fauna, festivals, tourist places).
---
## ๐Ÿ”Ž What it can recognize
* **PLACES** โ†’ States, districts, villages, regions (e.g., *Shillong*, *Tura*, *Ri-Bhoi*)
* **TRIBES** โ†’ Indigenous tribes & sub-tribes (e.g., *Khasi*, *Nyishi*, *Wancho*)
* **FESTIVALS** โ†’ Local festivals (e.g., *Wangala*, *Losar*, *Nyokum Yullo*)
* **TOURIST** โ†’ Landmarks & tourist spots (e.g., *Tawang Monastery*, *Umiam Lake*)
* **FLORA** โ†’ Plants & crops of the Himalayan / NE region
* **FAUNA** โ†’ Animals, birds, wildlife from NE region
---
## ๐Ÿ“Š Evaluation
Evaluated on a 5k-sentence dev set:
| Entity | Precision | Recall | F1 |
| ----------- | ----------------------------- | --------- | --------- |
| PLACES | 0.963 | 0.969 | 0.966 |
| TRIBES | 0.927 | 0.927 | 0.927 |
| FESTIVALS | (coming soon, fewer examples) | | |
| TOURIST | 0.167 | 0.125 | 0.143 |
| FLORA | 1.000 | 0.800 | 0.889 |
| FAUNA | 0.000 | 0.000 | 0.000 |
| **Overall** | **0.962** | **0.967** | **0.964** |
โš ๏ธ Low scores for **TOURIST / FAUNA** due to very few training examples โ€” performance will improve with more labeled data.
Note: The current evaluation set does not include enough examples of **NAMES**, so that category is not reported in the table. Training data did include a small gazetteer of Khasi and regional names (~81 entries), but more labeled examples are needed for meaningful evaluation.
---
## โš™๏ธ Training Setup
* **Base model**: `xlm-roberta-base`
* **Max sequence length**: 256
* **Batch size**: 16
* **Learning rate**: 3e-5
* **Epochs**: 3
* **Weight decay**: 0.01
* **Optimizer**: AdamW
* **Framework**: HuggingFace Transformers Trainer API
### ๐Ÿ“ฆ Dataset Size
- Train set: ~20,000 sentences
- Dev set: ~5,000 sentences
- Sources: Gazetteers (districts, tribes, flora/fauna, festivals, tourist sites, names), news articles, tourism/cultural descriptions
### ๐Ÿ”ง Environment
* **Transformers**: 4.44.2
* **Datasets**: 2.20.0
* **Evaluate**: 0.4.2
* **PyTorch**: 2.3.0+cu121
* **Python**: 3.11
* **Hardware**: Single NVIDIA A4500 GPU (20 GB VRAM), 62 GB RAM, 12 vCPU
---
## ๐Ÿš€ Usage
```python
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
model_id = "MWirelabs/NortheastNER"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForTokenClassification.from_pretrained(model_id)
ner = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
text = "Wangala festival is celebrated in Garo Hills near Tura."
print(ner(text))
```
Output:
```python
[{'entity_group': 'FESTIVALS', 'word': 'Wangala', 'score': 0.99},
{'entity_group': 'PLACES', 'word': 'Garo Hills', 'score': 0.98},
{'entity_group': 'PLACES', 'word': 'Tura', 'score': 0.97}]
```
---
## ๐Ÿ“œ License
This model is licensed under the **Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)** license.
You are free to use, share, and adapt the model for non-commercial purposes with attribution.
---
### ๐Ÿ—‚ Data Licenses
- Gazetteers of villages and tribes: compiled by MWirelabs (open reference use).
- Festivals, tourist sites, and names: curated by MWirelabs team.
Please ensure attribution when reusing any derived dataset.
---
## ๐Ÿ“– Citation
If you use this model in your research, please cite:
```bibtex
@misc{mwirelabs2025northeastner,
title = {NortheastNER: A Domain-Specific Named Entity Recognition Model for Northeast India},
author = {MWirelabs},
year = {2025},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/MWirelabs/NortheastNER}},
}
```
---
## โš ๏ธ Limitations
- Low support for **TOURIST** and **FAUNA** classes (few examples).
- **NAMES** entity class trained but not evaluated due to lack of dev set coverage.
- Possible confusion between **TRIBES** and **PLACES** where names overlap (e.g., Garo).
- Model optimized for Northeast India texts; performance outside this domain may degrade.
## ๐Ÿ”ฎ Future Work
- Add more gold-labeled examples for underrepresented classes (Names, Fauna, Tourist).
- Explore active learning to identify low-confidence predictions for manual annotation.
- Expand coverage of festivals and indigenous knowledge domains.
---
## ๐Ÿข About
This model is developed by **MWirelabs**, pioneering AI solutions for the rich cultural and linguistic diversity of **Northeast India**.
Contact: [MWirelabs](https://huggingface.co/MWirelabs)