|
|
--- |
|
|
language: en |
|
|
license: cc-by-nc-4.0 |
|
|
tags: |
|
|
- token-classification |
|
|
- ner |
|
|
- northeast-india |
|
|
- low-resource |
|
|
- xlm-roberta |
|
|
metrics: |
|
|
- f1 |
|
|
- precision |
|
|
- recall |
|
|
model-index: |
|
|
- name: MWirelabs/NortheastNER |
|
|
results: |
|
|
- task: |
|
|
type: token-classification |
|
|
name: Named Entity Recognition |
|
|
dataset: |
|
|
name: Custom Northeast India Gazetteers + News Corpus |
|
|
type: custom |
|
|
split: dev |
|
|
metrics: |
|
|
- name: Overall F1 |
|
|
type: f1 |
|
|
value: 0.964 |
|
|
- name: Precision |
|
|
type: precision |
|
|
value: 0.962 |
|
|
- name: Recall |
|
|
type: recall |
|
|
value: 0.967 |
|
|
--- |
|
|
|
|
|
# MWirelabs/NortheastNER |
|
|
|
|
|
**NortheastNER** is a Named Entity Recognition (NER) model fine-tuned by [MWirelabs](https://huggingface.co/MWirelabs) to recognize entities specific to **Northeast India**. |
|
|
It is based on [xlm-roberta-base](https://huggingface.co/xlm-roberta-base) and trained on a mix of gazetteers, curated news, and domain-specific data (tribes, villages, flora, fauna, festivals, tourist places). |
|
|
|
|
|
--- |
|
|
|
|
|
## ๐ What it can recognize |
|
|
|
|
|
* **PLACES** โ States, districts, villages, regions (e.g., *Shillong*, *Tura*, *Ri-Bhoi*) |
|
|
* **TRIBES** โ Indigenous tribes & sub-tribes (e.g., *Khasi*, *Nyishi*, *Wancho*) |
|
|
* **FESTIVALS** โ Local festivals (e.g., *Wangala*, *Losar*, *Nyokum Yullo*) |
|
|
* **TOURIST** โ Landmarks & tourist spots (e.g., *Tawang Monastery*, *Umiam Lake*) |
|
|
* **FLORA** โ Plants & crops of the Himalayan / NE region |
|
|
* **FAUNA** โ Animals, birds, wildlife from NE region |
|
|
|
|
|
--- |
|
|
|
|
|
## ๐ Evaluation |
|
|
|
|
|
Evaluated on a 5k-sentence dev set: |
|
|
|
|
|
| Entity | Precision | Recall | F1 | |
|
|
| ----------- | ----------------------------- | --------- | --------- | |
|
|
| PLACES | 0.963 | 0.969 | 0.966 | |
|
|
| TRIBES | 0.927 | 0.927 | 0.927 | |
|
|
| FESTIVALS | (coming soon, fewer examples) | | | |
|
|
| TOURIST | 0.167 | 0.125 | 0.143 | |
|
|
| FLORA | 1.000 | 0.800 | 0.889 | |
|
|
| FAUNA | 0.000 | 0.000 | 0.000 | |
|
|
| **Overall** | **0.962** | **0.967** | **0.964** | |
|
|
|
|
|
โ ๏ธ Low scores for **TOURIST / FAUNA** due to very few training examples โ performance will improve with more labeled data. |
|
|
Note: The current evaluation set does not include enough examples of **NAMES**, so that category is not reported in the table. Training data did include a small gazetteer of Khasi and regional names (~81 entries), but more labeled examples are needed for meaningful evaluation. |
|
|
|
|
|
--- |
|
|
|
|
|
## โ๏ธ Training Setup |
|
|
|
|
|
* **Base model**: `xlm-roberta-base` |
|
|
* **Max sequence length**: 256 |
|
|
* **Batch size**: 16 |
|
|
* **Learning rate**: 3e-5 |
|
|
* **Epochs**: 3 |
|
|
* **Weight decay**: 0.01 |
|
|
* **Optimizer**: AdamW |
|
|
* **Framework**: HuggingFace Transformers Trainer API |
|
|
|
|
|
### ๐ฆ Dataset Size |
|
|
- Train set: ~20,000 sentences |
|
|
- Dev set: ~5,000 sentences |
|
|
- Sources: Gazetteers (districts, tribes, flora/fauna, festivals, tourist sites, names), news articles, tourism/cultural descriptions |
|
|
|
|
|
|
|
|
### ๐ง Environment |
|
|
|
|
|
* **Transformers**: 4.44.2 |
|
|
* **Datasets**: 2.20.0 |
|
|
* **Evaluate**: 0.4.2 |
|
|
* **PyTorch**: 2.3.0+cu121 |
|
|
* **Python**: 3.11 |
|
|
* **Hardware**: Single NVIDIA A4500 GPU (20 GB VRAM), 62 GB RAM, 12 vCPU |
|
|
|
|
|
--- |
|
|
|
|
|
## ๐ Usage |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline |
|
|
|
|
|
model_id = "MWirelabs/NortheastNER" |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model_id) |
|
|
model = AutoModelForTokenClassification.from_pretrained(model_id) |
|
|
|
|
|
ner = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple") |
|
|
|
|
|
text = "Wangala festival is celebrated in Garo Hills near Tura." |
|
|
print(ner(text)) |
|
|
``` |
|
|
|
|
|
Output: |
|
|
|
|
|
```python |
|
|
[{'entity_group': 'FESTIVALS', 'word': 'Wangala', 'score': 0.99}, |
|
|
{'entity_group': 'PLACES', 'word': 'Garo Hills', 'score': 0.98}, |
|
|
{'entity_group': 'PLACES', 'word': 'Tura', 'score': 0.97}] |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## ๐ License |
|
|
|
|
|
This model is licensed under the **Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)** license. |
|
|
|
|
|
You are free to use, share, and adapt the model for non-commercial purposes with attribution. |
|
|
|
|
|
--- |
|
|
|
|
|
### ๐ Data Licenses |
|
|
- Gazetteers of villages and tribes: compiled by MWirelabs (open reference use). |
|
|
- Festivals, tourist sites, and names: curated by MWirelabs team. |
|
|
Please ensure attribution when reusing any derived dataset. |
|
|
|
|
|
--- |
|
|
|
|
|
## ๐ Citation |
|
|
|
|
|
If you use this model in your research, please cite: |
|
|
|
|
|
```bibtex |
|
|
@misc{mwirelabs2025northeastner, |
|
|
title = {NortheastNER: A Domain-Specific Named Entity Recognition Model for Northeast India}, |
|
|
author = {MWirelabs}, |
|
|
year = {2025}, |
|
|
publisher = {Hugging Face}, |
|
|
howpublished = {\url{https://huggingface.co/MWirelabs/NortheastNER}}, |
|
|
} |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## โ ๏ธ Limitations |
|
|
- Low support for **TOURIST** and **FAUNA** classes (few examples). |
|
|
- **NAMES** entity class trained but not evaluated due to lack of dev set coverage. |
|
|
- Possible confusion between **TRIBES** and **PLACES** where names overlap (e.g., Garo). |
|
|
- Model optimized for Northeast India texts; performance outside this domain may degrade. |
|
|
|
|
|
## ๐ฎ Future Work |
|
|
- Add more gold-labeled examples for underrepresented classes (Names, Fauna, Tourist). |
|
|
- Explore active learning to identify low-confidence predictions for manual annotation. |
|
|
- Expand coverage of festivals and indigenous knowledge domains. |
|
|
|
|
|
--- |
|
|
|
|
|
## ๐ข About |
|
|
|
|
|
This model is developed by **MWirelabs**, pioneering AI solutions for the rich cultural and linguistic diversity of **Northeast India**. |
|
|
Contact: [MWirelabs](https://huggingface.co/MWirelabs) |