--- language: en license: cc-by-nc-4.0 tags: - token-classification - ner - northeast-india - low-resource - xlm-roberta metrics: - f1 - precision - recall model-index: - name: MWirelabs/NortheastNER results: - task: type: token-classification name: Named Entity Recognition dataset: name: Custom Northeast India Gazetteers + News Corpus type: custom split: dev metrics: - name: Overall F1 type: f1 value: 0.964 - name: Precision type: precision value: 0.962 - name: Recall type: recall value: 0.967 --- # MWirelabs/NortheastNER **NortheastNER** is a Named Entity Recognition (NER) model fine-tuned by [MWirelabs](https://huggingface.co/MWirelabs) to recognize entities specific to **Northeast India**. It is based on [xlm-roberta-base](https://huggingface.co/xlm-roberta-base) and trained on a mix of gazetteers, curated news, and domain-specific data (tribes, villages, flora, fauna, festivals, tourist places). --- ## 🔎 What it can recognize * **PLACES** → States, districts, villages, regions (e.g., *Shillong*, *Tura*, *Ri-Bhoi*) * **TRIBES** → Indigenous tribes & sub-tribes (e.g., *Khasi*, *Nyishi*, *Wancho*) * **FESTIVALS** → Local festivals (e.g., *Wangala*, *Losar*, *Nyokum Yullo*) * **TOURIST** → Landmarks & tourist spots (e.g., *Tawang Monastery*, *Umiam Lake*) * **FLORA** → Plants & crops of the Himalayan / NE region * **FAUNA** → Animals, birds, wildlife from NE region --- ## 📊 Evaluation Evaluated on a 5k-sentence dev set: | Entity | Precision | Recall | F1 | | ----------- | ----------------------------- | --------- | --------- | | PLACES | 0.963 | 0.969 | 0.966 | | TRIBES | 0.927 | 0.927 | 0.927 | | FESTIVALS | (coming soon, fewer examples) | | | | TOURIST | 0.167 | 0.125 | 0.143 | | FLORA | 1.000 | 0.800 | 0.889 | | FAUNA | 0.000 | 0.000 | 0.000 | | **Overall** | **0.962** | **0.967** | **0.964** | ⚠️ Low scores for **TOURIST / FAUNA** due to very few training examples — performance will improve with more labeled data. Note: The current evaluation set does not include enough examples of **NAMES**, so that category is not reported in the table. Training data did include a small gazetteer of Khasi and regional names (~81 entries), but more labeled examples are needed for meaningful evaluation. --- ## ⚙️ Training Setup * **Base model**: `xlm-roberta-base` * **Max sequence length**: 256 * **Batch size**: 16 * **Learning rate**: 3e-5 * **Epochs**: 3 * **Weight decay**: 0.01 * **Optimizer**: AdamW * **Framework**: HuggingFace Transformers Trainer API ### 📦 Dataset Size - Train set: ~20,000 sentences - Dev set: ~5,000 sentences - Sources: Gazetteers (districts, tribes, flora/fauna, festivals, tourist sites, names), news articles, tourism/cultural descriptions ### 🔧 Environment * **Transformers**: 4.44.2 * **Datasets**: 2.20.0 * **Evaluate**: 0.4.2 * **PyTorch**: 2.3.0+cu121 * **Python**: 3.11 * **Hardware**: Single NVIDIA A4500 GPU (20 GB VRAM), 62 GB RAM, 12 vCPU --- ## 🚀 Usage ```python from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline model_id = "MWirelabs/NortheastNER" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForTokenClassification.from_pretrained(model_id) ner = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple") text = "Wangala festival is celebrated in Garo Hills near Tura." print(ner(text)) ``` Output: ```python [{'entity_group': 'FESTIVALS', 'word': 'Wangala', 'score': 0.99}, {'entity_group': 'PLACES', 'word': 'Garo Hills', 'score': 0.98}, {'entity_group': 'PLACES', 'word': 'Tura', 'score': 0.97}] ``` --- ## 📜 License This model is licensed under the **Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)** license. You are free to use, share, and adapt the model for non-commercial purposes with attribution. --- ### 🗂 Data Licenses - Gazetteers of villages and tribes: compiled by MWirelabs (open reference use). - Festivals, tourist sites, and names: curated by MWirelabs team. Please ensure attribution when reusing any derived dataset. --- ## 📖 Citation If you use this model in your research, please cite: ```bibtex @misc{mwirelabs2025northeastner, title = {NortheastNER: A Domain-Specific Named Entity Recognition Model for Northeast India}, author = {MWirelabs}, year = {2025}, publisher = {Hugging Face}, howpublished = {\url{https://huggingface.co/MWirelabs/NortheastNER}}, } ``` --- ## ⚠️ Limitations - Low support for **TOURIST** and **FAUNA** classes (few examples). - **NAMES** entity class trained but not evaluated due to lack of dev set coverage. - Possible confusion between **TRIBES** and **PLACES** where names overlap (e.g., Garo). - Model optimized for Northeast India texts; performance outside this domain may degrade. ## 🔮 Future Work - Add more gold-labeled examples for underrepresented classes (Names, Fauna, Tourist). - Explore active learning to identify low-confidence predictions for manual annotation. - Expand coverage of festivals and indigenous knowledge domains. --- ## 🏢 About This model is developed by **MWirelabs**, pioneering AI solutions for the rich cultural and linguistic diversity of **Northeast India**. Contact: [MWirelabs](https://huggingface.co/MWirelabs)