WG-IndicBERT
A token classification model fine-tuned from IndicBERT v2 for Indic Word Grouping (Local Word Group identification). Developed as part of the BHASHA 2025 Shared Task 2: IndicWG.
- Developed by: Manav Dhamecha, Gaurav Damor, Sunil Choudhary, Pruthwik Mishra
- License: MIT
- Base model: ai4bharat/IndicBERTv2-MLM-only
- Paper: Team Horizon at BHASHA Task 2
- Repository: manavdhamecha77/IndicWG2025
- GitHub.io: Indic Word Grouping
What it does
Given an input sentence in Hindi, the model identifies Local Word Groups (LWGs) — semantically cohesive sequences of words that convey a single complete meaning (e.g., noun compounds, postpositional groups, verb groups with auxiliaries, light verb constructions).
The task is modeled as BIO token classification with three labels: B (beginning of a group), I (inside a group), O (outside / delimiter). The output is reconstructed into grouped sentences using __ as the group boundary separator.
Exact Match Accuracy: 52.73% on the official test set
Quick Start
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch
model_name = "manavdhamecha77/WG-IndicBERT"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
sentence = "राम ने बाजार से सब्जियां खरीदीं।"
inputs = tokenizer(sentence, return_tensors="pt", truncation=True, max_length=128).to(device)
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.argmax(outputs.logits, dim=-1)[0].tolist()
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
# Label map: {0: "B", 1: "I", 2: "O"}
id2label = {0: "B", 1: "I", 2: "O"}
for token, pred in zip(tokens, predictions):
if token not in ["[CLS]", "[SEP]", "[PAD]"]:
print(f"{token:20s} {id2label[pred]}")
Training Details
The model is fine-tuned using AutoModelForTokenClassification with a class-weighted cross-entropy loss to address the dominant O-label imbalance. Labels are aligned to subword tokens using the tokenizer's word_ids() helper; only the first subword of each word is labeled, with subsequent subwords set to -100.
| Parameter | Value |
|---|---|
| Optimizer | AdamW |
| Learning Rate | 3×10⁻⁵ |
| Batch Size | 8 (train/eval) |
| Epochs | 20 |
| Weight Decay | 0.01 |
| Label Map | B:0, I:1, O:2 |
| Hardware | H100 GPU (94GB) |
Training data: Official BHASHA/IndicWG Hindi dataset (550 train / 100 dev / 226 test sentences), supplemented with 5K augmented sentences from a rule-based LWG finder applied to IndicCorp.
Evaluation
| Model | Dev EM (%) | Test EM (%) |
|---|---|---|
| MuRIL | 46.58 | 58.18 |
| XLM-Roberta | 39.00 | 53.36 |
| IndicBERT v2 (this model) | 35.40 | 52.73 |
Evaluation metric: Exact Match Accuracy — a prediction is correct only if the entire reconstructed grouped sentence matches the gold output exactly.
Limitations
- Trained and evaluated on Hindi only; generalization to other Indic languages is not guaranteed.
- Performance degrades on longer sentences (>40 words).
- IndicBERT v2 was pretrained with MLM only, without task-specific fine-tuning on sequence labeling, which may explain its slightly lower performance compared to MuRIL on this task.
- Gold annotations contain inconsistencies in multiword expressions and light-verb constructions, which caps achievable exact-match accuracy.
Citation
@inproceedings{dhamecha2025horizonwg,
title = {Team Horizon at {BHASHA} Task 2: Fine-tuning Multilingual Transformers for Indic Word Grouping},
author = {Dhamecha, Manav and Damor, Gaurav and Choudhary, Sunil and Mishra, Pruthwik},
booktitle = {Proceedings of the 1st Workshop on Benchmarks, Harmonization, Annotation, and Standardization for Human-Centric AI in Indian Languages (BHASHA 2025)},
year = {2025},
url = {https://aclanthology.org/2025.bhasha-1.18/}
}
- Downloads last month
- 41
Model tree for manavdhamecha77/WG-IndicBERT
Base model
ai4bharat/IndicBERTv2-MLM-only