WG-IndicBERT

A token classification model fine-tuned from IndicBERT v2 for Indic Word Grouping (Local Word Group identification). Developed as part of the BHASHA 2025 Shared Task 2: IndicWG.


What it does

Given an input sentence in Hindi, the model identifies Local Word Groups (LWGs) — semantically cohesive sequences of words that convey a single complete meaning (e.g., noun compounds, postpositional groups, verb groups with auxiliaries, light verb constructions).

The task is modeled as BIO token classification with three labels: B (beginning of a group), I (inside a group), O (outside / delimiter). The output is reconstructed into grouped sentences using __ as the group boundary separator.

Exact Match Accuracy: 52.73% on the official test set


Quick Start

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

model_name = "manavdhamecha77/WG-IndicBERT"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

sentence = "राम ने बाजार से सब्जियां खरीदीं।"

inputs = tokenizer(sentence, return_tensors="pt", truncation=True, max_length=128).to(device)

with torch.no_grad():
    outputs = model(**inputs)

predictions = torch.argmax(outputs.logits, dim=-1)[0].tolist()
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])

# Label map: {0: "B", 1: "I", 2: "O"}
id2label = {0: "B", 1: "I", 2: "O"}

for token, pred in zip(tokens, predictions):
    if token not in ["[CLS]", "[SEP]", "[PAD]"]:
        print(f"{token:20s} {id2label[pred]}")

Training Details

The model is fine-tuned using AutoModelForTokenClassification with a class-weighted cross-entropy loss to address the dominant O-label imbalance. Labels are aligned to subword tokens using the tokenizer's word_ids() helper; only the first subword of each word is labeled, with subsequent subwords set to -100.

Parameter Value
Optimizer AdamW
Learning Rate 3×10⁻⁵
Batch Size 8 (train/eval)
Epochs 20
Weight Decay 0.01
Label Map B:0, I:1, O:2
Hardware H100 GPU (94GB)

Training data: Official BHASHA/IndicWG Hindi dataset (550 train / 100 dev / 226 test sentences), supplemented with 5K augmented sentences from a rule-based LWG finder applied to IndicCorp.


Evaluation

Model Dev EM (%) Test EM (%)
MuRIL 46.58 58.18
XLM-Roberta 39.00 53.36
IndicBERT v2 (this model) 35.40 52.73

Evaluation metric: Exact Match Accuracy — a prediction is correct only if the entire reconstructed grouped sentence matches the gold output exactly.


Limitations

  • Trained and evaluated on Hindi only; generalization to other Indic languages is not guaranteed.
  • Performance degrades on longer sentences (>40 words).
  • IndicBERT v2 was pretrained with MLM only, without task-specific fine-tuning on sequence labeling, which may explain its slightly lower performance compared to MuRIL on this task.
  • Gold annotations contain inconsistencies in multiword expressions and light-verb constructions, which caps achievable exact-match accuracy.

Citation

@inproceedings{dhamecha2025horizonwg,
  title     = {Team Horizon at {BHASHA} Task 2: Fine-tuning Multilingual Transformers for Indic Word Grouping},
  author    = {Dhamecha, Manav and Damor, Gaurav and Choudhary, Sunil and Mishra, Pruthwik},
  booktitle = {Proceedings of the 1st Workshop on Benchmarks, Harmonization, Annotation, and Standardization for Human-Centric AI in Indian Languages (BHASHA 2025)},
  year      = {2025},
  url       = {https://aclanthology.org/2025.bhasha-1.18/}
}
Downloads last month
41
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for manavdhamecha77/WG-IndicBERT

Finetuned
(17)
this model