WG-IndicBERT

A token classification model fine-tuned from IndicBERT v2 for Indic Word Grouping (Local Word Group identification). Developed as part of the BHASHA 2025 Shared Task 2: IndicWG.

Developed by: Manav Dhamecha, Gaurav Damor, Sunil Choudhary, Pruthwik Mishra
License: MIT
Base model: ai4bharat/IndicBERTv2-MLM-only
Paper: Team Horizon at BHASHA Task 2
Repository: manavdhamecha77/IndicWG2025
GitHub.io: Indic Word Grouping

What it does

Given an input sentence in Hindi, the model identifies Local Word Groups (LWGs) — semantically cohesive sequences of words that convey a single complete meaning (e.g., noun compounds, postpositional groups, verb groups with auxiliaries, light verb constructions).

The task is modeled as BIO token classification with three labels: B (beginning of a group), I (inside a group), O (outside / delimiter). The output is reconstructed into grouped sentences using __ as the group boundary separator.

Exact Match Accuracy: 52.73% on the official test set

Quick Start

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

model_name = "manavdhamecha77/WG-IndicBERT"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

sentence = "राम ने बाजार से सब्जियां खरीदीं।"

inputs = tokenizer(sentence, return_tensors="pt", truncation=True, max_length=128).to(device)

with torch.no_grad():
    outputs = model(**inputs)

predictions = torch.argmax(outputs.logits, dim=-1)[0].tolist()
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])

# Label map: {0: "B", 1: "I", 2: "O"}
id2label = {0: "B", 1: "I", 2: "O"}

for token, pred in zip(tokens, predictions):
    if token not in ["[CLS]", "[SEP]", "[PAD]"]:
        print(f"{token:20s} {id2label[pred]}")

Training Details

The model is fine-tuned using AutoModelForTokenClassification with a class-weighted cross-entropy loss to address the dominant O-label imbalance. Labels are aligned to subword tokens using the tokenizer's word_ids() helper; only the first subword of each word is labeled, with subsequent subwords set to -100.

Parameter	Value
Optimizer	AdamW
Learning Rate	3×10⁻⁵
Batch Size	8 (train/eval)
Epochs	20
Weight Decay	0.01
Label Map	B:0, I:1, O:2
Hardware	H100 GPU (94GB)

Training data: Official BHASHA/IndicWG Hindi dataset (550 train / 100 dev / 226 test sentences), supplemented with 5K augmented sentences from a rule-based LWG finder applied to IndicCorp.

Evaluation

Model	Dev EM (%)	Test EM (%)
MuRIL	46.58	58.18
XLM-Roberta	39.00	53.36
IndicBERT v2 (this model)	35.40	52.73

Evaluation metric: Exact Match Accuracy — a prediction is correct only if the entire reconstructed grouped sentence matches the gold output exactly.

Limitations

Trained and evaluated on Hindi only; generalization to other Indic languages is not guaranteed.
Performance degrades on longer sentences (>40 words).
IndicBERT v2 was pretrained with MLM only, without task-specific fine-tuning on sequence labeling, which may explain its slightly lower performance compared to MuRIL on this task.
Gold annotations contain inconsistencies in multiword expressions and light-verb constructions, which caps achievable exact-match accuracy.

Citation

@inproceedings{dhamecha2025horizonwg,
  title     = {Team Horizon at {BHASHA} Task 2: Fine-tuning Multilingual Transformers for Indic Word Grouping},
  author    = {Dhamecha, Manav and Damor, Gaurav and Choudhary, Sunil and Mishra, Pruthwik},
  booktitle = {Proceedings of the 1st Workshop on Benchmarks, Harmonization, Annotation, and Standardization for Human-Centric AI in Indian Languages (BHASHA 2025)},
  year      = {2025},
  url       = {https://aclanthology.org/2025.bhasha-1.18/}
}

Downloads last month: 7

Safetensors

Model size

0.3B params

Tensor type

F32

Model tree for manavdhamecha77/WG-IndicBERT

Base model

ai4bharat/IndicBERTv2-MLM-only

Finetuned

(17)

this model