Model Card for Model ID

Model Description

This is a fine-tuned version of aya-expanse-8b for Part-of-Speech (POS) Tagging on Hinglish (Hindi-English code-mixed) text. It assigns a grammatical category to each token using a language-agnostic Universal POS tagset suitable for code-mixed content in Roman and Devanagari scripts.

Supported tags: NOUN, PROPN, VERB, ADJ, ADV, ADP, PRON, DET, CONJ, PART, PRON_WH, PART_NEG, NUM, X (for typos, punctuation, abbreviations, foreign elements).

Achieves 88.61 F1 on the COMI-LINGUA POS test set (5K instances), competitive with or slightly outperforming specialized traditional tools (codeswitch: 88.2 F1 zero-shot) and surpassing strong zero/one-shot LLMs.

Model type: LoRA-adapted Transformer LLM (8B params, ~32M trainable)
License: apache-2.0
Finetuned from model: CohereForAI/aya-expanse-8b

Model Sources

Paper: COMI-LINGUA: Expert Annotated Large-Scale Dataset for Multitask NLP in Hindi-English Code-Mixing
Demo: Integrated in Demo Portal

Uses

POS tagging in Hinglish pipelines (e.g., syntactic analysis, downstream tasks like dependency parsing, sentiment analysis, machine translation on code-mixed social media/news text).
Preprocessing for structure-sensitive NLP in mixed-language content.
Example inference prompt:

Assign Part-of-Speech (POS) tags to each token in the sentence given as: "मीराबाई चानू ने 21 st Commonwealth Games में India के लिए first Gold medal जीता था।"
Output: [{'मीराबाई': 'PROPN'}, {'चानू': 'PROPN'}, {'ने': 'PART'}, {'21': 'NUM'}, {'st': 'X'}, {'Commonwealth': 'PROPN'}, {'Games': 'PROPN'}, {'में': 'ADP'}, {'India': 'PROPN'}, {'के': 'ADP'}, {'लिए': 'ADP'}, {'first': 'ADJ'}, {'Gold': 'NOUN'}, {'medal': 'NOUN'}, {'जीता': 'VERB'}, {'था': 'VERB'}, {'।': 'X'}]

Training Details

Training Data

COMI-LINGUA Dataset Card.

Training Procedure

Preprocessing

Tokenized with base tokenizer; instruction templates + few-shot examples. Filtered: ≥5 tokens, no hate/non-Hinglish, focused on code-mixed content.

Training Hyperparameters

Regime: PEFT LoRA (rank=32, alpha=64, dropout=0.1)
Epochs: 3
Batch: 4 (accum=8, effective=32)
LR: 2e-4 (cosine + warmup=0.1)
Weight decay: 0.01

Evaluation

Testing Data

COMI-LINGUA POS test set (5K instances).

Metrics

Macro Precision / Recall / F1 (token-level).

Results

Setting	P	R	F1
Zero-shot	76.92	29.50	40.55
One-shot	55.29	48.70	48.20
Fine-tuned	88.97	88.55	88.61

Summary: Achieves competitive SOTA among open-weight models and edges out specialized traditional tools (codeswitch) on this high-quality benchmark; fine-tuning closes the gap with closed LLMs and handles script variability + code-mixing effectively.

Bias, Risks, and Limitations

This model is a research preview and is subject to ongoing iterative updates. As such, it provides only limited safety measures.

Model Card Contact

Lingo Research Group at IIT Gandhinagar, India
Mail at: lingo@iitgn.ac.in

Citation

If you use this model, please cite the following work:

@inproceedings{sheth-etal-2025-comi,
    title = "{COMI}-{LINGUA}: Expert Annotated Large-Scale Dataset for Multitask {NLP} in {H}indi-{E}nglish Code-Mixing",
    author = "Sheth, Rajvee  and
      Beniwal, Himanshu  and
      Singh, Mayank",
    editor = "Christodoulopoulos, Christos  and
      Chakraborty, Tanmoy  and
      Rose, Carolyn  and
      Peng, Violet",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2025",
    month = nov,
    year = "2025",
    address = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.findings-emnlp.422/",
    pages = "7973--7992",
    ISBN = "979-8-89176-335-7",
}

Downloads last month: 3

Model tree for LingoIITGN/COMI-LINGUA-POS

Base model

CohereLabs/aya-expanse-8b

Adapter

(26)

this model

LingoIITGN
/

COMI-LINGUA-POS