๐Ÿท๏ธ MahaBERT v2 โ€” DAFT Fine-Tuned Model

Authors: Tanvi Somani , Aryan Babare , Samyak Bora
Model Type: Masked Language Model (MLM)
Base Model: MahaBERT v2
Training Method: DAFT (Domain-Adaptive Fine-Tuning)
Language: Marathi ๐Ÿ‡ฎ๐Ÿ‡ณ
Framework: HuggingFace Transformers


๐Ÿ“Œ Overview

mahabert-v2-daft is a Domain-Adaptive Fine-Tuned (DAFT) version of the original MahaBERT v2 model.
The model has been further trained on a large collection of unlabeled Marathi text to adapt it to the specific domain used in downstream Marathi NLP tasks such as:

  • Sentiment analysis
  • Text classification
  • Emotion detection
  • Topic analysis
  • General Marathi language understanding tasks

DAFT helps the model learn domain-specific vocabulary, patterns, and semantics, improving downstream task performance without requiring labeled data.


๐Ÿง  What is DAFT?

DAFT (Domain Adaptive Fine-Tuning) is a continued pre-training method where:

  • You take a pretrained language model
  • Feed it large amounts of unlabeled domain text
  • Train it again using the Masked Language Modeling (MLM) objective

This process improves the modelโ€™s understanding of domain-specific words, idioms, and sentence structures.


๐Ÿš€ Training Details

Item Details
Base Model MahaBERT v2
Method DAFT (continued pre-training)
Objective Masked Language Modeling (MLM)
Dataset Unlabeled Marathi text (domain-specific)
Batch Size As per Colab training setup
Epochs Several thousand steps (as seen in training logs)
Hardware Google Colab (T4 GPU)
Optimizer AdamW
Precision FP32

๐Ÿ“ Files Included

The repository contains:

  • model.safetensors โ€” model weights
  • config.json โ€” model architecture
  • tokenizer.json, tokenizer_config.json โ€” tokenizer settings
  • vocab.txt โ€” BERT vocabulary
  • special_tokens_map.json โ€” CLS, SEP, PAD, MASK tokens
  • training_args.bin โ€” training configuration

๐Ÿงฉ Usage

๐Ÿ”น Load the Model

from transformers import AutoTokenizer, AutoModelForMaskedLM

model_id = "aryanx16/mahabert-v2-daft"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForMaskedLM.from_pretrained(model_id)
Downloads last month
2
Safetensors
Model size
0.2B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for aryanx16/mahabert-v2-daft

Finetuned
(5)
this model