You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Configuration Parsing Warning:Invalid JSON for config file tokenizer_config.json

ABIRMARv1 — Marathi-First Transformer Language Model

ABIRMARv1 is a custom transformer-based causal language model designed specifically for Marathi language understanding and generation. It is trained and optimized for Marathi text using curated Indic datasets and a custom tokenizer to ensure efficient and accurate Marathi NLP performance.

This model focuses on delivering strong contextual understanding, efficient inference, and scalable Marathi AI capabilities.

Model Details

Model Description

ABIRMARv1 is a decoder-only transformer language model designed for Marathi text generation and understanding. It builds upon Marathi-focused datasets and architecture optimizations to provide reliable Marathi NLP performance.

Developed by: Abir Maheshwari
Funded by: Independent Research
Shared by: Abir Maheshwari
Model type: Causal Language Model (Decoder-only Transformer)
Language(s): Marathi
License: MIT
Base model: abirmaheshwari/abirmarv1

Model Sources

Repository: https://huggingface.co/abirmaheshwari/abirmarv1
Datasets Used:
- ai4bharat/IndicCorpV2
- ai4bharat/Bhasha-Abhijnaanam
Framework: PyTorch + HuggingFace Transformers

Uses

Direct Use

ABIRMARv1 is suitable for:

Marathi text generation
Marathi conversational AI
Marathi chatbots
Text completion
NLP research
Educational purposes

Example applications:

Marathi AI assistants
Marathi content generation
Marathi NLP research

Downstream Use

This model can be fine-tuned for:

Marathi instruction models
Question answering
Domain-specific Marathi NLP tasks
Conversational AI systems

Out-of-Scope Use

Not recommended for:

Medical advice
Legal advice
Safety-critical systems
High-risk decision systems

This is an early-stage research model.

Bias, Risks, and Limitations

ABIRMARv1 may:

Produce incorrect or incomplete outputs
Reflect biases present in training data
Generate nonsensical responses in complex scenarios

These limitations are expected for models trained on limited or domain-specific datasets.

Recommendations

Use this model:

For research
For experimentation
For Marathi AI development
For fine-tuning and improvement

Not recommended for production use without further fine-tuning.

How to Get Started

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("abirmaheshwari/abirmarv1")
model = AutoModelForCausalLM.from_pretrained("abirmaheshwari/abirmarv1")

input_text = "महाराष्ट्र हा भारतातील एक महत्त्वाचा राज्य आहे कारण"

inputs = tokenizer(input_text, return_tensors="pt")

outputs = model.generate(
    **inputs,
    max_length=100,
    temperature=0.7
)

print(tokenizer.decode(outputs[0]))

---

# Training Details

## Training Data

ABIRMARv1 was trained using curated Marathi language datasets designed to provide strong linguistic coverage and contextual understanding.

The training datasets include:

- ai4bharat/IndicCorpV2  
- ai4bharat/Bhasha-Abhijnaanam  

These datasets contain high-quality Marathi text covering multiple domains, enabling robust Marathi language modeling.

---

## Training Procedure

### Preprocessing

The dataset was processed using a custom-trained Byte Pair Encoding (BPE) tokenizer optimized for Marathi language modeling.

Tokenizer specifications:

- Vocabulary size: 32,000 tokens  
- Maximum sequence length: 512 tokens  
- Tokenizer trained from scratch on Marathi-focused datasets  

---

### Training Hyperparameters

The model was trained using the following configuration:

- Optimizer: AdamW  
- Learning rate: 5e-5  
- Precision: FP16 mixed precision  
- Training objective: Causal Language Modeling  

---

### Training Hardware

Training was performed using GPU acceleration.

- GPU: NVIDIA GPU (CUDA-enabled)  
- Framework: PyTorch  
- Library: HuggingFace Transformers  

---

# Evaluation

## Testing Data

Evaluation was conducted using Marathi text samples representative of real-world Marathi language usage.

---

## Metrics

Evaluation metrics included:

- BLEU score  
- Training loss monitoring  

---

## Results

ABIRMARv1 demonstrates successful learning of:

- Marathi sentence structure  
- Context-aware text generation  
- Marathi token relationships  
- Language continuity and coherence  

The model provides functional Marathi generation capability suitable for research and fine-tuning applications.

---

# Technical Specifications

## Architecture

ABIRMARv1 uses a decoder-only Transformer architecture consisting of:

- Token embedding layer  
- Learned positional embeddings  
- Multi-head self-attention layers  
- Feedforward neural network layers  
- GELU activation function  
- Weight tying between embedding and output layers  

---

## Model Size

- Total parameters: ~96 Million  
- Context length: 512 tokens  
- Vocabulary size: 32,000 tokens  

---

# Compute Infrastructure

## Hardware

- NVIDIA GPU  

---

## Software

- Python  
- PyTorch  
- HuggingFace Transformers  
- SafeTensors  

---

# Environmental Impact

Training specifications:

- Hardware type: NVIDIA GPU  
- Training duration: ~3–4 hours  
- Framework: PyTorch  

---

# Author

Abir Maheshwari  
Independent AI Researcher  

HuggingFace Profile:  
https://huggingface.co/abirmaheshwari  

---

# Version

ABIRMARv1  

Initial release version.

---

# Contact

For research inquiries, collaboration, or technical questions:

HuggingFace:  
https://huggingface.co/abirmaheshwari  

---

Downloads last month: -

Model tree for abirmaheshwari/abirmarv1

Unable to build the model tree, the base model loops to the model itself. Learn more.

abirmaheshwari
/

abirmarv1