ABIRHINv1 — Hindi-First Transformer Language Model

ABIRHINv1 is a custom transformer-based causal language model designed specifically for Hindi and Hinglish text generation. It is trained from scratch using a custom tokenizer and architecture optimized for efficient Hindi language understanding and generation.

This model focuses on providing lightweight, efficient Hindi NLP capabilities while maintaining strong contextual understanding.

Model Details

Model Description

ABIRHINv1 is a decoder-only transformer model trained entirely from scratch without using pretrained weights. It uses a custom tokenizer trained on Hindi-focused data and is optimized for efficient inference and fine-tuning.

Developed by: Abir Maheshwari
Funded by: Independent Research
Shared by: Abir Maheshwari
Model type: Causal Language Model (Decoder-only Transformer)
Language(s): Hindi, Hinglish, English
License: MIT
Finetuned from model: None (trained from scratch)

Model Sources

Repository: https://huggingface.co/abirmaheshwari/abirhinv1
Architecture: Custom Transformer
Framework: PyTorch + HuggingFace Transformers

Uses

Direct Use

ABIRHINv1 is suitable for:

Hindi text generation
Hinglish conversational AI
Hindi chatbots
Text completion
NLP research
Educational purposes

Example applications:

Hindi AI assistants
Content generation
Research on small language models

Downstream Use

This model can be fine-tuned for:

Hindi instruction models
Question answering
Domain-specific NLP tasks
Hindi conversational agents

Out-of-Scope Use

Not recommended for:

Critical decision systems
Medical diagnosis
Legal advice
Safety-critical applications

This is an early version model.

Bias, Risks, and Limitations

ABIRHINv1 may:

Produce incorrect information
Reflect biases present in training data
Generate incomplete or nonsensical outputs

This is expected for models trained on limited datasets.

Recommendations

Use this model:

For research
For experimentation
For fine-tuning

Not recommended for production without further training.

How to Get Started

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("abirmaheshwari/abirhinv1")
model = AutoModelForCausalLM.from_pretrained("abirmaheshwari/abirhinv1")

input_text = "भारत एक महान देश है क्योंकि"

inputs = tokenizer(input_text, return_tensors="pt")

outputs = model.generate(
    **inputs,
    max_length=100,
    temperature=0.7
)

print(tokenizer.decode(outputs[0]))

---

# Training Details

## Training Data

ABIRHINv1 was trained on a custom Hindi-focused dataset designed to capture linguistic patterns, conversational structures, and general language usage.

The dataset includes:

- Hindi natural text corpus  
- Hinglish conversational text  
- General-purpose text data  

Dataset specifications:

- Total samples: ~161,000  
- Tokenizer: Custom-trained BPE tokenizer  
- Tokenizer trained entirely from scratch  

---

## Training Procedure

### Preprocessing

The dataset was processed using a custom Byte Pair Encoding (BPE) tokenizer with the following specifications:

- Vocabulary size: 32,000 tokens  
- Maximum sequence length: 512 tokens  
- Tokenizer trained on the same dataset as the model  

---

### Training Hyperparameters

The model was trained using the following configuration:

- Optimizer: AdamW  
- Learning rate: 5 × 10⁻⁵  
- Training precision: FP16 mixed precision  
- Epochs: 3  
- Batch size: 4  
- Training objective: Causal Language Modeling  

---

### Training Hardware

Training was performed using GPU acceleration.

- GPU: NVIDIA GPU (CUDA-enabled)  
- Framework: PyTorch  
- Library: HuggingFace Transformers  

---

# Evaluation

## Testing Data

Evaluation was performed using custom Hindi text samples representative of real-world usage.

---

## Metrics

Primary evaluation metric:

- Training loss monitoring  

---

## Results

The model successfully learns:

- Hindi sentence structure  
- Token relationships  
- Language continuity  
- Contextual text generation  

The model demonstrates functional Hindi generation capability suitable for research and fine-tuning.

---

# Technical Specifications

## Architecture

ABIRHINv1 uses a decoder-only Transformer architecture consisting of:

- Token embedding layer  
- Learned positional embeddings  
- Multi-head self-attention layers  
- Feedforward neural network layers  
- GELU activation function  
- Weight tying between embedding and output layers  

---

## Model Size

- Total parameters: ~96 Million  
- Context length: 512 tokens  
- Vocabulary size: 32,000 tokens  

---

# Compute Infrastructure

## Hardware

- NVIDIA GPU  

---

## Software

- Python  
- PyTorch  
- HuggingFace Transformers  
- SafeTensors  

---

# Environmental Impact

Training specifications:

- Hardware type: NVIDIA GPU  
- Training duration: ~3–4 hours  
- Framework: PyTorch  

---

# Author

Abir Maheshwari  
Independent AI Researcher  

HuggingFace Profile:  
https://huggingface.co/abirmaheshwari  

---

# Version

ABIRHINv1  

Initial release version.

---

# Contact

For questions, collaboration, or research inquiries:

HuggingFace:  
https://huggingface.co/abirmaheshwari

---

Downloads last month: 3

Safetensors

Model size

0.1B params

Tensor type

F32