ABIRHINv1 — Hindi-First Transformer Language Model

ABIRHINv1 is a custom transformer-based causal language model designed specifically for Hindi and Hinglish text generation. It is trained from scratch using a custom tokenizer and architecture optimized for efficient Hindi language understanding and generation.

This model focuses on providing lightweight, efficient Hindi NLP capabilities while maintaining strong contextual understanding.


Model Details

Model Description

ABIRHINv1 is a decoder-only transformer model trained entirely from scratch without using pretrained weights. It uses a custom tokenizer trained on Hindi-focused data and is optimized for efficient inference and fine-tuning.

  • Developed by: Abir Maheshwari
  • Funded by: Independent Research
  • Shared by: Abir Maheshwari
  • Model type: Causal Language Model (Decoder-only Transformer)
  • Language(s): Hindi, Hinglish, English
  • License: MIT
  • Finetuned from model: None (trained from scratch)

Model Sources


Uses

Direct Use

ABIRHINv1 is suitable for:

  • Hindi text generation
  • Hinglish conversational AI
  • Hindi chatbots
  • Text completion
  • NLP research
  • Educational purposes

Example applications:

  • Hindi AI assistants
  • Content generation
  • Research on small language models

Downstream Use

This model can be fine-tuned for:

  • Hindi instruction models
  • Question answering
  • Domain-specific NLP tasks
  • Hindi conversational agents

Out-of-Scope Use

Not recommended for:

  • Critical decision systems
  • Medical diagnosis
  • Legal advice
  • Safety-critical applications

This is an early version model.


Bias, Risks, and Limitations

ABIRHINv1 may:

  • Produce incorrect information
  • Reflect biases present in training data
  • Generate incomplete or nonsensical outputs

This is expected for models trained on limited datasets.


Recommendations

Use this model:

  • For research
  • For experimentation
  • For fine-tuning

Not recommended for production without further training.


How to Get Started

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("abirmaheshwari/abirhinv1")
model = AutoModelForCausalLM.from_pretrained("abirmaheshwari/abirhinv1")

input_text = "भारत एक महान देश है क्योंकि"

inputs = tokenizer(input_text, return_tensors="pt")

outputs = model.generate(
    **inputs,
    max_length=100,
    temperature=0.7
)

print(tokenizer.decode(outputs[0]))

---

# Training Details

## Training Data

ABIRHINv1 was trained on a custom Hindi-focused dataset designed to capture linguistic patterns, conversational structures, and general language usage.

The dataset includes:

- Hindi natural text corpus  
- Hinglish conversational text  
- General-purpose text data  

Dataset specifications:

- Total samples: ~161,000  
- Tokenizer: Custom-trained BPE tokenizer  
- Tokenizer trained entirely from scratch  

---

## Training Procedure

### Preprocessing

The dataset was processed using a custom Byte Pair Encoding (BPE) tokenizer with the following specifications:

- Vocabulary size: 32,000 tokens  
- Maximum sequence length: 512 tokens  
- Tokenizer trained on the same dataset as the model  

---

### Training Hyperparameters

The model was trained using the following configuration:

- Optimizer: AdamW  
- Learning rate: 5 × 10⁻⁵  
- Training precision: FP16 mixed precision  
- Epochs: 3  
- Batch size: 4  
- Training objective: Causal Language Modeling  

---

### Training Hardware

Training was performed using GPU acceleration.

- GPU: NVIDIA GPU (CUDA-enabled)  
- Framework: PyTorch  
- Library: HuggingFace Transformers  

---

# Evaluation

## Testing Data

Evaluation was performed using custom Hindi text samples representative of real-world usage.

---

## Metrics

Primary evaluation metric:

- Training loss monitoring  

---

## Results

The model successfully learns:

- Hindi sentence structure  
- Token relationships  
- Language continuity  
- Contextual text generation  

The model demonstrates functional Hindi generation capability suitable for research and fine-tuning.

---

# Technical Specifications

## Architecture

ABIRHINv1 uses a decoder-only Transformer architecture consisting of:

- Token embedding layer  
- Learned positional embeddings  
- Multi-head self-attention layers  
- Feedforward neural network layers  
- GELU activation function  
- Weight tying between embedding and output layers  

---

## Model Size

- Total parameters: ~96 Million  
- Context length: 512 tokens  
- Vocabulary size: 32,000 tokens  

---

# Compute Infrastructure

## Hardware

- NVIDIA GPU  

---

## Software

- Python  
- PyTorch  
- HuggingFace Transformers  
- SafeTensors  

---

# Environmental Impact

Training specifications:

- Hardware type: NVIDIA GPU  
- Training duration: ~34 hours  
- Framework: PyTorch  

---

# Author

Abir Maheshwari  
Independent AI Researcher  

HuggingFace Profile:  
https://huggingface.co/abirmaheshwari  

---

# Version

ABIRHINv1  

Initial release version.

---

# Contact

For questions, collaboration, or research inquiries:

HuggingFace:  
https://huggingface.co/abirmaheshwari

---
Downloads last month
13
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support