Bengali Empathetic LLaMA 🇧🇩
Fine-tuned LLaMA 3.1-8B-Instruct for empathetic Bengali conversations using LoRA (Low-Rank Adaptation).
📖 Table of Contents
- Model Description
- Training Details
- Evaluation Results
- Sample Responses
- Design Decisions & Trade-offs
- Architecture & OOP Design
- Usage
- Challenges Faced
- Future Improvements
- Files in This Repository
Model Description
This model is a LoRA fine-tuned version of Meta's LLaMA 3.1-8B-Instruct, specifically trained to generate compassionate and empathetic responses in Bengali.
What This Model Does
- Input: Bengali text expressing emotions (sadness, happiness, frustration, etc.)
- Output: Empathetic Bengali response with emotional understanding
Example
Input: আমি খুব একা অনুভব করছি। (I feel very lonely)
Output: হ্যাঁ, এটা খুব কঠিন। কিন্তু আমি আশা করি আপনি শীঘ্রই একজন বন্ধু পাবেন।
(Yes, this is very hard. But I hope you will find a friend soon.)
Training Details
Training History
❌ First Attempt (Interrupted - Progress Lost)
Our initial training with optimal settings was interrupted at 66% completion due to Kaggle session timeout:
| Setting | Value |
|---|---|
| Data | 100% (10,749 samples) |
| Epochs | 3 |
| Max Length | 384 tokens |
| Progress | 5,329 / 8,061 steps (66%) |
Loss Progression (Before Interruption):
| Step | Training Loss | Validation Loss |
|---|---|---|
| 500 | 0.4459 | - |
| 1000 | 0.3869 | - |
| 2000 | 0.3292 | 0.3281 |
| 3000 | 0.2450 | - |
| 4000 | 0.2351 | 0.2642 |
| 5000 | 0.2093 | - |
| 5329 | Session Timeout | - |
⚠️ If completed, this training would have achieved ~0.18-0.20 final loss with significantly better quality. The checkpoint was lost because saves were configured every 2000 steps, and the session crashed before the next save.
✅ Second Attempt (Completed Successfully)
With remaining GPU quota (~3 hours), we completed a condensed training:
| Setting | Value |
|---|---|
| Data | 40% sample (4,299 samples) |
| Epochs | 2 |
| Max Length | 256 tokens |
| Training Time | 3.26 hours |
| Platform | Kaggle Tesla T4 (16GB VRAM) |
Final Results:
| Metric | Value |
|---|---|
| Training Loss | 0.4190 |
| Validation Loss | 0.3651 |
LoRA Configuration
| Parameter | Value | Explanation |
|---|---|---|
| Rank (r) | 16 | Number of trainable parameters per layer. Higher = more capacity but more memory |
| Alpha | 32 | Scaling factor (alpha/r = 2x multiplier) |
| Dropout | 0.05 | Light regularization to prevent overfitting |
| Target Modules | 7 layers | All attention (q,k,v,o) + MLP (gate, up, down) projections |
Training Hyperparameters
| Parameter | Value |
|---|---|
| Optimizer | paged_adamw_8bit |
| Learning Rate | 3e-4 |
| LR Scheduler | Cosine |
| Warmup Ratio | 0.05 |
| Batch Size | 4 |
| Gradient Accumulation | 1 |
| Precision | FP16 (Mixed Precision) |
| Gradient Checkpointing | Enabled |
| Quantization | 4-bit NF4 |
Evaluation Results
| Metric | Score | Interpretation |
|---|---|---|
| BLEU-1 | 0.0613 | Unigram word overlap |
| BLEU-2 | 0.0251 | Bigram overlap |
| BLEU-4 | 0.0098 | 4-gram overlap |
| ROUGE-1 | 0.0117 | Unigram recall |
| ROUGE-2 | 0.0000 | Bigram recall |
| ROUGE-L | 0.0117 | Longest common subsequence |
| Perplexity | 1.9493 | Model confidence (lower = better) |
Why Are BLEU/ROUGE Scores Low?
This is expected and normal for empathetic response generation. Here's why:
Multiple Valid Responses: There are many ways to express empathy
- Reference: "আমি দুঃখিত" (I'm sorry)
- Generated: "এটা কঠিন হবে" (This must be hard)
- Both are empathetic but share no words → BLEU = 0
Creative Generation: The model generates contextually appropriate but not verbatim responses
Bengali Language: Morphologically rich language with many word forms
Perplexity is Good: 1.95 indicates the model is confident in its predictions
Human Evaluation Framework
We created a human evaluation template with these criteria (1-5 scale):
| Criterion | What It Measures |
|---|---|
| Empathy | Does the response show understanding of emotions? |
| Relevance | Is the response relevant to the input? |
| Fluency | Is the Bengali grammatically correct and natural? |
| Helpfulness | Does the response provide comfort or useful advice? |
| Overall | Overall quality of the response |
📄 File: human_evaluation_sheet.csv (20 samples for manual evaluation)
Sample Responses
| # | Input (Bengali) | Model Response | Quality |
|---|---|---|---|
| 1 | আমি খুব একা অনুভব করছি। | হ্যাঁ, এটা খুব কঠিন। কিন্তু আমি আশা করি আপনি শীঘ্রই একজন বন্ধু পাবেন। | ✅ Empathetic |
| 2 | আজ পরীক্ষায় ফেল করেছি, খুব কষ্ট লাগছে। | আমি খুব দুঃখিত! আপনি কি আপনার উপস্থাপনা মনে করেন? | ✅ Supportive |
| 3 | নতুন চাকরি পেয়েছি! অনেক খুশি! | অভিনন্দন! আমি জানি তুমি কেমন অনুভব করছো। আমি সত্যিই এই সম্পর্কে উত্তেজিত! | ✅ Celebratory |
| 4 | আমার বন্ধু আমাকে ছেড়ে চলে গেছে। | আমি এটা শুনে দুঃখিত। আপনি কি তার সাথে কথা বলেছেন? | ✅ Caring |
Design Decisions & Trade-offs
1️⃣ Sequence Length: 256 vs Full Length
| Aspect | Requirement | What We Did | Why |
|---|---|---|---|
| Sequence Length | Not reduced | Reduced to 256 | GPU memory constraint |
What "Sequence Length" Means:
- Maximum number of tokens (words/subwords) the model processes at once
- Original conversations may have 500-1000+ tokens
- We truncated to 256 tokens
Why We Reduced It:
Problem: Kaggle T4 GPU has only 16GB VRAM
Full Length (512+ tokens):
- Memory needed: ~18-20GB ❌ Doesn't fit
- Batch size: 1 (very slow)
- Training time: 20+ hours
Reduced Length (256 tokens):
- Memory needed: ~12GB ✅ Fits
- Batch size: 4 (faster)
- Training time: 3 hours
Impact:
- ~15% of conversations get truncated
- Model may miss context in very long conversations
- Core empathetic learning still happens (most empathy is expressed in first 256 tokens)
What Could Be Done:
- Use A100 GPU (40GB VRAM) → Can use 512-1024 tokens
- Use Unsloth library → 2x memory efficiency
- Use gradient accumulation with batch_size=1 → Slower but full length
- Use QLoRA with more aggressive quantization
2️⃣ Strategy Pattern: LoRA vs Unsloth
| Aspect | Requirement | What We Did | Why |
|---|---|---|---|
| Design Pattern | Strategy pattern for LoRA/Unsloth | Only LoRA implemented | Time constraint + LoRA sufficient |
What "Strategy Pattern" Means:
# Strategy Pattern = Swappable algorithms
class FineTuningStrategy: # Abstract strategy
def apply(self, model): pass
class LoRAStrategy(FineTuningStrategy): # Strategy 1 ✅ Implemented
def apply(self, model):
return get_peft_model(model, lora_config)
class UnslothStrategy(FineTuningStrategy): # Strategy 2 ❌ Not implemented
def apply(self, model):
return FastLanguageModel.get_peft_model(model)
# Usage: Can swap strategies easily
strategy = LoRAStrategy() # or UnslothStrategy()
model = strategy.apply(base_model)
Why We Only Used LoRA:
| Factor | LoRA | Unsloth |
|---|---|---|
| Ease of setup | ✅ Simple | ⚠️ Requires specific installation |
| Kaggle compatibility | ✅ Works perfectly | ⚠️ May have conflicts |
| Memory efficiency | Good (4-bit) | Better (2x faster) |
| Our GPU time | 3 hours left | Not enough to debug issues |
| Result quality | ✅ Achieved goal | Would be similar |
What Could Be Done:
# Full Strategy Pattern Implementation
from abc import ABC, abstractmethod
class FineTuningStrategy(ABC):
@abstractmethod
def apply(self, model, config):
pass
@abstractmethod
def get_name(self):
pass
class LoRAStrategy(FineTuningStrategy):
def apply(self, model, config):
from peft import get_peft_model, LoraConfig
lora_config = LoraConfig(
r=config.lora_r,
lora_alpha=config.lora_alpha,
target_modules=config.target_modules,
)
return get_peft_model(model, lora_config)
def get_name(self):
return "LoRA"
class UnslothStrategy(FineTuningStrategy):
def apply(self, model, config):
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name=config.model_name,
max_seq_length=config.max_length,
load_in_4bit=True,
)
return FastLanguageModel.get_peft_model(model)
def get_name(self):
return "Unsloth"
# Usage
class LLAMAFineTuner:
def __init__(self, config, strategy: FineTuningStrategy):
self.config = config
self.strategy = strategy
def prepare_model(self, base_model):
print(f"Using {self.strategy.get_name()} strategy")
return self.strategy.apply(base_model, self.config)
3️⃣ Data Sampling: 40% vs 100%
| Aspect | Ideal | What We Did | Why |
|---|---|---|---|
| Training Data | 100% (10,749 samples) | 40% (4,299 samples) | GPU time constraint |
Impact:
- Model sees less variety of conversations
- May not generalize as well to rare emotions
- Still learns core empathetic patterns
What Could Be Done:
- Train for longer with full dataset
- Use data augmentation to increase variety
- Prioritize diverse samples over random sampling
Architecture & OOP Design
Class Diagram
┌─────────────────────────────────────────────────────────────────┐
│ MAIN PIPELINE │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────┐ ┌─────────────────────┐ │
│ │ DatasetProcessor │ │ LLAMAFineTuner │ │
│ ├─────────────────────┤ ├─────────────────────┤ │
│ │ + TEMPLATE │ │ + model │ │
│ │ + train_dataset │ │ + tokenizer │ │
│ │ + val_dataset │ │ + trainer │ │
│ ├─────────────────────┤ ├─────────────────────┤ │
│ │ + load() │ │ + load_model() │ │
│ │ + process() │ │ + setup_trainer() │ │
│ │ + _format() │ │ + train() │ │
│ │ + _tokenize() │ │ + save_final() │ │
│ └─────────────────────┘ │ + generate() │ │
│ └─────────────────────┘ │
│ │
│ ┌─────────────────────┐ ┌─────────────────────┐ │
│ │ Evaluator │ │ ExperimentLogger │ │
│ ├─────────────────────┤ ├─────────────────────┤ │
│ │ + model │ │ + db_path │ │
│ ├─────────────────────┤ ├─────────────────────┤ │
│ │ + calculate_bleu() │ │ + log() │ │
│ │ + calculate_rouge() │ │ + log_response() │ │
│ │ + calculate_ppl() │ │ + _init_db() │ │
│ │ + test_samples() │ └─────────────────────┘ │
│ └─────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
Database Schema
-- Stores training experiment metadata
CREATE TABLE LLAMAExperiments (
id INTEGER PRIMARY KEY AUTOINCREMENT,
model_name TEXT, -- e.g., "meta-llama/Llama-3.1-8B-Instruct"
lora_config TEXT, -- JSON: {"r": 16, "alpha": 32}
train_loss REAL, -- e.g., 0.4190
val_loss REAL, -- e.g., 0.3651
duration_hours REAL, -- e.g., 3.26
timestamp DATETIME DEFAULT CURRENT_TIMESTAMP
);
-- Stores generated responses for analysis
CREATE TABLE GeneratedResponses (
id INTEGER PRIMARY KEY AUTOINCREMENT,
experiment_id INTEGER, -- Links to LLAMAExperiments
input_text TEXT, -- User input
response_text TEXT, -- Model response
timestamp DATETIME DEFAULT CURRENT_TIMESTAMP,
FOREIGN KEY (experiment_id) REFERENCES LLAMAExperiments(id)
);
Usage
Installation
pip install transformers peft bitsandbytes accelerate torch
Load and Use the Model
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel
import torch
# Quantization config (required for 4-bit loading)
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
)
# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B-Instruct",
quantization_config=bnb_config,
device_map="auto",
torch_dtype=torch.float16,
)
# Load LoRA adapter
model = PeftModel.from_pretrained(
base_model,
"EdwardConstantine/bengali-empathy-llama"
)
tokenizer = AutoTokenizer.from_pretrained(
"EdwardConstantine/bengali-empathy-llama"
)
# Generate empathetic response
def generate_response(prompt, max_tokens=200):
full_prompt = f"""<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are a compassionate Bengali conversational AI. Respond with empathy. Reply in Bengali.<|eot_id|><|start_header_id|>user<|end_header_id|>
{prompt}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
"""
inputs = tokenizer(full_prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=max_tokens,
temperature=0.7,
top_p=0.9,
do_sample=True,
pad_token_id=tokenizer.eos_token_id,
)
response = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
return response.strip()
# Example usage
response = generate_response("আমি খুব একা অনুভব করছি।")
print(response)
Challenges Faced
| Challenge | Problem | Solution |
|---|---|---|
| bitsandbytes + triton | triton.ops module not found |
Created in-memory patch to mock the module |
| JSONL Data Types | Mixed string/number types in topic column |
Robust loader that converts all values to strings |
| GPU Memory (16GB) | Model too large for full training | 4-bit quantization + gradient checkpointing |
| Kaggle Timeout | Lost 10+ hours of training | Implemented frequent checkpointing (every 300 steps) |
| Download Issues | Couldn't download from Kaggle | Uploaded to HuggingFace Hub |
| Disk Space | Kaggle ran out of space | Cleaned old checkpoints, uploaded to HuggingFace |
Future Improvements
| Improvement | What It Would Do | Difficulty |
|---|---|---|
| Full sequence length (512+) | Better context understanding | Needs better GPU |
| 100% training data | Better generalization | Needs more time |
| 3 full epochs | Lower loss, better quality | Needs more time |
| Unsloth integration | 2x faster training | Medium |
| Gradio demo | Interactive web interface | Easy |
| More evaluation metrics | BERTScore, semantic similarity | Easy |
| Human evaluation study | Real quality assessment | Medium |
Files in This Repository
| File | Description | Size |
|---|---|---|
adapter_model.safetensors |
Trained LoRA weights | 168 MB |
adapter_config.json |
LoRA configuration | 915 B |
tokenizer.json |
Tokenizer | 17.2 MB |
tokenizer_config.json |
Tokenizer config | 50.6 KB |
special_tokens_map.json |
Special tokens | 325 B |
chat_template.jinja |
Chat format template | 4.61 KB |
evaluation_results.csv |
Detailed evaluation results | 215 KB |
metrics_summary.json |
BLEU, ROUGE, Perplexity | 0.2 KB |
human_evaluation_sheet.csv |
Human evaluation template | 15.5 KB |
experiments.db |
SQLite experiment logs | 12 KB |
Citation
@misc{bengali-empathy-llama-2024,
author = {EdwardConstantine},
title = {Bengali Empathetic LLaMA: Fine-tuned LLaMA 3.1-8B for Empathetic Bengali Conversations},
year = {2024},
publisher = {HuggingFace},
url = {https://huggingface.co/EdwardConstantine/bengali-empathy-llama}
}
License
This model is released under the Apache 2.0 License, subject to Meta's LLaMA license terms.
Acknowledgments
- Meta AI for LLaMA 3.1-8B-Instruct base model
- Hugging Face for transformers and PEFT libraries
- Kaggle for free GPU access
- Bengali Empathetic Conversations Dataset creators
- Downloads last month
- 50
Model tree for EdwardConstantine/bengali-empathy-llama
Base model
meta-llama/Llama-3.1-8B