Bengali Empathetic LLaMA 🇧🇩

Fine-tuned LLaMA 3.1-8B-Instruct for empathetic Bengali conversations using LoRA (Low-Rank Adaptation).

📖 Table of Contents

Model Description
Training Details
Evaluation Results
Sample Responses
Design Decisions & Trade-offs
Architecture & OOP Design
Usage
Challenges Faced
Future Improvements
Files in This Repository

Model Description

This model is a LoRA fine-tuned version of Meta's LLaMA 3.1-8B-Instruct, specifically trained to generate compassionate and empathetic responses in Bengali.

What This Model Does

Input: Bengali text expressing emotions (sadness, happiness, frustration, etc.)
Output: Empathetic Bengali response with emotional understanding

Example

Input:  আমি খুব একা অনুভব করছি। (I feel very lonely)
Output: হ্যাঁ, এটা খুব কঠিন। কিন্তু আমি আশা করি আপনি শীঘ্রই একজন বন্ধু পাবেন।
        (Yes, this is very hard. But I hope you will find a friend soon.)

Training Details

Training History

❌ First Attempt (Interrupted - Progress Lost)

Our initial training with optimal settings was interrupted at 66% completion due to Kaggle session timeout:

Setting	Value
Data	100% (10,749 samples)
Epochs	3
Max Length	384 tokens
Progress	5,329 / 8,061 steps (66%)

Loss Progression (Before Interruption):

Step	Training Loss	Validation Loss
500	0.4459	-
1000	0.3869	-
2000	0.3292	0.3281
3000	0.2450	-
4000	0.2351	0.2642
5000	0.2093	-
5329	Session Timeout	-

⚠️ If completed, this training would have achieved ~0.18-0.20 final loss with significantly better quality. The checkpoint was lost because saves were configured every 2000 steps, and the session crashed before the next save.

✅ Second Attempt (Completed Successfully)

With remaining GPU quota (~3 hours), we completed a condensed training:

Setting	Value
Data	40% sample (4,299 samples)
Epochs	2
Max Length	256 tokens
Training Time	3.26 hours
Platform	Kaggle Tesla T4 (16GB VRAM)

Final Results:

Metric	Value
Training Loss	0.4190
Validation Loss	0.3651

LoRA Configuration

Parameter	Value	Explanation
Rank (r)	16	Number of trainable parameters per layer. Higher = more capacity but more memory
Alpha	32	Scaling factor (alpha/r = 2x multiplier)
Dropout	0.05	Light regularization to prevent overfitting
Target Modules	7 layers	All attention (q,k,v,o) + MLP (gate, up, down) projections

Training Hyperparameters

Parameter	Value
Optimizer	paged_adamw_8bit
Learning Rate	3e-4
LR Scheduler	Cosine
Warmup Ratio	0.05
Batch Size	4
Gradient Accumulation	1
Precision	FP16 (Mixed Precision)
Gradient Checkpointing	Enabled
Quantization	4-bit NF4

Evaluation Results

Metric	Score	Interpretation
BLEU-1	0.0613	Unigram word overlap
BLEU-2	0.0251	Bigram overlap
BLEU-4	0.0098	4-gram overlap
ROUGE-1	0.0117	Unigram recall
ROUGE-2	0.0000	Bigram recall
ROUGE-L	0.0117	Longest common subsequence
Perplexity	1.9493	Model confidence (lower = better)

Why Are BLEU/ROUGE Scores Low?

This is expected and normal for empathetic response generation. Here's why:

Multiple Valid Responses: There are many ways to express empathy
- Reference: "আমি দুঃখিত" (I'm sorry)
- Generated: "এটা কঠিন হবে" (This must be hard)
- Both are empathetic but share no words → BLEU = 0
Creative Generation: The model generates contextually appropriate but not verbatim responses
Bengali Language: Morphologically rich language with many word forms
Perplexity is Good: 1.95 indicates the model is confident in its predictions

Human Evaluation Framework

We created a human evaluation template with these criteria (1-5 scale):

Criterion	What It Measures
Empathy	Does the response show understanding of emotions?
Relevance	Is the response relevant to the input?
Fluency	Is the Bengali grammatically correct and natural?
Helpfulness	Does the response provide comfort or useful advice?
Overall	Overall quality of the response

📄 File: human_evaluation_sheet.csv (20 samples for manual evaluation)

Sample Responses

#	Input (Bengali)	Model Response	Quality
1	আমি খুব একা অনুভব করছি।	হ্যাঁ, এটা খুব কঠিন। কিন্তু আমি আশা করি আপনি শীঘ্রই একজন বন্ধু পাবেন।	✅ Empathetic
2	আজ পরীক্ষায় ফেল করেছি, খুব কষ্ট লাগছে।	আমি খুব দুঃখিত! আপনি কি আপনার উপস্থাপনা মনে করেন?	✅ Supportive
3	নতুন চাকরি পেয়েছি! অনেক খুশি!	অভিনন্দন! আমি জানি তুমি কেমন অনুভব করছো। আমি সত্যিই এই সম্পর্কে উত্তেজিত!	✅ Celebratory
4	আমার বন্ধু আমাকে ছেড়ে চলে গেছে।	আমি এটা শুনে দুঃখিত। আপনি কি তার সাথে কথা বলেছেন?	✅ Caring

Design Decisions & Trade-offs

1️⃣ Sequence Length: 256 vs Full Length

Aspect	Requirement	What We Did	Why
Sequence Length	Not reduced	Reduced to 256	GPU memory constraint

What "Sequence Length" Means:

Maximum number of tokens (words/subwords) the model processes at once
Original conversations may have 500-1000+ tokens
We truncated to 256 tokens

Why We Reduced It:

Problem: Kaggle T4 GPU has only 16GB VRAM

Full Length (512+ tokens):
- Memory needed: ~18-20GB ❌ Doesn't fit
- Batch size: 1 (very slow)
- Training time: 20+ hours

Reduced Length (256 tokens):
- Memory needed: ~12GB ✅ Fits
- Batch size: 4 (faster)
- Training time: 3 hours

Impact:

~15% of conversations get truncated
Model may miss context in very long conversations
Core empathetic learning still happens (most empathy is expressed in first 256 tokens)

What Could Be Done:

Use A100 GPU (40GB VRAM) → Can use 512-1024 tokens
Use Unsloth library → 2x memory efficiency
Use gradient accumulation with batch_size=1 → Slower but full length
Use QLoRA with more aggressive quantization

2️⃣ Strategy Pattern: LoRA vs Unsloth

Aspect	Requirement	What We Did	Why
Design Pattern	Strategy pattern for LoRA/Unsloth	Only LoRA implemented	Time constraint + LoRA sufficient

What "Strategy Pattern" Means:

# Strategy Pattern = Swappable algorithms

class FineTuningStrategy:           # Abstract strategy
    def apply(self, model): pass

class LoRAStrategy(FineTuningStrategy):     # Strategy 1 ✅ Implemented
    def apply(self, model):
        return get_peft_model(model, lora_config)

class UnslothStrategy(FineTuningStrategy):  # Strategy 2 ❌ Not implemented
    def apply(self, model):
        return FastLanguageModel.get_peft_model(model)

# Usage: Can swap strategies easily
strategy = LoRAStrategy()  # or UnslothStrategy()
model = strategy.apply(base_model)

Why We Only Used LoRA:

Factor	LoRA	Unsloth
Ease of setup	✅ Simple	⚠️ Requires specific installation
Kaggle compatibility	✅ Works perfectly	⚠️ May have conflicts
Memory efficiency	Good (4-bit)	Better (2x faster)
Our GPU time	3 hours left	Not enough to debug issues
Result quality	✅ Achieved goal	Would be similar

What Could Be Done:

# Full Strategy Pattern Implementation

from abc import ABC, abstractmethod

class FineTuningStrategy(ABC):
    @abstractmethod
    def apply(self, model, config):
        pass
    
    @abstractmethod
    def get_name(self):
        pass

class LoRAStrategy(FineTuningStrategy):
    def apply(self, model, config):
        from peft import get_peft_model, LoraConfig
        lora_config = LoraConfig(
            r=config.lora_r,
            lora_alpha=config.lora_alpha,
            target_modules=config.target_modules,
        )
        return get_peft_model(model, lora_config)
    
    def get_name(self):
        return "LoRA"

class UnslothStrategy(FineTuningStrategy):
    def apply(self, model, config):
        from unsloth import FastLanguageModel
        model, tokenizer = FastLanguageModel.from_pretrained(
            model_name=config.model_name,
            max_seq_length=config.max_length,
            load_in_4bit=True,
        )
        return FastLanguageModel.get_peft_model(model)
    
    def get_name(self):
        return "Unsloth"

# Usage
class LLAMAFineTuner:
    def __init__(self, config, strategy: FineTuningStrategy):
        self.config = config
        self.strategy = strategy
    
    def prepare_model(self, base_model):
        print(f"Using {self.strategy.get_name()} strategy")
        return self.strategy.apply(base_model, self.config)

3️⃣ Data Sampling: 40% vs 100%

Aspect	Ideal	What We Did	Why
Training Data	100% (10,749 samples)	40% (4,299 samples)	GPU time constraint

Impact:

Model sees less variety of conversations
May not generalize as well to rare emotions
Still learns core empathetic patterns

What Could Be Done:

Train for longer with full dataset
Use data augmentation to increase variety
Prioritize diverse samples over random sampling

Architecture & OOP Design

Class Diagram

┌─────────────────────────────────────────────────────────────────┐
│                         MAIN PIPELINE                     │
├─────────────────────────────────────────────────────────────────┤
│                                                            │
│  ┌─────────────────────┐        ┌─────────────────────┐        │
│  │   DatasetProcessor  │      │   LLAMAFineTuner    │       │
│  ├─────────────────────┤        ├─────────────────────┤        │
│  │ + TEMPLATE          │      │ + model             │       │
│  │ + train_dataset     │      │ + tokenizer         │       │
│  │ + val_dataset       │      │ + trainer           │       │
│  ├─────────────────────┤        ├─────────────────────┤        │
│  │ + load()            │      │ + load_model()      │       │
│  │ + process()         │      │ + setup_trainer()   │       │
│  │ + _format()         │      │ + train()           │       │
│  │ + _tokenize()       │      │ + save_final()      │       │
│  └─────────────────────┘        │ + generate()        │       │
│                               └─────────────────────┘         │
│                                                             │
│  ┌─────────────────────┐         ┌─────────────────────┐        │
│  │     Evaluator       │      │  ExperimentLogger   │       │
│  ├─────────────────────┤        ├─────────────────────┤         │
│  │ + model             │      │ + db_path           │       │
│  ├─────────────────────┤        ├─────────────────────┤         │
│  │ + calculate_bleu()  │      │ + log()             │       │
│  │ + calculate_rouge() │      │ + log_response()    │        │
│  │ + calculate_ppl()   │      │ + _init_db()        │        │
│  │ + test_samples()    │      └─────────────────────┘          │
│  └─────────────────────┘                                      │
│                                                             │
└─────────────────────────────────────────────────────────────────┘

Database Schema

-- Stores training experiment metadata
CREATE TABLE LLAMAExperiments (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    model_name TEXT,                    -- e.g., "meta-llama/Llama-3.1-8B-Instruct"
    lora_config TEXT,                   -- JSON: {"r": 16, "alpha": 32}
    train_loss REAL,                    -- e.g., 0.4190
    val_loss REAL,                      -- e.g., 0.3651
    duration_hours REAL,                -- e.g., 3.26
    timestamp DATETIME DEFAULT CURRENT_TIMESTAMP
);

-- Stores generated responses for analysis
CREATE TABLE GeneratedResponses (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    experiment_id INTEGER,              -- Links to LLAMAExperiments
    input_text TEXT,                    -- User input
    response_text TEXT,                 -- Model response
    timestamp DATETIME DEFAULT CURRENT_TIMESTAMP,
    FOREIGN KEY (experiment_id) REFERENCES LLAMAExperiments(id)
);

Usage

Installation

pip install transformers peft bitsandbytes accelerate torch

Load and Use the Model

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel
import torch

# Quantization config (required for 4-bit loading)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)

# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct",
    quantization_config=bnb_config,
    device_map="auto",
    torch_dtype=torch.float16,
)

# Load LoRA adapter
model = PeftModel.from_pretrained(
    base_model, 
    "EdwardConstantine/bengali-empathy-llama"
)
tokenizer = AutoTokenizer.from_pretrained(
    "EdwardConstantine/bengali-empathy-llama"
)

# Generate empathetic response
def generate_response(prompt, max_tokens=200):
    full_prompt = f"""<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a compassionate Bengali conversational AI. Respond with empathy. Reply in Bengali.<|eot_id|><|start_header_id|>user<|end_header_id|>

{prompt}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

"""
    inputs = tokenizer(full_prompt, return_tensors="pt").to(model.device)
    outputs = model.generate(
        **inputs,
        max_new_tokens=max_tokens,
        temperature=0.7,
        top_p=0.9,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id,
    )
    response = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
    return response.strip()

# Example usage
response = generate_response("আমি খুব একা অনুভব করছি।")
print(response)

Challenges Faced

Challenge	Problem	Solution
bitsandbytes + triton	`triton.ops` module not found	Created in-memory patch to mock the module
JSONL Data Types	Mixed string/number types in `topic` column	Robust loader that converts all values to strings
GPU Memory (16GB)	Model too large for full training	4-bit quantization + gradient checkpointing
Kaggle Timeout	Lost 10+ hours of training	Implemented frequent checkpointing (every 300 steps)
Download Issues	Couldn't download from Kaggle	Uploaded to HuggingFace Hub
Disk Space	Kaggle ran out of space	Cleaned old checkpoints, uploaded to HuggingFace

Future Improvements

Improvement	What It Would Do	Difficulty
Full sequence length (512+)	Better context understanding	Needs better GPU
100% training data	Better generalization	Needs more time
3 full epochs	Lower loss, better quality	Needs more time
Unsloth integration	2x faster training	Medium
Gradio demo	Interactive web interface	Easy
More evaluation metrics	BERTScore, semantic similarity	Easy
Human evaluation study	Real quality assessment	Medium

Files in This Repository

File	Description	Size
`adapter_model.safetensors`	Trained LoRA weights	168 MB
`adapter_config.json`	LoRA configuration	915 B
`tokenizer.json`	Tokenizer	17.2 MB
`tokenizer_config.json`	Tokenizer config	50.6 KB
`special_tokens_map.json`	Special tokens	325 B
`chat_template.jinja`	Chat format template	4.61 KB
`evaluation_results.csv`	Detailed evaluation results	215 KB
`metrics_summary.json`	BLEU, ROUGE, Perplexity	0.2 KB
`human_evaluation_sheet.csv`	Human evaluation template	15.5 KB
`experiments.db`	SQLite experiment logs	12 KB

Citation

@misc{bengali-empathy-llama-2024,
  author = {EdwardConstantine},
  title = {Bengali Empathetic LLaMA: Fine-tuned LLaMA 3.1-8B for Empathetic Bengali Conversations},
  year = {2024},
  publisher = {HuggingFace},
  url = {https://huggingface.co/EdwardConstantine/bengali-empathy-llama}
}

License

This model is released under the Apache 2.0 License, subject to Meta's LLaMA license terms.

Acknowledgments

Meta AI for LLaMA 3.1-8B-Instruct base model
Hugging Face for transformers and PEFT libraries
Kaggle for free GPU access
Bengali Empathetic Conversations Dataset creators

Downloads last month: 2

Model tree for EdwardConstantine/bengali-empathy-llama

Base model

meta-llama/Llama-3.1-8B

Finetuned

meta-llama/Llama-3.1-8B-Instruct

Adapter

(2242)

this model