Bengali Empathetic LLaMA 🇧🇩

Fine-tuned LLaMA 3.1-8B-Instruct for empathetic Bengali conversations using LoRA (Low-Rank Adaptation).


📖 Table of Contents

  1. Model Description
  2. Training Details
  3. Evaluation Results
  4. Sample Responses
  5. Design Decisions & Trade-offs
  6. Architecture & OOP Design
  7. Usage
  8. Challenges Faced
  9. Future Improvements
  10. Files in This Repository

Model Description

This model is a LoRA fine-tuned version of Meta's LLaMA 3.1-8B-Instruct, specifically trained to generate compassionate and empathetic responses in Bengali.

What This Model Does

  • Input: Bengali text expressing emotions (sadness, happiness, frustration, etc.)
  • Output: Empathetic Bengali response with emotional understanding

Example

Input:  আমি খুব একা অনুভব করছি। (I feel very lonely)
Output: হ্যাঁ, এটা খুব কঠিন। কিন্তু আমি আশা করি আপনি শীঘ্রই একজন বন্ধু পাবেন।
        (Yes, this is very hard. But I hope you will find a friend soon.)

Training Details

Training History

❌ First Attempt (Interrupted - Progress Lost)

Our initial training with optimal settings was interrupted at 66% completion due to Kaggle session timeout:

Setting Value
Data 100% (10,749 samples)
Epochs 3
Max Length 384 tokens
Progress 5,329 / 8,061 steps (66%)

Loss Progression (Before Interruption):

Step Training Loss Validation Loss
500 0.4459 -
1000 0.3869 -
2000 0.3292 0.3281
3000 0.2450 -
4000 0.2351 0.2642
5000 0.2093 -
5329 Session Timeout -

⚠️ If completed, this training would have achieved ~0.18-0.20 final loss with significantly better quality. The checkpoint was lost because saves were configured every 2000 steps, and the session crashed before the next save.

✅ Second Attempt (Completed Successfully)

With remaining GPU quota (~3 hours), we completed a condensed training:

Setting Value
Data 40% sample (4,299 samples)
Epochs 2
Max Length 256 tokens
Training Time 3.26 hours
Platform Kaggle Tesla T4 (16GB VRAM)

Final Results:

Metric Value
Training Loss 0.4190
Validation Loss 0.3651

LoRA Configuration

Parameter Value Explanation
Rank (r) 16 Number of trainable parameters per layer. Higher = more capacity but more memory
Alpha 32 Scaling factor (alpha/r = 2x multiplier)
Dropout 0.05 Light regularization to prevent overfitting
Target Modules 7 layers All attention (q,k,v,o) + MLP (gate, up, down) projections

Training Hyperparameters

Parameter Value
Optimizer paged_adamw_8bit
Learning Rate 3e-4
LR Scheduler Cosine
Warmup Ratio 0.05
Batch Size 4
Gradient Accumulation 1
Precision FP16 (Mixed Precision)
Gradient Checkpointing Enabled
Quantization 4-bit NF4

Evaluation Results

Metric Score Interpretation
BLEU-1 0.0613 Unigram word overlap
BLEU-2 0.0251 Bigram overlap
BLEU-4 0.0098 4-gram overlap
ROUGE-1 0.0117 Unigram recall
ROUGE-2 0.0000 Bigram recall
ROUGE-L 0.0117 Longest common subsequence
Perplexity 1.9493 Model confidence (lower = better)

Why Are BLEU/ROUGE Scores Low?

This is expected and normal for empathetic response generation. Here's why:

  1. Multiple Valid Responses: There are many ways to express empathy

    • Reference: "আমি দুঃখিত" (I'm sorry)
    • Generated: "এটা কঠিন হবে" (This must be hard)
    • Both are empathetic but share no words → BLEU = 0
  2. Creative Generation: The model generates contextually appropriate but not verbatim responses

  3. Bengali Language: Morphologically rich language with many word forms

  4. Perplexity is Good: 1.95 indicates the model is confident in its predictions

Human Evaluation Framework

We created a human evaluation template with these criteria (1-5 scale):

Criterion What It Measures
Empathy Does the response show understanding of emotions?
Relevance Is the response relevant to the input?
Fluency Is the Bengali grammatically correct and natural?
Helpfulness Does the response provide comfort or useful advice?
Overall Overall quality of the response

📄 File: human_evaluation_sheet.csv (20 samples for manual evaluation)


Sample Responses

# Input (Bengali) Model Response Quality
1 আমি খুব একা অনুভব করছি। হ্যাঁ, এটা খুব কঠিন। কিন্তু আমি আশা করি আপনি শীঘ্রই একজন বন্ধু পাবেন। ✅ Empathetic
2 আজ পরীক্ষায় ফেল করেছি, খুব কষ্ট লাগছে। আমি খুব দুঃখিত! আপনি কি আপনার উপস্থাপনা মনে করেন? ✅ Supportive
3 নতুন চাকরি পেয়েছি! অনেক খুশি! অভিনন্দন! আমি জানি তুমি কেমন অনুভব করছো। আমি সত্যিই এই সম্পর্কে উত্তেজিত! ✅ Celebratory
4 আমার বন্ধু আমাকে ছেড়ে চলে গেছে। আমি এটা শুনে দুঃখিত। আপনি কি তার সাথে কথা বলেছেন? ✅ Caring

Design Decisions & Trade-offs

1️⃣ Sequence Length: 256 vs Full Length

Aspect Requirement What We Did Why
Sequence Length Not reduced Reduced to 256 GPU memory constraint

What "Sequence Length" Means:

  • Maximum number of tokens (words/subwords) the model processes at once
  • Original conversations may have 500-1000+ tokens
  • We truncated to 256 tokens

Why We Reduced It:

Problem: Kaggle T4 GPU has only 16GB VRAM

Full Length (512+ tokens):
- Memory needed: ~18-20GB ❌ Doesn't fit
- Batch size: 1 (very slow)
- Training time: 20+ hours

Reduced Length (256 tokens):
- Memory needed: ~12GB ✅ Fits
- Batch size: 4 (faster)
- Training time: 3 hours

Impact:

  • ~15% of conversations get truncated
  • Model may miss context in very long conversations
  • Core empathetic learning still happens (most empathy is expressed in first 256 tokens)

What Could Be Done:

  1. Use A100 GPU (40GB VRAM) → Can use 512-1024 tokens
  2. Use Unsloth library → 2x memory efficiency
  3. Use gradient accumulation with batch_size=1 → Slower but full length
  4. Use QLoRA with more aggressive quantization

2️⃣ Strategy Pattern: LoRA vs Unsloth

Aspect Requirement What We Did Why
Design Pattern Strategy pattern for LoRA/Unsloth Only LoRA implemented Time constraint + LoRA sufficient

What "Strategy Pattern" Means:

# Strategy Pattern = Swappable algorithms

class FineTuningStrategy:           # Abstract strategy
    def apply(self, model): pass

class LoRAStrategy(FineTuningStrategy):     # Strategy 1 ✅ Implemented
    def apply(self, model):
        return get_peft_model(model, lora_config)

class UnslothStrategy(FineTuningStrategy):  # Strategy 2 ❌ Not implemented
    def apply(self, model):
        return FastLanguageModel.get_peft_model(model)

# Usage: Can swap strategies easily
strategy = LoRAStrategy()  # or UnslothStrategy()
model = strategy.apply(base_model)

Why We Only Used LoRA:

Factor LoRA Unsloth
Ease of setup ✅ Simple ⚠️ Requires specific installation
Kaggle compatibility ✅ Works perfectly ⚠️ May have conflicts
Memory efficiency Good (4-bit) Better (2x faster)
Our GPU time 3 hours left Not enough to debug issues
Result quality ✅ Achieved goal Would be similar

What Could Be Done:

# Full Strategy Pattern Implementation

from abc import ABC, abstractmethod

class FineTuningStrategy(ABC):
    @abstractmethod
    def apply(self, model, config):
        pass
    
    @abstractmethod
    def get_name(self):
        pass

class LoRAStrategy(FineTuningStrategy):
    def apply(self, model, config):
        from peft import get_peft_model, LoraConfig
        lora_config = LoraConfig(
            r=config.lora_r,
            lora_alpha=config.lora_alpha,
            target_modules=config.target_modules,
        )
        return get_peft_model(model, lora_config)
    
    def get_name(self):
        return "LoRA"

class UnslothStrategy(FineTuningStrategy):
    def apply(self, model, config):
        from unsloth import FastLanguageModel
        model, tokenizer = FastLanguageModel.from_pretrained(
            model_name=config.model_name,
            max_seq_length=config.max_length,
            load_in_4bit=True,
        )
        return FastLanguageModel.get_peft_model(model)
    
    def get_name(self):
        return "Unsloth"

# Usage
class LLAMAFineTuner:
    def __init__(self, config, strategy: FineTuningStrategy):
        self.config = config
        self.strategy = strategy
    
    def prepare_model(self, base_model):
        print(f"Using {self.strategy.get_name()} strategy")
        return self.strategy.apply(base_model, self.config)

3️⃣ Data Sampling: 40% vs 100%

Aspect Ideal What We Did Why
Training Data 100% (10,749 samples) 40% (4,299 samples) GPU time constraint

Impact:

  • Model sees less variety of conversations
  • May not generalize as well to rare emotions
  • Still learns core empathetic patterns

What Could Be Done:

  • Train for longer with full dataset
  • Use data augmentation to increase variety
  • Prioritize diverse samples over random sampling

Architecture & OOP Design

Class Diagram

┌─────────────────────────────────────────────────────────────────┐
│                         MAIN PIPELINE                     │
├─────────────────────────────────────────────────────────────────┤
│                                                            │
│  ┌─────────────────────┐        ┌─────────────────────┐        │
│  │   DatasetProcessor  │      │   LLAMAFineTuner    │       │
│  ├─────────────────────┤        ├─────────────────────┤        │
│  │ + TEMPLATE          │      │ + model             │       │
│  │ + train_dataset     │      │ + tokenizer         │       │
│  │ + val_dataset       │      │ + trainer           │       │
│  ├─────────────────────┤        ├─────────────────────┤        │
│  │ + load()            │      │ + load_model()      │       │
│  │ + process()         │      │ + setup_trainer()   │       │
│  │ + _format()         │      │ + train()           │       │
│  │ + _tokenize()       │      │ + save_final()      │       │
│  └─────────────────────┘        │ + generate()        │       │
│                               └─────────────────────┘         │
│                                                             │
│  ┌─────────────────────┐         ┌─────────────────────┐        │
│  │     Evaluator       │      │  ExperimentLogger   │       │
│  ├─────────────────────┤        ├─────────────────────┤         │
│  │ + model             │      │ + db_path           │       │
│  ├─────────────────────┤        ├─────────────────────┤         │
│  │ + calculate_bleu()  │      │ + log()             │       │
│  │ + calculate_rouge() │      │ + log_response()    │        │
│  │ + calculate_ppl()   │      │ + _init_db()        │        │
│  │ + test_samples()    │      └─────────────────────┘          │
│  └─────────────────────┘                                      │
│                                                             │
└─────────────────────────────────────────────────────────────────┘

Database Schema

-- Stores training experiment metadata
CREATE TABLE LLAMAExperiments (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    model_name TEXT,                    -- e.g., "meta-llama/Llama-3.1-8B-Instruct"
    lora_config TEXT,                   -- JSON: {"r": 16, "alpha": 32}
    train_loss REAL,                    -- e.g., 0.4190
    val_loss REAL,                      -- e.g., 0.3651
    duration_hours REAL,                -- e.g., 3.26
    timestamp DATETIME DEFAULT CURRENT_TIMESTAMP
);

-- Stores generated responses for analysis
CREATE TABLE GeneratedResponses (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    experiment_id INTEGER,              -- Links to LLAMAExperiments
    input_text TEXT,                    -- User input
    response_text TEXT,                 -- Model response
    timestamp DATETIME DEFAULT CURRENT_TIMESTAMP,
    FOREIGN KEY (experiment_id) REFERENCES LLAMAExperiments(id)
);

Usage

Installation

pip install transformers peft bitsandbytes accelerate torch

Load and Use the Model

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel
import torch

# Quantization config (required for 4-bit loading)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)

# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct",
    quantization_config=bnb_config,
    device_map="auto",
    torch_dtype=torch.float16,
)

# Load LoRA adapter
model = PeftModel.from_pretrained(
    base_model, 
    "EdwardConstantine/bengali-empathy-llama"
)
tokenizer = AutoTokenizer.from_pretrained(
    "EdwardConstantine/bengali-empathy-llama"
)

# Generate empathetic response
def generate_response(prompt, max_tokens=200):
    full_prompt = f"""<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a compassionate Bengali conversational AI. Respond with empathy. Reply in Bengali.<|eot_id|><|start_header_id|>user<|end_header_id|>

{prompt}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

"""
    inputs = tokenizer(full_prompt, return_tensors="pt").to(model.device)
    outputs = model.generate(
        **inputs,
        max_new_tokens=max_tokens,
        temperature=0.7,
        top_p=0.9,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id,
    )
    response = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
    return response.strip()

# Example usage
response = generate_response("আমি খুব একা অনুভব করছি।")
print(response)

Challenges Faced

Challenge Problem Solution
bitsandbytes + triton triton.ops module not found Created in-memory patch to mock the module
JSONL Data Types Mixed string/number types in topic column Robust loader that converts all values to strings
GPU Memory (16GB) Model too large for full training 4-bit quantization + gradient checkpointing
Kaggle Timeout Lost 10+ hours of training Implemented frequent checkpointing (every 300 steps)
Download Issues Couldn't download from Kaggle Uploaded to HuggingFace Hub
Disk Space Kaggle ran out of space Cleaned old checkpoints, uploaded to HuggingFace

Future Improvements

Improvement What It Would Do Difficulty
Full sequence length (512+) Better context understanding Needs better GPU
100% training data Better generalization Needs more time
3 full epochs Lower loss, better quality Needs more time
Unsloth integration 2x faster training Medium
Gradio demo Interactive web interface Easy
More evaluation metrics BERTScore, semantic similarity Easy
Human evaluation study Real quality assessment Medium

Files in This Repository

File Description Size
adapter_model.safetensors Trained LoRA weights 168 MB
adapter_config.json LoRA configuration 915 B
tokenizer.json Tokenizer 17.2 MB
tokenizer_config.json Tokenizer config 50.6 KB
special_tokens_map.json Special tokens 325 B
chat_template.jinja Chat format template 4.61 KB
evaluation_results.csv Detailed evaluation results 215 KB
metrics_summary.json BLEU, ROUGE, Perplexity 0.2 KB
human_evaluation_sheet.csv Human evaluation template 15.5 KB
experiments.db SQLite experiment logs 12 KB

Citation

@misc{bengali-empathy-llama-2024,
  author = {EdwardConstantine},
  title = {Bengali Empathetic LLaMA: Fine-tuned LLaMA 3.1-8B for Empathetic Bengali Conversations},
  year = {2024},
  publisher = {HuggingFace},
  url = {https://huggingface.co/EdwardConstantine/bengali-empathy-llama}
}

License

This model is released under the Apache 2.0 License, subject to Meta's LLaMA license terms.


Acknowledgments

  • Meta AI for LLaMA 3.1-8B-Instruct base model
  • Hugging Face for transformers and PEFT libraries
  • Kaggle for free GPU access
  • Bengali Empathetic Conversations Dataset creators
Downloads last month
50
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for EdwardConstantine/bengali-empathy-llama

Adapter
(1343)
this model