Upload DistilGPT-2 MAXIMUM - Best possible training (LoRA r=32, 15 epochs, 500 murlis)
ea1d0fd
verified
| language: en | |
| license: mit | |
| tags: | |
| - spiritual-ai | |
| - brahma-kumaris | |
| - murli | |
| - distilgpt2 | |
| - maximum-accuracy | |
| - experimental | |
| - research-only | |
| - peft | |
| - lora | |
| library_name: peft | |
| base_model: distilgpt2 | |
| # ποΈ Murli Assistant - DistilGPT-2 MAXIMUM (Experimental) | |
| β οΈ **WARNING: EXPERIMENTAL MODEL - NOT FOR PRODUCTION USE** β οΈ | |
| This model represents the **absolute maximum possible training** for DistilGPT-2 on murli content, but **quality remains insufficient** for spiritual guidance. Deployed for research, comparison, and educational purposes only. | |
| ## β οΈ Critical Limitations | |
| ### Known Quality Issues: | |
| - β **Hallucinations persist** despite maximum training | |
| - β **Social media contamination** (Twitter URLs, @mentions in responses) | |
| - β **Factual inaccuracies** in spiritual concepts | |
| - β **Mixed content** from base model pre-training | |
| - β **Not suitable for spiritual guidance** | |
| ### Why This Model Exists: | |
| - β **Research benchmark** for small model limitations | |
| - β **Comparison baseline** vs larger models (Phi-2, Flan-T5) | |
| - β **Educational example** of training optimization | |
| - β **Proof that model size matters** for specialized domains | |
| ### Production Recommendation: | |
| **Use Phi-2 (2.7B params) instead** - proven quality for murli chatbot. | |
| ## π― Maximum Training Configuration | |
| ### This is the BEST DistilGPT-2 can do: | |
| **LoRA Configuration (MAXIMUM):** | |
| - **Rank (r):** 32 (8x better than standard r=4) | |
| - **Alpha:** 64 (8x better than standard alpha=8) | |
| - **Target Modules:** c_attn, c_proj, c_fc (ALL transformer layers) | |
| - **Trainable Parameters:** 2.36M (2.80% of model) | |
| - **Dropout:** 0.05 (reduced for maximum learning) | |
| **Training Data (MAXIMUM):** | |
| - **Murlis Used:** 500 | |
| - **Training Examples:** 344 | |
| - **Context Length:** 512 tokens (MAXIMUM) | |
| - **Spiritual Concepts:** 15 detailed examples with full explanations | |
| **Training Configuration (MAXIMUM):** | |
| - **Epochs:** 15 (5x more than standard) | |
| - **Effective Batch Size:** 16 | |
| - **Learning Rate:** 5e-05 (ultra-careful) | |
| - **Warmup Steps:** 200 (4x more than standard) | |
| - **Scheduler:** cosine | |
| - **Weight Decay:** 0.02 (regularization) | |
| - **Training Time:** ~2h 50m on CPU | |
| **Final Training Loss:** 1.609 (66% improvement over standard 4.77) | |
| ## π Progressive Training Comparison | |
| | Version | LoRA Rank | Epochs | Murlis | Loss | Quality | | |
| |---------|-----------|--------|--------|------|---------| | |
| | Standard | 4 | 3 | 150 | 4.77 | β Poor | | |
| | Enhanced | 16 | 10 | 300 | 2.07 | β Poor | | |
| | **MAXIMUM** | **32** | **15** | **500** | **1.61** | β **Still Poor** | | |
| **Key Finding:** Loss improvement does NOT guarantee quality improvement for small models in specialized domains. | |
| ## π¬ What We Learned | |
| ### Why 82M Parameters Insufficient: | |
| 1. **Base Model Dominance:** Pre-trained on internet text (Twitter, social media) | |
| 2. **Fine-tuning Limitations:** Only 2.8% of model is trainable with LoRA | |
| 3. **Knowledge Capacity:** Cannot store specialized domain knowledge + language ability | |
| 4. **Pattern vs Knowledge:** Learns format but not deep spiritual understanding | |
| ### Improvements in MAXIMUM vs Standard: | |
| β LoRA Rank: 32 (8x from standard, 2x from enhanced) | |
| β LoRA Alpha: 64 (8x from standard, 2x from enhanced) | |
| β Target Modules: c_attn + c_proj + c_fc (ALL layers) | |
| β Epochs: 15 (5x from standard, 1.5x from enhanced) | |
| β Murlis: 500 (3.3x from standard, 1.67x from enhanced) | |
| β Context: 512 tokens (2x from standard, 1.33x from enhanced) | |
| β 15 detailed spiritual concepts with full explanations | |
| β 7 different formats per murli for comprehensive learning | |
| β Ultra-careful learning rate (5e-5) | |
| β Maximum warmup (200 steps) | |
| β Larger effective batch (16) | |
| β Stronger regularization (0.02 weight decay) | |
| ### What STILL Doesn't Work: | |
| - Accurate explanations of core BK concepts | |
| - Freedom from social media text patterns | |
| - Consistent factual responses | |
| - Spiritual guidance reliability | |
| ## π» Usage (For Research/Demo Only) | |
| ```python | |
| from transformers import AutoTokenizer, AutoModelForCausalLM | |
| from peft import PeftModel | |
| import torch | |
| # Load base model | |
| tokenizer = AutoTokenizer.from_pretrained("distilgpt2") | |
| base_model = AutoModelForCausalLM.from_pretrained( | |
| "distilgpt2", | |
| torch_dtype=torch.float16, | |
| device_map="auto" | |
| ) | |
| # Load MAXIMUM adapter | |
| model = PeftModel.from_pretrained( | |
| base_model, | |
| "eswarankrishnamurthy/murli-assistant-distilgpt2-maximum" | |
| ) | |
| # Chat function | |
| def chat(message): | |
| prompt = f"Question: {message}\nAnswer:" | |
| inputs = tokenizer(prompt, return_tensors="pt") | |
| outputs = model.generate( | |
| **inputs, | |
| max_new_tokens=150, | |
| temperature=0.7, | |
| top_p=0.9, | |
| top_k=50, | |
| repetition_penalty=1.2, | |
| no_repeat_ngram_size=3 | |
| ) | |
| response = tokenizer.decode(outputs[0], skip_special_tokens=True) | |
| return response.split("Answer:", 1)[1].strip() if "Answer:" in response else response | |
| # Test (expect mixed quality) | |
| print(chat("What is soul consciousness?")) | |
| ``` | |
| ## π Performance Metrics | |
| **Inference Speed (CPU):** | |
| - Fastest: 1.13s | |
| - Average: 2.69s | |
| - Slowest: 3.55s | |
| **Resource Usage:** | |
| - RAM: ~1.5-2GB | |
| - Model Size: 3.1 MB (adapter only) | |
| - Base Model: 353 MB (DistilGPT-2) | |
| **Compared to Production Models:** | |
| - **Phi-2 (2.7B):** 33x larger, βββββ quality, 5-10s inference | |
| - **Flan-T5:** 3x larger, ββββ quality, 3-5s inference | |
| - **DistilGPT-2 MAX:** Smallest, β quality, 1-3s inference | |
| ## π― Use Cases | |
| ### β Appropriate Uses: | |
| - Research on model size limitations | |
| - Benchmarking against larger models | |
| - Speed comparisons | |
| - Educational demonstrations | |
| - Training optimization studies | |
| ### β Inappropriate Uses: | |
| - **Spiritual guidance** (use Phi-2 instead) | |
| - **Production chatbot** (unreliable responses) | |
| - **Educational content** (may teach incorrect concepts) | |
| - **Public deployment** (without strong disclaimers) | |
| ## π§ Technical Details | |
| **Architecture:** | |
| - Base: DistilGPT-2 (82M parameters) | |
| - Fine-tuning: LoRA (Low-Rank Adaptation) | |
| - Modified layers: ALL attention + feed-forward layers | |
| **Training Process:** | |
| 1. Connected to MongoDB Atlas (1072 murlis available) | |
| 2. Selected 500 murlis for training | |
| 3. Created 344 enhanced training examples | |
| 4. Trained for 15 epochs with cosine LR schedule | |
| 5. Achieved lowest possible loss (1.61) | |
| **What Went Right:** | |
| - Perfect training convergence | |
| - Stable gradients throughout | |
| - Learned BK terminology and format | |
| - Fast inference speed maintained | |
| **What Went Wrong:** | |
| - Quality didn't match loss improvement | |
| - Social media patterns contaminate responses | |
| - Hallucinations persist despite maximum training | |
| - Cannot reliably explain spiritual concepts | |
| ## π Research Value | |
| This model proves important insights for AI/ML research: | |
| 1. **Model capacity is non-negotiable** for specialized domains | |
| 2. **Loss metrics can be misleading** without quality evaluation | |
| 3. **Fine-tuning has fundamental limits** based on base model size | |
| 4. **More training β better quality** when capacity insufficient | |
| 5. **Pre-training patterns dominate** small model behavior | |
| ## π Educational Message | |
| **Before deploying any AI model:** | |
| - β Test quality thoroughly, not just training metrics | |
| - β Use appropriate model size for domain complexity | |
| - β Understand fine-tuning limitations | |
| - β Consider base model's pre-training influence | |
| - β Validate against production requirements | |
| ## π Complete Training History | |
| **Completed:** 2025-10-03T12:25:52.051354 | |
| **Loss Progression:** | |
| - Epoch 1: 4.68 β 4.48 | |
| - Epoch 5: 3.44 (breakthrough) | |
| - Epoch 10: 1.81 (excellent convergence) | |
| - Epoch 15: **1.61 (BEST possible for DistilGPT-2)** | |
| **Gradient Norms:** Stable (0.72 - 1.72) | |
| ## βοΈ Final Verdict | |
| **Technical Success:** β Perfect training, lowest loss achieved | |
| **Functional Success:** β Quality insufficient for spiritual guidance | |
| **Research Value:** β Invaluable insights for model selection | |
| ### Recommendation: | |
| **For production murli chatbot, use [Phi-2](https://huggingface.co/microsoft/phi-2)** fine-tuned on murli data. | |
| This MAXIMUM model demonstrates that **small models cannot reliably handle specialized spiritual domains**, regardless of training optimization. | |
| ## π Related Models | |
| - **Standard Version:** [murli-assistant-distilgpt2-lite](https://huggingface.co/eswarankrishnamurthy/murli-assistant-distilgpt2-lite) (LoRA r=4) | |
| - **Enhanced Version:** To be released (LoRA r=16) | |
| - **Recommended Production:** Phi-2 based murli assistant (coming soon) | |
| ## π Citation | |
| ```bibtex | |
| @misc{murli-distilgpt2-maximum, | |
| author = {eswarankrishnamurthy}, | |
| title = {Murli Assistant - DistilGPT-2 MAXIMUM (Experimental)}, | |
| year = {2025}, | |
| publisher = {HuggingFace}, | |
| note = {Experimental model demonstrating small model limitations}, | |
| url = {https://huggingface.co/eswarankrishnamurthy/murli-assistant-distilgpt2-maximum} | |
| } | |
| ``` | |
| ## π§ Contact | |
| For questions about this research or the production Phi-2 model, please open an issue. | |
| --- | |
| ## β οΈ DISCLAIMER | |
| **This model is provided for research and educational purposes only.** | |
| - Not suitable for spiritual guidance | |
| - May produce incorrect or misleading information | |
| - Responses should be verified against authentic murli sources | |
| - Use at your own discretion | |
| **For reliable murli assistance, consult:** | |
| - Official Brahma Kumaris publications | |
| - Experienced BK teachers | |
| - The production Phi-2 based murli assistant (when available) | |
| --- | |
| **Om Shanti! π** | |
| *Maximum training doesn't overcome fundamental capacity limits.* | |
| *Sometimes you just need a bigger model.* | |
| --- | |
| **Model Type:** Experimental Research Model | |
| **Quality Rating:** β (Insufficient for production) | |
| **Speed Rating:** βββββ (Excellent) | |
| **Recommended Alternative:** Phi-2 (βββββ quality) | |