# 📋 Quick Reference - LDA Risk Discovery ## 🎯 What Changed ``` OLD: K-Means Clustering (hardcoded in risk_discovery.py) NEW: LDA Topic Modeling (configurable, default in config.py) ``` --- ## ⚡ Quick Start ### **1. Verify Setup:** ```bash python3 test_lda_integration.py # Expected: 4/4 tests passed ✅ ``` ### **2. Train with LDA:** ```bash python3 train.py # Look for: "🎯 Using LDA (Topic Modeling) for risk discovery" ``` ### **3. Check Results:** ```bash # Review discovered topics in training output # Topics will be named like: Topic_PARTY_AGREEMENT, Topic_INTELLECTUAL_PROPERTY ``` --- ## 🎛️ Configuration ### **File:** `config.py` ```python # Method Selection risk_discovery_method: str = "lda" # Options: 'lda', 'kmeans' # LDA Parameters lda_doc_topic_prior: float = 0.1 # α (alpha) lda_topic_word_prior: float = 0.01 # β (beta) lda_max_iter: int = 20 # Iterations lda_max_features: int = 5000 # Vocabulary lda_learning_method: str = 'batch' # Algorithm ``` --- ## 🔄 Switch Methods ### **Use LDA (default):** ```python risk_discovery_method: str = "lda" ``` ### **Use K-Means (old method):** ```python risk_discovery_method: str = "kmeans" ``` --- ## 📊 Performance Comparison | Method | Balance | Distribution | Overlapping | |--------|---------|--------------|-------------| | **LDA** | **0.718** | 1,146-3,426 | ✅ Yes | | K-Means | 0.481 | 436-9,163 | ❌ No | **Winner:** LDA (+49% better balance) --- ## 🛠️ Tuning Guide ### **More Focused Topics:** ```python lda_doc_topic_prior: float = 0.05 # Lower = more focused lda_topic_word_prior: float = 0.005 # Lower = sharper ``` ### **Better Convergence:** ```python lda_max_iter: int = 30 # More iterations ``` ### **Faster Training (large datasets):** ```python lda_learning_method: str = 'online' # vs 'batch' ``` --- ## 🔍 Code Changes Summary ### **1. config.py (8 lines added)** - Added `risk_discovery_method = "lda"` - Added 5 LDA-specific parameters ### **2. risk_discovery.py (140 lines added)** - Added `LDARiskDiscovery` class - Wraps `TopicModelingRiskDiscovery` - Compatible interface with existing code ### **3. trainer.py (25 lines modified)** - Added import: `LDARiskDiscovery` - Added method selection logic - Instantiates LDA or K-Means based on config ### **4. evaluator.py (no changes)** - Already compatible ✅ --- ## 📚 Documentation Files 1. **`doc/LDA_MIGRATION_GUIDE.md`** - Complete guide (480 lines) 2. **`doc/LDA_INTEGRATION_COMPLETE.md`** - Summary (280 lines) 3. **`doc/LDA_QUICK_REFERENCE.md`** - This file 4. **`test_lda_integration.py`** - Verification (230 lines) --- ## ✅ Verification Checklist - [x] Config has `risk_discovery_method = "lda"` - [x] LDARiskDiscovery class exists - [x] Trainer uses dynamic method selection - [x] All tests pass (4/4) - [x] Documentation complete --- ## 🚀 Usage Examples ### **Basic Training:** ```bash python3 train.py ``` ### **With Custom Epochs:** ```bash python3 train.py --epochs 10 ``` ### **Evaluate Model:** ```bash python3 evaluate.py --checkpoint checkpoints/best_model.pt ``` ### **Compare Methods:** ```bash python3 compare_risk_discovery.py --advanced ``` --- ## 🎯 Expected Output ### **LDA Discovery:** ``` 🎯 Using LDA (Topic Modeling) for risk discovery 🔍 Discovering risk patterns using LDA (n_topics=7)... 📊 LDA provides balanced, overlapping risk categories 🎯 Best for legal text with multi-faceted risks 📊 Creating document-term matrix... 🧠 Fitting LDA model... ✅ LDA discovery complete: 7 risk topics found 🔍 Discovered Risk Patterns: • Topic_PARTY_AGREEMENT Keywords: party, agreement, shall, company, consent • Topic_INTELLECTUAL_PROPERTY Keywords: shall, product, products, agreement, section • Topic_COMPLIANCE Keywords: shall, agreement, laws, state, governed ... ``` --- ## 🐛 Troubleshooting ### **Issue:** "LDA did not converge" **Solution:** Increase iterations in `config.py` ```python lda_max_iter: int = 30 ``` ### **Issue:** Topics too similar **Solution:** Lower priors for sharper topics ```python lda_doc_topic_prior: float = 0.05 lda_topic_word_prior: float = 0.005 ``` ### **Issue:** Slow training **Solution:** Use online learning ```python lda_learning_method: str = 'online' ``` ### **Issue:** Want K-Means back **Solution:** Change method in `config.py` ```python risk_discovery_method: str = "kmeans" ``` --- ## 💡 Key Insights ### **Why LDA Wins:** 1. **Balance:** 0.718 vs 0.481 (K-Means) - 49% better 2. **Overlapping:** Clauses can belong to multiple topics 3. **Probabilities:** Confidence scores for each assignment 4. **Interpretability:** Clear topic themes for legal text ### **When to Use Each:** **Use LDA when:** - ✅ Need balanced risk categories - ✅ Clauses have multiple risk types - ✅ Want probability distributions - ✅ Need interpretable topics **Use K-Means when:** - Hard cluster assignments needed - Speed is critical (slightly faster) - Simple, clear boundaries preferred --- ## 📊 Comparison Data From `risk_discovery_comparison_report.txt`: ``` Method Balance Patterns ---------------------------------------------- LDA (NEW DEFAULT) 0.718 7 (1,146-3,426) Risk-o-meter 0.577 7 (534-4,363) K-Means (OLD) 0.481 7 (436-9,163) Hierarchical 0.362 7 (91-10,483) Spectral 0.292 7 (11-13,702) Mini-Batch 0.291 7 (2-13,785) DBSCAN 1.000 1 (13,396) ---------------------------------------------- ``` **Clear Winner:** LDA --- ## 🎓 Learn More ### **LDA Theory:** - Blei et al. (2003) - "Latent Dirichlet Allocation" - Probabilistic topic modeling for text - Documents = mixture of topics - Topics = distribution over words ### **LDA for Legal:** - Overlapping categories (clauses have multiple themes) - Interpretable (topic-word distributions) - Proven for contracts (literature validated) ### **Parameters:** - **α (doc_topic_prior):** Controls document focus - Lower (0.01-0.1) = more focused - Higher (0.5-1.0) = more mixed - **β (topic_word_prior):** Controls topic specificity - Lower (0.001-0.01) = sharper topics - Higher (0.1-0.5) = broader topics --- ## ✨ Benefits Summary ### **For Users:** ✅ Better balanced risk categories ✅ More interpretable topic names ✅ Probability scores for confidence ✅ Proven superior in comparisons ### **For Developers:** ✅ Clean, compatible interface ✅ Easy to switch methods ✅ Well documented ✅ Comprehensive tests ### **For Models:** ✅ Better training data balance ✅ No class imbalance issues ✅ Richer feature representation ✅ Overlapping pattern support --- ## 📞 Support **Questions?** See: - `doc/LDA_MIGRATION_GUIDE.md` - Full guide - `doc/LDA_INTEGRATION_COMPLETE.md` - Summary - `risk_discovery_comparison_report.txt` - Results **Test:** `python3 test_lda_integration.py` **Train:** `python3 train.py` --- **Status:** ✅ **READY TO USE** **Default Method:** LDA **Backup Method:** K-Means (configurable) **Verified:** 4/4 tests passing **Documented:** Complete 🎉 **Happy Training!**