# 📋 Quick Reference - LDA Risk Discovery

## 🎯 What Changed

```
OLD: K-Means Clustering (hardcoded in risk_discovery.py)
NEW: LDA Topic Modeling (configurable, default in config.py)
```

---

## ⚡ Quick Start

### **1. Verify Setup:**
```bash
python3 test_lda_integration.py
# Expected: 4/4 tests passed ✅
```

### **2. Train with LDA:**
```bash
python3 train.py
# Look for: "🎯 Using LDA (Topic Modeling) for risk discovery"
```

### **3. Check Results:**
```bash
# Review discovered topics in training output
# Topics will be named like: Topic_PARTY_AGREEMENT, Topic_INTELLECTUAL_PROPERTY
```

---

## 🎛️ Configuration

### **File:** `config.py`

```python
# Method Selection
risk_discovery_method: str = "lda"  # Options: 'lda', 'kmeans'

# LDA Parameters
lda_doc_topic_prior: float = 0.1      # α (alpha)
lda_topic_word_prior: float = 0.01    # β (beta)
lda_max_iter: int = 20                # Iterations
lda_max_features: int = 5000          # Vocabulary
lda_learning_method: str = 'batch'    # Algorithm
```

---

## 🔄 Switch Methods

### **Use LDA (default):**
```python
risk_discovery_method: str = "lda"
```

### **Use K-Means (old method):**
```python
risk_discovery_method: str = "kmeans"
```

---

## 📊 Performance Comparison

| Method | Balance | Distribution | Overlapping |
|--------|---------|--------------|-------------|
| **LDA** | **0.718** | 1,146-3,426 | ✅ Yes |
| K-Means | 0.481 | 436-9,163 | ❌ No |

**Winner:** LDA (+49% better balance)

---

## 🛠️ Tuning Guide

### **More Focused Topics:**
```python
lda_doc_topic_prior: float = 0.05    # Lower = more focused
lda_topic_word_prior: float = 0.005   # Lower = sharper
```

### **Better Convergence:**
```python
lda_max_iter: int = 30  # More iterations
```

### **Faster Training (large datasets):**
```python
lda_learning_method: str = 'online'  # vs 'batch'
```

---

## 🔍 Code Changes Summary

### **1. config.py (8 lines added)**
- Added `risk_discovery_method = "lda"`
- Added 5 LDA-specific parameters

### **2. risk_discovery.py (140 lines added)**
- Added `LDARiskDiscovery` class
- Wraps `TopicModelingRiskDiscovery`
- Compatible interface with existing code

### **3. trainer.py (25 lines modified)**
- Added import: `LDARiskDiscovery`
- Added method selection logic
- Instantiates LDA or K-Means based on config

### **4. evaluator.py (no changes)**
- Already compatible ✅

---

## 📚 Documentation Files

1. **`doc/LDA_MIGRATION_GUIDE.md`** - Complete guide (480 lines)
2. **`doc/LDA_INTEGRATION_COMPLETE.md`** - Summary (280 lines)
3. **`doc/LDA_QUICK_REFERENCE.md`** - This file
4. **`test_lda_integration.py`** - Verification (230 lines)

---

## ✅ Verification Checklist

- [x] Config has `risk_discovery_method = "lda"`
- [x] LDARiskDiscovery class exists
- [x] Trainer uses dynamic method selection
- [x] All tests pass (4/4)
- [x] Documentation complete

---

## 🚀 Usage Examples

### **Basic Training:**
```bash
python3 train.py
```

### **With Custom Epochs:**
```bash
python3 train.py --epochs 10
```

### **Evaluate Model:**
```bash
python3 evaluate.py --checkpoint checkpoints/best_model.pt
```

### **Compare Methods:**
```bash
python3 compare_risk_discovery.py --advanced
```

---

## 🎯 Expected Output

### **LDA Discovery:**
```
🎯 Using LDA (Topic Modeling) for risk discovery
🔍 Discovering risk patterns using LDA (n_topics=7)...
   📊 LDA provides balanced, overlapping risk categories
   🎯 Best for legal text with multi-faceted risks
  📊 Creating document-term matrix...
  🧠 Fitting LDA model...
✅ LDA discovery complete: 7 risk topics found

🔍 Discovered Risk Patterns:
  • Topic_PARTY_AGREEMENT
    Keywords: party, agreement, shall, company, consent
  • Topic_INTELLECTUAL_PROPERTY
    Keywords: shall, product, products, agreement, section
  • Topic_COMPLIANCE
    Keywords: shall, agreement, laws, state, governed
  ...
```

---

## 🐛 Troubleshooting

### **Issue:** "LDA did not converge"
**Solution:** Increase iterations in `config.py`
```python
lda_max_iter: int = 30
```

### **Issue:** Topics too similar
**Solution:** Lower priors for sharper topics
```python
lda_doc_topic_prior: float = 0.05
lda_topic_word_prior: float = 0.005
```

### **Issue:** Slow training
**Solution:** Use online learning
```python
lda_learning_method: str = 'online'
```

### **Issue:** Want K-Means back
**Solution:** Change method in `config.py`
```python
risk_discovery_method: str = "kmeans"
```

---

## 💡 Key Insights

### **Why LDA Wins:**
1. **Balance:** 0.718 vs 0.481 (K-Means) - 49% better
2. **Overlapping:** Clauses can belong to multiple topics
3. **Probabilities:** Confidence scores for each assignment
4. **Interpretability:** Clear topic themes for legal text

### **When to Use Each:**

**Use LDA when:**
- ✅ Need balanced risk categories
- ✅ Clauses have multiple risk types
- ✅ Want probability distributions
- ✅ Need interpretable topics

**Use K-Means when:**
- Hard cluster assignments needed
- Speed is critical (slightly faster)
- Simple, clear boundaries preferred

---

## 📊 Comparison Data

From `risk_discovery_comparison_report.txt`:

```
Method                Balance    Patterns
----------------------------------------------
LDA (NEW DEFAULT)     0.718      7 (1,146-3,426)
Risk-o-meter          0.577      7 (534-4,363)
K-Means (OLD)         0.481      7 (436-9,163)
Hierarchical          0.362      7 (91-10,483)
Spectral              0.292      7 (11-13,702)
Mini-Batch            0.291      7 (2-13,785)
DBSCAN                1.000      1 (13,396)
----------------------------------------------
```

**Clear Winner:** LDA

---

## 🎓 Learn More

### **LDA Theory:**
- Blei et al. (2003) - "Latent Dirichlet Allocation"
- Probabilistic topic modeling for text
- Documents = mixture of topics
- Topics = distribution over words

### **LDA for Legal:**
- Overlapping categories (clauses have multiple themes)
- Interpretable (topic-word distributions)
- Proven for contracts (literature validated)

### **Parameters:**
- **α (doc_topic_prior):** Controls document focus
  - Lower (0.01-0.1) = more focused
  - Higher (0.5-1.0) = more mixed
  
- **β (topic_word_prior):** Controls topic specificity
  - Lower (0.001-0.01) = sharper topics
  - Higher (0.1-0.5) = broader topics

---

## ✨ Benefits Summary

### **For Users:**
✅ Better balanced risk categories  
✅ More interpretable topic names  
✅ Probability scores for confidence  
✅ Proven superior in comparisons  

### **For Developers:**
✅ Clean, compatible interface  
✅ Easy to switch methods  
✅ Well documented  
✅ Comprehensive tests  

### **For Models:**
✅ Better training data balance  
✅ No class imbalance issues  
✅ Richer feature representation  
✅ Overlapping pattern support  

---

## 📞 Support

**Questions?** See:
- `doc/LDA_MIGRATION_GUIDE.md` - Full guide
- `doc/LDA_INTEGRATION_COMPLETE.md` - Summary
- `risk_discovery_comparison_report.txt` - Results

**Test:** `python3 test_lda_integration.py`

**Train:** `python3 train.py`

---

**Status:** ✅ **READY TO USE**  
**Default Method:** LDA  
**Backup Method:** K-Means (configurable)  
**Verified:** 4/4 tests passing  
**Documented:** Complete  

🎉 **Happy Training!**