code2-repo / doc /LDA_QUICK_REFERENCE.md
Deepu1965's picture
Upload folder using huggingface_hub
9b1c753 verified
# πŸ“‹ Quick Reference - LDA Risk Discovery
## 🎯 What Changed
```
OLD: K-Means Clustering (hardcoded in risk_discovery.py)
NEW: LDA Topic Modeling (configurable, default in config.py)
```
---
## ⚑ Quick Start
### **1. Verify Setup:**
```bash
python3 test_lda_integration.py
# Expected: 4/4 tests passed βœ…
```
### **2. Train with LDA:**
```bash
python3 train.py
# Look for: "🎯 Using LDA (Topic Modeling) for risk discovery"
```
### **3. Check Results:**
```bash
# Review discovered topics in training output
# Topics will be named like: Topic_PARTY_AGREEMENT, Topic_INTELLECTUAL_PROPERTY
```
---
## πŸŽ›οΈ Configuration
### **File:** `config.py`
```python
# Method Selection
risk_discovery_method: str = "lda" # Options: 'lda', 'kmeans'
# LDA Parameters
lda_doc_topic_prior: float = 0.1 # Ξ± (alpha)
lda_topic_word_prior: float = 0.01 # Ξ² (beta)
lda_max_iter: int = 20 # Iterations
lda_max_features: int = 5000 # Vocabulary
lda_learning_method: str = 'batch' # Algorithm
```
---
## πŸ”„ Switch Methods
### **Use LDA (default):**
```python
risk_discovery_method: str = "lda"
```
### **Use K-Means (old method):**
```python
risk_discovery_method: str = "kmeans"
```
---
## πŸ“Š Performance Comparison
| Method | Balance | Distribution | Overlapping |
|--------|---------|--------------|-------------|
| **LDA** | **0.718** | 1,146-3,426 | βœ… Yes |
| K-Means | 0.481 | 436-9,163 | ❌ No |
**Winner:** LDA (+49% better balance)
---
## πŸ› οΈ Tuning Guide
### **More Focused Topics:**
```python
lda_doc_topic_prior: float = 0.05 # Lower = more focused
lda_topic_word_prior: float = 0.005 # Lower = sharper
```
### **Better Convergence:**
```python
lda_max_iter: int = 30 # More iterations
```
### **Faster Training (large datasets):**
```python
lda_learning_method: str = 'online' # vs 'batch'
```
---
## πŸ” Code Changes Summary
### **1. config.py (8 lines added)**
- Added `risk_discovery_method = "lda"`
- Added 5 LDA-specific parameters
### **2. risk_discovery.py (140 lines added)**
- Added `LDARiskDiscovery` class
- Wraps `TopicModelingRiskDiscovery`
- Compatible interface with existing code
### **3. trainer.py (25 lines modified)**
- Added import: `LDARiskDiscovery`
- Added method selection logic
- Instantiates LDA or K-Means based on config
### **4. evaluator.py (no changes)**
- Already compatible βœ…
---
## πŸ“š Documentation Files
1. **`doc/LDA_MIGRATION_GUIDE.md`** - Complete guide (480 lines)
2. **`doc/LDA_INTEGRATION_COMPLETE.md`** - Summary (280 lines)
3. **`doc/LDA_QUICK_REFERENCE.md`** - This file
4. **`test_lda_integration.py`** - Verification (230 lines)
---
## βœ… Verification Checklist
- [x] Config has `risk_discovery_method = "lda"`
- [x] LDARiskDiscovery class exists
- [x] Trainer uses dynamic method selection
- [x] All tests pass (4/4)
- [x] Documentation complete
---
## πŸš€ Usage Examples
### **Basic Training:**
```bash
python3 train.py
```
### **With Custom Epochs:**
```bash
python3 train.py --epochs 10
```
### **Evaluate Model:**
```bash
python3 evaluate.py --checkpoint checkpoints/best_model.pt
```
### **Compare Methods:**
```bash
python3 compare_risk_discovery.py --advanced
```
---
## 🎯 Expected Output
### **LDA Discovery:**
```
🎯 Using LDA (Topic Modeling) for risk discovery
πŸ” Discovering risk patterns using LDA (n_topics=7)...
πŸ“Š LDA provides balanced, overlapping risk categories
🎯 Best for legal text with multi-faceted risks
πŸ“Š Creating document-term matrix...
🧠 Fitting LDA model...
βœ… LDA discovery complete: 7 risk topics found
πŸ” Discovered Risk Patterns:
β€’ Topic_PARTY_AGREEMENT
Keywords: party, agreement, shall, company, consent
β€’ Topic_INTELLECTUAL_PROPERTY
Keywords: shall, product, products, agreement, section
β€’ Topic_COMPLIANCE
Keywords: shall, agreement, laws, state, governed
...
```
---
## πŸ› Troubleshooting
### **Issue:** "LDA did not converge"
**Solution:** Increase iterations in `config.py`
```python
lda_max_iter: int = 30
```
### **Issue:** Topics too similar
**Solution:** Lower priors for sharper topics
```python
lda_doc_topic_prior: float = 0.05
lda_topic_word_prior: float = 0.005
```
### **Issue:** Slow training
**Solution:** Use online learning
```python
lda_learning_method: str = 'online'
```
### **Issue:** Want K-Means back
**Solution:** Change method in `config.py`
```python
risk_discovery_method: str = "kmeans"
```
---
## πŸ’‘ Key Insights
### **Why LDA Wins:**
1. **Balance:** 0.718 vs 0.481 (K-Means) - 49% better
2. **Overlapping:** Clauses can belong to multiple topics
3. **Probabilities:** Confidence scores for each assignment
4. **Interpretability:** Clear topic themes for legal text
### **When to Use Each:**
**Use LDA when:**
- βœ… Need balanced risk categories
- βœ… Clauses have multiple risk types
- βœ… Want probability distributions
- βœ… Need interpretable topics
**Use K-Means when:**
- Hard cluster assignments needed
- Speed is critical (slightly faster)
- Simple, clear boundaries preferred
---
## πŸ“Š Comparison Data
From `risk_discovery_comparison_report.txt`:
```
Method Balance Patterns
----------------------------------------------
LDA (NEW DEFAULT) 0.718 7 (1,146-3,426)
Risk-o-meter 0.577 7 (534-4,363)
K-Means (OLD) 0.481 7 (436-9,163)
Hierarchical 0.362 7 (91-10,483)
Spectral 0.292 7 (11-13,702)
Mini-Batch 0.291 7 (2-13,785)
DBSCAN 1.000 1 (13,396)
----------------------------------------------
```
**Clear Winner:** LDA
---
## πŸŽ“ Learn More
### **LDA Theory:**
- Blei et al. (2003) - "Latent Dirichlet Allocation"
- Probabilistic topic modeling for text
- Documents = mixture of topics
- Topics = distribution over words
### **LDA for Legal:**
- Overlapping categories (clauses have multiple themes)
- Interpretable (topic-word distributions)
- Proven for contracts (literature validated)
### **Parameters:**
- **Ξ± (doc_topic_prior):** Controls document focus
- Lower (0.01-0.1) = more focused
- Higher (0.5-1.0) = more mixed
- **Ξ² (topic_word_prior):** Controls topic specificity
- Lower (0.001-0.01) = sharper topics
- Higher (0.1-0.5) = broader topics
---
## ✨ Benefits Summary
### **For Users:**
βœ… Better balanced risk categories
βœ… More interpretable topic names
βœ… Probability scores for confidence
βœ… Proven superior in comparisons
### **For Developers:**
βœ… Clean, compatible interface
βœ… Easy to switch methods
βœ… Well documented
βœ… Comprehensive tests
### **For Models:**
βœ… Better training data balance
βœ… No class imbalance issues
βœ… Richer feature representation
βœ… Overlapping pattern support
---
## πŸ“ž Support
**Questions?** See:
- `doc/LDA_MIGRATION_GUIDE.md` - Full guide
- `doc/LDA_INTEGRATION_COMPLETE.md` - Summary
- `risk_discovery_comparison_report.txt` - Results
**Test:** `python3 test_lda_integration.py`
**Train:** `python3 train.py`
---
**Status:** βœ… **READY TO USE**
**Default Method:** LDA
**Backup Method:** K-Means (configurable)
**Verified:** 4/4 tests passing
**Documented:** Complete
πŸŽ‰ **Happy Training!**