code2-repo / doc /LDA_QUICK_REFERENCE.md

Deepu1965

Upload folder using huggingface_hub

9b1c753 verified 3 months ago

7.26 kB

	# 📋 Quick Reference - LDA Risk Discovery

	## 🎯 What Changed

	```
	OLD: K-Means Clustering (hardcoded in risk_discovery.py)
	NEW: LDA Topic Modeling (configurable, default in config.py)
	```

	---

	## ⚡ Quick Start

	### 1. Verify Setup:
	```bash
	python3 test_lda_integration.py
	# Expected: 4/4 tests passed ✅
	```

	### 2. Train with LDA:
	```bash
	python3 train.py
	# Look for: "🎯 Using LDA (Topic Modeling) for risk discovery"
	```

	### 3. Check Results:
	```bash
	# Review discovered topics in training output
	# Topics will be named like: Topic_PARTY_AGREEMENT, Topic_INTELLECTUAL_PROPERTY
	```

	---

	## 🎛️ Configuration

	### File: `config.py`

	```python
	# Method Selection
	risk_discovery_method: str = "lda" # Options: 'lda', 'kmeans'

	# LDA Parameters
	lda_doc_topic_prior: float = 0.1 # α (alpha)
	lda_topic_word_prior: float = 0.01 # β (beta)
	lda_max_iter: int = 20 # Iterations
	lda_max_features: int = 5000 # Vocabulary
	lda_learning_method: str = 'batch' # Algorithm
	```

	---

	## 🔄 Switch Methods

	### Use LDA (default):
	```python
	risk_discovery_method: str = "lda"
	```

	### Use K-Means (old method):
	```python
	risk_discovery_method: str = "kmeans"
	```

	---

	## 📊 Performance Comparison

	\| Method \| Balance \| Distribution \| Overlapping \|
	\|--------\|---------\|--------------\|-------------\|
	\| LDA \| 0.718 \| 1,146-3,426 \| ✅ Yes \|
	\| K-Means \| 0.481 \| 436-9,163 \| ❌ No \|

	Winner: LDA (+49% better balance)

	---

	## 🛠️ Tuning Guide

	### More Focused Topics:
	```python
	lda_doc_topic_prior: float = 0.05 # Lower = more focused
	lda_topic_word_prior: float = 0.005 # Lower = sharper
	```

	### Better Convergence:
	```python
	lda_max_iter: int = 30 # More iterations
	```

	### Faster Training (large datasets):
	```python
	lda_learning_method: str = 'online' # vs 'batch'
	```

	---

	## 🔍 Code Changes Summary

	### 1. config.py (8 lines added)
	- Added `risk_discovery_method = "lda"`
	- Added 5 LDA-specific parameters

	### 2. risk_discovery.py (140 lines added)
	- Added `LDARiskDiscovery` class
	- Wraps `TopicModelingRiskDiscovery`
	- Compatible interface with existing code

	### 3. trainer.py (25 lines modified)
	- Added import: `LDARiskDiscovery`
	- Added method selection logic
	- Instantiates LDA or K-Means based on config

	### 4. evaluator.py (no changes)
	- Already compatible ✅

	---

	## 📚 Documentation Files

	1. `doc/LDA_MIGRATION_GUIDE.md` - Complete guide (480 lines)
	2. `doc/LDA_INTEGRATION_COMPLETE.md` - Summary (280 lines)
	3. `doc/LDA_QUICK_REFERENCE.md` - This file
	4. `test_lda_integration.py` - Verification (230 lines)

	---

	## ✅ Verification Checklist

	- [x] Config has `risk_discovery_method = "lda"`
	- [x] LDARiskDiscovery class exists
	- [x] Trainer uses dynamic method selection
	- [x] All tests pass (4/4)
	- [x] Documentation complete

	---

	## 🚀 Usage Examples

	### Basic Training:
	```bash
	python3 train.py
	```

	### With Custom Epochs:
	```bash
	python3 train.py --epochs 10
	```

	### Evaluate Model:
	```bash
	python3 evaluate.py --checkpoint checkpoints/best_model.pt
	```

	### Compare Methods:
	```bash
	python3 compare_risk_discovery.py --advanced
	```

	---

	## 🎯 Expected Output

	### LDA Discovery:
	```
	🎯 Using LDA (Topic Modeling) for risk discovery
	🔍 Discovering risk patterns using LDA (n_topics=7)...
	📊 LDA provides balanced, overlapping risk categories
	🎯 Best for legal text with multi-faceted risks
	📊 Creating document-term matrix...
	🧠 Fitting LDA model...
	✅ LDA discovery complete: 7 risk topics found

	🔍 Discovered Risk Patterns:
	• Topic_PARTY_AGREEMENT
	Keywords: party, agreement, shall, company, consent
	• Topic_INTELLECTUAL_PROPERTY
	Keywords: shall, product, products, agreement, section
	• Topic_COMPLIANCE
	Keywords: shall, agreement, laws, state, governed
	...
	```

	---

	## 🐛 Troubleshooting

	### Issue: "LDA did not converge"
	Solution: Increase iterations in `config.py`
	```python
	lda_max_iter: int = 30
	```

	### Issue: Topics too similar
	Solution: Lower priors for sharper topics
	```python
	lda_doc_topic_prior: float = 0.05
	lda_topic_word_prior: float = 0.005
	```

	### Issue: Slow training
	Solution: Use online learning
	```python
	lda_learning_method: str = 'online'
	```

	### Issue: Want K-Means back
	Solution: Change method in `config.py`
	```python
	risk_discovery_method: str = "kmeans"
	```

	---

	## 💡 Key Insights

	### Why LDA Wins:
	1. Balance: 0.718 vs 0.481 (K-Means) - 49% better
	2. Overlapping: Clauses can belong to multiple topics
	3. Probabilities: Confidence scores for each assignment
	4. Interpretability: Clear topic themes for legal text

	### When to Use Each:

	Use LDA when:
	- ✅ Need balanced risk categories
	- ✅ Clauses have multiple risk types
	- ✅ Want probability distributions
	- ✅ Need interpretable topics

	Use K-Means when:
	- Hard cluster assignments needed
	- Speed is critical (slightly faster)
	- Simple, clear boundaries preferred

	---

	## 📊 Comparison Data

	From `risk_discovery_comparison_report.txt`:

	```
	Method Balance Patterns
	----------------------------------------------
	LDA (NEW DEFAULT) 0.718 7 (1,146-3,426)
	Risk-o-meter 0.577 7 (534-4,363)
	K-Means (OLD) 0.481 7 (436-9,163)
	Hierarchical 0.362 7 (91-10,483)
	Spectral 0.292 7 (11-13,702)
	Mini-Batch 0.291 7 (2-13,785)
	DBSCAN 1.000 1 (13,396)
	----------------------------------------------
	```

	Clear Winner: LDA

	---

	## 🎓 Learn More

	### LDA Theory:
	- Blei et al. (2003) - "Latent Dirichlet Allocation"
	- Probabilistic topic modeling for text
	- Documents = mixture of topics
	- Topics = distribution over words

	### LDA for Legal:
	- Overlapping categories (clauses have multiple themes)
	- Interpretable (topic-word distributions)
	- Proven for contracts (literature validated)

	### Parameters:
	- α (doc_topic_prior): Controls document focus
	- Lower (0.01-0.1) = more focused
	- Higher (0.5-1.0) = more mixed

	- β (topic_word_prior): Controls topic specificity
	- Lower (0.001-0.01) = sharper topics
	- Higher (0.1-0.5) = broader topics

	---

	## ✨ Benefits Summary

	### For Users:
	✅ Better balanced risk categories
	✅ More interpretable topic names
	✅ Probability scores for confidence
	✅ Proven superior in comparisons

	### For Developers:
	✅ Clean, compatible interface
	✅ Easy to switch methods
	✅ Well documented
	✅ Comprehensive tests

	### For Models:
	✅ Better training data balance
	✅ No class imbalance issues
	✅ Richer feature representation
	✅ Overlapping pattern support

	---

	## 📞 Support

	Questions? See:
	- `doc/LDA_MIGRATION_GUIDE.md` - Full guide
	- `doc/LDA_INTEGRATION_COMPLETE.md` - Summary
	- `risk_discovery_comparison_report.txt` - Results

	Test: `python3 test_lda_integration.py`

	Train: `python3 train.py`

	---

	Status: ✅ READY TO USE
	Default Method: LDA
	Backup Method: K-Means (configurable)
	Verified: 4/4 tests passing
	Documented: Complete

	🎉 Happy Training!