π Quick Reference - LDA Risk Discovery
π― What Changed
OLD: K-Means Clustering (hardcoded in risk_discovery.py)
NEW: LDA Topic Modeling (configurable, default in config.py)
β‘ Quick Start
1. Verify Setup:
python3 test_lda_integration.py
# Expected: 4/4 tests passed β
2. Train with LDA:
python3 train.py
# Look for: "π― Using LDA (Topic Modeling) for risk discovery"
3. Check Results:
# Review discovered topics in training output
# Topics will be named like: Topic_PARTY_AGREEMENT, Topic_INTELLECTUAL_PROPERTY
ποΈ Configuration
File: config.py
# Method Selection
risk_discovery_method: str = "lda" # Options: 'lda', 'kmeans'
# LDA Parameters
lda_doc_topic_prior: float = 0.1 # Ξ± (alpha)
lda_topic_word_prior: float = 0.01 # Ξ² (beta)
lda_max_iter: int = 20 # Iterations
lda_max_features: int = 5000 # Vocabulary
lda_learning_method: str = 'batch' # Algorithm
π Switch Methods
Use LDA (default):
risk_discovery_method: str = "lda"
Use K-Means (old method):
risk_discovery_method: str = "kmeans"
π Performance Comparison
| Method | Balance | Distribution | Overlapping |
|---|---|---|---|
| LDA | 0.718 | 1,146-3,426 | β Yes |
| K-Means | 0.481 | 436-9,163 | β No |
Winner: LDA (+49% better balance)
π οΈ Tuning Guide
More Focused Topics:
lda_doc_topic_prior: float = 0.05 # Lower = more focused
lda_topic_word_prior: float = 0.005 # Lower = sharper
Better Convergence:
lda_max_iter: int = 30 # More iterations
Faster Training (large datasets):
lda_learning_method: str = 'online' # vs 'batch'
π Code Changes Summary
1. config.py (8 lines added)
- Added
risk_discovery_method = "lda" - Added 5 LDA-specific parameters
2. risk_discovery.py (140 lines added)
- Added
LDARiskDiscoveryclass - Wraps
TopicModelingRiskDiscovery - Compatible interface with existing code
3. trainer.py (25 lines modified)
- Added import:
LDARiskDiscovery - Added method selection logic
- Instantiates LDA or K-Means based on config
4. evaluator.py (no changes)
- Already compatible β
π Documentation Files
doc/LDA_MIGRATION_GUIDE.md- Complete guide (480 lines)doc/LDA_INTEGRATION_COMPLETE.md- Summary (280 lines)doc/LDA_QUICK_REFERENCE.md- This filetest_lda_integration.py- Verification (230 lines)
β Verification Checklist
- Config has
risk_discovery_method = "lda" - LDARiskDiscovery class exists
- Trainer uses dynamic method selection
- All tests pass (4/4)
- Documentation complete
π Usage Examples
Basic Training:
python3 train.py
With Custom Epochs:
python3 train.py --epochs 10
Evaluate Model:
python3 evaluate.py --checkpoint checkpoints/best_model.pt
Compare Methods:
python3 compare_risk_discovery.py --advanced
π― Expected Output
LDA Discovery:
π― Using LDA (Topic Modeling) for risk discovery
π Discovering risk patterns using LDA (n_topics=7)...
π LDA provides balanced, overlapping risk categories
π― Best for legal text with multi-faceted risks
π Creating document-term matrix...
π§ Fitting LDA model...
β
LDA discovery complete: 7 risk topics found
π Discovered Risk Patterns:
β’ Topic_PARTY_AGREEMENT
Keywords: party, agreement, shall, company, consent
β’ Topic_INTELLECTUAL_PROPERTY
Keywords: shall, product, products, agreement, section
β’ Topic_COMPLIANCE
Keywords: shall, agreement, laws, state, governed
...
π Troubleshooting
Issue: "LDA did not converge"
Solution: Increase iterations in config.py
lda_max_iter: int = 30
Issue: Topics too similar
Solution: Lower priors for sharper topics
lda_doc_topic_prior: float = 0.05
lda_topic_word_prior: float = 0.005
Issue: Slow training
Solution: Use online learning
lda_learning_method: str = 'online'
Issue: Want K-Means back
Solution: Change method in config.py
risk_discovery_method: str = "kmeans"
π‘ Key Insights
Why LDA Wins:
- Balance: 0.718 vs 0.481 (K-Means) - 49% better
- Overlapping: Clauses can belong to multiple topics
- Probabilities: Confidence scores for each assignment
- Interpretability: Clear topic themes for legal text
When to Use Each:
Use LDA when:
- β Need balanced risk categories
- β Clauses have multiple risk types
- β Want probability distributions
- β Need interpretable topics
Use K-Means when:
- Hard cluster assignments needed
- Speed is critical (slightly faster)
- Simple, clear boundaries preferred
π Comparison Data
From risk_discovery_comparison_report.txt:
Method Balance Patterns
----------------------------------------------
LDA (NEW DEFAULT) 0.718 7 (1,146-3,426)
Risk-o-meter 0.577 7 (534-4,363)
K-Means (OLD) 0.481 7 (436-9,163)
Hierarchical 0.362 7 (91-10,483)
Spectral 0.292 7 (11-13,702)
Mini-Batch 0.291 7 (2-13,785)
DBSCAN 1.000 1 (13,396)
----------------------------------------------
Clear Winner: LDA
π Learn More
LDA Theory:
- Blei et al. (2003) - "Latent Dirichlet Allocation"
- Probabilistic topic modeling for text
- Documents = mixture of topics
- Topics = distribution over words
LDA for Legal:
- Overlapping categories (clauses have multiple themes)
- Interpretable (topic-word distributions)
- Proven for contracts (literature validated)
Parameters:
Ξ± (doc_topic_prior): Controls document focus
- Lower (0.01-0.1) = more focused
- Higher (0.5-1.0) = more mixed
Ξ² (topic_word_prior): Controls topic specificity
- Lower (0.001-0.01) = sharper topics
- Higher (0.1-0.5) = broader topics
β¨ Benefits Summary
For Users:
β
Better balanced risk categories
β
More interpretable topic names
β
Probability scores for confidence
β
Proven superior in comparisons
For Developers:
β
Clean, compatible interface
β
Easy to switch methods
β
Well documented
β
Comprehensive tests
For Models:
β
Better training data balance
β
No class imbalance issues
β
Richer feature representation
β
Overlapping pattern support
π Support
Questions? See:
doc/LDA_MIGRATION_GUIDE.md- Full guidedoc/LDA_INTEGRATION_COMPLETE.md- Summaryrisk_discovery_comparison_report.txt- Results
Test: python3 test_lda_integration.py
Train: python3 train.py
Status: β
READY TO USE
Default Method: LDA
Backup Method: K-Means (configurable)
Verified: 4/4 tests passing
Documented: Complete
π Happy Training!