| # π Quick Reference - LDA Risk Discovery | |
| ## π― What Changed | |
| ``` | |
| OLD: K-Means Clustering (hardcoded in risk_discovery.py) | |
| NEW: LDA Topic Modeling (configurable, default in config.py) | |
| ``` | |
| --- | |
| ## β‘ Quick Start | |
| ### **1. Verify Setup:** | |
| ```bash | |
| python3 test_lda_integration.py | |
| # Expected: 4/4 tests passed β | |
| ``` | |
| ### **2. Train with LDA:** | |
| ```bash | |
| python3 train.py | |
| # Look for: "π― Using LDA (Topic Modeling) for risk discovery" | |
| ``` | |
| ### **3. Check Results:** | |
| ```bash | |
| # Review discovered topics in training output | |
| # Topics will be named like: Topic_PARTY_AGREEMENT, Topic_INTELLECTUAL_PROPERTY | |
| ``` | |
| --- | |
| ## ποΈ Configuration | |
| ### **File:** `config.py` | |
| ```python | |
| # Method Selection | |
| risk_discovery_method: str = "lda" # Options: 'lda', 'kmeans' | |
| # LDA Parameters | |
| lda_doc_topic_prior: float = 0.1 # Ξ± (alpha) | |
| lda_topic_word_prior: float = 0.01 # Ξ² (beta) | |
| lda_max_iter: int = 20 # Iterations | |
| lda_max_features: int = 5000 # Vocabulary | |
| lda_learning_method: str = 'batch' # Algorithm | |
| ``` | |
| --- | |
| ## π Switch Methods | |
| ### **Use LDA (default):** | |
| ```python | |
| risk_discovery_method: str = "lda" | |
| ``` | |
| ### **Use K-Means (old method):** | |
| ```python | |
| risk_discovery_method: str = "kmeans" | |
| ``` | |
| --- | |
| ## π Performance Comparison | |
| | Method | Balance | Distribution | Overlapping | | |
| |--------|---------|--------------|-------------| | |
| | **LDA** | **0.718** | 1,146-3,426 | β Yes | | |
| | K-Means | 0.481 | 436-9,163 | β No | | |
| **Winner:** LDA (+49% better balance) | |
| --- | |
| ## π οΈ Tuning Guide | |
| ### **More Focused Topics:** | |
| ```python | |
| lda_doc_topic_prior: float = 0.05 # Lower = more focused | |
| lda_topic_word_prior: float = 0.005 # Lower = sharper | |
| ``` | |
| ### **Better Convergence:** | |
| ```python | |
| lda_max_iter: int = 30 # More iterations | |
| ``` | |
| ### **Faster Training (large datasets):** | |
| ```python | |
| lda_learning_method: str = 'online' # vs 'batch' | |
| ``` | |
| --- | |
| ## π Code Changes Summary | |
| ### **1. config.py (8 lines added)** | |
| - Added `risk_discovery_method = "lda"` | |
| - Added 5 LDA-specific parameters | |
| ### **2. risk_discovery.py (140 lines added)** | |
| - Added `LDARiskDiscovery` class | |
| - Wraps `TopicModelingRiskDiscovery` | |
| - Compatible interface with existing code | |
| ### **3. trainer.py (25 lines modified)** | |
| - Added import: `LDARiskDiscovery` | |
| - Added method selection logic | |
| - Instantiates LDA or K-Means based on config | |
| ### **4. evaluator.py (no changes)** | |
| - Already compatible β | |
| --- | |
| ## π Documentation Files | |
| 1. **`doc/LDA_MIGRATION_GUIDE.md`** - Complete guide (480 lines) | |
| 2. **`doc/LDA_INTEGRATION_COMPLETE.md`** - Summary (280 lines) | |
| 3. **`doc/LDA_QUICK_REFERENCE.md`** - This file | |
| 4. **`test_lda_integration.py`** - Verification (230 lines) | |
| --- | |
| ## β Verification Checklist | |
| - [x] Config has `risk_discovery_method = "lda"` | |
| - [x] LDARiskDiscovery class exists | |
| - [x] Trainer uses dynamic method selection | |
| - [x] All tests pass (4/4) | |
| - [x] Documentation complete | |
| --- | |
| ## π Usage Examples | |
| ### **Basic Training:** | |
| ```bash | |
| python3 train.py | |
| ``` | |
| ### **With Custom Epochs:** | |
| ```bash | |
| python3 train.py --epochs 10 | |
| ``` | |
| ### **Evaluate Model:** | |
| ```bash | |
| python3 evaluate.py --checkpoint checkpoints/best_model.pt | |
| ``` | |
| ### **Compare Methods:** | |
| ```bash | |
| python3 compare_risk_discovery.py --advanced | |
| ``` | |
| --- | |
| ## π― Expected Output | |
| ### **LDA Discovery:** | |
| ``` | |
| π― Using LDA (Topic Modeling) for risk discovery | |
| π Discovering risk patterns using LDA (n_topics=7)... | |
| π LDA provides balanced, overlapping risk categories | |
| π― Best for legal text with multi-faceted risks | |
| π Creating document-term matrix... | |
| π§ Fitting LDA model... | |
| β LDA discovery complete: 7 risk topics found | |
| π Discovered Risk Patterns: | |
| β’ Topic_PARTY_AGREEMENT | |
| Keywords: party, agreement, shall, company, consent | |
| β’ Topic_INTELLECTUAL_PROPERTY | |
| Keywords: shall, product, products, agreement, section | |
| β’ Topic_COMPLIANCE | |
| Keywords: shall, agreement, laws, state, governed | |
| ... | |
| ``` | |
| --- | |
| ## π Troubleshooting | |
| ### **Issue:** "LDA did not converge" | |
| **Solution:** Increase iterations in `config.py` | |
| ```python | |
| lda_max_iter: int = 30 | |
| ``` | |
| ### **Issue:** Topics too similar | |
| **Solution:** Lower priors for sharper topics | |
| ```python | |
| lda_doc_topic_prior: float = 0.05 | |
| lda_topic_word_prior: float = 0.005 | |
| ``` | |
| ### **Issue:** Slow training | |
| **Solution:** Use online learning | |
| ```python | |
| lda_learning_method: str = 'online' | |
| ``` | |
| ### **Issue:** Want K-Means back | |
| **Solution:** Change method in `config.py` | |
| ```python | |
| risk_discovery_method: str = "kmeans" | |
| ``` | |
| --- | |
| ## π‘ Key Insights | |
| ### **Why LDA Wins:** | |
| 1. **Balance:** 0.718 vs 0.481 (K-Means) - 49% better | |
| 2. **Overlapping:** Clauses can belong to multiple topics | |
| 3. **Probabilities:** Confidence scores for each assignment | |
| 4. **Interpretability:** Clear topic themes for legal text | |
| ### **When to Use Each:** | |
| **Use LDA when:** | |
| - β Need balanced risk categories | |
| - β Clauses have multiple risk types | |
| - β Want probability distributions | |
| - β Need interpretable topics | |
| **Use K-Means when:** | |
| - Hard cluster assignments needed | |
| - Speed is critical (slightly faster) | |
| - Simple, clear boundaries preferred | |
| --- | |
| ## π Comparison Data | |
| From `risk_discovery_comparison_report.txt`: | |
| ``` | |
| Method Balance Patterns | |
| ---------------------------------------------- | |
| LDA (NEW DEFAULT) 0.718 7 (1,146-3,426) | |
| Risk-o-meter 0.577 7 (534-4,363) | |
| K-Means (OLD) 0.481 7 (436-9,163) | |
| Hierarchical 0.362 7 (91-10,483) | |
| Spectral 0.292 7 (11-13,702) | |
| Mini-Batch 0.291 7 (2-13,785) | |
| DBSCAN 1.000 1 (13,396) | |
| ---------------------------------------------- | |
| ``` | |
| **Clear Winner:** LDA | |
| --- | |
| ## π Learn More | |
| ### **LDA Theory:** | |
| - Blei et al. (2003) - "Latent Dirichlet Allocation" | |
| - Probabilistic topic modeling for text | |
| - Documents = mixture of topics | |
| - Topics = distribution over words | |
| ### **LDA for Legal:** | |
| - Overlapping categories (clauses have multiple themes) | |
| - Interpretable (topic-word distributions) | |
| - Proven for contracts (literature validated) | |
| ### **Parameters:** | |
| - **Ξ± (doc_topic_prior):** Controls document focus | |
| - Lower (0.01-0.1) = more focused | |
| - Higher (0.5-1.0) = more mixed | |
| - **Ξ² (topic_word_prior):** Controls topic specificity | |
| - Lower (0.001-0.01) = sharper topics | |
| - Higher (0.1-0.5) = broader topics | |
| --- | |
| ## β¨ Benefits Summary | |
| ### **For Users:** | |
| β Better balanced risk categories | |
| β More interpretable topic names | |
| β Probability scores for confidence | |
| β Proven superior in comparisons | |
| ### **For Developers:** | |
| β Clean, compatible interface | |
| β Easy to switch methods | |
| β Well documented | |
| β Comprehensive tests | |
| ### **For Models:** | |
| β Better training data balance | |
| β No class imbalance issues | |
| β Richer feature representation | |
| β Overlapping pattern support | |
| --- | |
| ## π Support | |
| **Questions?** See: | |
| - `doc/LDA_MIGRATION_GUIDE.md` - Full guide | |
| - `doc/LDA_INTEGRATION_COMPLETE.md` - Summary | |
| - `risk_discovery_comparison_report.txt` - Results | |
| **Test:** `python3 test_lda_integration.py` | |
| **Train:** `python3 train.py` | |
| --- | |
| **Status:** β **READY TO USE** | |
| **Default Method:** LDA | |
| **Backup Method:** K-Means (configurable) | |
| **Verified:** 4/4 tests passing | |
| **Documented:** Complete | |
| π **Happy Training!** | |