File size: 8,724 Bytes
9b1c753 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 |
# π― LDA Risk Discovery Migration Guide
## Overview
The codebase has been successfully migrated to use **LDA (Latent Dirichlet Allocation)** as the primary risk discovery method, replacing K-Means clustering. This change was made based on comprehensive comparison results showing LDA's superior performance for legal contract risk analysis.
---
## π Why LDA?
Based on comparison results from `risk_discovery_comparison_report.txt`:
### **LDA Performance:**
- β
**Best Balance Score: 0.718** (highest among all methods)
- β
**Quality Metrics:** Perplexity: 1186.4, Topic Diversity: 6.3
- β
**Even Distribution:** 1,146-3,426 clauses per pattern
- β
**Interpretable Topics:** Clear themes (Party/Agreement, IP, Compliance)
### **LDA Advantages for Legal Text:**
1. **Overlapping Categories** - Clauses can belong to multiple risk types
2. **Probability Distributions** - Know confidence of risk assignments
3. **Better Balance** - More even distribution across discovered patterns
4. **Interpretability** - Clear topic-word distributions
5. **Proven for Legal Text** - Widely used in contract analysis
---
## π§ Changes Made
### 1. **config.py** - Added LDA Configuration
**New Parameters:**
```python
# Risk discovery method selection
risk_discovery_method: str = "lda" # Options: 'lda', 'kmeans', 'hierarchical', etc.
# LDA-specific parameters
lda_doc_topic_prior: float = 0.1 # Alpha - document-topic density
lda_topic_word_prior: float = 0.01 # Beta - topic-word density
lda_max_iter: int = 20 # Maximum LDA training iterations
lda_max_features: int = 5000 # Vocabulary size for LDA
lda_learning_method: str = 'batch' # 'batch' or 'online'
```
**Key Settings:**
- `doc_topic_prior (Ξ±)`: Lower values (0.1) = documents focus on fewer topics
- `topic_word_prior (Ξ²)`: Lower values (0.01) = topics have fewer dominant words
- `learning_method`: 'batch' for better quality, 'online' for speed
### 2. **risk_discovery.py** - Added LDARiskDiscovery Class
**New Class:**
```python
class LDARiskDiscovery:
"""
LDA-based risk discovery with compatible interface.
Wraps TopicModelingRiskDiscovery from alternatives.
"""
```
**Key Features:**
- Compatible interface with `UnsupervisedRiskDiscovery`
- Wraps `TopicModelingRiskDiscovery` from `risk_discovery_alternatives.py`
- Provides same methods: `discover_risk_patterns()`, `get_risk_labels()`, `get_discovered_risk_names()`
- **Extra method:** `get_topic_distribution()` - returns probability distribution over all topics
### 3. **trainer.py** - Dynamic Method Selection
**Updated Initialization:**
```python
def __init__(self, config: LegalBertConfig):
# Dynamically select risk discovery method
risk_method = config.risk_discovery_method.lower()
if risk_method == 'lda':
self.risk_discovery = LDARiskDiscovery(...)
elif risk_method == 'kmeans':
self.risk_discovery = UnsupervisedRiskDiscovery(...)
else:
# Default to LDA
self.risk_discovery = LDARiskDiscovery(...)
```
### 4. **evaluator.py** - Already Compatible
No changes needed! The evaluator uses `self.risk_discovery.discovered_patterns` which both LDA and K-Means provide.
---
## π Usage
### **Option 1: Use Default LDA Settings (Recommended)**
```bash
# Train with LDA (default)
python3 train.py
# Evaluate with LDA
python3 evaluate.py --checkpoint checkpoints/best_model.pt
```
### **Option 2: Customize LDA Parameters**
Edit `config.py`:
```python
# Fine-tune for your dataset
lda_doc_topic_prior: float = 0.05 # More focused topics
lda_topic_word_prior: float = 0.005 # Sharper topic definitions
lda_max_iter: int = 30 # Better convergence
```
### **Option 3: Switch Back to K-Means**
Edit `config.py`:
```python
risk_discovery_method: str = "kmeans" # Change from "lda"
```
---
## π Expected Output
### **During Training:**
```
π― Using LDA (Topic Modeling) for risk discovery
π Discovering risk patterns using LDA (n_topics=7)...
π LDA provides balanced, overlapping risk categories
π― Best for legal text with multi-faceted risks
π Creating document-term matrix...
π§ Fitting LDA model...
π Analyzing topics and naming patterns...
β
LDA discovery complete: 7 risk topics found
π Discovered Risk Patterns:
β’ Topic_PARTY_AGREEMENT
Keywords: party, agreement, shall, company, consent
β’ Topic_INTELLECTUAL_PROPERTY
Keywords: shall, product, products, agreement, section
β’ Topic_COMPLIANCE
Keywords: shall, agreement, laws, state, governed
...
```
### **Key Differences from K-Means:**
| Aspect | K-Means (Old) | LDA (New) |
|--------|--------------|-----------|
| Pattern Names | `low_risk_obligation_pattern` | `Topic_PARTY_AGREEMENT` |
| Assignment | Hard (one cluster) | Soft (probability distribution) |
| Balance | 0.481 | **0.718** β
|
| Overlapping | No | **Yes** β
|
| Interpretability | Good | **Better** β
|
---
## π Verification
### **1. Check Risk Discovery Method:**
```bash
python3 -c "from config import LegalBertConfig; c = LegalBertConfig(); print(f'Method: {c.risk_discovery_method}')"
# Expected: Method: lda
```
### **2. Test LDA Discovery:**
```python
from config import LegalBertConfig
from trainer import LegalBertTrainer
config = LegalBertConfig()
trainer = LegalBertTrainer(config)
# Should print: "π― Using LDA (Topic Modeling) for risk discovery"
```
### **3. Verify Topic Distribution (LDA-specific feature):**
```python
# Get probability distribution over all topics
clauses = ["Sample clause text..."]
topic_probs = trainer.risk_discovery.get_topic_distribution(clauses)
print(f"Topic distribution shape: {topic_probs.shape}")
# Expected: (1, 7) - probabilities for each of 7 topics
```
---
## ποΈ LDA Parameter Tuning Guide
### **Document-Topic Prior (Ξ± / doc_topic_prior)**
Controls how many topics each document covers:
- **Lower (0.01-0.1)**: Documents focus on 1-2 topics β More decisive assignments
- **Higher (0.5-1.0)**: Documents spread across many topics β More mixed assignments
**Recommended:** `0.1` (current setting) - Good for legal clauses with focused risks
### **Topic-Word Prior (Ξ² / topic_word_prior)**
Controls how many words define each topic:
- **Lower (0.001-0.01)**: Topics defined by fewer words β Sharper topics
- **Higher (0.1-0.5)**: Topics use more words β Broader topics
**Recommended:** `0.01` (current setting) - Clear topic definitions
### **Max Iterations**
- **10-20**: Fast, may not fully converge
- **20-30**: **Recommended** - Good balance
- **50+**: Better quality, slower training
### **Learning Method**
- **'batch'** (current): Better quality, uses full dataset per iteration
- **'online'**: Faster, good for very large datasets (>100K clauses)
---
## π Troubleshooting
### **Error: "Import 'TopicModelingRiskDiscovery' not found"**
**Solution:** Ensure `risk_discovery_alternatives.py` is in the same directory.
### **Warning: "LDA did not converge"**
**Solution:** Increase `lda_max_iter` in config.py:
```python
lda_max_iter: int = 30 # or 40
```
### **Topics are too similar/overlapping**
**Solution:** Lower the priors for sharper topics:
```python
lda_doc_topic_prior: float = 0.05 # More focused
lda_topic_word_prior: float = 0.005 # Sharper
```
### **Need faster training**
**Solution:** Switch to online learning:
```python
lda_learning_method: str = 'online'
```
---
## π References
### **LDA Theory:**
- Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. JMLR.
### **LDA for Legal Text:**
- Katz, D. M., et al. (2011). Quantitative analysis of the law using text analytics.
- Ashley, K. D. (2017). Artificial Intelligence and Legal Analytics.
### **Comparison Results:**
- See `risk_discovery_comparison_report.txt` for full analysis
- See `risk_discovery_comparison_results.json` for raw data
---
## β
Migration Complete
The codebase now uses **LDA as the default risk discovery method**, providing:
1. β
**Better Balance** - 0.718 vs 0.481 (K-Means)
2. β
**Overlapping Categories** - Clauses can belong to multiple risk types
3. β
**Probability Distributions** - Confidence scores for assignments
4. β
**Proven Quality** - Best performer in comparison study
5. β
**Backward Compatible** - Can switch back to K-Means anytime
**Next Steps:**
1. Run `python3 train.py` to train with LDA
2. Monitor discovered topics in output
3. Adjust LDA parameters if needed (see tuning guide above)
4. Compare results with previous K-Means baseline
---
**Questions?** Check the comparison report or review the code comments in `risk_discovery.py` for detailed explanations.
|