code2-repo / doc /LDA_MIGRATION_GUIDE.md
Deepu1965's picture
Upload folder using huggingface_hub
9b1c753 verified
# 🎯 LDA Risk Discovery Migration Guide
## Overview
The codebase has been successfully migrated to use **LDA (Latent Dirichlet Allocation)** as the primary risk discovery method, replacing K-Means clustering. This change was made based on comprehensive comparison results showing LDA's superior performance for legal contract risk analysis.
---
## πŸ“Š Why LDA?
Based on comparison results from `risk_discovery_comparison_report.txt`:
### **LDA Performance:**
- βœ… **Best Balance Score: 0.718** (highest among all methods)
- βœ… **Quality Metrics:** Perplexity: 1186.4, Topic Diversity: 6.3
- βœ… **Even Distribution:** 1,146-3,426 clauses per pattern
- βœ… **Interpretable Topics:** Clear themes (Party/Agreement, IP, Compliance)
### **LDA Advantages for Legal Text:**
1. **Overlapping Categories** - Clauses can belong to multiple risk types
2. **Probability Distributions** - Know confidence of risk assignments
3. **Better Balance** - More even distribution across discovered patterns
4. **Interpretability** - Clear topic-word distributions
5. **Proven for Legal Text** - Widely used in contract analysis
---
## πŸ”§ Changes Made
### 1. **config.py** - Added LDA Configuration
**New Parameters:**
```python
# Risk discovery method selection
risk_discovery_method: str = "lda" # Options: 'lda', 'kmeans', 'hierarchical', etc.
# LDA-specific parameters
lda_doc_topic_prior: float = 0.1 # Alpha - document-topic density
lda_topic_word_prior: float = 0.01 # Beta - topic-word density
lda_max_iter: int = 20 # Maximum LDA training iterations
lda_max_features: int = 5000 # Vocabulary size for LDA
lda_learning_method: str = 'batch' # 'batch' or 'online'
```
**Key Settings:**
- `doc_topic_prior (Ξ±)`: Lower values (0.1) = documents focus on fewer topics
- `topic_word_prior (Ξ²)`: Lower values (0.01) = topics have fewer dominant words
- `learning_method`: 'batch' for better quality, 'online' for speed
### 2. **risk_discovery.py** - Added LDARiskDiscovery Class
**New Class:**
```python
class LDARiskDiscovery:
"""
LDA-based risk discovery with compatible interface.
Wraps TopicModelingRiskDiscovery from alternatives.
"""
```
**Key Features:**
- Compatible interface with `UnsupervisedRiskDiscovery`
- Wraps `TopicModelingRiskDiscovery` from `risk_discovery_alternatives.py`
- Provides same methods: `discover_risk_patterns()`, `get_risk_labels()`, `get_discovered_risk_names()`
- **Extra method:** `get_topic_distribution()` - returns probability distribution over all topics
### 3. **trainer.py** - Dynamic Method Selection
**Updated Initialization:**
```python
def __init__(self, config: LegalBertConfig):
# Dynamically select risk discovery method
risk_method = config.risk_discovery_method.lower()
if risk_method == 'lda':
self.risk_discovery = LDARiskDiscovery(...)
elif risk_method == 'kmeans':
self.risk_discovery = UnsupervisedRiskDiscovery(...)
else:
# Default to LDA
self.risk_discovery = LDARiskDiscovery(...)
```
### 4. **evaluator.py** - Already Compatible
No changes needed! The evaluator uses `self.risk_discovery.discovered_patterns` which both LDA and K-Means provide.
---
## πŸš€ Usage
### **Option 1: Use Default LDA Settings (Recommended)**
```bash
# Train with LDA (default)
python3 train.py
# Evaluate with LDA
python3 evaluate.py --checkpoint checkpoints/best_model.pt
```
### **Option 2: Customize LDA Parameters**
Edit `config.py`:
```python
# Fine-tune for your dataset
lda_doc_topic_prior: float = 0.05 # More focused topics
lda_topic_word_prior: float = 0.005 # Sharper topic definitions
lda_max_iter: int = 30 # Better convergence
```
### **Option 3: Switch Back to K-Means**
Edit `config.py`:
```python
risk_discovery_method: str = "kmeans" # Change from "lda"
```
---
## πŸ“ˆ Expected Output
### **During Training:**
```
🎯 Using LDA (Topic Modeling) for risk discovery
πŸ” Discovering risk patterns using LDA (n_topics=7)...
πŸ“Š LDA provides balanced, overlapping risk categories
🎯 Best for legal text with multi-faceted risks
πŸ“Š Creating document-term matrix...
🧠 Fitting LDA model...
πŸ“‹ Analyzing topics and naming patterns...
βœ… LDA discovery complete: 7 risk topics found
πŸ” Discovered Risk Patterns:
β€’ Topic_PARTY_AGREEMENT
Keywords: party, agreement, shall, company, consent
β€’ Topic_INTELLECTUAL_PROPERTY
Keywords: shall, product, products, agreement, section
β€’ Topic_COMPLIANCE
Keywords: shall, agreement, laws, state, governed
...
```
### **Key Differences from K-Means:**
| Aspect | K-Means (Old) | LDA (New) |
|--------|--------------|-----------|
| Pattern Names | `low_risk_obligation_pattern` | `Topic_PARTY_AGREEMENT` |
| Assignment | Hard (one cluster) | Soft (probability distribution) |
| Balance | 0.481 | **0.718** βœ… |
| Overlapping | No | **Yes** βœ… |
| Interpretability | Good | **Better** βœ… |
---
## πŸ” Verification
### **1. Check Risk Discovery Method:**
```bash
python3 -c "from config import LegalBertConfig; c = LegalBertConfig(); print(f'Method: {c.risk_discovery_method}')"
# Expected: Method: lda
```
### **2. Test LDA Discovery:**
```python
from config import LegalBertConfig
from trainer import LegalBertTrainer
config = LegalBertConfig()
trainer = LegalBertTrainer(config)
# Should print: "🎯 Using LDA (Topic Modeling) for risk discovery"
```
### **3. Verify Topic Distribution (LDA-specific feature):**
```python
# Get probability distribution over all topics
clauses = ["Sample clause text..."]
topic_probs = trainer.risk_discovery.get_topic_distribution(clauses)
print(f"Topic distribution shape: {topic_probs.shape}")
# Expected: (1, 7) - probabilities for each of 7 topics
```
---
## πŸŽ›οΈ LDA Parameter Tuning Guide
### **Document-Topic Prior (Ξ± / doc_topic_prior)**
Controls how many topics each document covers:
- **Lower (0.01-0.1)**: Documents focus on 1-2 topics β†’ More decisive assignments
- **Higher (0.5-1.0)**: Documents spread across many topics β†’ More mixed assignments
**Recommended:** `0.1` (current setting) - Good for legal clauses with focused risks
### **Topic-Word Prior (Ξ² / topic_word_prior)**
Controls how many words define each topic:
- **Lower (0.001-0.01)**: Topics defined by fewer words β†’ Sharper topics
- **Higher (0.1-0.5)**: Topics use more words β†’ Broader topics
**Recommended:** `0.01` (current setting) - Clear topic definitions
### **Max Iterations**
- **10-20**: Fast, may not fully converge
- **20-30**: **Recommended** - Good balance
- **50+**: Better quality, slower training
### **Learning Method**
- **'batch'** (current): Better quality, uses full dataset per iteration
- **'online'**: Faster, good for very large datasets (>100K clauses)
---
## πŸ› Troubleshooting
### **Error: "Import 'TopicModelingRiskDiscovery' not found"**
**Solution:** Ensure `risk_discovery_alternatives.py` is in the same directory.
### **Warning: "LDA did not converge"**
**Solution:** Increase `lda_max_iter` in config.py:
```python
lda_max_iter: int = 30 # or 40
```
### **Topics are too similar/overlapping**
**Solution:** Lower the priors for sharper topics:
```python
lda_doc_topic_prior: float = 0.05 # More focused
lda_topic_word_prior: float = 0.005 # Sharper
```
### **Need faster training**
**Solution:** Switch to online learning:
```python
lda_learning_method: str = 'online'
```
---
## πŸ“š References
### **LDA Theory:**
- Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. JMLR.
### **LDA for Legal Text:**
- Katz, D. M., et al. (2011). Quantitative analysis of the law using text analytics.
- Ashley, K. D. (2017). Artificial Intelligence and Legal Analytics.
### **Comparison Results:**
- See `risk_discovery_comparison_report.txt` for full analysis
- See `risk_discovery_comparison_results.json` for raw data
---
## βœ… Migration Complete
The codebase now uses **LDA as the default risk discovery method**, providing:
1. βœ… **Better Balance** - 0.718 vs 0.481 (K-Means)
2. βœ… **Overlapping Categories** - Clauses can belong to multiple risk types
3. βœ… **Probability Distributions** - Confidence scores for assignments
4. βœ… **Proven Quality** - Best performer in comparison study
5. βœ… **Backward Compatible** - Can switch back to K-Means anytime
**Next Steps:**
1. Run `python3 train.py` to train with LDA
2. Monitor discovered topics in output
3. Adjust LDA parameters if needed (see tuning guide above)
4. Compare results with previous K-Means baseline
---
**Questions?** Check the comparison report or review the code comments in `risk_discovery.py` for detailed explanations.