File size: 8,724 Bytes

9b1c753

# 🎯 LDA Risk Discovery Migration Guide

## Overview

The codebase has been successfully migrated to use **LDA (Latent Dirichlet Allocation)** as the primary risk discovery method, replacing K-Means clustering. This change was made based on comprehensive comparison results showing LDA's superior performance for legal contract risk analysis.

---

## 📊 Why LDA?

Based on comparison results from `risk_discovery_comparison_report.txt`:

### **LDA Performance:**
- ✅ **Best Balance Score: 0.718** (highest among all methods)
- ✅ **Quality Metrics:** Perplexity: 1186.4, Topic Diversity: 6.3
- ✅ **Even Distribution:** 1,146-3,426 clauses per pattern
- ✅ **Interpretable Topics:** Clear themes (Party/Agreement, IP, Compliance)

### **LDA Advantages for Legal Text:**
1. **Overlapping Categories** - Clauses can belong to multiple risk types
2. **Probability Distributions** - Know confidence of risk assignments
3. **Better Balance** - More even distribution across discovered patterns
4. **Interpretability** - Clear topic-word distributions
5. **Proven for Legal Text** - Widely used in contract analysis

---

## 🔧 Changes Made

### 1. **config.py** - Added LDA Configuration

**New Parameters:**
```python
# Risk discovery method selection
risk_discovery_method: str = "lda"  # Options: 'lda', 'kmeans', 'hierarchical', etc.

# LDA-specific parameters
lda_doc_topic_prior: float = 0.1      # Alpha - document-topic density
lda_topic_word_prior: float = 0.01    # Beta - topic-word density  
lda_max_iter: int = 20                # Maximum LDA training iterations
lda_max_features: int = 5000          # Vocabulary size for LDA
lda_learning_method: str = 'batch'    # 'batch' or 'online'
```

**Key Settings:**
- `doc_topic_prior (α)`: Lower values (0.1) = documents focus on fewer topics
- `topic_word_prior (β)`: Lower values (0.01) = topics have fewer dominant words
- `learning_method`: 'batch' for better quality, 'online' for speed

### 2. **risk_discovery.py** - Added LDARiskDiscovery Class

**New Class:**
```python
class LDARiskDiscovery:
    """
    LDA-based risk discovery with compatible interface.
    Wraps TopicModelingRiskDiscovery from alternatives.
    """
```

**Key Features:**
- Compatible interface with `UnsupervisedRiskDiscovery`
- Wraps `TopicModelingRiskDiscovery` from `risk_discovery_alternatives.py`
- Provides same methods: `discover_risk_patterns()`, `get_risk_labels()`, `get_discovered_risk_names()`
- **Extra method:** `get_topic_distribution()` - returns probability distribution over all topics

### 3. **trainer.py** - Dynamic Method Selection

**Updated Initialization:**
```python
def __init__(self, config: LegalBertConfig):
    # Dynamically select risk discovery method
    risk_method = config.risk_discovery_method.lower()
    
    if risk_method == 'lda':
        self.risk_discovery = LDARiskDiscovery(...)
    elif risk_method == 'kmeans':
        self.risk_discovery = UnsupervisedRiskDiscovery(...)
    else:
        # Default to LDA
        self.risk_discovery = LDARiskDiscovery(...)
```

### 4. **evaluator.py** - Already Compatible

No changes needed! The evaluator uses `self.risk_discovery.discovered_patterns` which both LDA and K-Means provide.

---

## 🚀 Usage

### **Option 1: Use Default LDA Settings (Recommended)**

```bash
# Train with LDA (default)
python3 train.py

# Evaluate with LDA
python3 evaluate.py --checkpoint checkpoints/best_model.pt
```

### **Option 2: Customize LDA Parameters**

Edit `config.py`:
```python
# Fine-tune for your dataset
lda_doc_topic_prior: float = 0.05      # More focused topics
lda_topic_word_prior: float = 0.005    # Sharper topic definitions
lda_max_iter: int = 30                 # Better convergence
```

### **Option 3: Switch Back to K-Means**

Edit `config.py`:
```python
risk_discovery_method: str = "kmeans"  # Change from "lda"
```

---

## 📈 Expected Output

### **During Training:**

```
🎯 Using LDA (Topic Modeling) for risk discovery
🔍 Discovering risk patterns using LDA (n_topics=7)...
   📊 LDA provides balanced, overlapping risk categories
   🎯 Best for legal text with multi-faceted risks
  📊 Creating document-term matrix...
  🧠 Fitting LDA model...
  📋 Analyzing topics and naming patterns...
✅ LDA discovery complete: 7 risk topics found

🔍 Discovered Risk Patterns:
  • Topic_PARTY_AGREEMENT
    Keywords: party, agreement, shall, company, consent
  • Topic_INTELLECTUAL_PROPERTY
    Keywords: shall, product, products, agreement, section
  • Topic_COMPLIANCE
    Keywords: shall, agreement, laws, state, governed
  ...
```

### **Key Differences from K-Means:**

| Aspect | K-Means (Old) | LDA (New) |
|--------|--------------|-----------|
| Pattern Names | `low_risk_obligation_pattern` | `Topic_PARTY_AGREEMENT` |
| Assignment | Hard (one cluster) | Soft (probability distribution) |
| Balance | 0.481 | **0.718** ✅ |
| Overlapping | No | **Yes** ✅ |
| Interpretability | Good | **Better** ✅ |

---

## 🔍 Verification

### **1. Check Risk Discovery Method:**

```bash
python3 -c "from config import LegalBertConfig; c = LegalBertConfig(); print(f'Method: {c.risk_discovery_method}')"
# Expected: Method: lda
```

### **2. Test LDA Discovery:**

```python
from config import LegalBertConfig
from trainer import LegalBertTrainer

config = LegalBertConfig()
trainer = LegalBertTrainer(config)

# Should print: "🎯 Using LDA (Topic Modeling) for risk discovery"
```

### **3. Verify Topic Distribution (LDA-specific feature):**

```python
# Get probability distribution over all topics
clauses = ["Sample clause text..."]
topic_probs = trainer.risk_discovery.get_topic_distribution(clauses)
print(f"Topic distribution shape: {topic_probs.shape}")
# Expected: (1, 7) - probabilities for each of 7 topics
```

---

## 🎛️ LDA Parameter Tuning Guide

### **Document-Topic Prior (α / doc_topic_prior)**

Controls how many topics each document covers:
- **Lower (0.01-0.1)**: Documents focus on 1-2 topics → More decisive assignments
- **Higher (0.5-1.0)**: Documents spread across many topics → More mixed assignments

**Recommended:** `0.1` (current setting) - Good for legal clauses with focused risks

### **Topic-Word Prior (β / topic_word_prior)**

Controls how many words define each topic:
- **Lower (0.001-0.01)**: Topics defined by fewer words → Sharper topics
- **Higher (0.1-0.5)**: Topics use more words → Broader topics

**Recommended:** `0.01` (current setting) - Clear topic definitions

### **Max Iterations**

- **10-20**: Fast, may not fully converge
- **20-30**: **Recommended** - Good balance
- **50+**: Better quality, slower training

### **Learning Method**

- **'batch'** (current): Better quality, uses full dataset per iteration
- **'online'**: Faster, good for very large datasets (>100K clauses)

---

## 🐛 Troubleshooting

### **Error: "Import 'TopicModelingRiskDiscovery' not found"**

**Solution:** Ensure `risk_discovery_alternatives.py` is in the same directory.

### **Warning: "LDA did not converge"**

**Solution:** Increase `lda_max_iter` in config.py:
```python
lda_max_iter: int = 30  # or 40
```

### **Topics are too similar/overlapping**

**Solution:** Lower the priors for sharper topics:
```python
lda_doc_topic_prior: float = 0.05   # More focused
lda_topic_word_prior: float = 0.005  # Sharper
```

### **Need faster training**

**Solution:** Switch to online learning:
```python
lda_learning_method: str = 'online'
```

---

## 📚 References

### **LDA Theory:**
- Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. JMLR.

### **LDA for Legal Text:**
- Katz, D. M., et al. (2011). Quantitative analysis of the law using text analytics.
- Ashley, K. D. (2017). Artificial Intelligence and Legal Analytics.

### **Comparison Results:**
- See `risk_discovery_comparison_report.txt` for full analysis
- See `risk_discovery_comparison_results.json` for raw data

---

## ✅ Migration Complete

The codebase now uses **LDA as the default risk discovery method**, providing:

1. ✅ **Better Balance** - 0.718 vs 0.481 (K-Means)
2. ✅ **Overlapping Categories** - Clauses can belong to multiple risk types
3. ✅ **Probability Distributions** - Confidence scores for assignments
4. ✅ **Proven Quality** - Best performer in comparison study
5. ✅ **Backward Compatible** - Can switch back to K-Means anytime

**Next Steps:**
1. Run `python3 train.py` to train with LDA
2. Monitor discovered topics in output
3. Adjust LDA parameters if needed (see tuning guide above)
4. Compare results with previous K-Means baseline

---

**Questions?** Check the comparison report or review the code comments in `risk_discovery.py` for detailed explanations.