code2-repo / doc /LDA_MIGRATION_GUIDE.md

Deepu1965

Upload folder using huggingface_hub

9b1c753 verified about 2 months ago

8.72 kB

	# 🎯 LDA Risk Discovery Migration Guide

	## Overview

	The codebase has been successfully migrated to use LDA (Latent Dirichlet Allocation) as the primary risk discovery method, replacing K-Means clustering. This change was made based on comprehensive comparison results showing LDA's superior performance for legal contract risk analysis.

	---

	## 📊 Why LDA?

	Based on comparison results from `risk_discovery_comparison_report.txt`:

	### LDA Performance:
	- ✅ Best Balance Score: 0.718 (highest among all methods)
	- ✅ Quality Metrics: Perplexity: 1186.4, Topic Diversity: 6.3
	- ✅ Even Distribution: 1,146-3,426 clauses per pattern
	- ✅ Interpretable Topics: Clear themes (Party/Agreement, IP, Compliance)

	### LDA Advantages for Legal Text:
	1. Overlapping Categories - Clauses can belong to multiple risk types
	2. Probability Distributions - Know confidence of risk assignments
	3. Better Balance - More even distribution across discovered patterns
	4. Interpretability - Clear topic-word distributions
	5. Proven for Legal Text - Widely used in contract analysis

	---

	## 🔧 Changes Made

	### 1. config.py - Added LDA Configuration

	New Parameters:
	```python
	# Risk discovery method selection
	risk_discovery_method: str = "lda" # Options: 'lda', 'kmeans', 'hierarchical', etc.

	# LDA-specific parameters
	lda_doc_topic_prior: float = 0.1 # Alpha - document-topic density
	lda_topic_word_prior: float = 0.01 # Beta - topic-word density
	lda_max_iter: int = 20 # Maximum LDA training iterations
	lda_max_features: int = 5000 # Vocabulary size for LDA
	lda_learning_method: str = 'batch' # 'batch' or 'online'
	```

	Key Settings:
	- `doc_topic_prior (α)`: Lower values (0.1) = documents focus on fewer topics
	- `topic_word_prior (β)`: Lower values (0.01) = topics have fewer dominant words
	- `learning_method`: 'batch' for better quality, 'online' for speed

	### 2. risk_discovery.py - Added LDARiskDiscovery Class

	New Class:
	```python
	class LDARiskDiscovery:
	"""
	LDA-based risk discovery with compatible interface.
	Wraps TopicModelingRiskDiscovery from alternatives.
	"""
	```

	Key Features:
	- Compatible interface with `UnsupervisedRiskDiscovery`
	- Wraps `TopicModelingRiskDiscovery` from `risk_discovery_alternatives.py`
	- Provides same methods: `discover_risk_patterns()`, `get_risk_labels()`, `get_discovered_risk_names()`
	- Extra method: `get_topic_distribution()` - returns probability distribution over all topics

	### 3. trainer.py - Dynamic Method Selection

	Updated Initialization:
	```python
	def __init__(self, config: LegalBertConfig):
	# Dynamically select risk discovery method
	risk_method = config.risk_discovery_method.lower()

	if risk_method == 'lda':
	self.risk_discovery = LDARiskDiscovery(...)
	elif risk_method == 'kmeans':
	self.risk_discovery = UnsupervisedRiskDiscovery(...)
	else:
	# Default to LDA
	self.risk_discovery = LDARiskDiscovery(...)
	```

	### 4. evaluator.py - Already Compatible

	No changes needed! The evaluator uses `self.risk_discovery.discovered_patterns` which both LDA and K-Means provide.

	---

	## 🚀 Usage

	### Option 1: Use Default LDA Settings (Recommended)

	```bash
	# Train with LDA (default)
	python3 train.py

	# Evaluate with LDA
	python3 evaluate.py --checkpoint checkpoints/best_model.pt
	```

	### Option 2: Customize LDA Parameters

	Edit `config.py`:
	```python
	# Fine-tune for your dataset
	lda_doc_topic_prior: float = 0.05 # More focused topics
	lda_topic_word_prior: float = 0.005 # Sharper topic definitions
	lda_max_iter: int = 30 # Better convergence
	```

	### Option 3: Switch Back to K-Means

	Edit `config.py`:
	```python
	risk_discovery_method: str = "kmeans" # Change from "lda"
	```

	---

	## 📈 Expected Output

	### During Training:

	```
	🎯 Using LDA (Topic Modeling) for risk discovery
	🔍 Discovering risk patterns using LDA (n_topics=7)...
	📊 LDA provides balanced, overlapping risk categories
	🎯 Best for legal text with multi-faceted risks
	📊 Creating document-term matrix...
	🧠 Fitting LDA model...
	📋 Analyzing topics and naming patterns...
	✅ LDA discovery complete: 7 risk topics found

	🔍 Discovered Risk Patterns:
	• Topic_PARTY_AGREEMENT
	Keywords: party, agreement, shall, company, consent
	• Topic_INTELLECTUAL_PROPERTY
	Keywords: shall, product, products, agreement, section
	• Topic_COMPLIANCE
	Keywords: shall, agreement, laws, state, governed
	...
	```

	### Key Differences from K-Means:

	\| Aspect \| K-Means (Old) \| LDA (New) \|
	\|--------\|--------------\|-----------\|
	\| Pattern Names \| `low_risk_obligation_pattern` \| `Topic_PARTY_AGREEMENT` \|
	\| Assignment \| Hard (one cluster) \| Soft (probability distribution) \|
	\| Balance \| 0.481 \| 0.718 ✅ \|
	\| Overlapping \| No \| Yes ✅ \|
	\| Interpretability \| Good \| Better ✅ \|

	---

	## 🔍 Verification

	### 1. Check Risk Discovery Method:

	```bash
	python3 -c "from config import LegalBertConfig; c = LegalBertConfig(); print(f'Method: {c.risk_discovery_method}')"
	# Expected: Method: lda
	```

	### 2. Test LDA Discovery:

	```python
	from config import LegalBertConfig
	from trainer import LegalBertTrainer

	config = LegalBertConfig()
	trainer = LegalBertTrainer(config)

	# Should print: "🎯 Using LDA (Topic Modeling) for risk discovery"
	```

	### 3. Verify Topic Distribution (LDA-specific feature):

	```python
	# Get probability distribution over all topics
	clauses = ["Sample clause text..."]
	topic_probs = trainer.risk_discovery.get_topic_distribution(clauses)
	print(f"Topic distribution shape: {topic_probs.shape}")
	# Expected: (1, 7) - probabilities for each of 7 topics
	```

	---

	## 🎛️ LDA Parameter Tuning Guide

	### Document-Topic Prior (α / doc_topic_prior)

	Controls how many topics each document covers:
	- Lower (0.01-0.1): Documents focus on 1-2 topics → More decisive assignments
	- Higher (0.5-1.0): Documents spread across many topics → More mixed assignments

	Recommended: `0.1` (current setting) - Good for legal clauses with focused risks

	### Topic-Word Prior (β / topic_word_prior)

	Controls how many words define each topic:
	- Lower (0.001-0.01): Topics defined by fewer words → Sharper topics
	- Higher (0.1-0.5): Topics use more words → Broader topics

	Recommended: `0.01` (current setting) - Clear topic definitions

	### Max Iterations

	- 10-20: Fast, may not fully converge
	- 20-30: Recommended - Good balance
	- 50+: Better quality, slower training

	### Learning Method

	- 'batch' (current): Better quality, uses full dataset per iteration
	- 'online': Faster, good for very large datasets (>100K clauses)

	---

	## 🐛 Troubleshooting

	### Error: "Import 'TopicModelingRiskDiscovery' not found"

	Solution: Ensure `risk_discovery_alternatives.py` is in the same directory.

	### Warning: "LDA did not converge"

	Solution: Increase `lda_max_iter` in config.py:
	```python
	lda_max_iter: int = 30 # or 40
	```

	### Topics are too similar/overlapping

	Solution: Lower the priors for sharper topics:
	```python
	lda_doc_topic_prior: float = 0.05 # More focused
	lda_topic_word_prior: float = 0.005 # Sharper
	```

	### Need faster training

	Solution: Switch to online learning:
	```python
	lda_learning_method: str = 'online'
	```

	---

	## 📚 References

	### LDA Theory:
	- Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. JMLR.

	### LDA for Legal Text:
	- Katz, D. M., et al. (2011). Quantitative analysis of the law using text analytics.
	- Ashley, K. D. (2017). Artificial Intelligence and Legal Analytics.

	### Comparison Results:
	- See `risk_discovery_comparison_report.txt` for full analysis
	- See `risk_discovery_comparison_results.json` for raw data

	---

	## ✅ Migration Complete

	The codebase now uses LDA as the default risk discovery method, providing:

	1. ✅ Better Balance - 0.718 vs 0.481 (K-Means)
	2. ✅ Overlapping Categories - Clauses can belong to multiple risk types
	3. ✅ Probability Distributions - Confidence scores for assignments
	4. ✅ Proven Quality - Best performer in comparison study
	5. ✅ Backward Compatible - Can switch back to K-Means anytime

	Next Steps:
	1. Run `python3 train.py` to train with LDA
	2. Monitor discovered topics in output
	3. Adjust LDA parameters if needed (see tuning guide above)
	4. Compare results with previous K-Means baseline

	---

	Questions? Check the comparison report or review the code comments in `risk_discovery.py` for detailed explanations.