π― LDA Risk Discovery Migration Guide
Overview
The codebase has been successfully migrated to use LDA (Latent Dirichlet Allocation) as the primary risk discovery method, replacing K-Means clustering. This change was made based on comprehensive comparison results showing LDA's superior performance for legal contract risk analysis.
π Why LDA?
Based on comparison results from risk_discovery_comparison_report.txt:
LDA Performance:
- β Best Balance Score: 0.718 (highest among all methods)
- β Quality Metrics: Perplexity: 1186.4, Topic Diversity: 6.3
- β Even Distribution: 1,146-3,426 clauses per pattern
- β Interpretable Topics: Clear themes (Party/Agreement, IP, Compliance)
LDA Advantages for Legal Text:
- Overlapping Categories - Clauses can belong to multiple risk types
- Probability Distributions - Know confidence of risk assignments
- Better Balance - More even distribution across discovered patterns
- Interpretability - Clear topic-word distributions
- Proven for Legal Text - Widely used in contract analysis
π§ Changes Made
1. config.py - Added LDA Configuration
New Parameters:
# Risk discovery method selection
risk_discovery_method: str = "lda" # Options: 'lda', 'kmeans', 'hierarchical', etc.
# LDA-specific parameters
lda_doc_topic_prior: float = 0.1 # Alpha - document-topic density
lda_topic_word_prior: float = 0.01 # Beta - topic-word density
lda_max_iter: int = 20 # Maximum LDA training iterations
lda_max_features: int = 5000 # Vocabulary size for LDA
lda_learning_method: str = 'batch' # 'batch' or 'online'
Key Settings:
doc_topic_prior (Ξ±): Lower values (0.1) = documents focus on fewer topicstopic_word_prior (Ξ²): Lower values (0.01) = topics have fewer dominant wordslearning_method: 'batch' for better quality, 'online' for speed
2. risk_discovery.py - Added LDARiskDiscovery Class
New Class:
class LDARiskDiscovery:
"""
LDA-based risk discovery with compatible interface.
Wraps TopicModelingRiskDiscovery from alternatives.
"""
Key Features:
- Compatible interface with
UnsupervisedRiskDiscovery - Wraps
TopicModelingRiskDiscoveryfromrisk_discovery_alternatives.py - Provides same methods:
discover_risk_patterns(),get_risk_labels(),get_discovered_risk_names() - Extra method:
get_topic_distribution()- returns probability distribution over all topics
3. trainer.py - Dynamic Method Selection
Updated Initialization:
def __init__(self, config: LegalBertConfig):
# Dynamically select risk discovery method
risk_method = config.risk_discovery_method.lower()
if risk_method == 'lda':
self.risk_discovery = LDARiskDiscovery(...)
elif risk_method == 'kmeans':
self.risk_discovery = UnsupervisedRiskDiscovery(...)
else:
# Default to LDA
self.risk_discovery = LDARiskDiscovery(...)
4. evaluator.py - Already Compatible
No changes needed! The evaluator uses self.risk_discovery.discovered_patterns which both LDA and K-Means provide.
π Usage
Option 1: Use Default LDA Settings (Recommended)
# Train with LDA (default)
python3 train.py
# Evaluate with LDA
python3 evaluate.py --checkpoint checkpoints/best_model.pt
Option 2: Customize LDA Parameters
Edit config.py:
# Fine-tune for your dataset
lda_doc_topic_prior: float = 0.05 # More focused topics
lda_topic_word_prior: float = 0.005 # Sharper topic definitions
lda_max_iter: int = 30 # Better convergence
Option 3: Switch Back to K-Means
Edit config.py:
risk_discovery_method: str = "kmeans" # Change from "lda"
π Expected Output
During Training:
π― Using LDA (Topic Modeling) for risk discovery
π Discovering risk patterns using LDA (n_topics=7)...
π LDA provides balanced, overlapping risk categories
π― Best for legal text with multi-faceted risks
π Creating document-term matrix...
π§ Fitting LDA model...
π Analyzing topics and naming patterns...
β
LDA discovery complete: 7 risk topics found
π Discovered Risk Patterns:
β’ Topic_PARTY_AGREEMENT
Keywords: party, agreement, shall, company, consent
β’ Topic_INTELLECTUAL_PROPERTY
Keywords: shall, product, products, agreement, section
β’ Topic_COMPLIANCE
Keywords: shall, agreement, laws, state, governed
...
Key Differences from K-Means:
| Aspect | K-Means (Old) | LDA (New) |
|---|---|---|
| Pattern Names | low_risk_obligation_pattern |
Topic_PARTY_AGREEMENT |
| Assignment | Hard (one cluster) | Soft (probability distribution) |
| Balance | 0.481 | 0.718 β |
| Overlapping | No | Yes β |
| Interpretability | Good | Better β |
π Verification
1. Check Risk Discovery Method:
python3 -c "from config import LegalBertConfig; c = LegalBertConfig(); print(f'Method: {c.risk_discovery_method}')"
# Expected: Method: lda
2. Test LDA Discovery:
from config import LegalBertConfig
from trainer import LegalBertTrainer
config = LegalBertConfig()
trainer = LegalBertTrainer(config)
# Should print: "π― Using LDA (Topic Modeling) for risk discovery"
3. Verify Topic Distribution (LDA-specific feature):
# Get probability distribution over all topics
clauses = ["Sample clause text..."]
topic_probs = trainer.risk_discovery.get_topic_distribution(clauses)
print(f"Topic distribution shape: {topic_probs.shape}")
# Expected: (1, 7) - probabilities for each of 7 topics
ποΈ LDA Parameter Tuning Guide
Document-Topic Prior (Ξ± / doc_topic_prior)
Controls how many topics each document covers:
- Lower (0.01-0.1): Documents focus on 1-2 topics β More decisive assignments
- Higher (0.5-1.0): Documents spread across many topics β More mixed assignments
Recommended: 0.1 (current setting) - Good for legal clauses with focused risks
Topic-Word Prior (Ξ² / topic_word_prior)
Controls how many words define each topic:
- Lower (0.001-0.01): Topics defined by fewer words β Sharper topics
- Higher (0.1-0.5): Topics use more words β Broader topics
Recommended: 0.01 (current setting) - Clear topic definitions
Max Iterations
- 10-20: Fast, may not fully converge
- 20-30: Recommended - Good balance
- 50+: Better quality, slower training
Learning Method
- 'batch' (current): Better quality, uses full dataset per iteration
- 'online': Faster, good for very large datasets (>100K clauses)
π Troubleshooting
Error: "Import 'TopicModelingRiskDiscovery' not found"
Solution: Ensure risk_discovery_alternatives.py is in the same directory.
Warning: "LDA did not converge"
Solution: Increase lda_max_iter in config.py:
lda_max_iter: int = 30 # or 40
Topics are too similar/overlapping
Solution: Lower the priors for sharper topics:
lda_doc_topic_prior: float = 0.05 # More focused
lda_topic_word_prior: float = 0.005 # Sharper
Need faster training
Solution: Switch to online learning:
lda_learning_method: str = 'online'
π References
LDA Theory:
- Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. JMLR.
LDA for Legal Text:
- Katz, D. M., et al. (2011). Quantitative analysis of the law using text analytics.
- Ashley, K. D. (2017). Artificial Intelligence and Legal Analytics.
Comparison Results:
- See
risk_discovery_comparison_report.txtfor full analysis - See
risk_discovery_comparison_results.jsonfor raw data
β Migration Complete
The codebase now uses LDA as the default risk discovery method, providing:
- β Better Balance - 0.718 vs 0.481 (K-Means)
- β Overlapping Categories - Clauses can belong to multiple risk types
- β Probability Distributions - Confidence scores for assignments
- β Proven Quality - Best performer in comparison study
- β Backward Compatible - Can switch back to K-Means anytime
Next Steps:
- Run
python3 train.pyto train with LDA - Monitor discovered topics in output
- Adjust LDA parameters if needed (see tuning guide above)
- Compare results with previous K-Means baseline
Questions? Check the comparison report or review the code comments in risk_discovery.py for detailed explanations.