code2-repo / doc /LDA_MIGRATION_GUIDE.md
Deepu1965's picture
Upload folder using huggingface_hub
9b1c753 verified

🎯 LDA Risk Discovery Migration Guide

Overview

The codebase has been successfully migrated to use LDA (Latent Dirichlet Allocation) as the primary risk discovery method, replacing K-Means clustering. This change was made based on comprehensive comparison results showing LDA's superior performance for legal contract risk analysis.


πŸ“Š Why LDA?

Based on comparison results from risk_discovery_comparison_report.txt:

LDA Performance:

  • βœ… Best Balance Score: 0.718 (highest among all methods)
  • βœ… Quality Metrics: Perplexity: 1186.4, Topic Diversity: 6.3
  • βœ… Even Distribution: 1,146-3,426 clauses per pattern
  • βœ… Interpretable Topics: Clear themes (Party/Agreement, IP, Compliance)

LDA Advantages for Legal Text:

  1. Overlapping Categories - Clauses can belong to multiple risk types
  2. Probability Distributions - Know confidence of risk assignments
  3. Better Balance - More even distribution across discovered patterns
  4. Interpretability - Clear topic-word distributions
  5. Proven for Legal Text - Widely used in contract analysis

πŸ”§ Changes Made

1. config.py - Added LDA Configuration

New Parameters:

# Risk discovery method selection
risk_discovery_method: str = "lda"  # Options: 'lda', 'kmeans', 'hierarchical', etc.

# LDA-specific parameters
lda_doc_topic_prior: float = 0.1      # Alpha - document-topic density
lda_topic_word_prior: float = 0.01    # Beta - topic-word density  
lda_max_iter: int = 20                # Maximum LDA training iterations
lda_max_features: int = 5000          # Vocabulary size for LDA
lda_learning_method: str = 'batch'    # 'batch' or 'online'

Key Settings:

  • doc_topic_prior (Ξ±): Lower values (0.1) = documents focus on fewer topics
  • topic_word_prior (Ξ²): Lower values (0.01) = topics have fewer dominant words
  • learning_method: 'batch' for better quality, 'online' for speed

2. risk_discovery.py - Added LDARiskDiscovery Class

New Class:

class LDARiskDiscovery:
    """
    LDA-based risk discovery with compatible interface.
    Wraps TopicModelingRiskDiscovery from alternatives.
    """

Key Features:

  • Compatible interface with UnsupervisedRiskDiscovery
  • Wraps TopicModelingRiskDiscovery from risk_discovery_alternatives.py
  • Provides same methods: discover_risk_patterns(), get_risk_labels(), get_discovered_risk_names()
  • Extra method: get_topic_distribution() - returns probability distribution over all topics

3. trainer.py - Dynamic Method Selection

Updated Initialization:

def __init__(self, config: LegalBertConfig):
    # Dynamically select risk discovery method
    risk_method = config.risk_discovery_method.lower()
    
    if risk_method == 'lda':
        self.risk_discovery = LDARiskDiscovery(...)
    elif risk_method == 'kmeans':
        self.risk_discovery = UnsupervisedRiskDiscovery(...)
    else:
        # Default to LDA
        self.risk_discovery = LDARiskDiscovery(...)

4. evaluator.py - Already Compatible

No changes needed! The evaluator uses self.risk_discovery.discovered_patterns which both LDA and K-Means provide.


πŸš€ Usage

Option 1: Use Default LDA Settings (Recommended)

# Train with LDA (default)
python3 train.py

# Evaluate with LDA
python3 evaluate.py --checkpoint checkpoints/best_model.pt

Option 2: Customize LDA Parameters

Edit config.py:

# Fine-tune for your dataset
lda_doc_topic_prior: float = 0.05      # More focused topics
lda_topic_word_prior: float = 0.005    # Sharper topic definitions
lda_max_iter: int = 30                 # Better convergence

Option 3: Switch Back to K-Means

Edit config.py:

risk_discovery_method: str = "kmeans"  # Change from "lda"

πŸ“ˆ Expected Output

During Training:

🎯 Using LDA (Topic Modeling) for risk discovery
πŸ” Discovering risk patterns using LDA (n_topics=7)...
   πŸ“Š LDA provides balanced, overlapping risk categories
   🎯 Best for legal text with multi-faceted risks
  πŸ“Š Creating document-term matrix...
  🧠 Fitting LDA model...
  πŸ“‹ Analyzing topics and naming patterns...
βœ… LDA discovery complete: 7 risk topics found

πŸ” Discovered Risk Patterns:
  β€’ Topic_PARTY_AGREEMENT
    Keywords: party, agreement, shall, company, consent
  β€’ Topic_INTELLECTUAL_PROPERTY
    Keywords: shall, product, products, agreement, section
  β€’ Topic_COMPLIANCE
    Keywords: shall, agreement, laws, state, governed
  ...

Key Differences from K-Means:

Aspect K-Means (Old) LDA (New)
Pattern Names low_risk_obligation_pattern Topic_PARTY_AGREEMENT
Assignment Hard (one cluster) Soft (probability distribution)
Balance 0.481 0.718 βœ…
Overlapping No Yes βœ…
Interpretability Good Better βœ…

πŸ” Verification

1. Check Risk Discovery Method:

python3 -c "from config import LegalBertConfig; c = LegalBertConfig(); print(f'Method: {c.risk_discovery_method}')"
# Expected: Method: lda

2. Test LDA Discovery:

from config import LegalBertConfig
from trainer import LegalBertTrainer

config = LegalBertConfig()
trainer = LegalBertTrainer(config)

# Should print: "🎯 Using LDA (Topic Modeling) for risk discovery"

3. Verify Topic Distribution (LDA-specific feature):

# Get probability distribution over all topics
clauses = ["Sample clause text..."]
topic_probs = trainer.risk_discovery.get_topic_distribution(clauses)
print(f"Topic distribution shape: {topic_probs.shape}")
# Expected: (1, 7) - probabilities for each of 7 topics

πŸŽ›οΈ LDA Parameter Tuning Guide

Document-Topic Prior (Ξ± / doc_topic_prior)

Controls how many topics each document covers:

  • Lower (0.01-0.1): Documents focus on 1-2 topics β†’ More decisive assignments
  • Higher (0.5-1.0): Documents spread across many topics β†’ More mixed assignments

Recommended: 0.1 (current setting) - Good for legal clauses with focused risks

Topic-Word Prior (Ξ² / topic_word_prior)

Controls how many words define each topic:

  • Lower (0.001-0.01): Topics defined by fewer words β†’ Sharper topics
  • Higher (0.1-0.5): Topics use more words β†’ Broader topics

Recommended: 0.01 (current setting) - Clear topic definitions

Max Iterations

  • 10-20: Fast, may not fully converge
  • 20-30: Recommended - Good balance
  • 50+: Better quality, slower training

Learning Method

  • 'batch' (current): Better quality, uses full dataset per iteration
  • 'online': Faster, good for very large datasets (>100K clauses)

πŸ› Troubleshooting

Error: "Import 'TopicModelingRiskDiscovery' not found"

Solution: Ensure risk_discovery_alternatives.py is in the same directory.

Warning: "LDA did not converge"

Solution: Increase lda_max_iter in config.py:

lda_max_iter: int = 30  # or 40

Topics are too similar/overlapping

Solution: Lower the priors for sharper topics:

lda_doc_topic_prior: float = 0.05   # More focused
lda_topic_word_prior: float = 0.005  # Sharper

Need faster training

Solution: Switch to online learning:

lda_learning_method: str = 'online'

πŸ“š References

LDA Theory:

  • Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. JMLR.

LDA for Legal Text:

  • Katz, D. M., et al. (2011). Quantitative analysis of the law using text analytics.
  • Ashley, K. D. (2017). Artificial Intelligence and Legal Analytics.

Comparison Results:

  • See risk_discovery_comparison_report.txt for full analysis
  • See risk_discovery_comparison_results.json for raw data

βœ… Migration Complete

The codebase now uses LDA as the default risk discovery method, providing:

  1. βœ… Better Balance - 0.718 vs 0.481 (K-Means)
  2. βœ… Overlapping Categories - Clauses can belong to multiple risk types
  3. βœ… Probability Distributions - Confidence scores for assignments
  4. βœ… Proven Quality - Best performer in comparison study
  5. βœ… Backward Compatible - Can switch back to K-Means anytime

Next Steps:

  1. Run python3 train.py to train with LDA
  2. Monitor discovered topics in output
  3. Adjust LDA parameters if needed (see tuning guide above)
  4. Compare results with previous K-Means baseline

Questions? Check the comparison report or review the code comments in risk_discovery.py for detailed explanations.