code2-repo / doc /LDA_QUICK_REFERENCE.md
Deepu1965's picture
Upload folder using huggingface_hub
9b1c753 verified

πŸ“‹ Quick Reference - LDA Risk Discovery

🎯 What Changed

OLD: K-Means Clustering (hardcoded in risk_discovery.py)
NEW: LDA Topic Modeling (configurable, default in config.py)

⚑ Quick Start

1. Verify Setup:

python3 test_lda_integration.py
# Expected: 4/4 tests passed βœ…

2. Train with LDA:

python3 train.py
# Look for: "🎯 Using LDA (Topic Modeling) for risk discovery"

3. Check Results:

# Review discovered topics in training output
# Topics will be named like: Topic_PARTY_AGREEMENT, Topic_INTELLECTUAL_PROPERTY

πŸŽ›οΈ Configuration

File: config.py

# Method Selection
risk_discovery_method: str = "lda"  # Options: 'lda', 'kmeans'

# LDA Parameters
lda_doc_topic_prior: float = 0.1      # Ξ± (alpha)
lda_topic_word_prior: float = 0.01    # Ξ² (beta)
lda_max_iter: int = 20                # Iterations
lda_max_features: int = 5000          # Vocabulary
lda_learning_method: str = 'batch'    # Algorithm

πŸ”„ Switch Methods

Use LDA (default):

risk_discovery_method: str = "lda"

Use K-Means (old method):

risk_discovery_method: str = "kmeans"

πŸ“Š Performance Comparison

Method Balance Distribution Overlapping
LDA 0.718 1,146-3,426 βœ… Yes
K-Means 0.481 436-9,163 ❌ No

Winner: LDA (+49% better balance)


πŸ› οΈ Tuning Guide

More Focused Topics:

lda_doc_topic_prior: float = 0.05    # Lower = more focused
lda_topic_word_prior: float = 0.005   # Lower = sharper

Better Convergence:

lda_max_iter: int = 30  # More iterations

Faster Training (large datasets):

lda_learning_method: str = 'online'  # vs 'batch'

πŸ” Code Changes Summary

1. config.py (8 lines added)

  • Added risk_discovery_method = "lda"
  • Added 5 LDA-specific parameters

2. risk_discovery.py (140 lines added)

  • Added LDARiskDiscovery class
  • Wraps TopicModelingRiskDiscovery
  • Compatible interface with existing code

3. trainer.py (25 lines modified)

  • Added import: LDARiskDiscovery
  • Added method selection logic
  • Instantiates LDA or K-Means based on config

4. evaluator.py (no changes)

  • Already compatible βœ…

πŸ“š Documentation Files

  1. doc/LDA_MIGRATION_GUIDE.md - Complete guide (480 lines)
  2. doc/LDA_INTEGRATION_COMPLETE.md - Summary (280 lines)
  3. doc/LDA_QUICK_REFERENCE.md - This file
  4. test_lda_integration.py - Verification (230 lines)

βœ… Verification Checklist

  • Config has risk_discovery_method = "lda"
  • LDARiskDiscovery class exists
  • Trainer uses dynamic method selection
  • All tests pass (4/4)
  • Documentation complete

πŸš€ Usage Examples

Basic Training:

python3 train.py

With Custom Epochs:

python3 train.py --epochs 10

Evaluate Model:

python3 evaluate.py --checkpoint checkpoints/best_model.pt

Compare Methods:

python3 compare_risk_discovery.py --advanced

🎯 Expected Output

LDA Discovery:

🎯 Using LDA (Topic Modeling) for risk discovery
πŸ” Discovering risk patterns using LDA (n_topics=7)...
   πŸ“Š LDA provides balanced, overlapping risk categories
   🎯 Best for legal text with multi-faceted risks
  πŸ“Š Creating document-term matrix...
  🧠 Fitting LDA model...
βœ… LDA discovery complete: 7 risk topics found

πŸ” Discovered Risk Patterns:
  β€’ Topic_PARTY_AGREEMENT
    Keywords: party, agreement, shall, company, consent
  β€’ Topic_INTELLECTUAL_PROPERTY
    Keywords: shall, product, products, agreement, section
  β€’ Topic_COMPLIANCE
    Keywords: shall, agreement, laws, state, governed
  ...

πŸ› Troubleshooting

Issue: "LDA did not converge"

Solution: Increase iterations in config.py

lda_max_iter: int = 30

Issue: Topics too similar

Solution: Lower priors for sharper topics

lda_doc_topic_prior: float = 0.05
lda_topic_word_prior: float = 0.005

Issue: Slow training

Solution: Use online learning

lda_learning_method: str = 'online'

Issue: Want K-Means back

Solution: Change method in config.py

risk_discovery_method: str = "kmeans"

πŸ’‘ Key Insights

Why LDA Wins:

  1. Balance: 0.718 vs 0.481 (K-Means) - 49% better
  2. Overlapping: Clauses can belong to multiple topics
  3. Probabilities: Confidence scores for each assignment
  4. Interpretability: Clear topic themes for legal text

When to Use Each:

Use LDA when:

  • βœ… Need balanced risk categories
  • βœ… Clauses have multiple risk types
  • βœ… Want probability distributions
  • βœ… Need interpretable topics

Use K-Means when:

  • Hard cluster assignments needed
  • Speed is critical (slightly faster)
  • Simple, clear boundaries preferred

πŸ“Š Comparison Data

From risk_discovery_comparison_report.txt:

Method                Balance    Patterns
----------------------------------------------
LDA (NEW DEFAULT)     0.718      7 (1,146-3,426)
Risk-o-meter          0.577      7 (534-4,363)
K-Means (OLD)         0.481      7 (436-9,163)
Hierarchical          0.362      7 (91-10,483)
Spectral              0.292      7 (11-13,702)
Mini-Batch            0.291      7 (2-13,785)
DBSCAN                1.000      1 (13,396)
----------------------------------------------

Clear Winner: LDA


πŸŽ“ Learn More

LDA Theory:

  • Blei et al. (2003) - "Latent Dirichlet Allocation"
  • Probabilistic topic modeling for text
  • Documents = mixture of topics
  • Topics = distribution over words

LDA for Legal:

  • Overlapping categories (clauses have multiple themes)
  • Interpretable (topic-word distributions)
  • Proven for contracts (literature validated)

Parameters:

  • Ξ± (doc_topic_prior): Controls document focus

    • Lower (0.01-0.1) = more focused
    • Higher (0.5-1.0) = more mixed
  • Ξ² (topic_word_prior): Controls topic specificity

    • Lower (0.001-0.01) = sharper topics
    • Higher (0.1-0.5) = broader topics

✨ Benefits Summary

For Users:

βœ… Better balanced risk categories
βœ… More interpretable topic names
βœ… Probability scores for confidence
βœ… Proven superior in comparisons

For Developers:

βœ… Clean, compatible interface
βœ… Easy to switch methods
βœ… Well documented
βœ… Comprehensive tests

For Models:

βœ… Better training data balance
βœ… No class imbalance issues
βœ… Richer feature representation
βœ… Overlapping pattern support


πŸ“ž Support

Questions? See:

  • doc/LDA_MIGRATION_GUIDE.md - Full guide
  • doc/LDA_INTEGRATION_COMPLETE.md - Summary
  • risk_discovery_comparison_report.txt - Results

Test: python3 test_lda_integration.py

Train: python3 train.py


Status: βœ… READY TO USE
Default Method: LDA
Backup Method: K-Means (configurable)
Verified: 4/4 tests passing
Documented: Complete

πŸŽ‰ Happy Training!