| # π― LDA Risk Discovery Migration Guide | |
| ## Overview | |
| The codebase has been successfully migrated to use **LDA (Latent Dirichlet Allocation)** as the primary risk discovery method, replacing K-Means clustering. This change was made based on comprehensive comparison results showing LDA's superior performance for legal contract risk analysis. | |
| --- | |
| ## π Why LDA? | |
| Based on comparison results from `risk_discovery_comparison_report.txt`: | |
| ### **LDA Performance:** | |
| - β **Best Balance Score: 0.718** (highest among all methods) | |
| - β **Quality Metrics:** Perplexity: 1186.4, Topic Diversity: 6.3 | |
| - β **Even Distribution:** 1,146-3,426 clauses per pattern | |
| - β **Interpretable Topics:** Clear themes (Party/Agreement, IP, Compliance) | |
| ### **LDA Advantages for Legal Text:** | |
| 1. **Overlapping Categories** - Clauses can belong to multiple risk types | |
| 2. **Probability Distributions** - Know confidence of risk assignments | |
| 3. **Better Balance** - More even distribution across discovered patterns | |
| 4. **Interpretability** - Clear topic-word distributions | |
| 5. **Proven for Legal Text** - Widely used in contract analysis | |
| --- | |
| ## π§ Changes Made | |
| ### 1. **config.py** - Added LDA Configuration | |
| **New Parameters:** | |
| ```python | |
| # Risk discovery method selection | |
| risk_discovery_method: str = "lda" # Options: 'lda', 'kmeans', 'hierarchical', etc. | |
| # LDA-specific parameters | |
| lda_doc_topic_prior: float = 0.1 # Alpha - document-topic density | |
| lda_topic_word_prior: float = 0.01 # Beta - topic-word density | |
| lda_max_iter: int = 20 # Maximum LDA training iterations | |
| lda_max_features: int = 5000 # Vocabulary size for LDA | |
| lda_learning_method: str = 'batch' # 'batch' or 'online' | |
| ``` | |
| **Key Settings:** | |
| - `doc_topic_prior (Ξ±)`: Lower values (0.1) = documents focus on fewer topics | |
| - `topic_word_prior (Ξ²)`: Lower values (0.01) = topics have fewer dominant words | |
| - `learning_method`: 'batch' for better quality, 'online' for speed | |
| ### 2. **risk_discovery.py** - Added LDARiskDiscovery Class | |
| **New Class:** | |
| ```python | |
| class LDARiskDiscovery: | |
| """ | |
| LDA-based risk discovery with compatible interface. | |
| Wraps TopicModelingRiskDiscovery from alternatives. | |
| """ | |
| ``` | |
| **Key Features:** | |
| - Compatible interface with `UnsupervisedRiskDiscovery` | |
| - Wraps `TopicModelingRiskDiscovery` from `risk_discovery_alternatives.py` | |
| - Provides same methods: `discover_risk_patterns()`, `get_risk_labels()`, `get_discovered_risk_names()` | |
| - **Extra method:** `get_topic_distribution()` - returns probability distribution over all topics | |
| ### 3. **trainer.py** - Dynamic Method Selection | |
| **Updated Initialization:** | |
| ```python | |
| def __init__(self, config: LegalBertConfig): | |
| # Dynamically select risk discovery method | |
| risk_method = config.risk_discovery_method.lower() | |
| if risk_method == 'lda': | |
| self.risk_discovery = LDARiskDiscovery(...) | |
| elif risk_method == 'kmeans': | |
| self.risk_discovery = UnsupervisedRiskDiscovery(...) | |
| else: | |
| # Default to LDA | |
| self.risk_discovery = LDARiskDiscovery(...) | |
| ``` | |
| ### 4. **evaluator.py** - Already Compatible | |
| No changes needed! The evaluator uses `self.risk_discovery.discovered_patterns` which both LDA and K-Means provide. | |
| --- | |
| ## π Usage | |
| ### **Option 1: Use Default LDA Settings (Recommended)** | |
| ```bash | |
| # Train with LDA (default) | |
| python3 train.py | |
| # Evaluate with LDA | |
| python3 evaluate.py --checkpoint checkpoints/best_model.pt | |
| ``` | |
| ### **Option 2: Customize LDA Parameters** | |
| Edit `config.py`: | |
| ```python | |
| # Fine-tune for your dataset | |
| lda_doc_topic_prior: float = 0.05 # More focused topics | |
| lda_topic_word_prior: float = 0.005 # Sharper topic definitions | |
| lda_max_iter: int = 30 # Better convergence | |
| ``` | |
| ### **Option 3: Switch Back to K-Means** | |
| Edit `config.py`: | |
| ```python | |
| risk_discovery_method: str = "kmeans" # Change from "lda" | |
| ``` | |
| --- | |
| ## π Expected Output | |
| ### **During Training:** | |
| ``` | |
| π― Using LDA (Topic Modeling) for risk discovery | |
| π Discovering risk patterns using LDA (n_topics=7)... | |
| π LDA provides balanced, overlapping risk categories | |
| π― Best for legal text with multi-faceted risks | |
| π Creating document-term matrix... | |
| π§ Fitting LDA model... | |
| π Analyzing topics and naming patterns... | |
| β LDA discovery complete: 7 risk topics found | |
| π Discovered Risk Patterns: | |
| β’ Topic_PARTY_AGREEMENT | |
| Keywords: party, agreement, shall, company, consent | |
| β’ Topic_INTELLECTUAL_PROPERTY | |
| Keywords: shall, product, products, agreement, section | |
| β’ Topic_COMPLIANCE | |
| Keywords: shall, agreement, laws, state, governed | |
| ... | |
| ``` | |
| ### **Key Differences from K-Means:** | |
| | Aspect | K-Means (Old) | LDA (New) | | |
| |--------|--------------|-----------| | |
| | Pattern Names | `low_risk_obligation_pattern` | `Topic_PARTY_AGREEMENT` | | |
| | Assignment | Hard (one cluster) | Soft (probability distribution) | | |
| | Balance | 0.481 | **0.718** β | | |
| | Overlapping | No | **Yes** β | | |
| | Interpretability | Good | **Better** β | | |
| --- | |
| ## π Verification | |
| ### **1. Check Risk Discovery Method:** | |
| ```bash | |
| python3 -c "from config import LegalBertConfig; c = LegalBertConfig(); print(f'Method: {c.risk_discovery_method}')" | |
| # Expected: Method: lda | |
| ``` | |
| ### **2. Test LDA Discovery:** | |
| ```python | |
| from config import LegalBertConfig | |
| from trainer import LegalBertTrainer | |
| config = LegalBertConfig() | |
| trainer = LegalBertTrainer(config) | |
| # Should print: "π― Using LDA (Topic Modeling) for risk discovery" | |
| ``` | |
| ### **3. Verify Topic Distribution (LDA-specific feature):** | |
| ```python | |
| # Get probability distribution over all topics | |
| clauses = ["Sample clause text..."] | |
| topic_probs = trainer.risk_discovery.get_topic_distribution(clauses) | |
| print(f"Topic distribution shape: {topic_probs.shape}") | |
| # Expected: (1, 7) - probabilities for each of 7 topics | |
| ``` | |
| --- | |
| ## ποΈ LDA Parameter Tuning Guide | |
| ### **Document-Topic Prior (Ξ± / doc_topic_prior)** | |
| Controls how many topics each document covers: | |
| - **Lower (0.01-0.1)**: Documents focus on 1-2 topics β More decisive assignments | |
| - **Higher (0.5-1.0)**: Documents spread across many topics β More mixed assignments | |
| **Recommended:** `0.1` (current setting) - Good for legal clauses with focused risks | |
| ### **Topic-Word Prior (Ξ² / topic_word_prior)** | |
| Controls how many words define each topic: | |
| - **Lower (0.001-0.01)**: Topics defined by fewer words β Sharper topics | |
| - **Higher (0.1-0.5)**: Topics use more words β Broader topics | |
| **Recommended:** `0.01` (current setting) - Clear topic definitions | |
| ### **Max Iterations** | |
| - **10-20**: Fast, may not fully converge | |
| - **20-30**: **Recommended** - Good balance | |
| - **50+**: Better quality, slower training | |
| ### **Learning Method** | |
| - **'batch'** (current): Better quality, uses full dataset per iteration | |
| - **'online'**: Faster, good for very large datasets (>100K clauses) | |
| --- | |
| ## π Troubleshooting | |
| ### **Error: "Import 'TopicModelingRiskDiscovery' not found"** | |
| **Solution:** Ensure `risk_discovery_alternatives.py` is in the same directory. | |
| ### **Warning: "LDA did not converge"** | |
| **Solution:** Increase `lda_max_iter` in config.py: | |
| ```python | |
| lda_max_iter: int = 30 # or 40 | |
| ``` | |
| ### **Topics are too similar/overlapping** | |
| **Solution:** Lower the priors for sharper topics: | |
| ```python | |
| lda_doc_topic_prior: float = 0.05 # More focused | |
| lda_topic_word_prior: float = 0.005 # Sharper | |
| ``` | |
| ### **Need faster training** | |
| **Solution:** Switch to online learning: | |
| ```python | |
| lda_learning_method: str = 'online' | |
| ``` | |
| --- | |
| ## π References | |
| ### **LDA Theory:** | |
| - Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. JMLR. | |
| ### **LDA for Legal Text:** | |
| - Katz, D. M., et al. (2011). Quantitative analysis of the law using text analytics. | |
| - Ashley, K. D. (2017). Artificial Intelligence and Legal Analytics. | |
| ### **Comparison Results:** | |
| - See `risk_discovery_comparison_report.txt` for full analysis | |
| - See `risk_discovery_comparison_results.json` for raw data | |
| --- | |
| ## β Migration Complete | |
| The codebase now uses **LDA as the default risk discovery method**, providing: | |
| 1. β **Better Balance** - 0.718 vs 0.481 (K-Means) | |
| 2. β **Overlapping Categories** - Clauses can belong to multiple risk types | |
| 3. β **Probability Distributions** - Confidence scores for assignments | |
| 4. β **Proven Quality** - Best performer in comparison study | |
| 5. β **Backward Compatible** - Can switch back to K-Means anytime | |
| **Next Steps:** | |
| 1. Run `python3 train.py` to train with LDA | |
| 2. Monitor discovered topics in output | |
| 3. Adjust LDA parameters if needed (see tuning guide above) | |
| 4. Compare results with previous K-Means baseline | |
| --- | |
| **Questions?** Check the comparison report or review the code comments in `risk_discovery.py` for detailed explanations. | |