Risk Discovery Methods - Comparison Implementation
Overview
Implemented 3 additional risk discovery methods beyond the original K-Means clustering to enable comprehensive comparison of different approaches.
π― Implemented Methods (4 Total)
1. β K-Means Clustering (Original)
File: risk_discovery.py
Status: Already existed
Characteristics:
- Fast and scalable
- Creates spherical clusters
- Clear, non-overlapping boundaries
- Requires predefined number of clusters
Best For:
- Large datasets (10K+ clauses)
- When you need consistent, fast results
- Clear risk categorization needed
2. β LDA Topic Modeling (NEW)
File: risk_discovery_alternatives.py
Class: TopicModelingRiskDiscovery
Characteristics:
- Probabilistic approach
- Clauses can belong to multiple topics
- Discovers interpretable topic-word distributions
- Slower than K-Means
Advantages:
- β Natural handling of overlapping risks
- β Provides probability distribution per clause
- β Topic words are highly interpretable
- β Works well with legal text (multi-thematic)
Disadvantages:
- β Requires tuning (alpha, beta parameters)
- β Slower training (20 iterations)
- β Less clear boundaries than K-Means
Parameters:
n_topics: Number of topics to discover (default: 7)doc_topic_prior: Alpha - document-topic density (default: 0.1)topic_word_prior: Beta - topic-word density (default: 0.01)
Quality Metrics:
- Perplexity (lower is better)
- Log-likelihood (higher is better)
- Topic diversity (entropy)
Example Output:
{
'topic_id': 0,
'topic_name': 'Topic_LIABILITY',
'top_words': ['liability', 'damages', 'indemnify', 'loss', 'liable'],
'word_weights': [0.045, 0.038, 0.031, ...],
'clause_count': 2847,
'proportion': 0.284
}
3. β Hierarchical Clustering (NEW)
File: risk_discovery_alternatives.py
Class: HierarchicalRiskDiscovery
Characteristics:
- Builds hierarchy of clusters (dendrogram)
- Can be cut at different levels
- Agglomerative (bottom-up) approach
- Deterministic results
Advantages:
- β Discovers nested risk structure
- β No need to specify clusters upfront
- β Deterministic (same results every run)
- β Can explore different granularities
Disadvantages:
- β Slower for large datasets (O(nΒ²) complexity)
- β Memory intensive
- β Cannot scale to 100K+ clauses easily
Parameters:
n_clusters: Number of clusters to form (default: 7)linkage: Linkage criterion - 'ward', 'average', 'complete', 'single'
Quality Metrics:
- Silhouette score (-1 to 1, higher is better)
- Average cluster size
- Cluster balance
Example Output:
{
'cluster_id': 2,
'cluster_name': 'RISK_INDEMNITY',
'top_terms': ['indemnify', 'hold', 'harmless', 'defend', 'claims'],
'term_scores': [0.234, 0.198, 0.176, ...],
'clause_count': 1543,
'proportion': 0.154
}
4. β DBSCAN (Density-Based) (NEW)
File: risk_discovery_alternatives.py
Class: DensityBasedRiskDiscovery
Characteristics:
- Density-based clustering
- Discovers clusters of arbitrary shapes
- Identifies outliers/noise automatically
- No need to specify number of clusters
Advantages:
- β Finds clusters of any shape
- β Identifies outliers (rare risks)
- β Robust to noise
- β No cluster number required
Disadvantages:
- β Sensitive to eps parameter
- β Struggles with varying density
- β May produce many small clusters
- β Requires careful tuning
Parameters:
eps: Maximum distance between samples (default: 0.5)min_samples: Minimum samples in neighborhood (default: 5)auto_tune: Automatically tune eps (default: True)
Quality Metrics:
- Number of clusters found
- Outlier ratio (% of noise points)
- Average cluster size
Special Feature: Outlier Detection
- Identifies clauses that don't fit any pattern
- Useful for finding rare/unique risks
- Can flag unusual contract clauses
Example Output:
{
'cluster_id': 3,
'cluster_name': 'RISK_PAYMENT_C3',
'top_terms': ['payment', 'fee', 'invoice', 'dollar', 'cost'],
'clause_count': 892,
'is_core_cluster': True
}
# Plus outliers
{
'n_outliers': 234,
'outlier_ratio': 0.023,
'outlier_clauses': ['Unusual clause text...']
}
π Comparison Features
Automated Comparison Script
File: compare_risk_discovery.py
What It Does:
- Loads CUAD dataset (or generates sample data)
- Runs all 4 methods on the same clauses
- Measures execution time for each
- Compares quality metrics
- Analyzes pattern diversity
- Generates comprehensive report
Usage:
python compare_risk_discovery.py
Output Files:
risk_discovery_comparison_report.txt- Human-readable reportrisk_discovery_comparison_results.json- Detailed results with metrics
Report Includes:
- Summary table with timing and quality
- Detailed analysis per method
- Pattern diversity metrics
- Top discovered patterns
- Recommendations for each method
π¬ Comparison Metrics
Performance Metrics
- Execution Time: Time to discover patterns
- Clauses/Second: Processing throughput
- Scalability: How method handles large datasets
Quality Metrics
Silhouette Score (Hierarchical):
- Range: -1 to 1
- Higher is better
- Measures cluster cohesion
Perplexity (LDA):
- Lower is better
- Measures model fit to data
Outlier Ratio (DBSCAN):
- % of clauses marked as outliers
- Indicates rare pattern detection
Diversity Metrics
- Pattern Balance: How evenly clauses distributed
- Pattern Size Variance: Consistency of cluster sizes
- Topic Diversity (LDA): Entropy of word distributions
π― Method Selection Guide
Choose K-Means when:
- β You have large datasets (10K+ clauses)
- β You need fast, consistent results
- β You want clear, non-overlapping categories
- β You can specify number of risks upfront
Choose LDA when:
- β Clauses may have multiple risk types
- β You need probability distributions
- β Interpretability of topics is crucial
- β You can afford slower processing
Choose Hierarchical when:
- β You want to explore risk hierarchies
- β Dataset is moderate size (<10K clauses)
- β You need deterministic results
- β You want to analyze at multiple granularities
Choose DBSCAN when:
- β You need to identify outliers/rare risks
- β Cluster shapes are not spherical
- β You don't know number of clusters
- β You want automatic noise handling
π Performance Comparison (Estimated)
Based on 5,000 clauses:
| Method | Time | Scalability | Quality | Interpretability |
|---|---|---|---|---|
| K-Means | 5-10s | Excellent | Good | Good |
| LDA | 30-60s | Good | Good | Excellent |
| Hierarchical | 15-30s | Moderate | Good | Good |
| DBSCAN | 10-20s | Good | Variable | Good |
π§ Integration with Training Pipeline
Option 1: Use Single Method
# In trainer.py
from risk_discovery import UnsupervisedRiskDiscovery # Original
# Or choose alternative
from risk_discovery_alternatives import TopicModelingRiskDiscovery
risk_discovery = TopicModelingRiskDiscovery(n_topics=7)
Option 2: Compare Before Training
# Run comparison first
python compare_risk_discovery.py
# Review results
cat risk_discovery_comparison_report.txt
# Choose best method for your data
python train.py # Will use chosen method
Option 3: Ensemble Approach (Advanced)
# Use multiple methods and combine
from risk_discovery import UnsupervisedRiskDiscovery
from risk_discovery_alternatives import TopicModelingRiskDiscovery
kmeans = UnsupervisedRiskDiscovery(n_clusters=7)
lda = TopicModelingRiskDiscovery(n_topics=7)
kmeans_labels = kmeans.discover_risk_patterns(clauses)
lda_labels = lda.discover_risk_patterns(clauses)
# Combine using voting or averaging
ensemble_labels = combine_predictions(kmeans_labels, lda_labels)
π Code Statistics
New Implementation
- New File:
risk_discovery_alternatives.py(570 lines) - New File:
compare_risk_discovery.py(450 lines) - Total New Code: ~1,020 lines
- New Classes: 3 (TopicModelingRiskDiscovery, HierarchicalRiskDiscovery, DensityBasedRiskDiscovery)
- New Methods: 20+
Dependencies Required
All methods use existing dependencies:
sklearn.decomposition.LatentDirichletAllocation(LDA)sklearn.cluster.AgglomerativeClustering(Hierarchical)sklearn.cluster.DBSCAN(Density-based)- Already in
requirements.txt
π§ͺ Testing the Comparison
Quick Test (Sample Data)
# Uses synthetic sample clauses
python compare_risk_discovery.py
Full Test (CUAD Dataset)
# Requires CUAD dataset at dataset/CUAD_v1/CUAD_v1.json
python compare_risk_discovery.py
Expected Output
π¬ RISK DISCOVERY METHOD COMPARISON
Loading CUAD dataset...
β
Loaded 5000 clauses for comparison
METHOD 1: K-Means Clustering
Execution time: 8.34 seconds
β
7 clusters found
METHOD 2: LDA Topic Modeling
Execution time: 45.67 seconds
Perplexity: 1234.56
β
7 topics found
METHOD 3: Hierarchical Clustering
Execution time: 23.12 seconds
Silhouette Score: 0.234
β
7 clusters found
METHOD 4: DBSCAN
Execution time: 12.45 seconds
β
9 clusters found, 123 outliers
π COMPARISON SUMMARY
[Detailed comparison table...]
πΎ Report saved to: risk_discovery_comparison_report.txt
β
Detailed results saved to: risk_discovery_comparison_results.json
π Summary
What Was Added
- β 3 new risk discovery methods (LDA, Hierarchical, DBSCAN)
- β Automated comparison framework
- β Comprehensive evaluation metrics
- β Method selection guidelines
- β Integration ready for training pipeline
What You Can Do Now
- Compare Methods: Run
compare_risk_discovery.py - Choose Best Method: Based on your data characteristics
- Train with Alternative: Swap in any method for training
- Ensemble Learning: Combine multiple methods (advanced)
Next Steps
- Run comparison on your CUAD dataset
- Review the generated report
- Choose the method that best fits your needs:
- Fast & Scalable β K-Means
- Overlapping Risks β LDA
- Risk Hierarchy β Hierarchical
- Outlier Detection β DBSCAN
- Update
trainer.pyto use chosen method - Train Legal-BERT with optimized risk discovery
Status: β
COMPLETE AND READY FOR TESTING
Files: risk_discovery_alternatives.py, compare_risk_discovery.py
Lines Added: ~1,020 lines of production code