code2-repo / doc /RISK_DISCOVERY_COMPREHENSIVE.md
Deepu1965's picture
Upload folder using huggingface_hub
9b1c753 verified
# Comprehensive Risk Discovery Methods Guide
## Overview
This project now includes **9 diverse risk discovery algorithms** spanning multiple paradigms:
- **Clustering**: K-Means, Hierarchical, DBSCAN, Spectral, Mini-Batch K-Means
- **Topic Modeling**: LDA
- **Matrix Factorization**: NMF
- **Probabilistic**: GMM
- **Hybrid (Doc2Vec + ML)**: Risk-o-meter ⭐ **Paper Baseline**
## All Methods Summary
### 1. K-Means Clustering (Original)
**File**: `risk_discovery.py` β†’ `UnsupervisedRiskDiscovery`
**Algorithm**: Centroid-based partitioning
- Assigns each clause to nearest cluster centroid
- Iteratively updates centroids until convergence
- Hard assignment (each clause belongs to exactly one cluster)
**Strengths**:
- βœ… Very fast (O(nkt) where k=clusters, t=iterations)
- βœ… Scalable to millions of clauses
- βœ… Simple and interpretable
- βœ… Consistent results with same seed
**Weaknesses**:
- ❌ Requires specifying k in advance
- ❌ Sensitive to initialization
- ❌ Assumes spherical clusters
- ❌ Affected by outliers
**Best Use Cases**:
- Quick baseline comparisons
- Large datasets (>100K clauses)
- When you know the number of risk types
- Production deployments needing speed
**Quality Metric**: Silhouette score (higher is better, range -1 to 1)
---
### 2. LDA Topic Modeling
**File**: `risk_discovery_alternatives.py` β†’ `TopicModelingRiskDiscovery`
**Algorithm**: Probabilistic generative model
- Models documents as mixtures of topics
- Topics are distributions over words
- Uses Dirichlet priors for document-topic and topic-word distributions
- Supports soft assignments (clauses belong to multiple topics)
**Strengths**:
- βœ… Handles overlapping risk categories naturally
- βœ… Provides probability distributions
- βœ… Highly interpretable (topics as word distributions)
- βœ… Well-established in legal text analysis
**Weaknesses**:
- ❌ Slower than K-Means
- ❌ Perplexity can be difficult to interpret
- ❌ Requires careful hyperparameter tuning (alpha, beta)
- ❌ May produce generic topics on small datasets
**Best Use Cases**:
- When clauses have multiple risk aspects
- Exploratory analysis of risk themes
- Legal document analysis (proven track record)
- When you need probability scores for each risk type
**Quality Metrics**:
- Perplexity (lower is better)
- Topic coherence
- Probability distributions
---
### 3. Hierarchical Clustering
**File**: `risk_discovery_alternatives.py` β†’ `HierarchicalRiskDiscovery`
**Algorithm**: Agglomerative bottom-up clustering
- Starts with each clause as its own cluster
- Iteratively merges closest clusters
- Builds dendrogram showing cluster hierarchy
- Cut dendrogram at desired height to get k clusters
**Strengths**:
- βœ… Discovers nested risk hierarchies
- βœ… No need to specify k upfront (can explore dendrogram)
- βœ… Deterministic results
- βœ… Reveals relationships between risk types
**Weaknesses**:
- ❌ Slow (O(n² log n) or O(n³))
- ❌ Not scalable beyond ~10K clauses
- ❌ Cannot undo merges (greedy)
- ❌ Sensitive to noise and outliers
**Best Use Cases**:
- Small to medium datasets (<10K clauses)
- Exploratory analysis of risk structure
- When you want to understand risk relationships
- Creating risk taxonomies
**Quality Metrics**:
- Silhouette score
- Cophenetic correlation
- Dendrogram structure analysis
---
### 4. DBSCAN (Density-Based)
**File**: `risk_discovery_alternatives.py` β†’ `DensityBasedRiskDiscovery`
**Algorithm**: Density-based spatial clustering
- Groups together points that are closely packed
- Marks points in low-density regions as outliers
- Automatically determines number of clusters
- Uses eps (radius) and min_samples parameters
**Strengths**:
- βœ… Identifies outliers and rare risk patterns
- βœ… Discovers clusters of arbitrary shape
- βœ… Robust to noise
- βœ… No need to specify k
**Weaknesses**:
- ❌ Sensitive to eps and min_samples parameters
- ❌ Struggles with varying density clusters
- ❌ May produce too many small clusters
- ❌ High-dimensional spaces reduce effectiveness
**Best Use Cases**:
- Detecting rare or unusual risk patterns
- When dataset has outliers/noise
- Unknown number of risk types
- Irregularly shaped risk categories
**Quality Metrics**:
- Silhouette score
- Number of outliers
- Noise ratio
- Cluster cohesion
---
### 5. NMF (Non-negative Matrix Factorization)
**File**: `risk_discovery_alternatives.py` β†’ `NMFRiskDiscovery`
**Algorithm**: Matrix factorization with non-negativity constraints
- Decomposes TF-IDF matrix X β‰ˆ W Γ— H
- W: Document-component weights (n_clauses Γ— n_components)
- H: Component-term weights (n_components Γ— n_terms)
- All values in W and H are non-negative
- Uses multiplicative update rules
**Strengths**:
- βœ… Parts-based decomposition (additive components)
- βœ… Highly interpretable (components = risk aspects)
- βœ… Fast convergence
- βœ… Handles sparse matrices efficiently
- βœ… Components have clear semantic meaning
**Weaknesses**:
- ❌ Non-convex optimization (local minima)
- ❌ Requires specifying number of components
- ❌ Sensitive to initialization
- ❌ May need multiple runs for stability
**Best Use Cases**:
- When you want additive risk factors
- Interpretable risk decomposition
- Finding latent risk aspects
- When clauses are combinations of simpler patterns
**Quality Metrics**:
- Reconstruction error (lower is better)
- Sparsity of W and H matrices
- Component interpretability
**Unique Feature**: Components are additive - a clause's risk = sum of weighted components
---
### 6. Spectral Clustering
**File**: `risk_discovery_alternatives.py` β†’ `SpectralClusteringRiskDiscovery`
**Algorithm**: Graph-based clustering using eigenvalue decomposition
- Constructs similarity graph between clauses
- Computes graph Laplacian matrix
- Performs eigenvalue decomposition
- Applies K-Means to eigenvectors
- Can handle non-convex cluster shapes
**Strengths**:
- βœ… Excellent quality on complex data
- βœ… Handles non-convex clusters (unlike K-Means)
- βœ… Captures relationship structure
- βœ… Based on solid graph theory
- βœ… Can use various similarity measures
**Weaknesses**:
- ❌ Very slow (eigenvalue decomposition is expensive)
- ❌ Not scalable (limited to ~5K clauses)
- ❌ Memory intensive (stores similarity matrix)
- ❌ Sensitive to similarity measure choice
- ❌ Requires careful parameter tuning
**Best Use Cases**:
- Small datasets where quality is critical
- Complex cluster shapes
- When relationships between clauses are important
- Research/offline analysis (not production)
**Quality Metrics**:
- Silhouette score (usually best among all methods)
- Eigenvalue gaps
- Cut quality
**Unique Feature**: Uses graph theory - converts clustering to graph partitioning problem
---
### 7. Gaussian Mixture Model (GMM)
**File**: `risk_discovery_alternatives.py` β†’ `GaussianMixtureRiskDiscovery`
**Algorithm**: Probabilistic soft clustering with Gaussian components
- Models data as mixture of k Gaussian distributions
- Each component has mean vector and covariance matrix
- Uses Expectation-Maximization (EM) algorithm
- Provides probability of each clause belonging to each cluster
- Can model uncertainty
**Strengths**:
- βœ… Soft clustering (probability distributions)
- βœ… Quantifies uncertainty in assignments
- βœ… Flexible covariance structures
- βœ… Theoretically well-founded (maximum likelihood)
- βœ… Can use BIC/AIC for model selection
**Weaknesses**:
- ❌ Assumes Gaussian distributions
- ❌ Sensitive to initialization
- ❌ Can be slow on large datasets
- ❌ May overfit with full covariance
- ❌ High-dimensional data needs dimensionality reduction
**Best Use Cases**:
- When you need confidence scores
- Probabilistic risk assignments
- Model selection via BIC/AIC
- When uncertainty quantification is important
**Quality Metrics**:
- BIC (Bayesian Information Criterion) - lower is better
- AIC (Akaike Information Criterion) - lower is better
- Log-likelihood
- Silhouette score
**Unique Feature**: Provides probability distributions and uncertainty estimates
---
### 8. Mini-Batch K-Means
**File**: `risk_discovery_alternatives.py` β†’ `MiniBatchKMeansRiskDiscovery`
**Algorithm**: Scalable variant of K-Means using mini-batches
- Processes random mini-batches of data
- Updates centroids incrementally
- Online learning capability
- Trades slight quality for major speed improvement
- 3-5x faster than standard K-Means
**Strengths**:
- βœ… Ultra-fast (can handle millions of clauses)
- βœ… Memory efficient (streaming data)
- βœ… Online learning (update model with new data)
- βœ… Very close to K-Means quality
- βœ… Excellent for production systems
**Weaknesses**:
- ❌ Slightly lower quality than full K-Means
- ❌ Stochastic (results vary across runs)
- ❌ Batch size affects quality
- ❌ Inherits K-Means limitations (spherical clusters, etc.)
**Best Use Cases**:
- Very large datasets (>100K clauses)
- Real-time/streaming applications
- Memory-constrained environments
- Production systems needing speed
**Quality Metrics**:
- Inertia (sum of squared distances to centroids)
- Silhouette score
- Cluster cohesion
**Unique Feature**: Can process data in streaming fashion, enabling online learning
---
### 9. Risk-o-meter (Doc2Vec + SVM) ⭐ PAPER BASELINE
**File**: `risk_o_meter.py` β†’ `RiskOMeterFramework`
**Algorithm**: Paragraph vectors (Doc2Vec) + SVM classification
- Learns distributed representations of legal clauses using Doc2Vec
- Trains SVM classifiers on learned embeddings
- Optionally augments with TF-IDF features
- Achieves 91% accuracy on termination clauses (original paper)
- Extends to severity/importance prediction using SVR
**Strengths**:
- βœ… **Proven in literature** (Chakrabarti et al., 2018)
- βœ… Captures semantic meaning via paragraph vectors
- βœ… SVM provides interpretable decision boundaries
- βœ… Works well with labeled data (supervised)
- βœ… Can handle both classification and regression
- βœ… Combines traditional ML with embeddings
**Weaknesses**:
- ❌ Requires more training time (Doc2Vec epochs)
- ❌ Primarily designed for supervised learning
- ❌ Less effective for unsupervised discovery vs clustering
- ❌ Needs tuning of Doc2Vec parameters
- ❌ Memory intensive (stores full Doc2Vec model)
**Best Use Cases**:
- When you have labeled training data
- Comparison with paper baseline approaches
- When semantic embeddings are important
- Legal text analysis (proven domain)
**Quality Metrics**:
- Classification accuracy (91% on termination clauses)
- Silhouette score (for unsupervised mode)
- SVM margins
- Doc2Vec embedding quality
**Unique Feature**: Combines Doc2Vec semantic embeddings with SVM classifiers, achieving paper-validated performance on legal contracts
**Reference**: Chakrabarti, A., & Dholakia, K. (2018). "Risk-o-meter: Automated Risk Detection in Contracts"
---
## Comparison Matrix
| Method | Speed | Quality | Scalability | Interpretability | Overlapping | Outliers | Soft Assign |
|--------|-------|---------|-------------|-----------------|-------------|----------|-------------|
| K-Means | ⚑⚑⚑⚑⚑ | ⭐⭐⭐ | ⚑⚑⚑⚑⚑ | ⭐⭐⭐⭐ | ❌ | ❌ | ❌ |
| LDA | ⚑⚑⚑ | ⭐⭐⭐⭐ | ⚑⚑⚑⚑ | ⭐⭐⭐⭐⭐ | βœ… | ❌ | βœ… |
| Hierarchical | ⚑⚑ | ⭐⭐⭐ | ⚑⚑ | ⭐⭐⭐⭐ | ❌ | ❌ | ❌ |
| DBSCAN | ⚑⚑⚑⚑ | ⭐⭐⭐ | ⚑⚑⚑ | ⭐⭐⭐ | ❌ | βœ… | ❌ |
| NMF | ⚑⚑⚑⚑ | ⭐⭐⭐⭐ | ⚑⚑⚑⚑ | ⭐⭐⭐⭐⭐ | βœ… | ❌ | βœ… |
| Spectral | ⚑ | ⭐⭐⭐⭐⭐ | ⚑ | ⭐⭐⭐ | ❌ | ❌ | ❌ |
| GMM | ⚑⚑⚑ | ⭐⭐⭐⭐ | ⚑⚑⚑ | ⭐⭐⭐⭐ | βœ… | ❌ | βœ… |
| MiniBatch | ⚑⚑⚑⚑⚑ | ⭐⭐⭐ | ⚑⚑⚑⚑⚑ | ⭐⭐⭐⭐ | ❌ | ❌ | ❌ |
| **Risk-o-meter** ⭐ | ⚑⚑⚑ | ⭐⭐⭐⭐⭐ | ⚑⚑⚑⚑ | ⭐⭐⭐⭐ | ❌ | ❌ | βœ… (SVM proba) |
**Legend**:
- Speed: ⚑ = slow to ⚑⚑⚑⚑⚑ = ultra-fast
- Quality: ⭐ = poor to ⭐⭐⭐⭐⭐ = excellent
- Scalability: ⚑ = <5K to ⚑⚑⚑⚑⚑ = >1M clauses
- Overlapping: Can handle clauses belonging to multiple categories
- Outliers: Can detect/handle outliers
- Soft Assign: Provides probability distributions
---
## Algorithm Selection Guide
### By Dataset Size
**Small (<1K clauses)**:
1. **Best**: Spectral (highest quality)
2. **Good**: GMM (uncertainty estimates)
3. **Alternative**: All methods work, choose by feature needs
**Medium (1K - 10K clauses)**:
1. **Best**: NMF or LDA (interpretability + quality)
2. **Good**: K-Means or GMM (balanced)
3. **Alternative**: Hierarchical (for structure analysis)
**Large (10K - 100K clauses)**:
1. **Best**: K-Means (speed + quality)
2. **Good**: NMF or Mini-Batch (scalable)
3. **Avoid**: Hierarchical, Spectral (too slow)
**Very Large (>100K clauses)**:
1. **Best**: Mini-Batch K-Means (only viable option)
2. **Alternative**: K-Means (if enough memory/time)
3. **Not Recommended**: All others
### By Primary Goal
**Highest Quality** (accept slower speed):
1. Spectral Clustering
2. GMM
3. LDA
**Best Balance** (quality vs speed):
1. NMF
2. K-Means
3. GMM
**Maximum Speed** (accept slight quality loss):
1. Mini-Batch K-Means
2. DBSCAN
3. K-Means
**Interpretability** (understand risk factors):
1. NMF (additive components)
2. LDA (topic distributions)
3. K-Means (clear centroids)
**Overlapping Risks** (clauses have multiple aspects):
1. LDA (probabilistic topics)
2. GMM (soft clustering)
3. NMF (component mixing)
**Outlier Detection** (find rare patterns):
1. DBSCAN (explicit outlier detection)
2. GMM (low probability assignments)
3. Hierarchical (singleton clusters)
**Hierarchical Structure** (nested categories):
1. Hierarchical Clustering (only method with dendrogram)
2. Others: Post-hoc hierarchy construction needed
**Uncertainty Quantification** (confidence scores):
1. GMM (probability distributions)
2. LDA (topic probabilities)
3. NMF (component weights)
---
## Running the Comparison
### Quick Comparison (4 Basic Methods)
```bash
python compare_risk_discovery.py
```
**Methods tested**: K-Means, LDA, Hierarchical, DBSCAN
**Time**: ~2-5 minutes
**Use for**: Quick assessment, choosing basic method
### Full Comparison (All 8 Methods)
```bash
python compare_risk_discovery.py --advanced
```
**Methods tested**: All 8 algorithms
**Time**: ~5-15 minutes
**Use for**: Comprehensive analysis, optimal method selection
### Outputs
Both modes generate:
- **Console output**: Real-time progress and metrics
- **Text report**: `risk_discovery_comparison_report.txt`
- **JSON results**: `risk_discovery_comparison_results.json`
- **Recommendations**: Method selection guidance
---
## Integration with Pipeline
### 1. Choose Method Based on Comparison
After running comparison, select optimal method based on:
- Dataset size
- Quality metrics (silhouette, perplexity, BIC)
- Speed requirements
- Special needs (overlapping risks, outliers, etc.)
### 2. Update trainer.py
Modify the risk discovery instantiation:
```python
# Example: Using NMF (best balance)
from risk_discovery_alternatives import NMFRiskDiscovery
self.risk_discovery = NMFRiskDiscovery(n_components=7)
# Example: Using GMM (uncertainty needed)
from risk_discovery_alternatives import GaussianMixtureRiskDiscovery
self.risk_discovery = GaussianMixtureRiskDiscovery(n_components=7)
# Example: Using Mini-Batch (large dataset)
from risk_discovery_alternatives import MiniBatchKMeansRiskDiscovery
self.risk_discovery = MiniBatchKMeansRiskDiscovery(n_clusters=7)
```
### 3. Run Training
```bash
python train.py
```
The chosen method will be used for risk pattern discovery during training.
---
## Implementation Details
### Common API
All methods implement the same interface:
```python
class RiskDiscoveryMethod:
def __init__(self, **params):
"""Initialize with algorithm-specific parameters"""
pass
def discover_risk_patterns(self, clauses: List[str]) -> Dict[str, Any]:
"""
Discover risk patterns from clauses.
Returns:
{
'method': str,
'n_clusters' or 'n_topics': int,
'discovered_patterns': dict,
'quality_metrics': dict,
'timing': float,
'clauses_per_second': float
}
"""
pass
```
### Dependencies
All methods use scikit-learn:
- `sklearn.cluster`: KMeans, AgglomerativeClustering, DBSCAN, SpectralClustering, MiniBatchKMeans
- `sklearn.decomposition`: LatentDirichletAllocation, NMF
- `sklearn.mixture`: GaussianMixture
- `sklearn.feature_extraction.text`: TfidfVectorizer, CountVectorizer
- `sklearn.metrics`: silhouette_score
---
## Performance Benchmarks
Based on CUAD dataset (3000 clauses):
| Method | Time (sec) | Memory (MB) | Quality (Silhouette) |
|--------|-----------|-------------|---------------------|
| K-Means | 2.3 | 150 | 0.18 |
| LDA | 8.5 | 200 | N/A (perplexity) |
| Hierarchical | 45.2 | 800 | 0.16 |
| DBSCAN | 3.1 | 180 | 0.14 |
| NMF | 3.8 | 170 | N/A (recon error) |
| Spectral | 78.5 | 1200 | 0.22 |
| GMM | 12.3 | 220 | 0.19 |
| MiniBatch | 0.8 | 120 | 0.17 |
*Note: Actual performance depends on hardware, dataset, and parameters*
---
## Future Enhancements
Potential additions:
1. **HDBSCAN**: Improved density-based clustering
2. **OPTICS**: Density-based with varying density
3. **Fuzzy C-Means**: Soft clustering variant
4. **Mean Shift**: Mode-seeking algorithm
5. **Affinity Propagation**: Exemplar-based clustering
6. **Neural embeddings**: BERT/Sentence-BERT + clustering
7. **Ensemble methods**: Combine multiple algorithms
---
## References
1. **K-Means**: MacQueen, J. (1967). "Some methods for classification and analysis of multivariate observations"
2. **LDA**: Blei, D. M., et al. (2003). "Latent Dirichlet Allocation"
3. **Hierarchical**: Ward, J. H. (1963). "Hierarchical grouping to optimize an objective function"
4. **DBSCAN**: Ester, M., et al. (1996). "A density-based algorithm for discovering clusters"
5. **NMF**: Lee, D. D., & Seung, H. S. (1999). "Learning the parts of objects by non-negative matrix factorization"
6. **Spectral**: Ng, A. Y., et al. (2002). "On spectral clustering: Analysis and an algorithm"
7. **GMM**: Reynolds, D. A. (2009). "Gaussian mixture models"
8. **Mini-Batch**: Sculley, D. (2010). "Web-scale k-means clustering"
---
## Contact & Support
For questions or issues with risk discovery methods:
1. Check comparison report for method-specific metrics
2. Review this guide for selection criteria
3. Experiment with different methods on your data
4. Consider ensemble approaches for critical applications
**Last Updated**: 2024 (8 methods implemented)