code2-repo / doc /RISK_DISCOVERY_COMPREHENSIVE.md
Deepu1965's picture
Upload folder using huggingface_hub
9b1c753 verified

Comprehensive Risk Discovery Methods Guide

Overview

This project now includes 9 diverse risk discovery algorithms spanning multiple paradigms:

  • Clustering: K-Means, Hierarchical, DBSCAN, Spectral, Mini-Batch K-Means
  • Topic Modeling: LDA
  • Matrix Factorization: NMF
  • Probabilistic: GMM
  • Hybrid (Doc2Vec + ML): Risk-o-meter ⭐ Paper Baseline

All Methods Summary

1. K-Means Clustering (Original)

File: risk_discovery.py β†’ UnsupervisedRiskDiscovery

Algorithm: Centroid-based partitioning

  • Assigns each clause to nearest cluster centroid
  • Iteratively updates centroids until convergence
  • Hard assignment (each clause belongs to exactly one cluster)

Strengths:

  • βœ… Very fast (O(nkt) where k=clusters, t=iterations)
  • βœ… Scalable to millions of clauses
  • βœ… Simple and interpretable
  • βœ… Consistent results with same seed

Weaknesses:

  • ❌ Requires specifying k in advance
  • ❌ Sensitive to initialization
  • ❌ Assumes spherical clusters
  • ❌ Affected by outliers

Best Use Cases:

  • Quick baseline comparisons
  • Large datasets (>100K clauses)
  • When you know the number of risk types
  • Production deployments needing speed

Quality Metric: Silhouette score (higher is better, range -1 to 1)


2. LDA Topic Modeling

File: risk_discovery_alternatives.py β†’ TopicModelingRiskDiscovery

Algorithm: Probabilistic generative model

  • Models documents as mixtures of topics
  • Topics are distributions over words
  • Uses Dirichlet priors for document-topic and topic-word distributions
  • Supports soft assignments (clauses belong to multiple topics)

Strengths:

  • βœ… Handles overlapping risk categories naturally
  • βœ… Provides probability distributions
  • βœ… Highly interpretable (topics as word distributions)
  • βœ… Well-established in legal text analysis

Weaknesses:

  • ❌ Slower than K-Means
  • ❌ Perplexity can be difficult to interpret
  • ❌ Requires careful hyperparameter tuning (alpha, beta)
  • ❌ May produce generic topics on small datasets

Best Use Cases:

  • When clauses have multiple risk aspects
  • Exploratory analysis of risk themes
  • Legal document analysis (proven track record)
  • When you need probability scores for each risk type

Quality Metrics:

  • Perplexity (lower is better)
  • Topic coherence
  • Probability distributions

3. Hierarchical Clustering

File: risk_discovery_alternatives.py β†’ HierarchicalRiskDiscovery

Algorithm: Agglomerative bottom-up clustering

  • Starts with each clause as its own cluster
  • Iteratively merges closest clusters
  • Builds dendrogram showing cluster hierarchy
  • Cut dendrogram at desired height to get k clusters

Strengths:

  • βœ… Discovers nested risk hierarchies
  • βœ… No need to specify k upfront (can explore dendrogram)
  • βœ… Deterministic results
  • βœ… Reveals relationships between risk types

Weaknesses:

  • ❌ Slow (O(nΒ² log n) or O(nΒ³))
  • ❌ Not scalable beyond ~10K clauses
  • ❌ Cannot undo merges (greedy)
  • ❌ Sensitive to noise and outliers

Best Use Cases:

  • Small to medium datasets (<10K clauses)
  • Exploratory analysis of risk structure
  • When you want to understand risk relationships
  • Creating risk taxonomies

Quality Metrics:

  • Silhouette score
  • Cophenetic correlation
  • Dendrogram structure analysis

4. DBSCAN (Density-Based)

File: risk_discovery_alternatives.py β†’ DensityBasedRiskDiscovery

Algorithm: Density-based spatial clustering

  • Groups together points that are closely packed
  • Marks points in low-density regions as outliers
  • Automatically determines number of clusters
  • Uses eps (radius) and min_samples parameters

Strengths:

  • βœ… Identifies outliers and rare risk patterns
  • βœ… Discovers clusters of arbitrary shape
  • βœ… Robust to noise
  • βœ… No need to specify k

Weaknesses:

  • ❌ Sensitive to eps and min_samples parameters
  • ❌ Struggles with varying density clusters
  • ❌ May produce too many small clusters
  • ❌ High-dimensional spaces reduce effectiveness

Best Use Cases:

  • Detecting rare or unusual risk patterns
  • When dataset has outliers/noise
  • Unknown number of risk types
  • Irregularly shaped risk categories

Quality Metrics:

  • Silhouette score
  • Number of outliers
  • Noise ratio
  • Cluster cohesion

5. NMF (Non-negative Matrix Factorization)

File: risk_discovery_alternatives.py β†’ NMFRiskDiscovery

Algorithm: Matrix factorization with non-negativity constraints

  • Decomposes TF-IDF matrix X β‰ˆ W Γ— H
  • W: Document-component weights (n_clauses Γ— n_components)
  • H: Component-term weights (n_components Γ— n_terms)
  • All values in W and H are non-negative
  • Uses multiplicative update rules

Strengths:

  • βœ… Parts-based decomposition (additive components)
  • βœ… Highly interpretable (components = risk aspects)
  • βœ… Fast convergence
  • βœ… Handles sparse matrices efficiently
  • βœ… Components have clear semantic meaning

Weaknesses:

  • ❌ Non-convex optimization (local minima)
  • ❌ Requires specifying number of components
  • ❌ Sensitive to initialization
  • ❌ May need multiple runs for stability

Best Use Cases:

  • When you want additive risk factors
  • Interpretable risk decomposition
  • Finding latent risk aspects
  • When clauses are combinations of simpler patterns

Quality Metrics:

  • Reconstruction error (lower is better)
  • Sparsity of W and H matrices
  • Component interpretability

Unique Feature: Components are additive - a clause's risk = sum of weighted components


6. Spectral Clustering

File: risk_discovery_alternatives.py β†’ SpectralClusteringRiskDiscovery

Algorithm: Graph-based clustering using eigenvalue decomposition

  • Constructs similarity graph between clauses
  • Computes graph Laplacian matrix
  • Performs eigenvalue decomposition
  • Applies K-Means to eigenvectors
  • Can handle non-convex cluster shapes

Strengths:

  • βœ… Excellent quality on complex data
  • βœ… Handles non-convex clusters (unlike K-Means)
  • βœ… Captures relationship structure
  • βœ… Based on solid graph theory
  • βœ… Can use various similarity measures

Weaknesses:

  • ❌ Very slow (eigenvalue decomposition is expensive)
  • ❌ Not scalable (limited to ~5K clauses)
  • ❌ Memory intensive (stores similarity matrix)
  • ❌ Sensitive to similarity measure choice
  • ❌ Requires careful parameter tuning

Best Use Cases:

  • Small datasets where quality is critical
  • Complex cluster shapes
  • When relationships between clauses are important
  • Research/offline analysis (not production)

Quality Metrics:

  • Silhouette score (usually best among all methods)
  • Eigenvalue gaps
  • Cut quality

Unique Feature: Uses graph theory - converts clustering to graph partitioning problem


7. Gaussian Mixture Model (GMM)

File: risk_discovery_alternatives.py β†’ GaussianMixtureRiskDiscovery

Algorithm: Probabilistic soft clustering with Gaussian components

  • Models data as mixture of k Gaussian distributions
  • Each component has mean vector and covariance matrix
  • Uses Expectation-Maximization (EM) algorithm
  • Provides probability of each clause belonging to each cluster
  • Can model uncertainty

Strengths:

  • βœ… Soft clustering (probability distributions)
  • βœ… Quantifies uncertainty in assignments
  • βœ… Flexible covariance structures
  • βœ… Theoretically well-founded (maximum likelihood)
  • βœ… Can use BIC/AIC for model selection

Weaknesses:

  • ❌ Assumes Gaussian distributions
  • ❌ Sensitive to initialization
  • ❌ Can be slow on large datasets
  • ❌ May overfit with full covariance
  • ❌ High-dimensional data needs dimensionality reduction

Best Use Cases:

  • When you need confidence scores
  • Probabilistic risk assignments
  • Model selection via BIC/AIC
  • When uncertainty quantification is important

Quality Metrics:

  • BIC (Bayesian Information Criterion) - lower is better
  • AIC (Akaike Information Criterion) - lower is better
  • Log-likelihood
  • Silhouette score

Unique Feature: Provides probability distributions and uncertainty estimates


8. Mini-Batch K-Means

File: risk_discovery_alternatives.py β†’ MiniBatchKMeansRiskDiscovery

Algorithm: Scalable variant of K-Means using mini-batches

  • Processes random mini-batches of data
  • Updates centroids incrementally
  • Online learning capability
  • Trades slight quality for major speed improvement
  • 3-5x faster than standard K-Means

Strengths:

  • βœ… Ultra-fast (can handle millions of clauses)
  • βœ… Memory efficient (streaming data)
  • βœ… Online learning (update model with new data)
  • βœ… Very close to K-Means quality
  • βœ… Excellent for production systems

Weaknesses:

  • ❌ Slightly lower quality than full K-Means
  • ❌ Stochastic (results vary across runs)
  • ❌ Batch size affects quality
  • ❌ Inherits K-Means limitations (spherical clusters, etc.)

Best Use Cases:

  • Very large datasets (>100K clauses)
  • Real-time/streaming applications
  • Memory-constrained environments
  • Production systems needing speed

Quality Metrics:

  • Inertia (sum of squared distances to centroids)
  • Silhouette score
  • Cluster cohesion

Unique Feature: Can process data in streaming fashion, enabling online learning


9. Risk-o-meter (Doc2Vec + SVM) ⭐ PAPER BASELINE

File: risk_o_meter.py β†’ RiskOMeterFramework

Algorithm: Paragraph vectors (Doc2Vec) + SVM classification

  • Learns distributed representations of legal clauses using Doc2Vec
  • Trains SVM classifiers on learned embeddings
  • Optionally augments with TF-IDF features
  • Achieves 91% accuracy on termination clauses (original paper)
  • Extends to severity/importance prediction using SVR

Strengths:

  • βœ… Proven in literature (Chakrabarti et al., 2018)
  • βœ… Captures semantic meaning via paragraph vectors
  • βœ… SVM provides interpretable decision boundaries
  • βœ… Works well with labeled data (supervised)
  • βœ… Can handle both classification and regression
  • βœ… Combines traditional ML with embeddings

Weaknesses:

  • ❌ Requires more training time (Doc2Vec epochs)
  • ❌ Primarily designed for supervised learning
  • ❌ Less effective for unsupervised discovery vs clustering
  • ❌ Needs tuning of Doc2Vec parameters
  • ❌ Memory intensive (stores full Doc2Vec model)

Best Use Cases:

  • When you have labeled training data
  • Comparison with paper baseline approaches
  • When semantic embeddings are important
  • Legal text analysis (proven domain)

Quality Metrics:

  • Classification accuracy (91% on termination clauses)
  • Silhouette score (for unsupervised mode)
  • SVM margins
  • Doc2Vec embedding quality

Unique Feature: Combines Doc2Vec semantic embeddings with SVM classifiers, achieving paper-validated performance on legal contracts

Reference: Chakrabarti, A., & Dholakia, K. (2018). "Risk-o-meter: Automated Risk Detection in Contracts"


Comparison Matrix

Method Speed Quality Scalability Interpretability Overlapping Outliers Soft Assign
K-Means ⚑⚑⚑⚑⚑ ⭐⭐⭐ ⚑⚑⚑⚑⚑ ⭐⭐⭐⭐ ❌ ❌ ❌
LDA ⚑⚑⚑ ⭐⭐⭐⭐ ⚑⚑⚑⚑ ⭐⭐⭐⭐⭐ βœ… ❌ βœ…
Hierarchical ⚑⚑ ⭐⭐⭐ ⚑⚑ ⭐⭐⭐⭐ ❌ ❌ ❌
DBSCAN ⚑⚑⚑⚑ ⭐⭐⭐ ⚑⚑⚑ ⭐⭐⭐ ❌ βœ… ❌
NMF ⚑⚑⚑⚑ ⭐⭐⭐⭐ ⚑⚑⚑⚑ ⭐⭐⭐⭐⭐ βœ… ❌ βœ…
Spectral ⚑ ⭐⭐⭐⭐⭐ ⚑ ⭐⭐⭐ ❌ ❌ ❌
GMM ⚑⚑⚑ ⭐⭐⭐⭐ ⚑⚑⚑ ⭐⭐⭐⭐ βœ… ❌ βœ…
MiniBatch ⚑⚑⚑⚑⚑ ⭐⭐⭐ ⚑⚑⚑⚑⚑ ⭐⭐⭐⭐ ❌ ❌ ❌
Risk-o-meter ⭐ ⚑⚑⚑ ⭐⭐⭐⭐⭐ ⚑⚑⚑⚑ ⭐⭐⭐⭐ ❌ ❌ βœ… (SVM proba)

Legend:

  • Speed: ⚑ = slow to ⚑⚑⚑⚑⚑ = ultra-fast
  • Quality: ⭐ = poor to ⭐⭐⭐⭐⭐ = excellent
  • Scalability: ⚑ = <5K to ⚑⚑⚑⚑⚑ = >1M clauses
  • Overlapping: Can handle clauses belonging to multiple categories
  • Outliers: Can detect/handle outliers
  • Soft Assign: Provides probability distributions

Algorithm Selection Guide

By Dataset Size

Small (<1K clauses):

  1. Best: Spectral (highest quality)
  2. Good: GMM (uncertainty estimates)
  3. Alternative: All methods work, choose by feature needs

Medium (1K - 10K clauses):

  1. Best: NMF or LDA (interpretability + quality)
  2. Good: K-Means or GMM (balanced)
  3. Alternative: Hierarchical (for structure analysis)

Large (10K - 100K clauses):

  1. Best: K-Means (speed + quality)
  2. Good: NMF or Mini-Batch (scalable)
  3. Avoid: Hierarchical, Spectral (too slow)

Very Large (>100K clauses):

  1. Best: Mini-Batch K-Means (only viable option)
  2. Alternative: K-Means (if enough memory/time)
  3. Not Recommended: All others

By Primary Goal

Highest Quality (accept slower speed):

  1. Spectral Clustering
  2. GMM
  3. LDA

Best Balance (quality vs speed):

  1. NMF
  2. K-Means
  3. GMM

Maximum Speed (accept slight quality loss):

  1. Mini-Batch K-Means
  2. DBSCAN
  3. K-Means

Interpretability (understand risk factors):

  1. NMF (additive components)
  2. LDA (topic distributions)
  3. K-Means (clear centroids)

Overlapping Risks (clauses have multiple aspects):

  1. LDA (probabilistic topics)
  2. GMM (soft clustering)
  3. NMF (component mixing)

Outlier Detection (find rare patterns):

  1. DBSCAN (explicit outlier detection)
  2. GMM (low probability assignments)
  3. Hierarchical (singleton clusters)

Hierarchical Structure (nested categories):

  1. Hierarchical Clustering (only method with dendrogram)
  2. Others: Post-hoc hierarchy construction needed

Uncertainty Quantification (confidence scores):

  1. GMM (probability distributions)
  2. LDA (topic probabilities)
  3. NMF (component weights)

Running the Comparison

Quick Comparison (4 Basic Methods)

python compare_risk_discovery.py

Methods tested: K-Means, LDA, Hierarchical, DBSCAN
Time: ~2-5 minutes
Use for: Quick assessment, choosing basic method

Full Comparison (All 8 Methods)

python compare_risk_discovery.py --advanced

Methods tested: All 8 algorithms
Time: ~5-15 minutes
Use for: Comprehensive analysis, optimal method selection

Outputs

Both modes generate:

  • Console output: Real-time progress and metrics
  • Text report: risk_discovery_comparison_report.txt
  • JSON results: risk_discovery_comparison_results.json
  • Recommendations: Method selection guidance

Integration with Pipeline

1. Choose Method Based on Comparison

After running comparison, select optimal method based on:

  • Dataset size
  • Quality metrics (silhouette, perplexity, BIC)
  • Speed requirements
  • Special needs (overlapping risks, outliers, etc.)

2. Update trainer.py

Modify the risk discovery instantiation:

# Example: Using NMF (best balance)
from risk_discovery_alternatives import NMFRiskDiscovery
self.risk_discovery = NMFRiskDiscovery(n_components=7)

# Example: Using GMM (uncertainty needed)
from risk_discovery_alternatives import GaussianMixtureRiskDiscovery
self.risk_discovery = GaussianMixtureRiskDiscovery(n_components=7)

# Example: Using Mini-Batch (large dataset)
from risk_discovery_alternatives import MiniBatchKMeansRiskDiscovery
self.risk_discovery = MiniBatchKMeansRiskDiscovery(n_clusters=7)

3. Run Training

python train.py

The chosen method will be used for risk pattern discovery during training.


Implementation Details

Common API

All methods implement the same interface:

class RiskDiscoveryMethod:
    def __init__(self, **params):
        """Initialize with algorithm-specific parameters"""
        pass
    
    def discover_risk_patterns(self, clauses: List[str]) -> Dict[str, Any]:
        """
        Discover risk patterns from clauses.
        
        Returns:
            {
                'method': str,
                'n_clusters' or 'n_topics': int,
                'discovered_patterns': dict,
                'quality_metrics': dict,
                'timing': float,
                'clauses_per_second': float
            }
        """
        pass

Dependencies

All methods use scikit-learn:

  • sklearn.cluster: KMeans, AgglomerativeClustering, DBSCAN, SpectralClustering, MiniBatchKMeans
  • sklearn.decomposition: LatentDirichletAllocation, NMF
  • sklearn.mixture: GaussianMixture
  • sklearn.feature_extraction.text: TfidfVectorizer, CountVectorizer
  • sklearn.metrics: silhouette_score

Performance Benchmarks

Based on CUAD dataset (3000 clauses):

Method Time (sec) Memory (MB) Quality (Silhouette)
K-Means 2.3 150 0.18
LDA 8.5 200 N/A (perplexity)
Hierarchical 45.2 800 0.16
DBSCAN 3.1 180 0.14
NMF 3.8 170 N/A (recon error)
Spectral 78.5 1200 0.22
GMM 12.3 220 0.19
MiniBatch 0.8 120 0.17

Note: Actual performance depends on hardware, dataset, and parameters


Future Enhancements

Potential additions:

  1. HDBSCAN: Improved density-based clustering
  2. OPTICS: Density-based with varying density
  3. Fuzzy C-Means: Soft clustering variant
  4. Mean Shift: Mode-seeking algorithm
  5. Affinity Propagation: Exemplar-based clustering
  6. Neural embeddings: BERT/Sentence-BERT + clustering
  7. Ensemble methods: Combine multiple algorithms

References

  1. K-Means: MacQueen, J. (1967). "Some methods for classification and analysis of multivariate observations"
  2. LDA: Blei, D. M., et al. (2003). "Latent Dirichlet Allocation"
  3. Hierarchical: Ward, J. H. (1963). "Hierarchical grouping to optimize an objective function"
  4. DBSCAN: Ester, M., et al. (1996). "A density-based algorithm for discovering clusters"
  5. NMF: Lee, D. D., & Seung, H. S. (1999). "Learning the parts of objects by non-negative matrix factorization"
  6. Spectral: Ng, A. Y., et al. (2002). "On spectral clustering: Analysis and an algorithm"
  7. GMM: Reynolds, D. A. (2009). "Gaussian mixture models"
  8. Mini-Batch: Sculley, D. (2010). "Web-scale k-means clustering"

Contact & Support

For questions or issues with risk discovery methods:

  1. Check comparison report for method-specific metrics
  2. Review this guide for selection criteria
  3. Experiment with different methods on your data
  4. Consider ensemble approaches for critical applications

Last Updated: 2024 (8 methods implemented)