| ================================================================================ |
| π¬ RISK DISCOVERY METHOD COMPARISON REPORT |
| ================================================================================ |
|
|
| π SUMMARY TABLE |
| -------------------------------------------------------------------------------- |
| Method Patterns Quality |
| -------------------------------------------------------------------------------- |
| kmeans 7 Silhouette: 0.017 |
| lda 7 Perplexity: 1186.4 |
| hierarchical 7 Silhouette: N/A |
| dbscan 1 See details |
| nmf 7 See details |
| spectral 7 Silhouette: N/A |
| gmm 7 See details |
| minibatch_kmeans 7 See details |
| risk_o_meter N/A Silhouette: 0.024 |
| -------------------------------------------------------------------------------- |
|
|
| π DETAILED ANALYSIS |
| ================================================================================ |
|
|
| KMEANS |
| -------------------------------------------------------------------------------- |
| Method: K-Means_Clustering |
| Patterns Discovered: 7 |
| Quality Metrics: |
| - silhouette_score: 0.017 |
| - n_patterns: 3 |
| Pattern Diversity: |
| - avg_pattern_size: 3637.333 |
| - std_pattern_size: 3923.606 |
| - min_pattern_size: 436 |
| - max_pattern_size: 9163 |
| - balance_score: 0.481 |
|
|
| Top 3 Patterns: |
| low_risk_obligation_pattern |
| Keywords: shall, agreement, company, product, insurance |
| Clauses: 9163 |
| low_risk_liability_pattern |
| Keywords: party, consent, damages, agreement, written consent |
| Clauses: 1313 |
| low_risk_compliance_pattern |
| Keywords: laws, state, governed, laws state, shall governed |
| Clauses: 436 |
|
|
| LDA |
| -------------------------------------------------------------------------------- |
| Method: LDA_Topic_Modeling |
| Patterns Discovered: 7 |
| Quality Metrics: |
| - perplexity: 1186.381 |
| - avg_topic_diversity: 6.312 |
| Pattern Diversity: |
| - avg_pattern_size: 1974.714 |
| - std_pattern_size: 777.392 |
| - min_pattern_size: 1146 |
| - max_pattern_size: 3426 |
| - balance_score: 0.718 |
|
|
| Top 3 Topics: |
| Topic 0: Topic_PARTY_AGREEMENT |
| Keywords: party, agreement, shall, company, consent |
| Clauses: 2517 (18.2%) |
| Topic 1: Topic_INTELLECTUAL_PROPERTY |
| Keywords: shall, product, products, agreement, section |
| Clauses: 3426 (24.8%) |
| Topic 2: Topic_COMPLIANCE |
| Keywords: shall, agreement, laws, state, governed |
| Clauses: 1314 (9.5%) |
|
|
| HIERARCHICAL |
| -------------------------------------------------------------------------------- |
| Method: Hierarchical_Agglomerative_Clustering |
| Patterns Discovered: 7 |
| Quality Metrics: |
| - silhouette_score: N/A |
| - avg_cluster_size: 1974.714 |
| Pattern Diversity: |
| - avg_pattern_size: 1974.714 |
| - std_pattern_size: 3483.902 |
| - min_pattern_size: 91 |
| - max_pattern_size: 10483 |
| - balance_score: 0.362 |
|
|
| Top 3 Clusters: |
| Cluster 0: RISK_AGREEMENT_SHALL |
| Keywords: agreement, shall, party, company, license |
| Clauses: 10483 (75.8%) |
| Cluster 1: RISK_TERM_DATE |
| Keywords: term, date, agreement, effective, effective date |
| Clauses: 1018 (7.4%) |
| Cluster 2: RISK_DAY_2019 |
| Keywords: day, 2019, 2018, 2020, march |
| Clauses: 796 (5.8%) |
|
|
| DBSCAN |
| -------------------------------------------------------------------------------- |
| Method: DBSCAN_Density_Based_Clustering |
| Patterns Discovered: 1 |
| Quality Metrics: |
| - n_clusters: 1 |
| - outlier_ratio: 0.031 |
| - avg_cluster_size: 13396.000 |
| Pattern Diversity: |
| - avg_pattern_size: 13396.000 |
| - std_pattern_size: 0.000 |
| - min_pattern_size: 13396 |
| - max_pattern_size: 13396 |
| - balance_score: 1.000 |
|
|
| Top 3 Clusters: |
| Cluster 0: RISK_CLUSTER_0_AGREEMENT |
| Keywords: agreement, shall, party, company, term |
| Clauses: 13396 (96.9%) |
|
|
| Outliers Detected: 427 (3.1%) |
| β These represent rare or unique risk patterns |
|
|
| NMF |
| -------------------------------------------------------------------------------- |
| Method: NMF_Matrix_Factorization |
| Patterns Discovered: 7 |
| Quality Metrics: |
| - reconstruction_error: 116.125 |
| - sparsity: 1.000 |
| - avg_component_strength: 0.000 |
|
|
| SPECTRAL |
| -------------------------------------------------------------------------------- |
| Method: Spectral_Clustering |
| Patterns Discovered: 7 |
| Quality Metrics: |
| - silhouette_score: N/A |
| - n_clusters_found: 7 |
| Pattern Diversity: |
| - avg_pattern_size: 1974.714 |
| - std_pattern_size: 4787.658 |
| - min_pattern_size: 11 |
| - max_pattern_size: 13702 |
| - balance_score: 0.292 |
|
|
| Top 3 Clusters: |
| Cluster 0: SPECTRAL_AGREEMENT_SHALL |
| Keywords: agreement, shall, party, company, term |
| Clauses: 13702 (99.1%) |
| Cluster 1: SPECTRAL_SELLER PERPETUAL_GRANTS SELLER |
| Keywords: seller perpetual, grants seller, arizona field, use arizona, company licensed |
| Clauses: 14 (0.1%) |
| Cluster 2: SPECTRAL_CONSULTING AGREEMENT_CONSULTING |
| Keywords: consulting agreement, consulting, agreement, zynga, events |
| Clauses: 11 (0.1%) |
|
|
| GMM |
| -------------------------------------------------------------------------------- |
| Method: Gaussian_Mixture_Model |
| Patterns Discovered: 7 |
| Quality Metrics: |
| - bic: -5743043.237 |
| - aic: -5753636.167 |
| - avg_confidence: 0.988 |
|
|
| MINIBATCH_KMEANS |
| -------------------------------------------------------------------------------- |
| Method: MiniBatch_KMeans |
| Patterns Discovered: 7 |
| Quality Metrics: |
| - inertia: 13303.751 |
| - avg_cluster_cohesion: 0.498 |
| Pattern Diversity: |
| - avg_pattern_size: 1974.714 |
| - std_pattern_size: 4821.530 |
| - min_pattern_size: 2 |
| - max_pattern_size: 13785 |
| - balance_score: 0.291 |
|
|
| Top 3 Clusters: |
| Cluster 0: MB_HARPOON_NOTICE CHANGE CONTROL |
| Keywords: harpoon, notice change control, notice change, abbvie, closing date |
| Clauses: 3 (0.0%) |
| Cluster 1: MB_BUYER_BUYER BUYER |
| Keywords: buyer, buyer buyer, entities, company, request |
| Clauses: 12 (0.1%) |
| Cluster 2: MB_BANK AMERICA_AMERICA |
| Keywords: bank america, america, america affiliates permitted, affiliates permitted assigns, bank |
| Clauses: 6 (0.0%) |
|
|
| RISK_O_METER |
| -------------------------------------------------------------------------------- |
| Method: Risk-o-meter (Doc2Vec + SVM) |
| Patterns Discovered: 0 |
| Quality Metrics: |
| - silhouette_score: 0.024 |
| - embedding_dimension: 100 |
| - doc2vec_epochs: 30 |
| Pattern Diversity: |
| - avg_pattern_size: 1974.714 |
| - std_pattern_size: 1449.941 |
| - min_pattern_size: 534 |
| - max_pattern_size: 4363 |
| - balance_score: 0.577 |
|
|
| Top 3 Patterns: |
| pattern_0 |
| Clauses: 1492 |
| pattern_1 |
| Clauses: 2430 |
| pattern_2 |
| Clauses: 4363 |
|
|
| ================================================================================ |
| π― RECOMMENDATIONS BY METHOD |
| ================================================================================ |
|
|
| βββ BASIC METHODS (Fast & Reliable) βββ |
|
|
| 1. K-MEANS (Original): |
| β
Best for: Fast, scalable clustering with clear boundaries |
| β
Use when: You need consistent performance and interpretability |
| β‘ Speed: Very Fast | π― Accuracy: Good | π Scalability: Excellent |
| |
| 2. LDA TOPIC MODELING: |
| β
Best for: Discovering overlapping risk categories |
| β
Use when: Clauses may belong to multiple risk types |
| β‘ Speed: Moderate | π― Accuracy: Very Good | π Scalability: Good |
| |
| 3. HIERARCHICAL CLUSTERING: |
| β
Best for: Understanding risk relationships and hierarchies |
| β
Use when: You want to explore risk structure at different levels |
| β‘ Speed: Moderate | π― Accuracy: Good | π Scalability: Limited (<10K clauses) |
| |
| 4. DBSCAN: |
| β
Best for: Finding rare/unusual risks and handling outliers |
| β
Use when: You need to identify unique risk patterns |
| β‘ Speed: Fast | π― Accuracy: Good | π Scalability: Good |
|
|
| βββ ADVANCED METHODS (Comprehensive Analysis) βββ |
|
|
| 5. NMF (Non-negative Matrix Factorization): |
| β
Best for: Parts-based decomposition with interpretable components |
| β
Use when: You want additive risk factors (clause = sum of components) |
| β‘ Speed: Fast | π― Accuracy: Very Good | π Scalability: Excellent |
| π‘ Unique: Components are non-negative, highly interpretable |
| |
| 6. SPECTRAL CLUSTERING: |
| β
Best for: Complex relationships and non-convex cluster shapes |
| β
Use when: Risk patterns have intricate graph-like relationships |
| β‘ Speed: Slow | π― Accuracy: Excellent | π Scalability: Limited (<5K clauses) |
| π‘ Unique: Uses eigenvalue decomposition, best quality for small datasets |
| |
| 7. GAUSSIAN MIXTURE MODEL: |
| β
Best for: Soft probabilistic clustering with uncertainty estimates |
| β
Use when: You need confidence scores for risk assignments |
| β‘ Speed: Moderate | π― Accuracy: Very Good | π Scalability: Good |
| π‘ Unique: Provides probability distributions, quantifies uncertainty |
| |
| 8. MINI-BATCH K-MEANS: |
| β
Best for: Ultra-large datasets (100K+ clauses) |
| β
Use when: You need K-Means quality at 3-5x faster speed |
| β‘ Speed: Ultra Fast | π― Accuracy: Good | π Scalability: Extreme (>1M clauses) |
| π‘ Unique: Online learning, extremely memory efficient |
|
|
| 9. RISK-O-METER (Doc2Vec + SVM) β PAPER BASELINE: |
| β
Best for: Supervised learning with labeled data |
| β
Use when: You have risk labels and want paper-validated approach |
| β‘ Speed: Moderate | π― Accuracy: Excellent (91% reported) | π Scalability: Good |
| π‘ Unique: Paragraph vectors capture semantic meaning, proven in literature |
| π Reference: Chakrabarti et al., 2018 - "Risk-o-meter framework" |
|
|
| βββ SELECTION GUIDE βββ |
|
|
| π Dataset Size: |
| β’ <1K clauses: Use Spectral or GMM for best quality |
| β’ 1K-10K clauses: All methods work well |
| β’ 10K-100K clauses: Avoid Hierarchical and Spectral |
| β’ >100K clauses: Use Mini-Batch K-Means |
|
|
| π― Quality Priority: |
| β’ Highest: Spectral, GMM, LDA |
| β’ Balanced: NMF, K-Means |
| β’ Speed-focused: Mini-Batch, DBSCAN |
|
|
| π Special Requirements: |
| β’ Overlapping risks: LDA, GMM |
| β’ Outlier detection: DBSCAN |
| β’ Hierarchical structure: Hierarchical |
| β’ Interpretability: NMF, LDA |
| β’ Uncertainty estimates: GMM, LDA |
|
|
| ================================================================================ |