| # Fix for KeyError: 'method' in Risk Discovery Comparison | |
| ## Problem | |
| When running `compare_risk_discovery.py`, the script failed with: | |
| ``` | |
| KeyError: 'method' | |
| ``` | |
| This occurred because the K-Means implementation (`UnsupervisedRiskDiscovery`) was returning inconsistent data format compared to other methods. | |
| ## Root Cause | |
| Different discovery methods were returning different data structures: | |
| ### Other Methods (LDA, NMF, etc.) returned: | |
| ```python | |
| { | |
| 'method': 'LDA_Topic_Modeling', | |
| 'n_topics': 7, | |
| 'discovered_topics': {...}, | |
| 'quality_metrics': {...} | |
| } | |
| ``` | |
| ### K-Means returned: | |
| ```python | |
| { | |
| # Just the patterns dictionary, no metadata | |
| 'pattern_1': {...}, | |
| 'pattern_2': {...} | |
| } | |
| ``` | |
| The comparison function expected all methods to return a consistent structure with metadata. | |
| ## Solution | |
| ### 1. Fixed K-Means Return Format (`risk_discovery.py`) | |
| **Before:** | |
| ```python | |
| def discover_risk_patterns(self, clause_texts: List[str]) -> Dict[str, Any]: | |
| # ... clustering logic ... | |
| return self.discovered_patterns # Just patterns dict | |
| ``` | |
| **After:** | |
| ```python | |
| def discover_risk_patterns(self, clause_texts: List[str]) -> Dict[str, Any]: | |
| # ... clustering logic ... | |
| # Calculate quality metrics | |
| from sklearn.metrics import silhouette_score | |
| try: | |
| silhouette = silhouette_score(self.feature_matrix, self.cluster_labels) | |
| except: | |
| silhouette = 0.0 | |
| # Return structured results for comparison | |
| return { | |
| 'method': 'K-Means_Clustering', | |
| 'n_clusters': self.n_clusters, | |
| 'discovered_patterns': self.discovered_patterns, | |
| 'cluster_labels': self.cluster_labels, | |
| 'quality_metrics': { | |
| 'silhouette_score': silhouette, | |
| 'n_patterns': len(self.discovered_patterns) | |
| } | |
| } | |
| ``` | |
| ### 2. Fixed Report Pattern Display (`compare_risk_discovery.py`) | |
| Updated pattern display code to handle different attribute names: | |
| **Before:** | |
| ```python | |
| elif 'discovered_patterns' in res: | |
| report.append("\nTop 3 Patterns:") | |
| for i, (pattern_id, pattern) in enumerate(list(res['discovered_patterns'].items())[:3]): | |
| report.append(f" Pattern {pattern_id}: {pattern.get('name', 'Unnamed')}") | |
| report.append(f" Keywords: {', '.join(pattern.get('top_keywords', [])[:5])}") | |
| report.append(f" Clauses: {pattern.get('size', 0)}") | |
| ``` | |
| **After:** | |
| ```python | |
| elif 'discovered_patterns' in res: | |
| report.append("\nTop 3 Patterns:") | |
| for i, (pattern_id, pattern) in enumerate(list(res['discovered_patterns'].items())[:3]): | |
| # Handle different pattern formats | |
| pattern_name = pattern_id if isinstance(pattern_id, str) else pattern.get('name', f'Pattern {pattern_id}') | |
| keywords = pattern.get('key_terms', pattern.get('top_keywords', [])) | |
| clause_count = pattern.get('clause_count', pattern.get('size', 0)) | |
| report.append(f" {pattern_name}") | |
| if keywords: | |
| report.append(f" Keywords: {', '.join(keywords[:5])}") | |
| report.append(f" Clauses: {clause_count}") | |
| ``` | |
| ## Result | |
| All discovery methods now return consistent data structures: | |
| ```python | |
| { | |
| 'method': '<method_name>', # Method identifier | |
| 'n_clusters' or 'n_topics': int, # Number of patterns | |
| 'discovered_*': {...}, # Pattern details | |
| 'quality_metrics': {...} # Performance metrics | |
| } | |
| ``` | |
| ## Files Modified | |
| 1. `risk_discovery.py` - Updated `discover_risk_patterns()` return value | |
| 2. `compare_risk_discovery.py` - Updated pattern display to handle different formats | |
| ## Testing | |
| Once dependencies are installed: | |
| ```bash | |
| cd /home/deepu/Downloads/code2 | |
| pip install -r requirements.txt | |
| python3 compare_risk_discovery.py # Basic comparison (4 methods) | |
| python3 compare_risk_discovery.py --advanced # Full comparison (9 methods) | |
| ``` | |
| ## Additional Fixes in This Session | |
| 1. **NMF Parameter Compatibility** - Added version detection for scikit-learn API differences | |
| 2. **Full Dataset Support** - Removed clause limits, added `--max-clauses` CLI option | |
| 3. **Consistent Return Formats** - Standardized all discovery methods | |
| All 9 risk discovery methods should now work correctly! | |