# Fix for KeyError: 'method' in Risk Discovery Comparison ## Problem When running `compare_risk_discovery.py`, the script failed with: ``` KeyError: 'method' ``` This occurred because the K-Means implementation (`UnsupervisedRiskDiscovery`) was returning inconsistent data format compared to other methods. ## Root Cause Different discovery methods were returning different data structures: ### Other Methods (LDA, NMF, etc.) returned: ```python { 'method': 'LDA_Topic_Modeling', 'n_topics': 7, 'discovered_topics': {...}, 'quality_metrics': {...} } ``` ### K-Means returned: ```python { # Just the patterns dictionary, no metadata 'pattern_1': {...}, 'pattern_2': {...} } ``` The comparison function expected all methods to return a consistent structure with metadata. ## Solution ### 1. Fixed K-Means Return Format (`risk_discovery.py`) **Before:** ```python def discover_risk_patterns(self, clause_texts: List[str]) -> Dict[str, Any]: # ... clustering logic ... return self.discovered_patterns # Just patterns dict ``` **After:** ```python def discover_risk_patterns(self, clause_texts: List[str]) -> Dict[str, Any]: # ... clustering logic ... # Calculate quality metrics from sklearn.metrics import silhouette_score try: silhouette = silhouette_score(self.feature_matrix, self.cluster_labels) except: silhouette = 0.0 # Return structured results for comparison return { 'method': 'K-Means_Clustering', 'n_clusters': self.n_clusters, 'discovered_patterns': self.discovered_patterns, 'cluster_labels': self.cluster_labels, 'quality_metrics': { 'silhouette_score': silhouette, 'n_patterns': len(self.discovered_patterns) } } ``` ### 2. Fixed Report Pattern Display (`compare_risk_discovery.py`) Updated pattern display code to handle different attribute names: **Before:** ```python elif 'discovered_patterns' in res: report.append("\nTop 3 Patterns:") for i, (pattern_id, pattern) in enumerate(list(res['discovered_patterns'].items())[:3]): report.append(f" Pattern {pattern_id}: {pattern.get('name', 'Unnamed')}") report.append(f" Keywords: {', '.join(pattern.get('top_keywords', [])[:5])}") report.append(f" Clauses: {pattern.get('size', 0)}") ``` **After:** ```python elif 'discovered_patterns' in res: report.append("\nTop 3 Patterns:") for i, (pattern_id, pattern) in enumerate(list(res['discovered_patterns'].items())[:3]): # Handle different pattern formats pattern_name = pattern_id if isinstance(pattern_id, str) else pattern.get('name', f'Pattern {pattern_id}') keywords = pattern.get('key_terms', pattern.get('top_keywords', [])) clause_count = pattern.get('clause_count', pattern.get('size', 0)) report.append(f" {pattern_name}") if keywords: report.append(f" Keywords: {', '.join(keywords[:5])}") report.append(f" Clauses: {clause_count}") ``` ## Result All discovery methods now return consistent data structures: ```python { 'method': '', # Method identifier 'n_clusters' or 'n_topics': int, # Number of patterns 'discovered_*': {...}, # Pattern details 'quality_metrics': {...} # Performance metrics } ``` ## Files Modified 1. `risk_discovery.py` - Updated `discover_risk_patterns()` return value 2. `compare_risk_discovery.py` - Updated pattern display to handle different formats ## Testing Once dependencies are installed: ```bash cd /home/deepu/Downloads/code2 pip install -r requirements.txt python3 compare_risk_discovery.py # Basic comparison (4 methods) python3 compare_risk_discovery.py --advanced # Full comparison (9 methods) ``` ## Additional Fixes in This Session 1. **NMF Parameter Compatibility** - Added version detection for scikit-learn API differences 2. **Full Dataset Support** - Removed clause limits, added `--max-clauses` CLI option 3. **Consistent Return Formats** - Standardized all discovery methods All 9 risk discovery methods should now work correctly!