Fix for KeyError: 'method' in Risk Discovery Comparison
Problem
When running compare_risk_discovery.py, the script failed with:
KeyError: 'method'
This occurred because the K-Means implementation (UnsupervisedRiskDiscovery) was returning inconsistent data format compared to other methods.
Root Cause
Different discovery methods were returning different data structures:
Other Methods (LDA, NMF, etc.) returned:
{
'method': 'LDA_Topic_Modeling',
'n_topics': 7,
'discovered_topics': {...},
'quality_metrics': {...}
}
K-Means returned:
{
# Just the patterns dictionary, no metadata
'pattern_1': {...},
'pattern_2': {...}
}
The comparison function expected all methods to return a consistent structure with metadata.
Solution
1. Fixed K-Means Return Format (risk_discovery.py)
Before:
def discover_risk_patterns(self, clause_texts: List[str]) -> Dict[str, Any]:
# ... clustering logic ...
return self.discovered_patterns # Just patterns dict
After:
def discover_risk_patterns(self, clause_texts: List[str]) -> Dict[str, Any]:
# ... clustering logic ...
# Calculate quality metrics
from sklearn.metrics import silhouette_score
try:
silhouette = silhouette_score(self.feature_matrix, self.cluster_labels)
except:
silhouette = 0.0
# Return structured results for comparison
return {
'method': 'K-Means_Clustering',
'n_clusters': self.n_clusters,
'discovered_patterns': self.discovered_patterns,
'cluster_labels': self.cluster_labels,
'quality_metrics': {
'silhouette_score': silhouette,
'n_patterns': len(self.discovered_patterns)
}
}
2. Fixed Report Pattern Display (compare_risk_discovery.py)
Updated pattern display code to handle different attribute names:
Before:
elif 'discovered_patterns' in res:
report.append("\nTop 3 Patterns:")
for i, (pattern_id, pattern) in enumerate(list(res['discovered_patterns'].items())[:3]):
report.append(f" Pattern {pattern_id}: {pattern.get('name', 'Unnamed')}")
report.append(f" Keywords: {', '.join(pattern.get('top_keywords', [])[:5])}")
report.append(f" Clauses: {pattern.get('size', 0)}")
After:
elif 'discovered_patterns' in res:
report.append("\nTop 3 Patterns:")
for i, (pattern_id, pattern) in enumerate(list(res['discovered_patterns'].items())[:3]):
# Handle different pattern formats
pattern_name = pattern_id if isinstance(pattern_id, str) else pattern.get('name', f'Pattern {pattern_id}')
keywords = pattern.get('key_terms', pattern.get('top_keywords', []))
clause_count = pattern.get('clause_count', pattern.get('size', 0))
report.append(f" {pattern_name}")
if keywords:
report.append(f" Keywords: {', '.join(keywords[:5])}")
report.append(f" Clauses: {clause_count}")
Result
All discovery methods now return consistent data structures:
{
'method': '<method_name>', # Method identifier
'n_clusters' or 'n_topics': int, # Number of patterns
'discovered_*': {...}, # Pattern details
'quality_metrics': {...} # Performance metrics
}
Files Modified
risk_discovery.py- Updateddiscover_risk_patterns()return valuecompare_risk_discovery.py- Updated pattern display to handle different formats
Testing
Once dependencies are installed:
cd /home/deepu/Downloads/code2
pip install -r requirements.txt
python3 compare_risk_discovery.py # Basic comparison (4 methods)
python3 compare_risk_discovery.py --advanced # Full comparison (9 methods)
Additional Fixes in This Session
- NMF Parameter Compatibility - Added version detection for scikit-learn API differences
- Full Dataset Support - Removed clause limits, added
--max-clausesCLI option - Consistent Return Formats - Standardized all discovery methods
All 9 risk discovery methods should now work correctly!