File size: 4,206 Bytes
9b1c753 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 |
# Fix for KeyError: 'method' in Risk Discovery Comparison
## Problem
When running `compare_risk_discovery.py`, the script failed with:
```
KeyError: 'method'
```
This occurred because the K-Means implementation (`UnsupervisedRiskDiscovery`) was returning inconsistent data format compared to other methods.
## Root Cause
Different discovery methods were returning different data structures:
### Other Methods (LDA, NMF, etc.) returned:
```python
{
'method': 'LDA_Topic_Modeling',
'n_topics': 7,
'discovered_topics': {...},
'quality_metrics': {...}
}
```
### K-Means returned:
```python
{
# Just the patterns dictionary, no metadata
'pattern_1': {...},
'pattern_2': {...}
}
```
The comparison function expected all methods to return a consistent structure with metadata.
## Solution
### 1. Fixed K-Means Return Format (`risk_discovery.py`)
**Before:**
```python
def discover_risk_patterns(self, clause_texts: List[str]) -> Dict[str, Any]:
# ... clustering logic ...
return self.discovered_patterns # Just patterns dict
```
**After:**
```python
def discover_risk_patterns(self, clause_texts: List[str]) -> Dict[str, Any]:
# ... clustering logic ...
# Calculate quality metrics
from sklearn.metrics import silhouette_score
try:
silhouette = silhouette_score(self.feature_matrix, self.cluster_labels)
except:
silhouette = 0.0
# Return structured results for comparison
return {
'method': 'K-Means_Clustering',
'n_clusters': self.n_clusters,
'discovered_patterns': self.discovered_patterns,
'cluster_labels': self.cluster_labels,
'quality_metrics': {
'silhouette_score': silhouette,
'n_patterns': len(self.discovered_patterns)
}
}
```
### 2. Fixed Report Pattern Display (`compare_risk_discovery.py`)
Updated pattern display code to handle different attribute names:
**Before:**
```python
elif 'discovered_patterns' in res:
report.append("\nTop 3 Patterns:")
for i, (pattern_id, pattern) in enumerate(list(res['discovered_patterns'].items())[:3]):
report.append(f" Pattern {pattern_id}: {pattern.get('name', 'Unnamed')}")
report.append(f" Keywords: {', '.join(pattern.get('top_keywords', [])[:5])}")
report.append(f" Clauses: {pattern.get('size', 0)}")
```
**After:**
```python
elif 'discovered_patterns' in res:
report.append("\nTop 3 Patterns:")
for i, (pattern_id, pattern) in enumerate(list(res['discovered_patterns'].items())[:3]):
# Handle different pattern formats
pattern_name = pattern_id if isinstance(pattern_id, str) else pattern.get('name', f'Pattern {pattern_id}')
keywords = pattern.get('key_terms', pattern.get('top_keywords', []))
clause_count = pattern.get('clause_count', pattern.get('size', 0))
report.append(f" {pattern_name}")
if keywords:
report.append(f" Keywords: {', '.join(keywords[:5])}")
report.append(f" Clauses: {clause_count}")
```
## Result
All discovery methods now return consistent data structures:
```python
{
'method': '<method_name>', # Method identifier
'n_clusters' or 'n_topics': int, # Number of patterns
'discovered_*': {...}, # Pattern details
'quality_metrics': {...} # Performance metrics
}
```
## Files Modified
1. `risk_discovery.py` - Updated `discover_risk_patterns()` return value
2. `compare_risk_discovery.py` - Updated pattern display to handle different formats
## Testing
Once dependencies are installed:
```bash
cd /home/deepu/Downloads/code2
pip install -r requirements.txt
python3 compare_risk_discovery.py # Basic comparison (4 methods)
python3 compare_risk_discovery.py --advanced # Full comparison (9 methods)
```
## Additional Fixes in This Session
1. **NMF Parameter Compatibility** - Added version detection for scikit-learn API differences
2. **Full Dataset Support** - Removed clause limits, added `--max-clauses` CLI option
3. **Consistent Return Formats** - Standardized all discovery methods
All 9 risk discovery methods should now work correctly!
|