File size: 2,757 Bytes
9b1c753
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
# Verification Checklist

## Before Running
- [ ] Install dependencies: `pip install -r requirements.txt`
- [ ] Ensure CUAD dataset is at: `dataset/CUAD_v1/CUAD_v1.json`
- [ ] Python 3.8+ installed

## Tests to Run

### 1. Basic Comparison (4 methods)
```bash
python3 compare_risk_discovery.py
```

**Expected:**
- K-Means βœ…
- LDA βœ…
- Hierarchical βœ…
- DBSCAN βœ…
- Output files created
- No KeyError
- No TypeError

### 2. Advanced Comparison (9 methods)
```bash
python3 compare_risk_discovery.py --advanced
```

**Expected:**
- All 4 basic methods βœ…
- NMF βœ… (no alpha parameter error)
- Spectral βœ…
- GMM βœ…
- Mini-Batch K-Means βœ…
- Risk-o-meter βœ…
- Output files created

### 3. Limited Dataset
```bash
python3 compare_risk_discovery.py --max-clauses 1000
```

**Expected:**
- Runs faster
- Uses 1000 clauses max
- All methods complete

### 4. Custom Data Path
```bash
python3 compare_risk_discovery.py --data-path dataset/CUAD_v1/CUAD_v1.json
```

**Expected:**
- Loads from specified path
- All methods complete

## Output Files to Check
After successful run:
- [ ] `risk_discovery_comparison_report.txt` exists
- [ ] `risk_discovery_comparison_results.json` exists
- [ ] Report contains all methods
- [ ] JSON is valid and parseable

## Key Metrics to Verify
In the report, check for:
- [ ] Each method has `Patterns Discovered` count
- [ ] Execution times are reasonable
- [ ] Quality metrics are present (silhouette/perplexity)
- [ ] Top patterns are displayed
- [ ] Recommendations section is complete

## Common Issues and Solutions

### Issue: No module named 'sklearn'
**Solution:** `pip install scikit-learn>=1.3.0`

### Issue: No module named 'gensim' (Risk-o-meter only)
**Solution:** `pip install gensim>=4.3.0` or skip with basic mode

### Issue: Dataset not found
**Solution:** Check path in `--data-path` argument or use default location

### Issue: Out of memory
**Solution:** Use `--max-clauses 5000` to limit dataset size

### Issue: Slow execution
**Solution:** 
- Use basic mode (without `--advanced`)
- Reduce `--max-clauses`
- Skip Spectral/Hierarchical for large datasets

## Performance Expectations

For ~13K clauses (full CUAD):
- K-Means: ~10-30 seconds ⚑
- LDA: ~30-60 seconds 🟑
- Hierarchical: ~60-120 seconds 🟑 (memory intensive)
- DBSCAN: ~20-40 seconds ⚑
- NMF: ~15-45 seconds ⚑
- Spectral: ~90-180 seconds πŸ”΄ (slow for large datasets)
- GMM: ~40-80 seconds 🟑
- Mini-Batch K-Means: ~5-15 seconds ⚑⚑
- Risk-o-meter: ~60-120 seconds 🟑

**Total time (advanced mode):** ~6-12 minutes

## Success Criteria
βœ… All methods complete without errors
βœ… Output files generated
βœ… Report contains meaningful patterns
βœ… Quality metrics are calculated
βœ… No KeyError or TypeError exceptions