File size: 2,757 Bytes
9b1c753 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 |
# Verification Checklist
## Before Running
- [ ] Install dependencies: `pip install -r requirements.txt`
- [ ] Ensure CUAD dataset is at: `dataset/CUAD_v1/CUAD_v1.json`
- [ ] Python 3.8+ installed
## Tests to Run
### 1. Basic Comparison (4 methods)
```bash
python3 compare_risk_discovery.py
```
**Expected:**
- K-Means β
- LDA β
- Hierarchical β
- DBSCAN β
- Output files created
- No KeyError
- No TypeError
### 2. Advanced Comparison (9 methods)
```bash
python3 compare_risk_discovery.py --advanced
```
**Expected:**
- All 4 basic methods β
- NMF β
(no alpha parameter error)
- Spectral β
- GMM β
- Mini-Batch K-Means β
- Risk-o-meter β
- Output files created
### 3. Limited Dataset
```bash
python3 compare_risk_discovery.py --max-clauses 1000
```
**Expected:**
- Runs faster
- Uses 1000 clauses max
- All methods complete
### 4. Custom Data Path
```bash
python3 compare_risk_discovery.py --data-path dataset/CUAD_v1/CUAD_v1.json
```
**Expected:**
- Loads from specified path
- All methods complete
## Output Files to Check
After successful run:
- [ ] `risk_discovery_comparison_report.txt` exists
- [ ] `risk_discovery_comparison_results.json` exists
- [ ] Report contains all methods
- [ ] JSON is valid and parseable
## Key Metrics to Verify
In the report, check for:
- [ ] Each method has `Patterns Discovered` count
- [ ] Execution times are reasonable
- [ ] Quality metrics are present (silhouette/perplexity)
- [ ] Top patterns are displayed
- [ ] Recommendations section is complete
## Common Issues and Solutions
### Issue: No module named 'sklearn'
**Solution:** `pip install scikit-learn>=1.3.0`
### Issue: No module named 'gensim' (Risk-o-meter only)
**Solution:** `pip install gensim>=4.3.0` or skip with basic mode
### Issue: Dataset not found
**Solution:** Check path in `--data-path` argument or use default location
### Issue: Out of memory
**Solution:** Use `--max-clauses 5000` to limit dataset size
### Issue: Slow execution
**Solution:**
- Use basic mode (without `--advanced`)
- Reduce `--max-clauses`
- Skip Spectral/Hierarchical for large datasets
## Performance Expectations
For ~13K clauses (full CUAD):
- K-Means: ~10-30 seconds β‘
- LDA: ~30-60 seconds π‘
- Hierarchical: ~60-120 seconds π‘ (memory intensive)
- DBSCAN: ~20-40 seconds β‘
- NMF: ~15-45 seconds β‘
- Spectral: ~90-180 seconds π΄ (slow for large datasets)
- GMM: ~40-80 seconds π‘
- Mini-Batch K-Means: ~5-15 seconds β‘β‘
- Risk-o-meter: ~60-120 seconds π‘
**Total time (advanced mode):** ~6-12 minutes
## Success Criteria
β
All methods complete without errors
β
Output files generated
β
Report contains meaningful patterns
β
Quality metrics are calculated
β
No KeyError or TypeError exceptions
|