Verification Checklist
Before Running
- Install dependencies:
pip install -r requirements.txt - Ensure CUAD dataset is at:
dataset/CUAD_v1/CUAD_v1.json - Python 3.8+ installed
Tests to Run
1. Basic Comparison (4 methods)
python3 compare_risk_discovery.py
Expected:
- K-Means β
- LDA β
- Hierarchical β
- DBSCAN β
- Output files created
- No KeyError
- No TypeError
2. Advanced Comparison (9 methods)
python3 compare_risk_discovery.py --advanced
Expected:
- All 4 basic methods β
- NMF β (no alpha parameter error)
- Spectral β
- GMM β
- Mini-Batch K-Means β
- Risk-o-meter β
- Output files created
3. Limited Dataset
python3 compare_risk_discovery.py --max-clauses 1000
Expected:
- Runs faster
- Uses 1000 clauses max
- All methods complete
4. Custom Data Path
python3 compare_risk_discovery.py --data-path dataset/CUAD_v1/CUAD_v1.json
Expected:
- Loads from specified path
- All methods complete
Output Files to Check
After successful run:
-
risk_discovery_comparison_report.txtexists -
risk_discovery_comparison_results.jsonexists - Report contains all methods
- JSON is valid and parseable
Key Metrics to Verify
In the report, check for:
- Each method has
Patterns Discoveredcount - Execution times are reasonable
- Quality metrics are present (silhouette/perplexity)
- Top patterns are displayed
- Recommendations section is complete
Common Issues and Solutions
Issue: No module named 'sklearn'
Solution: pip install scikit-learn>=1.3.0
Issue: No module named 'gensim' (Risk-o-meter only)
Solution: pip install gensim>=4.3.0 or skip with basic mode
Issue: Dataset not found
Solution: Check path in --data-path argument or use default location
Issue: Out of memory
Solution: Use --max-clauses 5000 to limit dataset size
Issue: Slow execution
Solution:
- Use basic mode (without
--advanced) - Reduce
--max-clauses - Skip Spectral/Hierarchical for large datasets
Performance Expectations
For ~13K clauses (full CUAD):
- K-Means: ~10-30 seconds β‘
- LDA: ~30-60 seconds π‘
- Hierarchical: ~60-120 seconds π‘ (memory intensive)
- DBSCAN: ~20-40 seconds β‘
- NMF: ~15-45 seconds β‘
- Spectral: ~90-180 seconds π΄ (slow for large datasets)
- GMM: ~40-80 seconds π‘
- Mini-Batch K-Means: ~5-15 seconds β‘β‘
- Risk-o-meter: ~60-120 seconds π‘
Total time (advanced mode): ~6-12 minutes
Success Criteria
β All methods complete without errors β Output files generated β Report contains meaningful patterns β Quality metrics are calculated β No KeyError or TypeError exceptions