Kraft102's picture
fix: sql.js Docker/Alpine compatibility layer for PatternMemory and FailureMemory
5a81b95
# Hugging Face Error Detection
**Purpose**: ML-powered vulnerability scanning and error pattern recognition
**Status**: 🟑 Working (performance tuning recommended)
**Version**: 1.0.0+
**Maintainer**: Security Team (Block 6)
## What It Does
Hugging Face error detection uses ML models to:
- Identify security vulnerabilities (SQL injection, XSS, etc.)
- Find API misuse patterns
- Detect performance anti-patterns
- Recognize code smell and potential bugs
- Analyze error propagation paths
## Installation
```bash
# Already configured in project
# To verify:
python -c "from transformers import pipeline; print('HF installed')"
# To update:
pip install --upgrade transformers torch
# Verify models are cached
ls ~/.cache/huggingface/hub/
```
## Performance Workaround
### Problem
Scanning large codebases (>100K lines) takes too long or times out
### Solution
**Scan by module, not entire project**:
```bash
# DON'T do this (will timeout)
python tools/error-libraries/huggingface-error-detection/detector.py src/
# DO this instead - scan module by module
cd C:/Users/claus/Projects/WidgetTDC
python tools/error-libraries/huggingface-error-detection/detector.py src/agents/
python tools/error-libraries/huggingface-error-detection/detector.py src/services/
python tools/error-libraries/huggingface-error-detection/detector.py src/routers/
# Or use parallel processing
parallel -j 3 "timeout 30 python detector.py {}" ::: src/*/
```
## Quick Start
### 1. Scan Single Module
```bash
# Scan one service
cd C:/Users/claus/Projects/WidgetTDC
python tools/error-libraries/huggingface-error-detection/detector.py src/services/email_service.py
# Expected output:
# [CRITICAL] SQL injection vulnerability in query: "SELECT * FROM users WHERE id=" + user_input
# [HIGH] Missing error handler for network timeout
# [MEDIUM] Unvalidated user input in API response
```
### 2. Scan Entire Directory (with timeout)
```bash
# With enforced 30-second timeout
timeout 30 python tools/error-libraries/huggingface-error-detection/detector.py src/agents/
# Multiple directories in sequence
for dir in src/agents src/services src/routers; do
echo "Scanning $dir..."
timeout 30 python tools/error-libraries/huggingface-error-detection/detector.py "$dir"
done
```
### 3. Generate Report
```bash
# Scan and save results
timeout 30 python tools/error-libraries/huggingface-error-detection/detector.py src/ > hf-scan-report.txt 2>&1
# Review report
cat hf-scan-report.txt
# Count vulnerabilities by severity
grep "\[CRITICAL\]" hf-scan-report.txt | wc -l
grep "\[HIGH\]" hf-scan-report.txt | wc -l
grep "\[MEDIUM\]" hf-scan-report.txt | wc -l
```
## Severity Levels
### πŸ”΄ CRITICAL
- SQL injection vulnerabilities
- Authentication bypass
- Credential exposure
- Remote code execution risks
- Data exfiltration paths
**Action**: Fix immediately before any deployment
### 🟠 HIGH
- Missing input validation
- Improper error handling
- Weak cryptography usage
- Privilege escalation paths
- Performance DoS patterns
**Action**: Fix within current sprint
### 🟑 MEDIUM
- Code smell and anti-patterns
- Potential logic errors
- Non-optimal algorithms
- Deprecated API usage
- Type safety issues
**Action**: Schedule for next sprint
### 🟒 LOW
- Style improvements
- Documentation suggestions
- Refactoring recommendations
- Performance micro-optimizations
- Best practice violations
**Action**: Consider for future improvement
## Usage Patterns
### Pattern 1: Security-Focused Scan
```bash
# Scan for security issues only
timeout 30 python tools/error-libraries/huggingface-error-detection/detector.py \
--security-only \
src/
# Output will only show CRITICAL and HIGH severity
```
### Pattern 2: Performance Analysis
```bash
# Scan for performance anti-patterns
timeout 30 python tools/error-libraries/huggingface-error-detection/detector.py \
--performance-check \
src/services/
# Identifies O(nΒ²) algorithms, memory leaks, etc.
```
### Pattern 3: API Misuse Detection
```bash
# Scan for common API misuse patterns
timeout 30 python tools/error-libraries/huggingface-error-detection/detector.py \
--api-validation \
src/integrations/
# Finds incorrect library usage, wrong parameters, etc.
```
### Pattern 4: Compare Before/After
```bash
# Baseline scan
timeout 30 python tools/error-libraries/huggingface-error-detection/detector.py src/ > baseline.txt
# Make changes...
# ... edit code ...
# Rescan
timeout 30 python tools/error-libraries/huggingface-error-detection/detector.py src/ > after.txt
# Compare
diff baseline.txt after.txt
```
## Configuration
**File**: `config/error-detection/huggingface-params.json`
```json
{
"timeout_seconds": 30,
"severity_threshold": "MEDIUM",
"models": [
"microsoft/codebert-base",
"huggingface/CodeBERTa-small-v1"
],
"parallel_workers": 3,
"cache_results": true,
"exclude_patterns": [
"node_modules/",
"dist/",
"*.test.py"
]
}
```
### Custom Configuration
```bash
# Use custom params file
python tools/error-libraries/huggingface-error-detection/detector.py \
--config custom-hf-params.json \
src/
```
## Common Issues and Solutions
### Issue: "CUDA out of memory"
**Cause**: GPU memory exhausted by ML model
**Solution**:
```bash
# Run on CPU instead
CUDA_VISIBLE_DEVICES="" python detector.py src/
# Or limit batch size
python detector.py --batch-size 16 src/
```
### Issue: "Timeout after 30 seconds"
**Cause**: Large directory or slow model
**Solution**:
```bash
# Scan smaller subsets
timeout 30 python detector.py src/agents/ > agents.txt
timeout 30 python detector.py src/services/ > services.txt
# Or increase timeout
timeout 60 python detector.py src/
```
### Issue: "Model download failed"
**Cause**: First run needs to download model (slow)
**Solution**:
```bash
# Pre-download model
python -c "from transformers import AutoModel; AutoModel.from_pretrained('microsoft/codebert-base')"
# Then run detector (cached)
python detector.py src/
```
### Issue: "High false-positive rate"
**Cause**: ML model marking legitimate code as vulnerable
**Solution**:
```bash
# Review reported vulnerabilities manually
# Update config to set higher threshold
python detector.py --severity-threshold HIGH src/
# Or exclude specific patterns
python detector.py --exclude-pattern "test_*" src/
```
## Example Scenarios
### Scenario 1: Widget Discovery Security Review
**Block**: 5 (QASpecialist)
**Task**: Verify widget discovery from Git/Hugging Face is secure
```bash
# Scan widget discovery service
timeout 30 python tools/error-libraries/huggingface-error-detection/detector.py \
--security-only \
src/services/widget_discovery.py
# Expected findings:
# - Input validation for URLs
# - Git URL injection protection
# - Model download safety
# - Sandboxing of untrusted code
```
### Scenario 2: MCP Communication Validation
**Block**: 2 (CloudArch)
**Task**: Find vulnerabilities in MCP widget trigger mechanism
```bash
# Scan MCP integration code
timeout 30 python tools/error-libraries/huggingface-error-detection/detector.py \
--security-only \
src/integrations/mcp_framework.py
# Expected findings:
# - Message injection protection
# - Parameter validation
# - Cross-widget access control
# - State synchronization safety
```
### Scenario 3: Widget Registry Data Safety
**Block**: 4 (DatabaseMaster)
**Task**: Verify widget registry prevents data corruption
```bash
# Scan database operations
timeout 30 python tools/error-libraries/huggingface-error-detection/detector.py \
--api-validation \
src/models/widget_registry.py
# Expected findings:
# - SQL injection prevention
# - Transaction integrity
# - Concurrent access safety
# - Backup mechanism validation
```
## Integration with Cascade
Hugging Face detector runs:
- On code changes for security-critical paths
- Results logged to: `.claude/logs/security-scan.log`
- CRITICAL findings block deployment
- HIGH findings trigger review
- Results included in daily standup
## Performance Optimization Tips
### Tip 1: Use Caching
```bash
# Results are cached by default
# Second run on same code is instant
# Clear cache if needed
rm -rf .cache/huggingface-detector/
```
### Tip 2: Parallel Scanning
```bash
# Install GNU Parallel
pip install parallel
# Scan 3 directories in parallel
parallel -j 3 "timeout 30 python detector.py {}" ::: src/agents src/services src/routers
```
### Tip 3: Incremental Scanning
```bash
# Only scan files changed since last commit
git diff --name-only | xargs -I {} timeout 30 python detector.py {}
# Or use git hooks for automatic scanning
# (setup in pre-commit hooks)
```
## Success Metrics
**Good HF detection results**:
- βœ… All CRITICAL issues identified
- βœ… Security vulnerabilities clearly reported
- βœ… Actionable fix recommendations
- βœ… Reasonable false-positive rate (<20%)
- βœ… Scans complete within 30 seconds
**Poor results** (need investigation):
- ❌ Scan timeout every run
- ❌ Memory errors or crashes
- ❌ High false-positive rate (>50%)
- ❌ Missing obvious vulnerabilities
- ❌ Cannot download models
## Troubleshooting Checklist
- [ ] Have 30+ second timeout available?
- [ ] Scanning module, not entire project?
- [ ] Models cached? (first run is slow)
- [ ] Have GPU memory available? (or use CPU)
- [ ] Internet connection for first model download?
- [ ] Sufficient disk space for model cache?
## Next Steps
1. **Start small**: Scan one Python file first
2. **Review findings**: Read security warnings carefully
3. **Fix issues**: Address CRITICAL findings first
4. **Rescan**: Confirm fixes worked
5. **Automate**: Add to CI/CD pipeline
## Questions?
Escalation path:
1. Check this README
2. Try lower severity threshold to exclude false positives
3. File issue in daily standup with:
- Command you ran
- Error message (if any)
- Your block number
- Timeout issues or false positives encountered
---
**Ready to use?**
```bash
cd C:/Users/claus/Projects/WidgetTDC
# Start with one file
timeout 30 python tools/error-libraries/huggingface-error-detection/detector.py src/services/email_service.py
# Then expand to module
timeout 30 python tools/error-libraries/huggingface-error-detection/detector.py src/services/
# Review findings and fix
```
Security scanning enabled. πŸ”’