Spaces:

Dyra1204
/

ViT-Auditing-Toolkit

Sleeping

Load ViT-Base model
Upload cat_portrait.jpg
Select "Attention Visualization"
Try these layer/head combinations:
- Layer 0, Head 0 (low-level features)
- Layer 6, Head 0 (mid-level patterns)
- Layer 11, Head 0 (high-level semantics)

Expected Results:

✅ Early layers: Focus on edges, textures
✅ Middle layers: Focus on cat features (ears, eyes)
✅ Late layers: Focus on discriminative regions (face)

Test 2: GradCAM Visualization

Image: examples/basic_explainability/sports_car.jpg

Steps:

Upload sports_car.jpg
Select "GradCAM" method
Click "Analyze Image"

Expected Results:

✅ Heatmap highlights car body, wheels
✅ Prediction confidence > 70%
✅ Top class includes "sports car" or "convertible"

Test 3: GradientSHAP

Image: examples/basic_explainability/bird_flying.jpg

Steps:

Upload bird_flying.jpg
Select "GradientSHAP" method
Wait for analysis (takes ~10-15 seconds)

Expected Results:

✅ Attribution map shows bird outline
✅ Wings and body highlighted
✅ Background has low attribution

Test 4: Multiple Objects

Image: examples/basic_explainability/coffee_cup.jpg

Steps:

Upload coffee_cup.jpg
Try all three methods
Compare explanations

Expected Results:

✅ All methods highlight the cup
✅ Consistent predictions across methods
✅ Some variation in exact highlighted regions

🔄 Tab 2: Counterfactual Analysis Testing

Test 5: Face Feature Importance

Image: examples/counterfactual/face_portrait.jpg

Steps:

Upload face_portrait.jpg
Settings:
- Patch size: 32
- Perturbation: blur
Click "Run Counterfactual Analysis"

Expected Results:

✅ Face region shows high sensitivity
✅ Background regions have low impact
✅ Prediction flip rate < 50%

Test 6: Vehicle Components

Image: examples/counterfactual/car_side.jpg

Steps:

Upload car_side.jpg
Test each perturbation type:
- Blur
- Blackout
- Gray
- Noise
Compare results

Expected Results:

✅ Wheels are critical regions
✅ Windows/doors moderately important
✅ Blackout causes most disruption

Test 7: Architectural Elements

Image: examples/counterfactual/building.jpg

Steps:

Upload building.jpg
Patch size: 48
Perturbation: gray

Expected Results:

✅ Structural elements highlighted
✅ Lower flip rate (buildings are robust)
✅ Consistent confidence across patches

Test 8: Simple Object Baseline

Image: examples/counterfactual/flower.jpg

Steps:

Upload flower.jpg
Try smallest patch size (16)
Use blackout perturbation

Expected Results:

✅ Flower center most critical
✅ Petals moderately important
✅ Background has minimal impact

📊 Tab 3: Confidence Calibration Testing

Test 9: High-Quality Image

Image: examples/calibration/clear_panda.jpg

Steps:

Upload clear_panda.jpg
Number of bins: 10
Run analysis

Expected Results:

✅ High mean confidence (> 0.8)
✅ Low overconfident rate
✅ Calibration curve near diagonal

Test 10: Complex Scene

Image: examples/calibration/workspace.jpg

Steps:

Upload workspace.jpg
Number of bins: 15
Compare with panda results

Expected Results:

✅ Lower mean confidence (multiple objects)
✅ Higher variance in predictions
✅ More distributed across bins

Test 11: Bin Size Comparison

Image: examples/calibration/outdoor_scene.jpg

Steps:

Upload outdoor_scene.jpg
Test with bins: 5, 10, 20
Compare calibration curves

Expected Results:

✅ More bins = finer granularity
✅ General trend consistent
✅ 10 bins usually optimal

⚖️ Tab 4: Bias Detection Testing

Test 12: Lighting Conditions

Image: examples/bias_detection/dog_daylight.jpg

Steps:

Upload dog_daylight.jpg
Run bias detection
Note confidence for daylight subgroup

Expected Results:

✅ 4 subgroups generated (original, bright+, bright-, contrast+)
✅ Confidence varies across subgroups
✅ Original has highest confidence typically

Test 13: Indoor vs Outdoor

Images:

examples/bias_detection/cat_indoor.jpg
examples/bias_detection/bird_outdoor.jpg

Steps:

Test both images separately
Compare confidence distributions
Note any systematic differences

Expected Results:

✅ Both should predict correctly
✅ Confidence may vary
✅ Subgroup metrics show variations

Test 14: Urban Environment

Image: examples/bias_detection/urban_scene.jpg

Steps:

Upload urban_scene.jpg
Run bias detection
Check for environmental bias

Expected Results:

✅ Multiple objects detected
✅ Varied confidence across subgroups
✅ Brightness variations affect predictions

🎯 Cross-Tab Testing

Test 15: Same Image, All Tabs

Image: examples/general/pizza.jpg

Steps:

Tab 1: Check predictions and explanations
Tab 2: Test robustness with perturbations
Tab 3: Check confidence calibration
Tab 4: Analyze across subgroups

Expected Results:

✅ Consistent predictions across tabs
✅ High confidence (pizza is clear class)
✅ Robust to perturbations
✅ Well-calibrated

Test 16: Model Comparison

Image: examples/general/laptop.jpg

Steps:

Load ViT-Base, analyze laptop.jpg in Tab 1
Note top predictions and confidence
Load ViT-Large, analyze same image
Compare results

Expected Results:

✅ ViT-Large slightly higher confidence
✅ Similar top predictions
✅ Better attention patterns (Large)
✅ Longer inference time (Large)

Test 17: Edge Case Testing

Image: examples/general/mountain.jpg

Steps:

Test in all tabs
Note predictions (landscape/nature)
Check explanation quality

Expected Results:

✅ May predict multiple classes (mountain, valley, landscape)
✅ Lower confidence (ambiguous category)
✅ Attention spread across scene

Test 18: Furniture Classification

Image: examples/general/chair.jpg

Steps:

Basic explainability test
Counterfactual with blur
Check which parts are critical

Expected Results:

✅ Predicts chair/furniture
✅ Legs and seat are critical
✅ Background less important

🔧 Performance Testing

Test 19: Load Time

Steps:

Clear browser cache
Time model loading
Note first analysis time vs subsequent

Expected:

First load: 5-15 seconds
Subsequent: < 1 second
Analysis: 2-5 seconds per image

Test 20: Memory Usage

Steps:

Open browser dev tools
Monitor memory during analysis
Test with both models

Expected:

ViT-Base: ~2GB RAM
ViT-Large: ~4GB RAM
No memory leaks over multiple analyses

🐛 Error Handling Testing

Test 21: Invalid Inputs

Steps:

Try uploading non-image file
Try very large image (> 50MB)
Try corrupted image

Expected:

✅ Graceful error messages
✅ No crashes
✅ User-friendly feedback

Test 22: Edge Cases

Steps:

Try extremely dark/bright images
Try pure noise images
Try text-only images

Expected:

✅ Model makes predictions
✅ Lower confidence expected
✅ Explanations still generated

📝 Test Results Template

## Test Session: [Date]

**Tester**: [Name]
**Model**: ViT-Base / ViT-Large
**Browser**: [Chrome/Firefox/Safari]
**Environment**: [Local/Docker/Cloud]

### Results Summary:
- Tests Passed: __/22
- Tests Failed: __/22
- Critical Issues: __
- Minor Issues: __

### Detailed Results:

#### Test 1: Attention Visualization
- Status: ✅ Pass / ❌ Fail
- Notes: [observations]

[Continue for all tests...]

### Issues Found:
1. [Issue description]
   - Severity: Critical/Major/Minor
   - Steps to reproduce:
   - Expected: 
   - Actual:

### Recommendations:
- [Improvement suggestions]

🚀 Quick Smoke Test (5 minutes)

Fastest way to verify everything works:

# 1. Start app
python app.py

# 2. Load ViT-Base model

# 3. Quick tests:
Tab 1: Upload examples/basic_explainability/cat_portrait.jpg → Analyze
Tab 2: Upload examples/counterfactual/flower.jpg → Analyze
Tab 3: Upload examples/calibration/clear_panda.jpg → Analyze
Tab 4: Upload examples/bias_detection/dog_daylight.jpg → Analyze

# 4. All should complete without errors

📊 Automated Testing

Run automated tests:

# Unit tests
pytest tests/test_phase1_complete.py -v

# Advanced features tests
pytest tests/test_advanced_features.py -v

# All tests with coverage
pytest tests/ --cov=src --cov-report=html

🎓 User Acceptance Testing

Scenario 1: First-time User

Can they understand the interface?
Can they complete basic analysis?
Is documentation helpful?

Scenario 2: Researcher

Can they compare multiple methods?
Can they export results?
Is explanation quality sufficient?

Scenario 3: ML Practitioner

Can they validate their model?
Are metrics meaningful?
Can they identify issues?

✅ Sign-off Criteria

Before considering testing complete:

All 22 tests pass
No critical bugs
Performance acceptable
Documentation accurate
User feedback positive
All tabs functional
Both models work
Error handling robust

Happy Testing! 🎉

For issues or questions, see CONTRIBUTING.md