# ๐Ÿงช Testing Guide for ViT Auditing Toolkit Complete guide for testing all features using the provided sample images. ## ๐Ÿ“‹ Quick Test Checklist - [ ] Basic Explainability - Attention Visualization - [ ] Basic Explainability - GradCAM - [ ] Basic Explainability - GradientSHAP - [ ] Counterfactual Analysis - All perturbation types - [ ] Confidence Calibration - Different bin sizes - [ ] Bias Detection - Multiple subgroups - [ ] Model Switching (ViT-Base โ†” ViT-Large) --- ## ๐Ÿ” Tab 1: Basic Explainability Testing ### Test 1: Attention Visualization **Image**: `examples/basic_explainability/cat_portrait.jpg` **Steps**: 1. Load ViT-Base model 2. Upload cat_portrait.jpg 3. Select "Attention Visualization" 4. Try these layer/head combinations: - Layer 0, Head 0 (low-level features) - Layer 6, Head 0 (mid-level patterns) - Layer 11, Head 0 (high-level semantics) **Expected Results**: - โœ… Early layers: Focus on edges, textures - โœ… Middle layers: Focus on cat features (ears, eyes) - โœ… Late layers: Focus on discriminative regions (face) --- ### Test 2: GradCAM Visualization **Image**: `examples/basic_explainability/sports_car.jpg` **Steps**: 1. Upload sports_car.jpg 2. Select "GradCAM" method 3. Click "Analyze Image" **Expected Results**: - โœ… Heatmap highlights car body, wheels - โœ… Prediction confidence > 70% - โœ… Top class includes "sports car" or "convertible" --- ### Test 3: GradientSHAP **Image**: `examples/basic_explainability/bird_flying.jpg` **Steps**: 1. Upload bird_flying.jpg 2. Select "GradientSHAP" method 3. Wait for analysis (takes ~10-15 seconds) **Expected Results**: - โœ… Attribution map shows bird outline - โœ… Wings and body highlighted - โœ… Background has low attribution --- ### Test 4: Multiple Objects **Image**: `examples/basic_explainability/coffee_cup.jpg` **Steps**: 1. Upload coffee_cup.jpg 2. Try all three methods 3. Compare explanations **Expected Results**: - โœ… All methods highlight the cup - โœ… Consistent predictions across methods - โœ… Some variation in exact highlighted regions --- ## ๐Ÿ”„ Tab 2: Counterfactual Analysis Testing ### Test 5: Face Feature Importance **Image**: `examples/counterfactual/face_portrait.jpg` **Steps**: 1. Upload face_portrait.jpg 2. Settings: - Patch size: 32 - Perturbation: blur 3. Click "Run Counterfactual Analysis" **Expected Results**: - โœ… Face region shows high sensitivity - โœ… Background regions have low impact - โœ… Prediction flip rate < 50% --- ### Test 6: Vehicle Components **Image**: `examples/counterfactual/car_side.jpg` **Steps**: 1. Upload car_side.jpg 2. Test each perturbation type: - Blur - Blackout - Gray - Noise 3. Compare results **Expected Results**: - โœ… Wheels are critical regions - โœ… Windows/doors moderately important - โœ… Blackout causes most disruption --- ### Test 7: Architectural Elements **Image**: `examples/counterfactual/building.jpg` **Steps**: 1. Upload building.jpg 2. Patch size: 48 3. Perturbation: gray **Expected Results**: - โœ… Structural elements highlighted - โœ… Lower flip rate (buildings are robust) - โœ… Consistent confidence across patches --- ### Test 8: Simple Object Baseline **Image**: `examples/counterfactual/flower.jpg` **Steps**: 1. Upload flower.jpg 2. Try smallest patch size (16) 3. Use blackout perturbation **Expected Results**: - โœ… Flower center most critical - โœ… Petals moderately important - โœ… Background has minimal impact --- ## ๐Ÿ“Š Tab 3: Confidence Calibration Testing ### Test 9: High-Quality Image **Image**: `examples/calibration/clear_panda.jpg` **Steps**: 1. Upload clear_panda.jpg 2. Number of bins: 10 3. Run analysis **Expected Results**: - โœ… High mean confidence (> 0.8) - โœ… Low overconfident rate - โœ… Calibration curve near diagonal --- ### Test 10: Complex Scene **Image**: `examples/calibration/workspace.jpg` **Steps**: 1. Upload workspace.jpg 2. Number of bins: 15 3. Compare with panda results **Expected Results**: - โœ… Lower mean confidence (multiple objects) - โœ… Higher variance in predictions - โœ… More distributed across bins --- ### Test 11: Bin Size Comparison **Image**: `examples/calibration/outdoor_scene.jpg` **Steps**: 1. Upload outdoor_scene.jpg 2. Test with bins: 5, 10, 20 3. Compare calibration curves **Expected Results**: - โœ… More bins = finer granularity - โœ… General trend consistent - โœ… 10 bins usually optimal --- ## โš–๏ธ Tab 4: Bias Detection Testing ### Test 12: Lighting Conditions **Image**: `examples/bias_detection/dog_daylight.jpg` **Steps**: 1. Upload dog_daylight.jpg 2. Run bias detection 3. Note confidence for daylight subgroup **Expected Results**: - โœ… 4 subgroups generated (original, bright+, bright-, contrast+) - โœ… Confidence varies across subgroups - โœ… Original has highest confidence typically --- ### Test 13: Indoor vs Outdoor **Images**: - `examples/bias_detection/cat_indoor.jpg` - `examples/bias_detection/bird_outdoor.jpg` **Steps**: 1. Test both images separately 2. Compare confidence distributions 3. Note any systematic differences **Expected Results**: - โœ… Both should predict correctly - โœ… Confidence may vary - โœ… Subgroup metrics show variations --- ### Test 14: Urban Environment **Image**: `examples/bias_detection/urban_scene.jpg` **Steps**: 1. Upload urban_scene.jpg 2. Run bias detection 3. Check for environmental bias **Expected Results**: - โœ… Multiple objects detected - โœ… Varied confidence across subgroups - โœ… Brightness variations affect predictions --- ## ๐ŸŽฏ Cross-Tab Testing ### Test 15: Same Image, All Tabs **Image**: `examples/general/pizza.jpg` **Steps**: 1. Tab 1: Check predictions and explanations 2. Tab 2: Test robustness with perturbations 3. Tab 3: Check confidence calibration 4. Tab 4: Analyze across subgroups **Expected Results**: - โœ… Consistent predictions across tabs - โœ… High confidence (pizza is clear class) - โœ… Robust to perturbations - โœ… Well-calibrated --- ### Test 16: Model Comparison **Image**: `examples/general/laptop.jpg` **Steps**: 1. Load ViT-Base, analyze laptop.jpg in Tab 1 2. Note top predictions and confidence 3. Load ViT-Large, analyze same image 4. Compare results **Expected Results**: - โœ… ViT-Large slightly higher confidence - โœ… Similar top predictions - โœ… Better attention patterns (Large) - โœ… Longer inference time (Large) --- ### Test 17: Edge Case Testing **Image**: `examples/general/mountain.jpg` **Steps**: 1. Test in all tabs 2. Note predictions (landscape/nature) 3. Check explanation quality **Expected Results**: - โœ… May predict multiple classes (mountain, valley, landscape) - โœ… Lower confidence (ambiguous category) - โœ… Attention spread across scene --- ### Test 18: Furniture Classification **Image**: `examples/general/chair.jpg` **Steps**: 1. Basic explainability test 2. Counterfactual with blur 3. Check which parts are critical **Expected Results**: - โœ… Predicts chair/furniture - โœ… Legs and seat are critical - โœ… Background less important --- ## ๐Ÿ”ง Performance Testing ### Test 19: Load Time **Steps**: 1. Clear browser cache 2. Time model loading 3. Note first analysis time vs subsequent **Expected**: - First load: 5-15 seconds - Subsequent: < 1 second - Analysis: 2-5 seconds per image --- ### Test 20: Memory Usage **Steps**: 1. Open browser dev tools 2. Monitor memory during analysis 3. Test with both models **Expected**: - ViT-Base: ~2GB RAM - ViT-Large: ~4GB RAM - No memory leaks over multiple analyses --- ## ๐Ÿ› Error Handling Testing ### Test 21: Invalid Inputs **Steps**: 1. Try uploading non-image file 2. Try very large image (> 50MB) 3. Try corrupted image **Expected**: - โœ… Graceful error messages - โœ… No crashes - โœ… User-friendly feedback --- ### Test 22: Edge Cases **Steps**: 1. Try extremely dark/bright images 2. Try pure noise images 3. Try text-only images **Expected**: - โœ… Model makes predictions - โœ… Lower confidence expected - โœ… Explanations still generated --- ## ๐Ÿ“ Test Results Template ```markdown ## Test Session: [Date] **Tester**: [Name] **Model**: ViT-Base / ViT-Large **Browser**: [Chrome/Firefox/Safari] **Environment**: [Local/Docker/Cloud] ### Results Summary: - Tests Passed: __/22 - Tests Failed: __/22 - Critical Issues: __ - Minor Issues: __ ### Detailed Results: #### Test 1: Attention Visualization - Status: โœ… Pass / โŒ Fail - Notes: [observations] [Continue for all tests...] ### Issues Found: 1. [Issue description] - Severity: Critical/Major/Minor - Steps to reproduce: - Expected: - Actual: ### Recommendations: - [Improvement suggestions] ``` --- ## ๐Ÿš€ Quick Smoke Test (5 minutes) Fastest way to verify everything works: ```bash # 1. Start app python app.py # 2. Load ViT-Base model # 3. Quick tests: Tab 1: Upload examples/basic_explainability/cat_portrait.jpg โ†’ Analyze Tab 2: Upload examples/counterfactual/flower.jpg โ†’ Analyze Tab 3: Upload examples/calibration/clear_panda.jpg โ†’ Analyze Tab 4: Upload examples/bias_detection/dog_daylight.jpg โ†’ Analyze # 4. All should complete without errors ``` --- ## ๐Ÿ“Š Automated Testing Run automated tests: ```bash # Unit tests pytest tests/test_phase1_complete.py -v # Advanced features tests pytest tests/test_advanced_features.py -v # All tests with coverage pytest tests/ --cov=src --cov-report=html ``` --- ## ๐ŸŽ“ User Acceptance Testing **Scenario 1: First-time User** - Can they understand the interface? - Can they complete basic analysis? - Is documentation helpful? **Scenario 2: Researcher** - Can they compare multiple methods? - Can they export results? - Is explanation quality sufficient? **Scenario 3: ML Practitioner** - Can they validate their model? - Are metrics meaningful? - Can they identify issues? --- ## โœ… Sign-off Criteria Before considering testing complete: - [ ] All 22 tests pass - [ ] No critical bugs - [ ] Performance acceptable - [ ] Documentation accurate - [ ] User feedback positive - [ ] All tabs functional - [ ] Both models work - [ ] Error handling robust --- **Happy Testing! ๐ŸŽ‰** For issues or questions, see [CONTRIBUTING.md](CONTRIBUTING.md)