ViT-Auditing-Toolkit / TESTING.md
Dyuti Dasmahapatra
feat: add test images, docs, and code polish
be5c319

A newer version of the Gradio SDK is available: 6.1.0

Upgrade

πŸ§ͺ Testing Guide for ViT Auditing Toolkit

Complete guide for testing all features using the provided sample images.

πŸ“‹ Quick Test Checklist

  • Basic Explainability - Attention Visualization
  • Basic Explainability - GradCAM
  • Basic Explainability - GradientSHAP
  • Counterfactual Analysis - All perturbation types
  • Confidence Calibration - Different bin sizes
  • Bias Detection - Multiple subgroups
  • Model Switching (ViT-Base ↔ ViT-Large)

πŸ” Tab 1: Basic Explainability Testing

Test 1: Attention Visualization

Image: examples/basic_explainability/cat_portrait.jpg

Steps:

  1. Load ViT-Base model
  2. Upload cat_portrait.jpg
  3. Select "Attention Visualization"
  4. Try these layer/head combinations:
    • Layer 0, Head 0 (low-level features)
    • Layer 6, Head 0 (mid-level patterns)
    • Layer 11, Head 0 (high-level semantics)

Expected Results:

  • βœ… Early layers: Focus on edges, textures
  • βœ… Middle layers: Focus on cat features (ears, eyes)
  • βœ… Late layers: Focus on discriminative regions (face)

Test 2: GradCAM Visualization

Image: examples/basic_explainability/sports_car.jpg

Steps:

  1. Upload sports_car.jpg
  2. Select "GradCAM" method
  3. Click "Analyze Image"

Expected Results:

  • βœ… Heatmap highlights car body, wheels
  • βœ… Prediction confidence > 70%
  • βœ… Top class includes "sports car" or "convertible"

Test 3: GradientSHAP

Image: examples/basic_explainability/bird_flying.jpg

Steps:

  1. Upload bird_flying.jpg
  2. Select "GradientSHAP" method
  3. Wait for analysis (takes ~10-15 seconds)

Expected Results:

  • βœ… Attribution map shows bird outline
  • βœ… Wings and body highlighted
  • βœ… Background has low attribution

Test 4: Multiple Objects

Image: examples/basic_explainability/coffee_cup.jpg

Steps:

  1. Upload coffee_cup.jpg
  2. Try all three methods
  3. Compare explanations

Expected Results:

  • βœ… All methods highlight the cup
  • βœ… Consistent predictions across methods
  • βœ… Some variation in exact highlighted regions

πŸ”„ Tab 2: Counterfactual Analysis Testing

Test 5: Face Feature Importance

Image: examples/counterfactual/face_portrait.jpg

Steps:

  1. Upload face_portrait.jpg
  2. Settings:
    • Patch size: 32
    • Perturbation: blur
  3. Click "Run Counterfactual Analysis"

Expected Results:

  • βœ… Face region shows high sensitivity
  • βœ… Background regions have low impact
  • βœ… Prediction flip rate < 50%

Test 6: Vehicle Components

Image: examples/counterfactual/car_side.jpg

Steps:

  1. Upload car_side.jpg
  2. Test each perturbation type:
    • Blur
    • Blackout
    • Gray
    • Noise
  3. Compare results

Expected Results:

  • βœ… Wheels are critical regions
  • βœ… Windows/doors moderately important
  • βœ… Blackout causes most disruption

Test 7: Architectural Elements

Image: examples/counterfactual/building.jpg

Steps:

  1. Upload building.jpg
  2. Patch size: 48
  3. Perturbation: gray

Expected Results:

  • βœ… Structural elements highlighted
  • βœ… Lower flip rate (buildings are robust)
  • βœ… Consistent confidence across patches

Test 8: Simple Object Baseline

Image: examples/counterfactual/flower.jpg

Steps:

  1. Upload flower.jpg
  2. Try smallest patch size (16)
  3. Use blackout perturbation

Expected Results:

  • βœ… Flower center most critical
  • βœ… Petals moderately important
  • βœ… Background has minimal impact

πŸ“Š Tab 3: Confidence Calibration Testing

Test 9: High-Quality Image

Image: examples/calibration/clear_panda.jpg

Steps:

  1. Upload clear_panda.jpg
  2. Number of bins: 10
  3. Run analysis

Expected Results:

  • βœ… High mean confidence (> 0.8)
  • βœ… Low overconfident rate
  • βœ… Calibration curve near diagonal

Test 10: Complex Scene

Image: examples/calibration/workspace.jpg

Steps:

  1. Upload workspace.jpg
  2. Number of bins: 15
  3. Compare with panda results

Expected Results:

  • βœ… Lower mean confidence (multiple objects)
  • βœ… Higher variance in predictions
  • βœ… More distributed across bins

Test 11: Bin Size Comparison

Image: examples/calibration/outdoor_scene.jpg

Steps:

  1. Upload outdoor_scene.jpg
  2. Test with bins: 5, 10, 20
  3. Compare calibration curves

Expected Results:

  • βœ… More bins = finer granularity
  • βœ… General trend consistent
  • βœ… 10 bins usually optimal

βš–οΈ Tab 4: Bias Detection Testing

Test 12: Lighting Conditions

Image: examples/bias_detection/dog_daylight.jpg

Steps:

  1. Upload dog_daylight.jpg
  2. Run bias detection
  3. Note confidence for daylight subgroup

Expected Results:

  • βœ… 4 subgroups generated (original, bright+, bright-, contrast+)
  • βœ… Confidence varies across subgroups
  • βœ… Original has highest confidence typically

Test 13: Indoor vs Outdoor

Images:

  • examples/bias_detection/cat_indoor.jpg
  • examples/bias_detection/bird_outdoor.jpg

Steps:

  1. Test both images separately
  2. Compare confidence distributions
  3. Note any systematic differences

Expected Results:

  • βœ… Both should predict correctly
  • βœ… Confidence may vary
  • βœ… Subgroup metrics show variations

Test 14: Urban Environment

Image: examples/bias_detection/urban_scene.jpg

Steps:

  1. Upload urban_scene.jpg
  2. Run bias detection
  3. Check for environmental bias

Expected Results:

  • βœ… Multiple objects detected
  • βœ… Varied confidence across subgroups
  • βœ… Brightness variations affect predictions

🎯 Cross-Tab Testing

Test 15: Same Image, All Tabs

Image: examples/general/pizza.jpg

Steps:

  1. Tab 1: Check predictions and explanations
  2. Tab 2: Test robustness with perturbations
  3. Tab 3: Check confidence calibration
  4. Tab 4: Analyze across subgroups

Expected Results:

  • βœ… Consistent predictions across tabs
  • βœ… High confidence (pizza is clear class)
  • βœ… Robust to perturbations
  • βœ… Well-calibrated

Test 16: Model Comparison

Image: examples/general/laptop.jpg

Steps:

  1. Load ViT-Base, analyze laptop.jpg in Tab 1
  2. Note top predictions and confidence
  3. Load ViT-Large, analyze same image
  4. Compare results

Expected Results:

  • βœ… ViT-Large slightly higher confidence
  • βœ… Similar top predictions
  • βœ… Better attention patterns (Large)
  • βœ… Longer inference time (Large)

Test 17: Edge Case Testing

Image: examples/general/mountain.jpg

Steps:

  1. Test in all tabs
  2. Note predictions (landscape/nature)
  3. Check explanation quality

Expected Results:

  • βœ… May predict multiple classes (mountain, valley, landscape)
  • βœ… Lower confidence (ambiguous category)
  • βœ… Attention spread across scene

Test 18: Furniture Classification

Image: examples/general/chair.jpg

Steps:

  1. Basic explainability test
  2. Counterfactual with blur
  3. Check which parts are critical

Expected Results:

  • βœ… Predicts chair/furniture
  • βœ… Legs and seat are critical
  • βœ… Background less important

πŸ”§ Performance Testing

Test 19: Load Time

Steps:

  1. Clear browser cache
  2. Time model loading
  3. Note first analysis time vs subsequent

Expected:

  • First load: 5-15 seconds
  • Subsequent: < 1 second
  • Analysis: 2-5 seconds per image

Test 20: Memory Usage

Steps:

  1. Open browser dev tools
  2. Monitor memory during analysis
  3. Test with both models

Expected:

  • ViT-Base: ~2GB RAM
  • ViT-Large: ~4GB RAM
  • No memory leaks over multiple analyses

πŸ› Error Handling Testing

Test 21: Invalid Inputs

Steps:

  1. Try uploading non-image file
  2. Try very large image (> 50MB)
  3. Try corrupted image

Expected:

  • βœ… Graceful error messages
  • βœ… No crashes
  • βœ… User-friendly feedback

Test 22: Edge Cases

Steps:

  1. Try extremely dark/bright images
  2. Try pure noise images
  3. Try text-only images

Expected:

  • βœ… Model makes predictions
  • βœ… Lower confidence expected
  • βœ… Explanations still generated

πŸ“ Test Results Template

## Test Session: [Date]

**Tester**: [Name]
**Model**: ViT-Base / ViT-Large
**Browser**: [Chrome/Firefox/Safari]
**Environment**: [Local/Docker/Cloud]

### Results Summary:
- Tests Passed: __/22
- Tests Failed: __/22
- Critical Issues: __
- Minor Issues: __

### Detailed Results:

#### Test 1: Attention Visualization
- Status: βœ… Pass / ❌ Fail
- Notes: [observations]

[Continue for all tests...]

### Issues Found:
1. [Issue description]
   - Severity: Critical/Major/Minor
   - Steps to reproduce:
   - Expected: 
   - Actual:

### Recommendations:
- [Improvement suggestions]

πŸš€ Quick Smoke Test (5 minutes)

Fastest way to verify everything works:

# 1. Start app
python app.py

# 2. Load ViT-Base model

# 3. Quick tests:
Tab 1: Upload examples/basic_explainability/cat_portrait.jpg β†’ Analyze
Tab 2: Upload examples/counterfactual/flower.jpg β†’ Analyze
Tab 3: Upload examples/calibration/clear_panda.jpg β†’ Analyze
Tab 4: Upload examples/bias_detection/dog_daylight.jpg β†’ Analyze

# 4. All should complete without errors

πŸ“Š Automated Testing

Run automated tests:

# Unit tests
pytest tests/test_phase1_complete.py -v

# Advanced features tests
pytest tests/test_advanced_features.py -v

# All tests with coverage
pytest tests/ --cov=src --cov-report=html

πŸŽ“ User Acceptance Testing

Scenario 1: First-time User

  • Can they understand the interface?
  • Can they complete basic analysis?
  • Is documentation helpful?

Scenario 2: Researcher

  • Can they compare multiple methods?
  • Can they export results?
  • Is explanation quality sufficient?

Scenario 3: ML Practitioner

  • Can they validate their model?
  • Are metrics meaningful?
  • Can they identify issues?

βœ… Sign-off Criteria

Before considering testing complete:

  • All 22 tests pass
  • No critical bugs
  • Performance acceptable
  • Documentation accurate
  • User feedback positive
  • All tabs functional
  • Both models work
  • Error handling robust

Happy Testing! πŸŽ‰

For issues or questions, see CONTRIBUTING.md