SAM3 Testing Guide
Overview
This guide covers two testing approaches for SAM3:
- Basic Inference Testing - Quick API validation with sample images
- Metrics Evaluation - Comprehensive performance analysis against CVAT ground truth
1. Basic Inference Testing
Purpose
Quickly validate that the SAM3 endpoint is working and producing reasonable segmentation results.
Test Infrastructure
The basic testing framework:
- Tests multiple images automatically
- Saves detailed JSON logs of requests and responses
- Generates visualizations with semi-transparent colored masks
- Stores all results in
.cache/test/inference/{image_name}/
Running Basic Tests
python3 scripts/test/test_inference_comprehensive.py
Test Output Structure
For each test image, files are generated in .cache/test/inference/{image_name}/:
request.json- Request metadata (timestamp, endpoint, classes)response.json- Response metadata (timestamp, status, results summary)full_results.json- Complete API response including base64 masksoriginal.jpg- Original test imagevisualization.png- Original image with colored mask overlaylegend.png- Legend showing class colors and coverage percentagesmask_{ClassName}.png- Individual binary masks for each class
Tested Classes
The endpoint is tested with these semantic classes:
- Pothole (Red overlay)
- Road crack (Yellow overlay)
- Road (Blue overlay)
Recent Test Results
Last run: November 23, 2025
- Total images: 8
- Successful: 8/8 (100%)
- Failed: 0
- Average response time: ~1.5 seconds per image
- Status: All API calls returning HTTP 200 with valid masks
Test images include:
pothole_pexels_01.jpg,pothole_pexels_02.jpgroad_damage_01.jpgroad_pexels_01.jpg,road_pexels_02.jpg,road_pexels_03.jpgroad_unsplash_01.jpgtest.jpg
Results stored in .cache/test/inference/summary.json
Adding More Test Images
Test images should be placed in assets/test_images/. To expand the test suite:
Download from Public Datasets:
- Pothole Detection Dataset (1,243 images)
- RDD2022 Dataset (47,420 images from 6 countries)
- Roboflow Pothole Dataset
Extract Sample Images: Select diverse examples showing potholes, cracks, and clean roads
Place in Test Directory: Copy to
assets/test_images/
2. Metrics Evaluation System
Purpose
Comprehensive quantitative evaluation of SAM3 performance against ground truth annotations from CVAT.
What It Measures
- mAP (mean Average Precision): Detection accuracy across all confidence thresholds
- mAR (mean Average Recall): Coverage of ground truth instances
- IoU metrics: Intersection over Union at multiple thresholds (0%, 25%, 50%, 75%)
- Confusion matrices: Class prediction accuracy patterns
- Per-class statistics: Precision, recall, F1-score for each damage type
Running Metrics Evaluation
cd metrics_evaluation
python run_evaluation.py
Options:
# Force re-download from CVAT (ignore cache)
python run_evaluation.py --force-download
# Force re-run inference (ignore cached predictions)
python run_evaluation.py --force-inference
# Skip inference step (use existing predictions)
python run_evaluation.py --skip-inference
# Generate visual comparisons
python run_evaluation.py --visualize
Dataset
Evaluates on 150 annotated images from CVAT:
- 50 images with "Fissure" (road cracks)
- 50 images with "Nid de poule" (potholes)
- 50 images with road surface
Source: Logiroad CVAT organization, AI training project
Output Structure
.cache/test/metrics/
βββ Fissure/
β βββ {image_name}/
β βββ image.jpg
β βββ ground_truth/
β β βββ mask_Fissure_0.png
β β βββ metadata.json
β βββ inference/
β βββ mask_Fissure_0.png
β βββ metadata.json
βββ Nid de poule/
βββ Road/
βββ metrics_summary.txt # Human-readable results
βββ metrics_detailed.json # Complete metrics data
βββ evaluation_log.txt # Execution trace
Execution Time
- Image download: ~5-10 minutes (150 images)
- SAM3 inference:
5-10 minutes (2s per image) - Metrics computation: ~1 minute
- Total: ~15-20 minutes for full evaluation
Configuration
Edit metrics_evaluation/config/config.json to:
- Change CVAT project or organization
- Adjust number of images per class
- Modify IoU thresholds
- Update SAM3 endpoint URL
CVAT credentials must be in .env at project root.
Cache Directory
All test results are stored in .cache/ (git-ignored):
- Review results without cluttering the repository
- Compare results across different test runs
- Debug segmentation quality issues
- Resume interrupted evaluations
Quality Validation Checklist
Before accepting test results:
Basic Tests:
- All test images processed successfully
- Masks generated for all requested classes
- Response times reasonable (< 3s per image)
- Visualizations show plausible segmentations
Metrics Evaluation:
- 150 images downloaded from CVAT
- Ground truth masks not empty
- SAM3 inference completed for all images
- Metrics within reasonable ranges (0-100%)
- Confusion matrices show sensible patterns
- Per-class F1 scores above baseline
Troubleshooting
Basic Inference Issues
Endpoint not responding:
- Check endpoint URL in test script
- Verify endpoint is running (use
curlor browser) - Check network connectivity
Empty or invalid masks:
- Review class names match model expectations
- Check image format (should be JPEG/PNG)
- Verify base64 encoding/decoding
Metrics Evaluation Issues
CVAT connection fails:
- Check
.envcredentials - Verify CVAT organization name
- Test CVAT web access
No images found:
- Check project filter in
config.json - Verify labels exist in CVAT
- Ensure images have annotations
Metrics seem incorrect:
- Inspect confusion matrices
- Review sample visualizations
- Check ground truth quality in CVAT
- Verify mask format (PNG-L, 8-bit grayscale)
Next Steps
- Run basic tests to validate API connectivity
- Review visualizations to assess segmentation quality
- Run metrics evaluation for quantitative performance
- Analyze confusion matrices to identify systematic errors
- Iterate on model/prompts based on metrics feedback
For detailed metrics evaluation documentation, see metrics_evaluation/README.md.