# SAM3 Testing Guide ## Overview This guide covers two testing approaches for SAM3: 1. **Basic Inference Testing** - Quick API validation with sample images 2. **Metrics Evaluation** - Comprehensive performance analysis against CVAT ground truth --- ## 1. Basic Inference Testing ### Purpose Quickly validate that the SAM3 endpoint is working and producing reasonable segmentation results. ### Test Infrastructure The basic testing framework: - Tests multiple images automatically - Saves detailed JSON logs of requests and responses - Generates visualizations with semi-transparent colored masks - Stores all results in `.cache/test/inference/{image_name}/` ### Running Basic Tests ```bash python3 scripts/test/test_inference_comprehensive.py ``` ### Test Output Structure For each test image, files are generated in `.cache/test/inference/{image_name}/`: - `request.json` - Request metadata (timestamp, endpoint, classes) - `response.json` - Response metadata (timestamp, status, results summary) - `full_results.json` - Complete API response including base64 masks - `original.jpg` - Original test image - `visualization.png` - Original image with colored mask overlay - `legend.png` - Legend showing class colors and coverage percentages - `mask_{ClassName}.png` - Individual binary masks for each class ### Tested Classes The endpoint is tested with these semantic classes: - **Pothole** (Red overlay) - **Road crack** (Yellow overlay) - **Road** (Blue overlay) ### Recent Test Results **Last run**: November 23, 2025 - **Total images**: 8 - **Successful**: 8/8 (100%) - **Failed**: 0 - **Average response time**: ~1.5 seconds per image - **Status**: All API calls returning HTTP 200 with valid masks Test images include: - `pothole_pexels_01.jpg`, `pothole_pexels_02.jpg` - `road_damage_01.jpg` - `road_pexels_01.jpg`, `road_pexels_02.jpg`, `road_pexels_03.jpg` - `road_unsplash_01.jpg` - `test.jpg` Results stored in `.cache/test/inference/summary.json` ### Adding More Test Images Test images should be placed in `assets/test_images/`. To expand the test suite: 1. **Download from Public Datasets**: - [Pothole Detection Dataset](https://github.com/jaygala24/pothole-detection/releases/download/v1.0.0/Pothole.Dataset.IVCNZ.zip) (1,243 images) - [RDD2022 Dataset](https://github.com/sekilab/RoadDamageDetector) (47,420 images from 6 countries) - [Roboflow Pothole Dataset](https://public.roboflow.com/object-detection/pothole/) 2. **Extract Sample Images**: Select diverse examples showing potholes, cracks, and clean roads 3. **Place in Test Directory**: Copy to `assets/test_images/` --- ## 2. Metrics Evaluation System ### Purpose Comprehensive quantitative evaluation of SAM3 performance against ground truth annotations from CVAT. ### What It Measures - **mAP (mean Average Precision)**: Detection accuracy across all confidence thresholds - **mAR (mean Average Recall)**: Coverage of ground truth instances - **IoU metrics**: Intersection over Union at multiple thresholds (0%, 25%, 50%, 75%) - **Confusion matrices**: Class prediction accuracy patterns - **Per-class statistics**: Precision, recall, F1-score for each damage type ### Running Metrics Evaluation ```bash cd metrics_evaluation python run_evaluation.py ``` **Options**: ```bash # Force re-download from CVAT (ignore cache) python run_evaluation.py --force-download # Force re-run inference (ignore cached predictions) python run_evaluation.py --force-inference # Skip inference step (use existing predictions) python run_evaluation.py --skip-inference # Generate visual comparisons python run_evaluation.py --visualize ``` ### Dataset Evaluates on **150 annotated images** from CVAT: - **50 images** with "Fissure" (road cracks) - **50 images** with "Nid de poule" (potholes) - **50 images** with road surface Source: Logiroad CVAT organization, AI training project ### Output Structure ``` .cache/test/metrics/ ├── Fissure/ │ └── {image_name}/ │ ├── image.jpg │ ├── ground_truth/ │ │ ├── mask_Fissure_0.png │ │ └── metadata.json │ └── inference/ │ ├── mask_Fissure_0.png │ └── metadata.json ├── Nid de poule/ ├── Road/ ├── metrics_summary.txt # Human-readable results ├── metrics_detailed.json # Complete metrics data └── evaluation_log.txt # Execution trace ``` ### Execution Time - Image download: ~5-10 minutes (150 images) - SAM3 inference: ~5-10 minutes (~2s per image) - Metrics computation: ~1 minute - **Total**: ~15-20 minutes for full evaluation ### Configuration Edit `metrics_evaluation/config/config.json` to: - Change CVAT project or organization - Adjust number of images per class - Modify IoU thresholds - Update SAM3 endpoint URL CVAT credentials must be in `.env` at project root. --- ## Cache Directory All test results are stored in `.cache/` (git-ignored): - Review results without cluttering the repository - Compare results across different test runs - Debug segmentation quality issues - Resume interrupted evaluations --- ## Quality Validation Checklist Before accepting test results: **Basic Tests**: - [ ] All test images processed successfully - [ ] Masks generated for all requested classes - [ ] Response times reasonable (< 3s per image) - [ ] Visualizations show plausible segmentations **Metrics Evaluation**: - [ ] 150 images downloaded from CVAT - [ ] Ground truth masks not empty - [ ] SAM3 inference completed for all images - [ ] Metrics within reasonable ranges (0-100%) - [ ] Confusion matrices show sensible patterns - [ ] Per-class F1 scores above baseline --- ## Troubleshooting ### Basic Inference Issues **Endpoint not responding**: - Check endpoint URL in test script - Verify endpoint is running (use `curl` or browser) - Check network connectivity **Empty or invalid masks**: - Review class names match model expectations - Check image format (should be JPEG/PNG) - Verify base64 encoding/decoding ### Metrics Evaluation Issues **CVAT connection fails**: - Check `.env` credentials - Verify CVAT organization name - Test CVAT web access **No images found**: - Check project filter in `config.json` - Verify labels exist in CVAT - Ensure images have annotations **Metrics seem incorrect**: - Inspect confusion matrices - Review sample visualizations - Check ground truth quality in CVAT - Verify mask format (PNG-L, 8-bit grayscale) --- ## Next Steps 1. **Run basic tests** to validate API connectivity 2. **Review visualizations** to assess segmentation quality 3. **Run metrics evaluation** for quantitative performance 4. **Analyze confusion matrices** to identify systematic errors 5. **Iterate on model/prompts** based on metrics feedback For detailed metrics evaluation documentation, see `metrics_evaluation/README.md`.