| # SAM3 Testing Guide | |
| ## Overview | |
| This guide covers two testing approaches for SAM3: | |
| 1. **Basic Inference Testing** - Quick API validation with sample images | |
| 2. **Metrics Evaluation** - Comprehensive performance analysis against CVAT ground truth | |
| --- | |
| ## 1. Basic Inference Testing | |
| ### Purpose | |
| Quickly validate that the SAM3 endpoint is working and producing reasonable segmentation results. | |
| ### Test Infrastructure | |
| The basic testing framework: | |
| - Tests multiple images automatically | |
| - Saves detailed JSON logs of requests and responses | |
| - Generates visualizations with semi-transparent colored masks | |
| - Stores all results in `.cache/test/inference/{image_name}/` | |
| ### Running Basic Tests | |
| ```bash | |
| python3 scripts/test/test_inference_comprehensive.py | |
| ``` | |
| ### Test Output Structure | |
| For each test image, files are generated in `.cache/test/inference/{image_name}/`: | |
| - `request.json` - Request metadata (timestamp, endpoint, classes) | |
| - `response.json` - Response metadata (timestamp, status, results summary) | |
| - `full_results.json` - Complete API response including base64 masks | |
| - `original.jpg` - Original test image | |
| - `visualization.png` - Original image with colored mask overlay | |
| - `legend.png` - Legend showing class colors and coverage percentages | |
| - `mask_{ClassName}.png` - Individual binary masks for each class | |
| ### Tested Classes | |
| The endpoint is tested with these semantic classes: | |
| - **Pothole** (Red overlay) | |
| - **Road crack** (Yellow overlay) | |
| - **Road** (Blue overlay) | |
| ### Recent Test Results | |
| **Last run**: November 23, 2025 | |
| - **Total images**: 8 | |
| - **Successful**: 8/8 (100%) | |
| - **Failed**: 0 | |
| - **Average response time**: ~1.5 seconds per image | |
| - **Status**: All API calls returning HTTP 200 with valid masks | |
| Test images include: | |
| - `pothole_pexels_01.jpg`, `pothole_pexels_02.jpg` | |
| - `road_damage_01.jpg` | |
| - `road_pexels_01.jpg`, `road_pexels_02.jpg`, `road_pexels_03.jpg` | |
| - `road_unsplash_01.jpg` | |
| - `test.jpg` | |
| Results stored in `.cache/test/inference/summary.json` | |
| ### Adding More Test Images | |
| Test images should be placed in `assets/test_images/`. To expand the test suite: | |
| 1. **Download from Public Datasets**: | |
| - [Pothole Detection Dataset](https://github.com/jaygala24/pothole-detection/releases/download/v1.0.0/Pothole.Dataset.IVCNZ.zip) (1,243 images) | |
| - [RDD2022 Dataset](https://github.com/sekilab/RoadDamageDetector) (47,420 images from 6 countries) | |
| - [Roboflow Pothole Dataset](https://public.roboflow.com/object-detection/pothole/) | |
| 2. **Extract Sample Images**: Select diverse examples showing potholes, cracks, and clean roads | |
| 3. **Place in Test Directory**: Copy to `assets/test_images/` | |
| --- | |
| ## 2. Metrics Evaluation System | |
| ### Purpose | |
| Comprehensive quantitative evaluation of SAM3 performance against ground truth annotations from CVAT. | |
| ### What It Measures | |
| - **mAP (mean Average Precision)**: Detection accuracy across all confidence thresholds | |
| - **mAR (mean Average Recall)**: Coverage of ground truth instances | |
| - **IoU metrics**: Intersection over Union at multiple thresholds (0%, 25%, 50%, 75%) | |
| - **Confusion matrices**: Class prediction accuracy patterns | |
| - **Per-class statistics**: Precision, recall, F1-score for each damage type | |
| ### Running Metrics Evaluation | |
| ```bash | |
| cd metrics_evaluation | |
| python run_evaluation.py | |
| ``` | |
| **Options**: | |
| ```bash | |
| # Force re-download from CVAT (ignore cache) | |
| python run_evaluation.py --force-download | |
| # Force re-run inference (ignore cached predictions) | |
| python run_evaluation.py --force-inference | |
| # Skip inference step (use existing predictions) | |
| python run_evaluation.py --skip-inference | |
| # Generate visual comparisons | |
| python run_evaluation.py --visualize | |
| ``` | |
| ### Dataset | |
| Evaluates on **150 annotated images** from CVAT: | |
| - **50 images** with "Fissure" (road cracks) | |
| - **50 images** with "Nid de poule" (potholes) | |
| - **50 images** with road surface | |
| Source: Logiroad CVAT organization, AI training project | |
| ### Output Structure | |
| ``` | |
| .cache/test/metrics/ | |
| βββ Fissure/ | |
| β βββ {image_name}/ | |
| β βββ image.jpg | |
| β βββ ground_truth/ | |
| β β βββ mask_Fissure_0.png | |
| β β βββ metadata.json | |
| β βββ inference/ | |
| β βββ mask_Fissure_0.png | |
| β βββ metadata.json | |
| βββ Nid de poule/ | |
| βββ Road/ | |
| βββ metrics_summary.txt # Human-readable results | |
| βββ metrics_detailed.json # Complete metrics data | |
| βββ evaluation_log.txt # Execution trace | |
| ``` | |
| ### Execution Time | |
| - Image download: ~5-10 minutes (150 images) | |
| - SAM3 inference: ~5-10 minutes (~2s per image) | |
| - Metrics computation: ~1 minute | |
| - **Total**: ~15-20 minutes for full evaluation | |
| ### Configuration | |
| Edit `metrics_evaluation/config/config.json` to: | |
| - Change CVAT project or organization | |
| - Adjust number of images per class | |
| - Modify IoU thresholds | |
| - Update SAM3 endpoint URL | |
| CVAT credentials must be in `.env` at project root. | |
| --- | |
| ## Cache Directory | |
| All test results are stored in `.cache/` (git-ignored): | |
| - Review results without cluttering the repository | |
| - Compare results across different test runs | |
| - Debug segmentation quality issues | |
| - Resume interrupted evaluations | |
| --- | |
| ## Quality Validation Checklist | |
| Before accepting test results: | |
| **Basic Tests**: | |
| - [ ] All test images processed successfully | |
| - [ ] Masks generated for all requested classes | |
| - [ ] Response times reasonable (< 3s per image) | |
| - [ ] Visualizations show plausible segmentations | |
| **Metrics Evaluation**: | |
| - [ ] 150 images downloaded from CVAT | |
| - [ ] Ground truth masks not empty | |
| - [ ] SAM3 inference completed for all images | |
| - [ ] Metrics within reasonable ranges (0-100%) | |
| - [ ] Confusion matrices show sensible patterns | |
| - [ ] Per-class F1 scores above baseline | |
| --- | |
| ## Troubleshooting | |
| ### Basic Inference Issues | |
| **Endpoint not responding**: | |
| - Check endpoint URL in test script | |
| - Verify endpoint is running (use `curl` or browser) | |
| - Check network connectivity | |
| **Empty or invalid masks**: | |
| - Review class names match model expectations | |
| - Check image format (should be JPEG/PNG) | |
| - Verify base64 encoding/decoding | |
| ### Metrics Evaluation Issues | |
| **CVAT connection fails**: | |
| - Check `.env` credentials | |
| - Verify CVAT organization name | |
| - Test CVAT web access | |
| **No images found**: | |
| - Check project filter in `config.json` | |
| - Verify labels exist in CVAT | |
| - Ensure images have annotations | |
| **Metrics seem incorrect**: | |
| - Inspect confusion matrices | |
| - Review sample visualizations | |
| - Check ground truth quality in CVAT | |
| - Verify mask format (PNG-L, 8-bit grayscale) | |
| --- | |
| ## Next Steps | |
| 1. **Run basic tests** to validate API connectivity | |
| 2. **Review visualizations** to assess segmentation quality | |
| 3. **Run metrics evaluation** for quantitative performance | |
| 4. **Analyze confusion matrices** to identify systematic errors | |
| 5. **Iterate on model/prompts** based on metrics feedback | |
| For detailed metrics evaluation documentation, see `metrics_evaluation/README.md`. | |