sam3 / TESTING.md
Thibaut's picture
Fix SAM3 instance segmentation and update documentation
d032bfc
# SAM3 Testing Guide
## Overview
This guide covers two testing approaches for SAM3:
1. **Basic Inference Testing** - Quick API validation with sample images
2. **Metrics Evaluation** - Comprehensive performance analysis against CVAT ground truth
---
## 1. Basic Inference Testing
### Purpose
Quickly validate that the SAM3 endpoint is working and producing reasonable segmentation results.
### Test Infrastructure
The basic testing framework:
- Tests multiple images automatically
- Saves detailed JSON logs of requests and responses
- Generates visualizations with semi-transparent colored masks
- Stores all results in `.cache/test/inference/{image_name}/`
### Running Basic Tests
```bash
python3 scripts/test/test_inference_comprehensive.py
```
### Test Output Structure
For each test image, files are generated in `.cache/test/inference/{image_name}/`:
- `request.json` - Request metadata (timestamp, endpoint, classes)
- `response.json` - Response metadata (timestamp, status, results summary)
- `full_results.json` - Complete API response including base64 masks
- `original.jpg` - Original test image
- `visualization.png` - Original image with colored mask overlay
- `legend.png` - Legend showing class colors and coverage percentages
- `mask_{ClassName}.png` - Individual binary masks for each class
### Tested Classes
The endpoint is tested with these semantic classes:
- **Pothole** (Red overlay)
- **Road crack** (Yellow overlay)
- **Road** (Blue overlay)
### Recent Test Results
**Last run**: November 23, 2025
- **Total images**: 8
- **Successful**: 8/8 (100%)
- **Failed**: 0
- **Average response time**: ~1.5 seconds per image
- **Status**: All API calls returning HTTP 200 with valid masks
Test images include:
- `pothole_pexels_01.jpg`, `pothole_pexels_02.jpg`
- `road_damage_01.jpg`
- `road_pexels_01.jpg`, `road_pexels_02.jpg`, `road_pexels_03.jpg`
- `road_unsplash_01.jpg`
- `test.jpg`
Results stored in `.cache/test/inference/summary.json`
### Adding More Test Images
Test images should be placed in `assets/test_images/`. To expand the test suite:
1. **Download from Public Datasets**:
- [Pothole Detection Dataset](https://github.com/jaygala24/pothole-detection/releases/download/v1.0.0/Pothole.Dataset.IVCNZ.zip) (1,243 images)
- [RDD2022 Dataset](https://github.com/sekilab/RoadDamageDetector) (47,420 images from 6 countries)
- [Roboflow Pothole Dataset](https://public.roboflow.com/object-detection/pothole/)
2. **Extract Sample Images**: Select diverse examples showing potholes, cracks, and clean roads
3. **Place in Test Directory**: Copy to `assets/test_images/`
---
## 2. Metrics Evaluation System
### Purpose
Comprehensive quantitative evaluation of SAM3 performance against ground truth annotations from CVAT.
### What It Measures
- **mAP (mean Average Precision)**: Detection accuracy across all confidence thresholds
- **mAR (mean Average Recall)**: Coverage of ground truth instances
- **IoU metrics**: Intersection over Union at multiple thresholds (0%, 25%, 50%, 75%)
- **Confusion matrices**: Class prediction accuracy patterns
- **Per-class statistics**: Precision, recall, F1-score for each damage type
### Running Metrics Evaluation
```bash
cd metrics_evaluation
python run_evaluation.py
```
**Options**:
```bash
# Force re-download from CVAT (ignore cache)
python run_evaluation.py --force-download
# Force re-run inference (ignore cached predictions)
python run_evaluation.py --force-inference
# Skip inference step (use existing predictions)
python run_evaluation.py --skip-inference
# Generate visual comparisons
python run_evaluation.py --visualize
```
### Dataset
Evaluates on **150 annotated images** from CVAT:
- **50 images** with "Fissure" (road cracks)
- **50 images** with "Nid de poule" (potholes)
- **50 images** with road surface
Source: Logiroad CVAT organization, AI training project
### Output Structure
```
.cache/test/metrics/
β”œβ”€β”€ Fissure/
β”‚ └── {image_name}/
β”‚ β”œβ”€β”€ image.jpg
β”‚ β”œβ”€β”€ ground_truth/
β”‚ β”‚ β”œβ”€β”€ mask_Fissure_0.png
β”‚ β”‚ └── metadata.json
β”‚ └── inference/
β”‚ β”œβ”€β”€ mask_Fissure_0.png
β”‚ └── metadata.json
β”œβ”€β”€ Nid de poule/
β”œβ”€β”€ Road/
β”œβ”€β”€ metrics_summary.txt # Human-readable results
β”œβ”€β”€ metrics_detailed.json # Complete metrics data
└── evaluation_log.txt # Execution trace
```
### Execution Time
- Image download: ~5-10 minutes (150 images)
- SAM3 inference: ~5-10 minutes (~2s per image)
- Metrics computation: ~1 minute
- **Total**: ~15-20 minutes for full evaluation
### Configuration
Edit `metrics_evaluation/config/config.json` to:
- Change CVAT project or organization
- Adjust number of images per class
- Modify IoU thresholds
- Update SAM3 endpoint URL
CVAT credentials must be in `.env` at project root.
---
## Cache Directory
All test results are stored in `.cache/` (git-ignored):
- Review results without cluttering the repository
- Compare results across different test runs
- Debug segmentation quality issues
- Resume interrupted evaluations
---
## Quality Validation Checklist
Before accepting test results:
**Basic Tests**:
- [ ] All test images processed successfully
- [ ] Masks generated for all requested classes
- [ ] Response times reasonable (< 3s per image)
- [ ] Visualizations show plausible segmentations
**Metrics Evaluation**:
- [ ] 150 images downloaded from CVAT
- [ ] Ground truth masks not empty
- [ ] SAM3 inference completed for all images
- [ ] Metrics within reasonable ranges (0-100%)
- [ ] Confusion matrices show sensible patterns
- [ ] Per-class F1 scores above baseline
---
## Troubleshooting
### Basic Inference Issues
**Endpoint not responding**:
- Check endpoint URL in test script
- Verify endpoint is running (use `curl` or browser)
- Check network connectivity
**Empty or invalid masks**:
- Review class names match model expectations
- Check image format (should be JPEG/PNG)
- Verify base64 encoding/decoding
### Metrics Evaluation Issues
**CVAT connection fails**:
- Check `.env` credentials
- Verify CVAT organization name
- Test CVAT web access
**No images found**:
- Check project filter in `config.json`
- Verify labels exist in CVAT
- Ensure images have annotations
**Metrics seem incorrect**:
- Inspect confusion matrices
- Review sample visualizations
- Check ground truth quality in CVAT
- Verify mask format (PNG-L, 8-bit grayscale)
---
## Next Steps
1. **Run basic tests** to validate API connectivity
2. **Review visualizations** to assess segmentation quality
3. **Run metrics evaluation** for quantitative performance
4. **Analyze confusion matrices** to identify systematic errors
5. **Iterate on model/prompts** based on metrics feedback
For detailed metrics evaluation documentation, see `metrics_evaluation/README.md`.