SAM3 Metrics Evaluation
Comprehensive evaluation system to measure SAM3 model performance against ground truth annotations from CVAT.
Overview
This subproject evaluates the SAM3 semantic segmentation endpoint by:
- Extracting annotated images from CVAT (ground truth)
- Running SAM3 inference on the same images
- Computing standard segmentation metrics (mAP, mAR, IoU, confusion matrices)
- Generating detailed reports and visualizations
Purpose
Provide quantitative metrics to:
- Measure SAM3 detection accuracy for road damage
- Identify systematic errors and biases
- Compare performance across different damage types
- Guide model improvement efforts
- Track performance over time
Dataset
Extracts 150 annotated images from CVAT:
- 50 images with "Fissure" (road cracks)
- 50 images with "Nid de poule" (potholes)
- 50 images with road surface (any annotated image)
Source: Logiroad organization CVAT project designated for AI training.
Metrics Computed
Instance-Level Metrics
- mAP (mean Average Precision): Detection accuracy across all confidence thresholds
- mAR (mean Average Recall): Coverage of ground truth instances
- Instance counts at IoU thresholds: Number of true positives, false positives, false negatives at 0%, 25%, 50%, 75% IoU
Confusion Matrices
Generated at four IoU thresholds (0%, 25%, 50%, 75%):
- Rows: Ground truth classes
- Columns: Predicted classes
- Shows class confusion patterns
Per-Class Statistics
- Detection rate (% of ground truth instances found)
- Precision (% of predictions that are correct)
- Recall (% of ground truth that is detected)
- F1-score (harmonic mean of precision and recall)
Output Structure
.cache/test/metrics/
βββ Fissure/
β βββ {image_name}/
β βββ image.jpg # Original image
β βββ ground_truth/
β β βββ mask_Fissure_0.png # Ground truth instance masks (PNG-L)
β β βββ mask_Fissure_1.png
β β βββ metadata.json # Mask metadata
β βββ inference/
β βββ mask_Fissure_0.png # SAM3 predicted masks (PNG-L)
β βββ mask_Road_0.png
β βββ metadata.json
βββ Nid de poule/
β βββ ...
βββ Road/
β βββ ...
βββ metrics_summary.txt # Human-readable summary
βββ metrics_detailed.json # Complete metrics data
βββ evaluation_log.txt # Execution log
Configuration
All parameters configured in config/config.json:
{
"cvat": {
"url": "https://app.cvat.ai",
"organization": "Logiroad",
"project_filter": "training"
},
"classes": {
"Fissure": 50,
"Nid de poule": 50,
"Road": 50
},
"sam3": {
"endpoint": "https://p6irm2x7y9mwp4l4.us-east-1.aws.endpoints.huggingface.cloud"
},
"metrics": {
"iou_thresholds": [0.0, 0.25, 0.5, 0.75]
},
"output": {
"cache_dir": ".cache/test/metrics"
}
}
CVAT credentials loaded from .env file at project root.
Usage
Quick Start
cd metrics_evaluation
python run_evaluation.py
Advanced Options
# Use custom config file
python run_evaluation.py --config my_config.json
# Force re-download images (ignore cache)
python run_evaluation.py --force-download
# Force re-run inference (ignore cached results)
python run_evaluation.py --force-inference
# Skip inference (use only cached results)
python run_evaluation.py --skip-inference
# Generate visualization images
python run_evaluation.py --visualize
Pipeline Stages
1. CVAT Data Extraction
- Connects to CVAT API
- Finds AI training project
- Discovers images with target labels
- Downloads images (checks cache first)
- Extracts ground truth masks from CVAT RLE format
- Saves masks as PNG-L format
2. SAM3 Inference
- Loads cached images
- Calls SAM3 endpoint for each image
- Converts base64 masks to PNG-L format
- Saves predicted masks with same structure as ground truth
3. Metrics Calculation
- Matches predicted instances to ground truth (Hungarian algorithm)
- Computes mAP, mAR at multiple IoU thresholds
- Generates confusion matrices
- Calculates per-class and overall statistics
4. Report Generation
- Creates human-readable summary (metrics_summary.txt)
- Saves detailed JSON with all metrics (metrics_detailed.json)
- Logs complete execution trace (evaluation_log.txt)
- Optionally generates visual comparisons
Dependencies
Install required packages:
pip install opencv-python numpy requests pydantic pillow scipy python-dotenv
Resumability
Pipeline is designed to be resumable:
- Checks cache before downloading images
- Skips inference if results already exist
- Can run metrics on existing data
Use --force-download or --force-inference to override cache.
Error Handling
Pipeline follows "fail fast, fail loud" principle:
- Clear error messages with context
- Validation at each stage
- Graceful handling of missing data
- Continues processing on non-critical errors
- Detailed logging for debugging
Expected Execution Time
- Image download: ~5-10 minutes (150 images, network dependent)
- SAM3 inference: ~5-10 minutes (150 images, ~2s per image)
- Metrics computation: ~1 minute
- Total: ~15-20 minutes for full evaluation
Quality Checks
Before accepting results:
- Verify 150 images downloaded
- Check ground truth masks are not empty
- Verify SAM3 inference completed for all images
- Review metrics are within reasonable ranges (0-100%)
- Inspect sample visualizations
- Check for systematic errors in confusion matrices
Limitations
- Only evaluates on images present in CVAT training dataset
- Performance may not generalize to unseen road conditions
- Metrics depend on ground truth annotation quality
- Instance matching uses simple IoU (no advanced matching algorithms)
Future Improvements
- Add support for more label types
- Implement advanced instance matching (deformable IoU, boundary IoU)
- Add temporal consistency metrics for video sequences
- Generate interactive HTML report with visualizations
- Support for additional segmentation backends (local models, other APIs)
Troubleshooting
CVAT connection fails:
- Check
.envfile has correct credentials - Verify CVAT is accessible from your network
- Check organization name matches
No images found:
- Verify project name filter in config.json
- Check labels exist in CVAT project
- Ensure images have annotations
SAM3 inference errors:
- Check endpoint URL in config.json
- Verify endpoint is running (test with curl)
- Check network connectivity
- Review endpoint logs
Metrics seem wrong:
- Check ground truth masks are valid
- Verify predicted masks are in correct format
- Review confusion matrices for patterns
- Inspect sample images visually
Contact
For issues or questions about this evaluation system, contact the Logiroad AI team.
License
Internal Logiroad project. Not for public distribution.