# SAM3 Metrics Evaluation Comprehensive evaluation system to measure SAM3 model performance against ground truth annotations from CVAT. ## Overview This subproject evaluates the SAM3 semantic segmentation endpoint by: 1. Extracting annotated images from CVAT (ground truth) 2. Running SAM3 inference on the same images 3. Computing standard segmentation metrics (mAP, mAR, IoU, confusion matrices) 4. Generating detailed reports and visualizations ## Purpose Provide quantitative metrics to: - Measure SAM3 detection accuracy for road damage - Identify systematic errors and biases - Compare performance across different damage types - Guide model improvement efforts - Track performance over time ## Dataset Extracts 150 annotated images from CVAT: - **50 images** with "Fissure" (road cracks) - **50 images** with "Nid de poule" (potholes) - **50 images** with road surface (any annotated image) Source: Logiroad organization CVAT project designated for AI training. ## Metrics Computed ### Instance-Level Metrics - **mAP (mean Average Precision)**: Detection accuracy across all confidence thresholds - **mAR (mean Average Recall)**: Coverage of ground truth instances - **Instance counts at IoU thresholds**: Number of true positives, false positives, false negatives at 0%, 25%, 50%, 75% IoU ### Confusion Matrices Generated at four IoU thresholds (0%, 25%, 50%, 75%): - Rows: Ground truth classes - Columns: Predicted classes - Shows class confusion patterns ### Per-Class Statistics - Detection rate (% of ground truth instances found) - Precision (% of predictions that are correct) - Recall (% of ground truth that is detected) - F1-score (harmonic mean of precision and recall) ## Output Structure ``` .cache/test/metrics/ ├── Fissure/ │ └── {image_name}/ │ ├── image.jpg # Original image │ ├── ground_truth/ │ │ ├── mask_Fissure_0.png # Ground truth instance masks (PNG-L) │ │ ├── mask_Fissure_1.png │ │ └── metadata.json # Mask metadata │ └── inference/ │ ├── mask_Fissure_0.png # SAM3 predicted masks (PNG-L) │ ├── mask_Road_0.png │ └── metadata.json ├── Nid de poule/ │ └── ... ├── Road/ │ └── ... ├── metrics_summary.txt # Human-readable summary ├── metrics_detailed.json # Complete metrics data └── evaluation_log.txt # Execution log ``` ## Configuration All parameters configured in `config/config.json`: ```json { "cvat": { "url": "https://app.cvat.ai", "organization": "Logiroad", "project_filter": "training" }, "classes": { "Fissure": 50, "Nid de poule": 50, "Road": 50 }, "sam3": { "endpoint": "https://p6irm2x7y9mwp4l4.us-east-1.aws.endpoints.huggingface.cloud" }, "metrics": { "iou_thresholds": [0.0, 0.25, 0.5, 0.75] }, "output": { "cache_dir": ".cache/test/metrics" } } ``` CVAT credentials loaded from `.env` file at project root. ## Usage ### Quick Start ```bash cd metrics_evaluation python run_evaluation.py ``` ### Advanced Options ```bash # Use custom config file python run_evaluation.py --config my_config.json # Force re-download images (ignore cache) python run_evaluation.py --force-download # Force re-run inference (ignore cached results) python run_evaluation.py --force-inference # Skip inference (use only cached results) python run_evaluation.py --skip-inference # Generate visualization images python run_evaluation.py --visualize ``` ## Pipeline Stages ### 1. CVAT Data Extraction - Connects to CVAT API - Finds AI training project - Discovers images with target labels - Downloads images (checks cache first) - Extracts ground truth masks from CVAT RLE format - Saves masks as PNG-L format ### 2. SAM3 Inference - Loads cached images - Calls SAM3 endpoint for each image - Converts base64 masks to PNG-L format - Saves predicted masks with same structure as ground truth ### 3. Metrics Calculation - Matches predicted instances to ground truth (Hungarian algorithm) - Computes mAP, mAR at multiple IoU thresholds - Generates confusion matrices - Calculates per-class and overall statistics ### 4. Report Generation - Creates human-readable summary (metrics_summary.txt) - Saves detailed JSON with all metrics (metrics_detailed.json) - Logs complete execution trace (evaluation_log.txt) - Optionally generates visual comparisons ## Dependencies Install required packages: ```bash pip install opencv-python numpy requests pydantic pillow scipy python-dotenv ``` ## Resumability Pipeline is designed to be resumable: - Checks cache before downloading images - Skips inference if results already exist - Can run metrics on existing data Use `--force-download` or `--force-inference` to override cache. ## Error Handling Pipeline follows "fail fast, fail loud" principle: - Clear error messages with context - Validation at each stage - Graceful handling of missing data - Continues processing on non-critical errors - Detailed logging for debugging ## Expected Execution Time - Image download: ~5-10 minutes (150 images, network dependent) - SAM3 inference: ~5-10 minutes (150 images, ~2s per image) - Metrics computation: ~1 minute - **Total**: ~15-20 minutes for full evaluation ## Quality Checks Before accepting results: 1. Verify 150 images downloaded 2. Check ground truth masks are not empty 3. Verify SAM3 inference completed for all images 4. Review metrics are within reasonable ranges (0-100%) 5. Inspect sample visualizations 6. Check for systematic errors in confusion matrices ## Limitations - Only evaluates on images present in CVAT training dataset - Performance may not generalize to unseen road conditions - Metrics depend on ground truth annotation quality - Instance matching uses simple IoU (no advanced matching algorithms) ## Future Improvements - Add support for more label types - Implement advanced instance matching (deformable IoU, boundary IoU) - Add temporal consistency metrics for video sequences - Generate interactive HTML report with visualizations - Support for additional segmentation backends (local models, other APIs) ## Troubleshooting **CVAT connection fails:** - Check `.env` file has correct credentials - Verify CVAT is accessible from your network - Check organization name matches **No images found:** - Verify project name filter in config.json - Check labels exist in CVAT project - Ensure images have annotations **SAM3 inference errors:** - Check endpoint URL in config.json - Verify endpoint is running (test with curl) - Check network connectivity - Review endpoint logs **Metrics seem wrong:** - Check ground truth masks are valid - Verify predicted masks are in correct format - Review confusion matrices for patterns - Inspect sample images visually ## Contact For issues or questions about this evaluation system, contact the Logiroad AI team. ## License Internal Logiroad project. Not for public distribution.