| # SAM3 Metrics Evaluation | |
| Comprehensive evaluation system to measure SAM3 model performance against ground truth annotations from CVAT. | |
| ## Overview | |
| This subproject evaluates the SAM3 semantic segmentation endpoint by: | |
| 1. Extracting annotated images from CVAT (ground truth) | |
| 2. Running SAM3 inference on the same images | |
| 3. Computing standard segmentation metrics (mAP, mAR, IoU, confusion matrices) | |
| 4. Generating detailed reports and visualizations | |
| ## Purpose | |
| Provide quantitative metrics to: | |
| - Measure SAM3 detection accuracy for road damage | |
| - Identify systematic errors and biases | |
| - Compare performance across different damage types | |
| - Guide model improvement efforts | |
| - Track performance over time | |
| ## Dataset | |
| Extracts 150 annotated images from CVAT: | |
| - **50 images** with "Fissure" (road cracks) | |
| - **50 images** with "Nid de poule" (potholes) | |
| - **50 images** with road surface (any annotated image) | |
| Source: Logiroad organization CVAT project designated for AI training. | |
| ## Metrics Computed | |
| ### Instance-Level Metrics | |
| - **mAP (mean Average Precision)**: Detection accuracy across all confidence thresholds | |
| - **mAR (mean Average Recall)**: Coverage of ground truth instances | |
| - **Instance counts at IoU thresholds**: Number of true positives, false positives, false negatives at 0%, 25%, 50%, 75% IoU | |
| ### Confusion Matrices | |
| Generated at four IoU thresholds (0%, 25%, 50%, 75%): | |
| - Rows: Ground truth classes | |
| - Columns: Predicted classes | |
| - Shows class confusion patterns | |
| ### Per-Class Statistics | |
| - Detection rate (% of ground truth instances found) | |
| - Precision (% of predictions that are correct) | |
| - Recall (% of ground truth that is detected) | |
| - F1-score (harmonic mean of precision and recall) | |
| ## Output Structure | |
| ``` | |
| .cache/test/metrics/ | |
| ├── Fissure/ | |
| │ └── {image_name}/ | |
| │ ├── image.jpg # Original image | |
| │ ├── ground_truth/ | |
| │ │ ├── mask_Fissure_0.png # Ground truth instance masks (PNG-L) | |
| │ │ ├── mask_Fissure_1.png | |
| │ │ └── metadata.json # Mask metadata | |
| │ └── inference/ | |
| │ ├── mask_Fissure_0.png # SAM3 predicted masks (PNG-L) | |
| │ ├── mask_Road_0.png | |
| │ └── metadata.json | |
| ├── Nid de poule/ | |
| │ └── ... | |
| ├── Road/ | |
| │ └── ... | |
| ├── metrics_summary.txt # Human-readable summary | |
| ├── metrics_detailed.json # Complete metrics data | |
| └── evaluation_log.txt # Execution log | |
| ``` | |
| ## Configuration | |
| All parameters configured in `config/config.json`: | |
| ```json | |
| { | |
| "cvat": { | |
| "url": "https://app.cvat.ai", | |
| "organization": "Logiroad", | |
| "project_filter": "training" | |
| }, | |
| "classes": { | |
| "Fissure": 50, | |
| "Nid de poule": 50, | |
| "Road": 50 | |
| }, | |
| "sam3": { | |
| "endpoint": "https://p6irm2x7y9mwp4l4.us-east-1.aws.endpoints.huggingface.cloud" | |
| }, | |
| "metrics": { | |
| "iou_thresholds": [0.0, 0.25, 0.5, 0.75] | |
| }, | |
| "output": { | |
| "cache_dir": ".cache/test/metrics" | |
| } | |
| } | |
| ``` | |
| CVAT credentials loaded from `.env` file at project root. | |
| ## Usage | |
| ### Quick Start | |
| ```bash | |
| cd metrics_evaluation | |
| python run_evaluation.py | |
| ``` | |
| ### Advanced Options | |
| ```bash | |
| # Use custom config file | |
| python run_evaluation.py --config my_config.json | |
| # Force re-download images (ignore cache) | |
| python run_evaluation.py --force-download | |
| # Force re-run inference (ignore cached results) | |
| python run_evaluation.py --force-inference | |
| # Skip inference (use only cached results) | |
| python run_evaluation.py --skip-inference | |
| # Generate visualization images | |
| python run_evaluation.py --visualize | |
| ``` | |
| ## Pipeline Stages | |
| ### 1. CVAT Data Extraction | |
| - Connects to CVAT API | |
| - Finds AI training project | |
| - Discovers images with target labels | |
| - Downloads images (checks cache first) | |
| - Extracts ground truth masks from CVAT RLE format | |
| - Saves masks as PNG-L format | |
| ### 2. SAM3 Inference | |
| - Loads cached images | |
| - Calls SAM3 endpoint for each image | |
| - Converts base64 masks to PNG-L format | |
| - Saves predicted masks with same structure as ground truth | |
| ### 3. Metrics Calculation | |
| - Matches predicted instances to ground truth (Hungarian algorithm) | |
| - Computes mAP, mAR at multiple IoU thresholds | |
| - Generates confusion matrices | |
| - Calculates per-class and overall statistics | |
| ### 4. Report Generation | |
| - Creates human-readable summary (metrics_summary.txt) | |
| - Saves detailed JSON with all metrics (metrics_detailed.json) | |
| - Logs complete execution trace (evaluation_log.txt) | |
| - Optionally generates visual comparisons | |
| ## Dependencies | |
| Install required packages: | |
| ```bash | |
| pip install opencv-python numpy requests pydantic pillow scipy python-dotenv | |
| ``` | |
| ## Resumability | |
| Pipeline is designed to be resumable: | |
| - Checks cache before downloading images | |
| - Skips inference if results already exist | |
| - Can run metrics on existing data | |
| Use `--force-download` or `--force-inference` to override cache. | |
| ## Error Handling | |
| Pipeline follows "fail fast, fail loud" principle: | |
| - Clear error messages with context | |
| - Validation at each stage | |
| - Graceful handling of missing data | |
| - Continues processing on non-critical errors | |
| - Detailed logging for debugging | |
| ## Expected Execution Time | |
| - Image download: ~5-10 minutes (150 images, network dependent) | |
| - SAM3 inference: ~5-10 minutes (150 images, ~2s per image) | |
| - Metrics computation: ~1 minute | |
| - **Total**: ~15-20 minutes for full evaluation | |
| ## Quality Checks | |
| Before accepting results: | |
| 1. Verify 150 images downloaded | |
| 2. Check ground truth masks are not empty | |
| 3. Verify SAM3 inference completed for all images | |
| 4. Review metrics are within reasonable ranges (0-100%) | |
| 5. Inspect sample visualizations | |
| 6. Check for systematic errors in confusion matrices | |
| ## Limitations | |
| - Only evaluates on images present in CVAT training dataset | |
| - Performance may not generalize to unseen road conditions | |
| - Metrics depend on ground truth annotation quality | |
| - Instance matching uses simple IoU (no advanced matching algorithms) | |
| ## Future Improvements | |
| - Add support for more label types | |
| - Implement advanced instance matching (deformable IoU, boundary IoU) | |
| - Add temporal consistency metrics for video sequences | |
| - Generate interactive HTML report with visualizations | |
| - Support for additional segmentation backends (local models, other APIs) | |
| ## Troubleshooting | |
| **CVAT connection fails:** | |
| - Check `.env` file has correct credentials | |
| - Verify CVAT is accessible from your network | |
| - Check organization name matches | |
| **No images found:** | |
| - Verify project name filter in config.json | |
| - Check labels exist in CVAT project | |
| - Ensure images have annotations | |
| **SAM3 inference errors:** | |
| - Check endpoint URL in config.json | |
| - Verify endpoint is running (test with curl) | |
| - Check network connectivity | |
| - Review endpoint logs | |
| **Metrics seem wrong:** | |
| - Check ground truth masks are valid | |
| - Verify predicted masks are in correct format | |
| - Review confusion matrices for patterns | |
| - Inspect sample images visually | |
| ## Contact | |
| For issues or questions about this evaluation system, contact the Logiroad AI team. | |
| ## License | |
| Internal Logiroad project. Not for public distribution. | |