Thibaut's picture
Add complete metrics evaluation subproject structure
b7d2408

SAM3 Metrics Evaluation

Comprehensive evaluation system to measure SAM3 model performance against ground truth annotations from CVAT.

Overview

This subproject evaluates the SAM3 semantic segmentation endpoint by:

  1. Extracting annotated images from CVAT (ground truth)
  2. Running SAM3 inference on the same images
  3. Computing standard segmentation metrics (mAP, mAR, IoU, confusion matrices)
  4. Generating detailed reports and visualizations

Purpose

Provide quantitative metrics to:

  • Measure SAM3 detection accuracy for road damage
  • Identify systematic errors and biases
  • Compare performance across different damage types
  • Guide model improvement efforts
  • Track performance over time

Dataset

Extracts 150 annotated images from CVAT:

  • 50 images with "Fissure" (road cracks)
  • 50 images with "Nid de poule" (potholes)
  • 50 images with road surface (any annotated image)

Source: Logiroad organization CVAT project designated for AI training.

Metrics Computed

Instance-Level Metrics

  • mAP (mean Average Precision): Detection accuracy across all confidence thresholds
  • mAR (mean Average Recall): Coverage of ground truth instances
  • Instance counts at IoU thresholds: Number of true positives, false positives, false negatives at 0%, 25%, 50%, 75% IoU

Confusion Matrices

Generated at four IoU thresholds (0%, 25%, 50%, 75%):

  • Rows: Ground truth classes
  • Columns: Predicted classes
  • Shows class confusion patterns

Per-Class Statistics

  • Detection rate (% of ground truth instances found)
  • Precision (% of predictions that are correct)
  • Recall (% of ground truth that is detected)
  • F1-score (harmonic mean of precision and recall)

Output Structure

.cache/test/metrics/
β”œβ”€β”€ Fissure/
β”‚   └── {image_name}/
β”‚       β”œβ”€β”€ image.jpg                    # Original image
β”‚       β”œβ”€β”€ ground_truth/
β”‚       β”‚   β”œβ”€β”€ mask_Fissure_0.png       # Ground truth instance masks (PNG-L)
β”‚       β”‚   β”œβ”€β”€ mask_Fissure_1.png
β”‚       β”‚   └── metadata.json            # Mask metadata
β”‚       └── inference/
β”‚           β”œβ”€β”€ mask_Fissure_0.png       # SAM3 predicted masks (PNG-L)
β”‚           β”œβ”€β”€ mask_Road_0.png
β”‚           └── metadata.json
β”œβ”€β”€ Nid de poule/
β”‚   └── ...
β”œβ”€β”€ Road/
β”‚   └── ...
β”œβ”€β”€ metrics_summary.txt                  # Human-readable summary
β”œβ”€β”€ metrics_detailed.json                # Complete metrics data
└── evaluation_log.txt                   # Execution log

Configuration

All parameters configured in config/config.json:

{
  "cvat": {
    "url": "https://app.cvat.ai",
    "organization": "Logiroad",
    "project_filter": "training"
  },
  "classes": {
    "Fissure": 50,
    "Nid de poule": 50,
    "Road": 50
  },
  "sam3": {
    "endpoint": "https://p6irm2x7y9mwp4l4.us-east-1.aws.endpoints.huggingface.cloud"
  },
  "metrics": {
    "iou_thresholds": [0.0, 0.25, 0.5, 0.75]
  },
  "output": {
    "cache_dir": ".cache/test/metrics"
  }
}

CVAT credentials loaded from .env file at project root.

Usage

Quick Start

cd metrics_evaluation
python run_evaluation.py

Advanced Options

# Use custom config file
python run_evaluation.py --config my_config.json

# Force re-download images (ignore cache)
python run_evaluation.py --force-download

# Force re-run inference (ignore cached results)
python run_evaluation.py --force-inference

# Skip inference (use only cached results)
python run_evaluation.py --skip-inference

# Generate visualization images
python run_evaluation.py --visualize

Pipeline Stages

1. CVAT Data Extraction

  • Connects to CVAT API
  • Finds AI training project
  • Discovers images with target labels
  • Downloads images (checks cache first)
  • Extracts ground truth masks from CVAT RLE format
  • Saves masks as PNG-L format

2. SAM3 Inference

  • Loads cached images
  • Calls SAM3 endpoint for each image
  • Converts base64 masks to PNG-L format
  • Saves predicted masks with same structure as ground truth

3. Metrics Calculation

  • Matches predicted instances to ground truth (Hungarian algorithm)
  • Computes mAP, mAR at multiple IoU thresholds
  • Generates confusion matrices
  • Calculates per-class and overall statistics

4. Report Generation

  • Creates human-readable summary (metrics_summary.txt)
  • Saves detailed JSON with all metrics (metrics_detailed.json)
  • Logs complete execution trace (evaluation_log.txt)
  • Optionally generates visual comparisons

Dependencies

Install required packages:

pip install opencv-python numpy requests pydantic pillow scipy python-dotenv

Resumability

Pipeline is designed to be resumable:

  • Checks cache before downloading images
  • Skips inference if results already exist
  • Can run metrics on existing data

Use --force-download or --force-inference to override cache.

Error Handling

Pipeline follows "fail fast, fail loud" principle:

  • Clear error messages with context
  • Validation at each stage
  • Graceful handling of missing data
  • Continues processing on non-critical errors
  • Detailed logging for debugging

Expected Execution Time

  • Image download: ~5-10 minutes (150 images, network dependent)
  • SAM3 inference: ~5-10 minutes (150 images, ~2s per image)
  • Metrics computation: ~1 minute
  • Total: ~15-20 minutes for full evaluation

Quality Checks

Before accepting results:

  1. Verify 150 images downloaded
  2. Check ground truth masks are not empty
  3. Verify SAM3 inference completed for all images
  4. Review metrics are within reasonable ranges (0-100%)
  5. Inspect sample visualizations
  6. Check for systematic errors in confusion matrices

Limitations

  • Only evaluates on images present in CVAT training dataset
  • Performance may not generalize to unseen road conditions
  • Metrics depend on ground truth annotation quality
  • Instance matching uses simple IoU (no advanced matching algorithms)

Future Improvements

  • Add support for more label types
  • Implement advanced instance matching (deformable IoU, boundary IoU)
  • Add temporal consistency metrics for video sequences
  • Generate interactive HTML report with visualizations
  • Support for additional segmentation backends (local models, other APIs)

Troubleshooting

CVAT connection fails:

  • Check .env file has correct credentials
  • Verify CVAT is accessible from your network
  • Check organization name matches

No images found:

  • Verify project name filter in config.json
  • Check labels exist in CVAT project
  • Ensure images have annotations

SAM3 inference errors:

  • Check endpoint URL in config.json
  • Verify endpoint is running (test with curl)
  • Check network connectivity
  • Review endpoint logs

Metrics seem wrong:

  • Check ground truth masks are valid
  • Verify predicted masks are in correct format
  • Review confusion matrices for patterns
  • Inspect sample images visually

Contact

For issues or questions about this evaluation system, contact the Logiroad AI team.

License

Internal Logiroad project. Not for public distribution.