sam3 / metrics_evaluation /README.md

Thibaut

Add complete metrics evaluation subproject structure

b7d2408 3 months ago

preview code

raw

history blame contribute delete

7.13 kB

SAM3 Metrics Evaluation

Comprehensive evaluation system to measure SAM3 model performance against ground truth annotations from CVAT.

Overview

This subproject evaluates the SAM3 semantic segmentation endpoint by:

Extracting annotated images from CVAT (ground truth)
Running SAM3 inference on the same images
Computing standard segmentation metrics (mAP, mAR, IoU, confusion matrices)
Generating detailed reports and visualizations

Purpose

Provide quantitative metrics to:

Measure SAM3 detection accuracy for road damage
Identify systematic errors and biases
Compare performance across different damage types
Guide model improvement efforts
Track performance over time

Dataset

Extracts 150 annotated images from CVAT:

50 images with "Fissure" (road cracks)
50 images with "Nid de poule" (potholes)
50 images with road surface (any annotated image)

Source: Logiroad organization CVAT project designated for AI training.

Metrics Computed

Instance-Level Metrics

mAP (mean Average Precision): Detection accuracy across all confidence thresholds
mAR (mean Average Recall): Coverage of ground truth instances
Instance counts at IoU thresholds: Number of true positives, false positives, false negatives at 0%, 25%, 50%, 75% IoU

Confusion Matrices

Generated at four IoU thresholds (0%, 25%, 50%, 75%):

Rows: Ground truth classes
Columns: Predicted classes
Shows class confusion patterns

Per-Class Statistics

Detection rate (% of ground truth instances found)
Precision (% of predictions that are correct)
Recall (% of ground truth that is detected)
F1-score (harmonic mean of precision and recall)

Output Structure

.cache/test/metrics/
├── Fissure/
│   └── {image_name}/
│       ├── image.jpg                    # Original image
│       ├── ground_truth/
│       │   ├── mask_Fissure_0.png       # Ground truth instance masks (PNG-L)
│       │   ├── mask_Fissure_1.png
│       │   └── metadata.json            # Mask metadata
│       └── inference/
│           ├── mask_Fissure_0.png       # SAM3 predicted masks (PNG-L)
│           ├── mask_Road_0.png
│           └── metadata.json
├── Nid de poule/
│   └── ...
├── Road/
│   └── ...
├── metrics_summary.txt                  # Human-readable summary
├── metrics_detailed.json                # Complete metrics data
└── evaluation_log.txt                   # Execution log

Configuration

All parameters configured in config/config.json:

{
  "cvat": {
    "url": "https://app.cvat.ai",
    "organization": "Logiroad",
    "project_filter": "training"
  },
  "classes": {
    "Fissure": 50,
    "Nid de poule": 50,
    "Road": 50
  },
  "sam3": {
    "endpoint": "https://p6irm2x7y9mwp4l4.us-east-1.aws.endpoints.huggingface.cloud"
  },
  "metrics": {
    "iou_thresholds": [0.0, 0.25, 0.5, 0.75]
  },
  "output": {
    "cache_dir": ".cache/test/metrics"
  }
}

CVAT credentials loaded from .env file at project root.

Usage

Quick Start

cd metrics_evaluation
python run_evaluation.py

Advanced Options

# Use custom config file
python run_evaluation.py --config my_config.json

# Force re-download images (ignore cache)
python run_evaluation.py --force-download

# Force re-run inference (ignore cached results)
python run_evaluation.py --force-inference

# Skip inference (use only cached results)
python run_evaluation.py --skip-inference

# Generate visualization images
python run_evaluation.py --visualize

Pipeline Stages

1. CVAT Data Extraction

Connects to CVAT API
Finds AI training project
Discovers images with target labels
Downloads images (checks cache first)
Extracts ground truth masks from CVAT RLE format
Saves masks as PNG-L format

2. SAM3 Inference

Loads cached images
Calls SAM3 endpoint for each image
Converts base64 masks to PNG-L format
Saves predicted masks with same structure as ground truth

3. Metrics Calculation

Matches predicted instances to ground truth (Hungarian algorithm)
Computes mAP, mAR at multiple IoU thresholds
Generates confusion matrices
Calculates per-class and overall statistics

4. Report Generation

Creates human-readable summary (metrics_summary.txt)
Saves detailed JSON with all metrics (metrics_detailed.json)
Logs complete execution trace (evaluation_log.txt)
Optionally generates visual comparisons

Dependencies

Install required packages:

pip install opencv-python numpy requests pydantic pillow scipy python-dotenv

Resumability

Pipeline is designed to be resumable:

Checks cache before downloading images
Skips inference if results already exist
Can run metrics on existing data

Use --force-download or --force-inference to override cache.

Error Handling

Pipeline follows "fail fast, fail loud" principle:

Clear error messages with context
Validation at each stage
Graceful handling of missing data
Continues processing on non-critical errors
Detailed logging for debugging

Expected Execution Time

Image download: ~5-10 minutes (150 images, network dependent)
SAM3 inference: ~5-10 minutes (150 images, ~2s per image)
Metrics computation: ~1 minute
Total: ~15-20 minutes for full evaluation

Quality Checks

Before accepting results:

Verify 150 images downloaded
Check ground truth masks are not empty
Verify SAM3 inference completed for all images
Review metrics are within reasonable ranges (0-100%)
Inspect sample visualizations
Check for systematic errors in confusion matrices

Limitations

Only evaluates on images present in CVAT training dataset
Performance may not generalize to unseen road conditions
Metrics depend on ground truth annotation quality
Instance matching uses simple IoU (no advanced matching algorithms)

Future Improvements

Add support for more label types
Implement advanced instance matching (deformable IoU, boundary IoU)
Add temporal consistency metrics for video sequences
Generate interactive HTML report with visualizations
Support for additional segmentation backends (local models, other APIs)

Troubleshooting

CVAT connection fails:

Check .env file has correct credentials
Verify CVAT is accessible from your network
Check organization name matches

No images found:

Verify project name filter in config.json
Check labels exist in CVAT project
Ensure images have annotations

SAM3 inference errors:

Check endpoint URL in config.json
Verify endpoint is running (test with curl)
Check network connectivity
Review endpoint logs

Metrics seem wrong:

Check ground truth masks are valid
Verify predicted masks are in correct format
Review confusion matrices for patterns
Inspect sample images visually

Contact

For issues or questions about this evaluation system, contact the Logiroad AI team.

License

Internal Logiroad project. Not for public distribution.