sam3 / TESTING.md

Thibaut

Fix SAM3 instance segmentation and update documentation

d032bfc 15 days ago

preview code

raw

history blame contribute delete

6.91 kB

SAM3 Testing Guide

Overview

This guide covers two testing approaches for SAM3:

Basic Inference Testing - Quick API validation with sample images
Metrics Evaluation - Comprehensive performance analysis against CVAT ground truth

1. Basic Inference Testing

Purpose

Quickly validate that the SAM3 endpoint is working and producing reasonable segmentation results.

Test Infrastructure

The basic testing framework:

Tests multiple images automatically
Saves detailed JSON logs of requests and responses
Generates visualizations with semi-transparent colored masks
Stores all results in .cache/test/inference/{image_name}/

Running Basic Tests

python3 scripts/test/test_inference_comprehensive.py

Test Output Structure

For each test image, files are generated in .cache/test/inference/{image_name}/:

request.json - Request metadata (timestamp, endpoint, classes)
response.json - Response metadata (timestamp, status, results summary)
full_results.json - Complete API response including base64 masks
original.jpg - Original test image
visualization.png - Original image with colored mask overlay
legend.png - Legend showing class colors and coverage percentages
mask_{ClassName}.png - Individual binary masks for each class

Tested Classes

The endpoint is tested with these semantic classes:

Pothole (Red overlay)
Road crack (Yellow overlay)
Road (Blue overlay)

Recent Test Results

Last run: November 23, 2025

Total images: 8
Successful: 8/8 (100%)
Failed: 0
Average response time: ~1.5 seconds per image
Status: All API calls returning HTTP 200 with valid masks

Test images include:

pothole_pexels_01.jpg, pothole_pexels_02.jpg
road_damage_01.jpg
road_pexels_01.jpg, road_pexels_02.jpg, road_pexels_03.jpg
road_unsplash_01.jpg
test.jpg

Results stored in .cache/test/inference/summary.json

Adding More Test Images

Test images should be placed in assets/test_images/. To expand the test suite:

Download from Public Datasets:
- Pothole Detection Dataset (1,243 images)
- RDD2022 Dataset (47,420 images from 6 countries)
- Roboflow Pothole Dataset
Extract Sample Images: Select diverse examples showing potholes, cracks, and clean roads
Place in Test Directory: Copy to assets/test_images/

2. Metrics Evaluation System

Purpose

Comprehensive quantitative evaluation of SAM3 performance against ground truth annotations from CVAT.

What It Measures

mAP (mean Average Precision): Detection accuracy across all confidence thresholds
mAR (mean Average Recall): Coverage of ground truth instances
IoU metrics: Intersection over Union at multiple thresholds (0%, 25%, 50%, 75%)
Confusion matrices: Class prediction accuracy patterns
Per-class statistics: Precision, recall, F1-score for each damage type

Running Metrics Evaluation

cd metrics_evaluation
python run_evaluation.py

Options:

# Force re-download from CVAT (ignore cache)
python run_evaluation.py --force-download

# Force re-run inference (ignore cached predictions)
python run_evaluation.py --force-inference

# Skip inference step (use existing predictions)
python run_evaluation.py --skip-inference

# Generate visual comparisons
python run_evaluation.py --visualize

Dataset

Evaluates on 150 annotated images from CVAT:

50 images with "Fissure" (road cracks)
50 images with "Nid de poule" (potholes)
50 images with road surface

Source: Logiroad CVAT organization, AI training project

Output Structure

.cache/test/metrics/
├── Fissure/
│   └── {image_name}/
│       ├── image.jpg
│       ├── ground_truth/
│       │   ├── mask_Fissure_0.png
│       │   └── metadata.json
│       └── inference/
│           ├── mask_Fissure_0.png
│           └── metadata.json
├── Nid de poule/
├── Road/
├── metrics_summary.txt        # Human-readable results
├── metrics_detailed.json      # Complete metrics data
└── evaluation_log.txt         # Execution trace

Execution Time

Image download: ~5-10 minutes (150 images)
SAM3 inference: ~~5-10 minutes (~~2s per image)
Metrics computation: ~1 minute
Total: ~15-20 minutes for full evaluation

Configuration

Edit metrics_evaluation/config/config.json to:

Change CVAT project or organization
Adjust number of images per class
Modify IoU thresholds
Update SAM3 endpoint URL

CVAT credentials must be in .env at project root.

Cache Directory

All test results are stored in .cache/ (git-ignored):

Review results without cluttering the repository
Compare results across different test runs
Debug segmentation quality issues
Resume interrupted evaluations

Quality Validation Checklist

Before accepting test results:

Basic Tests:

All test images processed successfully
Masks generated for all requested classes
Response times reasonable (< 3s per image)
Visualizations show plausible segmentations

Metrics Evaluation:

150 images downloaded from CVAT
Ground truth masks not empty
SAM3 inference completed for all images
Metrics within reasonable ranges (0-100%)
Confusion matrices show sensible patterns
Per-class F1 scores above baseline

Troubleshooting

Basic Inference Issues

Endpoint not responding:

Check endpoint URL in test script
Verify endpoint is running (use curl or browser)
Check network connectivity

Empty or invalid masks:

Review class names match model expectations
Check image format (should be JPEG/PNG)
Verify base64 encoding/decoding

Metrics Evaluation Issues

CVAT connection fails:

Check .env credentials
Verify CVAT organization name
Test CVAT web access

No images found:

Check project filter in config.json
Verify labels exist in CVAT
Ensure images have annotations

Metrics seem incorrect:

Inspect confusion matrices
Review sample visualizations
Check ground truth quality in CVAT
Verify mask format (PNG-L, 8-bit grayscale)

Next Steps

Run basic tests to validate API connectivity
Review visualizations to assess segmentation quality
Run metrics evaluation for quantitative performance
Analyze confusion matrices to identify systematic errors
Iterate on model/prompts based on metrics feedback

For detailed metrics evaluation documentation, see metrics_evaluation/README.md.