# SAM3 Testing Guide

## Overview

This guide covers two testing approaches for SAM3:

1. **Basic Inference Testing** - Quick API validation with sample images
2. **Metrics Evaluation** - Comprehensive performance analysis against CVAT ground truth

---

## 1. Basic Inference Testing

### Purpose

Quickly validate that the SAM3 endpoint is working and producing reasonable segmentation results.

### Test Infrastructure

The basic testing framework:
- Tests multiple images automatically
- Saves detailed JSON logs of requests and responses
- Generates visualizations with semi-transparent colored masks
- Stores all results in `.cache/test/inference/{image_name}/`

### Running Basic Tests

```bash
python3 scripts/test/test_inference_comprehensive.py
```

### Test Output Structure

For each test image, files are generated in `.cache/test/inference/{image_name}/`:

- `request.json` - Request metadata (timestamp, endpoint, classes)
- `response.json` - Response metadata (timestamp, status, results summary)
- `full_results.json` - Complete API response including base64 masks
- `original.jpg` - Original test image
- `visualization.png` - Original image with colored mask overlay
- `legend.png` - Legend showing class colors and coverage percentages
- `mask_{ClassName}.png` - Individual binary masks for each class

### Tested Classes

The endpoint is tested with these semantic classes:
- **Pothole** (Red overlay)
- **Road crack** (Yellow overlay)
- **Road** (Blue overlay)

### Recent Test Results

**Last run**: November 23, 2025

- **Total images**: 8
- **Successful**: 8/8 (100%)
- **Failed**: 0
- **Average response time**: ~1.5 seconds per image
- **Status**: All API calls returning HTTP 200 with valid masks

Test images include:
- `pothole_pexels_01.jpg`, `pothole_pexels_02.jpg`
- `road_damage_01.jpg`
- `road_pexels_01.jpg`, `road_pexels_02.jpg`, `road_pexels_03.jpg`
- `road_unsplash_01.jpg`
- `test.jpg`

Results stored in `.cache/test/inference/summary.json`

### Adding More Test Images

Test images should be placed in `assets/test_images/`. To expand the test suite:

1. **Download from Public Datasets**:
   - [Pothole Detection Dataset](https://github.com/jaygala24/pothole-detection/releases/download/v1.0.0/Pothole.Dataset.IVCNZ.zip) (1,243 images)
   - [RDD2022 Dataset](https://github.com/sekilab/RoadDamageDetector) (47,420 images from 6 countries)
   - [Roboflow Pothole Dataset](https://public.roboflow.com/object-detection/pothole/)

2. **Extract Sample Images**: Select diverse examples showing potholes, cracks, and clean roads

3. **Place in Test Directory**: Copy to `assets/test_images/`

---

## 2. Metrics Evaluation System

### Purpose

Comprehensive quantitative evaluation of SAM3 performance against ground truth annotations from CVAT.

### What It Measures

- **mAP (mean Average Precision)**: Detection accuracy across all confidence thresholds
- **mAR (mean Average Recall)**: Coverage of ground truth instances
- **IoU metrics**: Intersection over Union at multiple thresholds (0%, 25%, 50%, 75%)
- **Confusion matrices**: Class prediction accuracy patterns
- **Per-class statistics**: Precision, recall, F1-score for each damage type

### Running Metrics Evaluation

```bash
cd metrics_evaluation
python run_evaluation.py
```

**Options**:
```bash
# Force re-download from CVAT (ignore cache)
python run_evaluation.py --force-download

# Force re-run inference (ignore cached predictions)
python run_evaluation.py --force-inference

# Skip inference step (use existing predictions)
python run_evaluation.py --skip-inference

# Generate visual comparisons
python run_evaluation.py --visualize
```

### Dataset

Evaluates on **150 annotated images** from CVAT:
- **50 images** with "Fissure" (road cracks)
- **50 images** with "Nid de poule" (potholes)
- **50 images** with road surface

Source: Logiroad CVAT organization, AI training project

### Output Structure

```
.cache/test/metrics/
├── Fissure/
│   └── {image_name}/
│       ├── image.jpg
│       ├── ground_truth/
│       │   ├── mask_Fissure_0.png
│       │   └── metadata.json
│       └── inference/
│           ├── mask_Fissure_0.png
│           └── metadata.json
├── Nid de poule/
├── Road/
├── metrics_summary.txt        # Human-readable results
├── metrics_detailed.json      # Complete metrics data
└── evaluation_log.txt         # Execution trace
```

### Execution Time

- Image download: ~5-10 minutes (150 images)
- SAM3 inference: ~5-10 minutes (~2s per image)
- Metrics computation: ~1 minute
- **Total**: ~15-20 minutes for full evaluation

### Configuration

Edit `metrics_evaluation/config/config.json` to:
- Change CVAT project or organization
- Adjust number of images per class
- Modify IoU thresholds
- Update SAM3 endpoint URL

CVAT credentials must be in `.env` at project root.

---

## Cache Directory

All test results are stored in `.cache/` (git-ignored):
- Review results without cluttering the repository
- Compare results across different test runs
- Debug segmentation quality issues
- Resume interrupted evaluations

---

## Quality Validation Checklist

Before accepting test results:

**Basic Tests**:
- [ ] All test images processed successfully
- [ ] Masks generated for all requested classes
- [ ] Response times reasonable (< 3s per image)
- [ ] Visualizations show plausible segmentations

**Metrics Evaluation**:
- [ ] 150 images downloaded from CVAT
- [ ] Ground truth masks not empty
- [ ] SAM3 inference completed for all images
- [ ] Metrics within reasonable ranges (0-100%)
- [ ] Confusion matrices show sensible patterns
- [ ] Per-class F1 scores above baseline

---

## Troubleshooting

### Basic Inference Issues

**Endpoint not responding**:
- Check endpoint URL in test script
- Verify endpoint is running (use `curl` or browser)
- Check network connectivity

**Empty or invalid masks**:
- Review class names match model expectations
- Check image format (should be JPEG/PNG)
- Verify base64 encoding/decoding

### Metrics Evaluation Issues

**CVAT connection fails**:
- Check `.env` credentials
- Verify CVAT organization name
- Test CVAT web access

**No images found**:
- Check project filter in `config.json`
- Verify labels exist in CVAT
- Ensure images have annotations

**Metrics seem incorrect**:
- Inspect confusion matrices
- Review sample visualizations
- Check ground truth quality in CVAT
- Verify mask format (PNG-L, 8-bit grayscale)

---

## Next Steps

1. **Run basic tests** to validate API connectivity
2. **Review visualizations** to assess segmentation quality
3. **Run metrics evaluation** for quantitative performance
4. **Analyze confusion matrices** to identify systematic errors
5. **Iterate on model/prompts** based on metrics feedback

For detailed metrics evaluation documentation, see `metrics_evaluation/README.md`.