File size: 6,912 Bytes
f2dc3c4
 
d032bfc
 
 
 
 
 
 
 
 
 
 
 
 
 
f2dc3c4
 
 
d032bfc
f2dc3c4
d032bfc
f2dc3c4
 
 
d032bfc
f2dc3c4
 
 
 
 
 
 
d032bfc
f2dc3c4
 
 
 
 
 
 
 
 
d032bfc
f2dc3c4
 
 
 
 
 
d032bfc
 
 
f2dc3c4
d032bfc
 
 
 
 
f2dc3c4
d032bfc
 
 
 
 
 
 
 
 
 
 
 
f2dc3c4
 
 
 
 
 
 
 
 
 
d032bfc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f2dc3c4
d032bfc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f2dc3c4
 
 
d032bfc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f2dc3c4
d032bfc
f2dc3c4
d032bfc
 
 
 
 
f2dc3c4
d032bfc
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
# SAM3 Testing Guide

## Overview

This guide covers two testing approaches for SAM3:

1. **Basic Inference Testing** - Quick API validation with sample images
2. **Metrics Evaluation** - Comprehensive performance analysis against CVAT ground truth

---

## 1. Basic Inference Testing

### Purpose

Quickly validate that the SAM3 endpoint is working and producing reasonable segmentation results.

### Test Infrastructure

The basic testing framework:
- Tests multiple images automatically
- Saves detailed JSON logs of requests and responses
- Generates visualizations with semi-transparent colored masks
- Stores all results in `.cache/test/inference/{image_name}/`

### Running Basic Tests

```bash
python3 scripts/test/test_inference_comprehensive.py
```

### Test Output Structure

For each test image, files are generated in `.cache/test/inference/{image_name}/`:

- `request.json` - Request metadata (timestamp, endpoint, classes)
- `response.json` - Response metadata (timestamp, status, results summary)
- `full_results.json` - Complete API response including base64 masks
- `original.jpg` - Original test image
- `visualization.png` - Original image with colored mask overlay
- `legend.png` - Legend showing class colors and coverage percentages
- `mask_{ClassName}.png` - Individual binary masks for each class

### Tested Classes

The endpoint is tested with these semantic classes:
- **Pothole** (Red overlay)
- **Road crack** (Yellow overlay)
- **Road** (Blue overlay)

### Recent Test Results

**Last run**: November 23, 2025

- **Total images**: 8
- **Successful**: 8/8 (100%)
- **Failed**: 0
- **Average response time**: ~1.5 seconds per image
- **Status**: All API calls returning HTTP 200 with valid masks

Test images include:
- `pothole_pexels_01.jpg`, `pothole_pexels_02.jpg`
- `road_damage_01.jpg`
- `road_pexels_01.jpg`, `road_pexels_02.jpg`, `road_pexels_03.jpg`
- `road_unsplash_01.jpg`
- `test.jpg`

Results stored in `.cache/test/inference/summary.json`

### Adding More Test Images

Test images should be placed in `assets/test_images/`. To expand the test suite:

1. **Download from Public Datasets**:
   - [Pothole Detection Dataset](https://github.com/jaygala24/pothole-detection/releases/download/v1.0.0/Pothole.Dataset.IVCNZ.zip) (1,243 images)
   - [RDD2022 Dataset](https://github.com/sekilab/RoadDamageDetector) (47,420 images from 6 countries)
   - [Roboflow Pothole Dataset](https://public.roboflow.com/object-detection/pothole/)

2. **Extract Sample Images**: Select diverse examples showing potholes, cracks, and clean roads

3. **Place in Test Directory**: Copy to `assets/test_images/`

---

## 2. Metrics Evaluation System

### Purpose

Comprehensive quantitative evaluation of SAM3 performance against ground truth annotations from CVAT.

### What It Measures

- **mAP (mean Average Precision)**: Detection accuracy across all confidence thresholds
- **mAR (mean Average Recall)**: Coverage of ground truth instances
- **IoU metrics**: Intersection over Union at multiple thresholds (0%, 25%, 50%, 75%)
- **Confusion matrices**: Class prediction accuracy patterns
- **Per-class statistics**: Precision, recall, F1-score for each damage type

### Running Metrics Evaluation

```bash
cd metrics_evaluation
python run_evaluation.py
```

**Options**:
```bash
# Force re-download from CVAT (ignore cache)
python run_evaluation.py --force-download

# Force re-run inference (ignore cached predictions)
python run_evaluation.py --force-inference

# Skip inference step (use existing predictions)
python run_evaluation.py --skip-inference

# Generate visual comparisons
python run_evaluation.py --visualize
```

### Dataset

Evaluates on **150 annotated images** from CVAT:
- **50 images** with "Fissure" (road cracks)
- **50 images** with "Nid de poule" (potholes)
- **50 images** with road surface

Source: Logiroad CVAT organization, AI training project

### Output Structure

```
.cache/test/metrics/
β”œβ”€β”€ Fissure/
β”‚   └── {image_name}/
β”‚       β”œβ”€β”€ image.jpg
β”‚       β”œβ”€β”€ ground_truth/
β”‚       β”‚   β”œβ”€β”€ mask_Fissure_0.png
β”‚       β”‚   └── metadata.json
β”‚       └── inference/
β”‚           β”œβ”€β”€ mask_Fissure_0.png
β”‚           └── metadata.json
β”œβ”€β”€ Nid de poule/
β”œβ”€β”€ Road/
β”œβ”€β”€ metrics_summary.txt        # Human-readable results
β”œβ”€β”€ metrics_detailed.json      # Complete metrics data
└── evaluation_log.txt         # Execution trace
```

### Execution Time

- Image download: ~5-10 minutes (150 images)
- SAM3 inference: ~5-10 minutes (~2s per image)
- Metrics computation: ~1 minute
- **Total**: ~15-20 minutes for full evaluation

### Configuration

Edit `metrics_evaluation/config/config.json` to:
- Change CVAT project or organization
- Adjust number of images per class
- Modify IoU thresholds
- Update SAM3 endpoint URL

CVAT credentials must be in `.env` at project root.

---

## Cache Directory

All test results are stored in `.cache/` (git-ignored):
- Review results without cluttering the repository
- Compare results across different test runs
- Debug segmentation quality issues
- Resume interrupted evaluations

---

## Quality Validation Checklist

Before accepting test results:

**Basic Tests**:
- [ ] All test images processed successfully
- [ ] Masks generated for all requested classes
- [ ] Response times reasonable (< 3s per image)
- [ ] Visualizations show plausible segmentations

**Metrics Evaluation**:
- [ ] 150 images downloaded from CVAT
- [ ] Ground truth masks not empty
- [ ] SAM3 inference completed for all images
- [ ] Metrics within reasonable ranges (0-100%)
- [ ] Confusion matrices show sensible patterns
- [ ] Per-class F1 scores above baseline

---

## Troubleshooting

### Basic Inference Issues

**Endpoint not responding**:
- Check endpoint URL in test script
- Verify endpoint is running (use `curl` or browser)
- Check network connectivity

**Empty or invalid masks**:
- Review class names match model expectations
- Check image format (should be JPEG/PNG)
- Verify base64 encoding/decoding

### Metrics Evaluation Issues

**CVAT connection fails**:
- Check `.env` credentials
- Verify CVAT organization name
- Test CVAT web access

**No images found**:
- Check project filter in `config.json`
- Verify labels exist in CVAT
- Ensure images have annotations

**Metrics seem incorrect**:
- Inspect confusion matrices
- Review sample visualizations
- Check ground truth quality in CVAT
- Verify mask format (PNG-L, 8-bit grayscale)

---

## Next Steps

1. **Run basic tests** to validate API connectivity
2. **Review visualizations** to assess segmentation quality
3. **Run metrics evaluation** for quantitative performance
4. **Analyze confusion matrices** to identify systematic errors
5. **Iterate on model/prompts** based on metrics feedback

For detailed metrics evaluation documentation, see `metrics_evaluation/README.md`.