Spaces:
Sleeping
Sleeping
File size: 15,994 Bytes
e327f0d | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 | # ML Model Guide
Everything about the three machine-learning models powering HasarΔ° β performance numbers, when each one runs, known failure modes, and how to retrain.
> Target audience: ML engineers and technical operators. End-user model intuition lives in [USER_GUIDE_TR.md](USER_GUIDE_TR.md#6-sonuΓ§larΔ±-anlama).
---
## Pipeline overview
For every uploaded image, three models run **in parallel**, then a deterministic post-processor stitches the outputs into a part-centric JSON:
```
ββββββββββββββββββββββββββββββββ
β Damage YOLO11m-seg β
β 6 classes, segmentation β
β (where is the damage?) β
ββββββββββββ¬ββββββββββββββββββββ
β
ββββββββββββββββββββββββ β ββββββββββββββββββββββ
β Parts YOLO11s-seg β β β For each damage β
β 21 classes, seg β β IoU(damage, part) β crop: β
β (which part?) βββββββββΌββββββββββββββββββββΊ β Severity classifierβ
ββββββββββββββββββββββββ β β YOLO11n-cls 3 cls β
β β (hafif/orta/agir) β
βΌ βββββββββββ¬βββββββββββ
ββββββββββββββββββββββββββββββββ β
β Match every damage to its β β
β best-overlapping part βββββββββββββββ
β β "front_bumper has a dent" β
ββββββββββββ¬ββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββ
β Cost engine (lookup table + β
β part Γ damage Γ severity) β
β β βΊ range per part β
ββββββββββββ¬ββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββ
β Aggregate to summary β
β (totals, recommendation) β
ββββββββββββββββββββββββββββββββ
```
End-to-end latency on a single 1920Γ1080 image, RTX 5050 8GB (Blackwell, sm_120, cu128):
| Stage | Time |
|---|---|
| Image decode + preprocess | ~10 ms |
| Damage YOLO11m-seg | ~45 ms |
| Parts YOLO11s-seg | ~30 ms |
| Severity (per damage crop, avg 3 damages) | ~36 ms total |
| IoU matching + post-processing | ~5 ms |
| Cost engine | ~1 ms |
| **Single-image total** | **~125 ms** |
A typical 4-photo inspection runs sequentially on the worker and finishes in **5β8 seconds** including S3 round trips.
---
## Model 1 β Damage segmentation (YOLO11m-seg)
**What it does**: pixel-level segmentation of damage regions on the car body. Outputs a mask + class label + bounding box + confidence for each damage instance.
### Specs
| Item | Value |
|---|---|
| **Architecture** | YOLO11m-seg (Ultralytics) |
| **Input size** | 640Γ640 |
| **Parameters** | ~22M |
| **Classes (6)** | `dent`, `scratch`, `crack`, `glass_shatter`, `lamp_broken`, `tire_flat` |
| **Dataset** | CarDD (academic, non-commercial) β ~4 000 labeled images |
| **Epochs trained** | 120 |
| **Optimizer** | SGD with default Ultralytics schedule |
| **Augmentation** | Mosaic + HSV + flip (Ultralytics defaults) |
| **Weights file** | `services/ml/yolo11m-seg.pt` |
### Performance (validation set)
| Metric | Value |
|---|---|
| **mAP50-95 (box)** | 0.491 |
| **mAP50 (box)** | 0.671 |
| **mAP50-95 (mask)** | **0.509** |
| **mAP50 (mask)** | **0.683** |
| **Precision (mask, IoUβ₯0.5)** | 0.71 |
| **Recall (mask, IoUβ₯0.5)** | 0.67 |
### Per-class behavior
| Class | mAP50 | Notes |
|---|---|---|
| `dent` | 0.74 | Strongest. Lots of training data, distinctive shape. |
| `scratch` | 0.69 | Mostly good, occasionally confused with cosmetic dirt. |
| `crack` | 0.61 | Plastic crack vs. paint crack ambiguity; thin cracks under-recalled. |
| `glass_shatter` | 0.78 | Very strong β shatter pattern is distinctive. |
| `lamp_broken` | 0.65 | Good when lamp lens is shattered; missed when only a fine crack. |
| `tire_flat` | 0.42 | **Weakest** β only ~80 training instances; needs more data (v0.2). |
### Known failure modes
- **Cosmetic dirt / mud** on bumpers occasionally classified as scratch. Mitigation: instruct users to clean the vehicle (USER_GUIDE rule).
- **Reflective glare** on glass produces phantom `glass_shatter` detections. Confidence threshold β₯ 0.55 reduces this; tune per deployment.
- **Tire flatness** rarely detected β class is included for completeness but should be considered advisory only until v0.2 dataset boost.
- **Wet surfaces** reflect like cracks. Same mitigation as glare.
---
## Model 2 β Parts segmentation (YOLO11s-seg)
**What it does**: pixel-level segmentation of vehicle body parts. Tells us *which* part each damage sits on.
### Specs
| Item | Value |
|---|---|
| **Architecture** | YOLO11s-seg |
| **Input size** | 640Γ640 |
| **Parameters** | ~10M |
| **Classes (21)** | `back_bumper`, `front_bumper`, `back_door`, `front_left_door`, `front_right_door`, `back_left_door`, `back_right_door`, `back_glass`, `front_glass`, `back_light`, `front_light`, `back_left_light`, `back_right_light`, `front_left_light`, `front_right_light`, `hood`, `trunk`, `tailgate`, `left_mirror`, `right_mirror`, `wheel` |
| **Dataset** | Combined: Roboflow car-parts + supplementary CarPartsDB scrape β ~6 000 images |
| **Epochs trained** | ~50 |
| **Weights file** | `services/ml/yolo11s-seg.pt` |
### Performance (validation set)
| Metric | Value |
|---|---|
| **mAP50 (mask)** | **~0.72** |
| **mAP50-95 (mask)** | ~0.55 |
### Known failure modes
- **Left/right confusion** on doors and headlights when the vehicle is photographed from the rear quarter β the system fuses left/right calls using image orientation heuristics, but it's still a known weak spot.
- **Mirror miss** on small images: the mirror class has ~3% of bounding-box area on average and is sometimes missed on low-resolution input.
- **Trunk vs. tailgate** ambiguity on hatchbacks β both classes can fire on the same region. Post-processor picks the higher-confidence one.
---
## Model 3 β Severity classifier (YOLO11n-cls)
**What it does**: given a tight crop of a single damage region, classify its severity as `hafif` (minor), `orta` (moderate), or `agir` (severe).
### Specs
| Item | Value |
|---|---|
| **Architecture** | YOLO11n-cls |
| **Input size** | 224Γ224 |
| **Parameters** | ~2.6M |
| **Classes (3)** | `hafif`, `orta`, `agir` |
| **Dataset** | Roboflow Severity dataset β ~1 800 labeled crops |
| **Epochs trained** | 30 |
| **Weights file** | `services/ml/yolo11n-cls.pt` |
### Performance (validation set)
| Metric | Value |
|---|---|
| **Top-1 accuracy** | **0.742** |
| **Macro F1** | 0.71 |
| **Confusion** | Mostly `orta` β `agir`; `hafif` is well-separated. |
### Known failure modes
- **Overfitting tendency**: small dataset means the model is slightly biased toward `orta`. Val accuracy plateaued at ~0.74 β adding more `agir` examples is a v0.2 priority.
- **Crop quality dependency**: if the damage YOLO produces a tight, well-centered crop, classification is reliable. Loose or off-center crops degrade accuracy by ~10%.
- **Glass shatter severity** is currently always classified as `orta` or `agir` β there's no "minor glass shatter" in the training distribution. Acceptable for v0.1.
---
## Cost engine
Not an ML model β a **lookup-table-driven** function:
```
cost(part, damage_type, severity) β (min_tl, max_tl)
```
The table lives at `services/ml/cost_table.yaml` and is calibrated to local TΓΌrkiye OEM + aftermarket prices (March 2026). Example entries:
```yaml
front_bumper:
dent:
hafif: [400, 1200]
orta: [2500, 5500]
agir: [6000, 12000]
scratch:
hafif: [200, 600]
orta: [800, 2000]
agir: [2500, 4500]
```
**Why not ML for cost?** Insufficient labeled price data (you need real repair invoices) and the lookup table is more debuggable for pilot use. v0.2 plans an ML regression head once the pilot accumulates ~500 verified inspections with actual repair costs.
---
## Inference configuration
Default thresholds in `services/ml/pipeline.py`:
| Hyperparameter | Default | When to tune |
|---|---|---|
| `damage_conf_threshold` | 0.55 | Lower β more sensitive, more false positives. Raise to 0.65 in noisy environments. |
| `parts_conf_threshold` | 0.5 | Parts model is more reliable; rarely needs tuning. |
| `iou_match_threshold` | 0.15 | How much a damage mask must overlap a part to be assigned to it. Lower = more aggressive matching. |
| `severity_min_crop_size` | 32Γ32 px | Smaller crops degrade severity accuracy; below this we skip severity and label `bilinmiyor`. |
| `max_damages_per_image` | 25 | Hard cap to prevent runaway false positives. |
To change a threshold per request, pass the override in the API call (planned feature β not yet exposed in v0.1).
---
## Retraining
### Quick: incremental data, same architecture
For weekly fine-tuning runs on top of the existing checkpoint:
```powershell
cd services\ml
# Damage model β 30 more epochs on top of the v0.1 weights
python train.py --resume yolo11m-seg.pt --data cardd.yaml --epochs 30 --batch 8 --device 0
# Parts model
python train_parts.py --resume yolo11s-seg.pt --data parts.yaml --epochs 20 --batch 16 --device 0
# Severity classifier
python train_severity.py --resume yolo11n-cls.pt --data data/severity --epochs 15 --batch 32 --device 0
```
### Full: clean retrain from pre-trained YOLO11
For a major version bump (v0.2 β v0.3):
```powershell
cd services\ml
python train_all.py --full --device 0
```
`train_all.py --full` runs all three trainings sequentially and logs to `services/ml/runs/` and `services/ml/logs/`. It will:
1. Download pre-trained YOLO11 base weights if missing.
2. Train damage model for 120 epochs.
3. Train parts model for 50 epochs.
4. Train severity classifier for 30 epochs.
5. Run the regression test suite (`tools/regression_test.py`) and write a comparison report against the previous deployment.
**Wall-clock time** on RTX 5050 8GB: ~14 hours for the full run.
### Dataset refresh
Before retraining, refresh datasets:
```powershell
# CarDD β re-download if upstream HuggingFace mirror updated
python scripts\download_data.py --cardd-hf --force
# Roboflow severity β set API key first
$env:ROBOFLOW_API_KEY = "..."
python scripts\download_data.py --roboflow-severity --force
# Pilot in-the-wild data (if you've collected labeled images from pilot users)
python scripts\merge_pilot_data.py --in pilot_inspections.csv --out data/pilot/
python scripts\verify_data.py --datasets cardd pilot
```
### Validating a new checkpoint
Always run the regression suite before promoting:
```powershell
python tools\regression_test.py `
--baseline services\ml\runs\v0.1\weights\best.pt `
--candidate services\ml\runs\v0.2\weights\best.pt `
--fixtures tools\fixtures\regression\
```
The regression suite scores both models on 200 hand-curated images and fails the build if any of these regresses by >2%:
- mAP50 (mask) per class
- IoU matching accuracy (does each damage land on the right part?)
- Total cost variance (is the new model producing drastically different cost ranges?)
### Promoting weights to production
1. Copy the new `.pt` files to a versioned S3 location:
```bash
aws s3 cp services/ml/runs/v0.2/weights/best.pt s3://hasari-models/v0.2/yolo11m-seg.pt
```
2. Update `ML_MODEL_VERSION=v0.2` env var on the Render API service.
3. The backend reads `ML_MODEL_VERSION` at startup and downloads the matching weights from S3.
4. Smoke-test on staging before pointing production at the new version.
5. Keep the previous version (`v0.1`) on S3 for instant rollback.
### Export for on-device (mobile, v0.2 backlog)
```powershell
cd tools
python export.py --model yolo11n-seg.pt --format tflite --output models/damage_yolo11n.tflite
python export.py --model yolo11n-seg.pt --format coreml --output models/damage_yolo11n.mlpackage
```
Output models are quantized to int8 by default β ~3 MB, runs at ~80 ms on iPhone 13 Neural Engine.
---
## Hardware requirements
### Training (full pipeline)
- **GPU**: NVIDIA, β₯8 GB VRAM (Blackwell architecture or newer recommended for sm_120 features)
- **CUDA**: 12.8+
- **PyTorch**: 2.4+ with `cu128` wheels (see `services/ml/setup.ps1` / `setup.sh` β Blackwell support is non-trivial)
- **RAM**: 32 GB
- **CPU**: β₯8 cores (for data loader workers)
- **Disk**: 50 GB free (datasets + checkpoints)
### Inference
- **GPU (preferred)**: 4 GB VRAM minimum
- **CPU-only (acceptable)**: any modern x86_64; ~5β10Γ slower than GPU. Used in the Render-hosted pilot until GPU host is provisioned.
---
## Telemetry: what we measure in production
Every inference logs:
- Per-model wall time (`damage_ms`, `parts_ms`, `severity_ms`, `total_ms`)
- Per-image counts: detected damages, detected parts, matched/unmatched damages
- Confidence-score distributions (P50, P95) per class
- Image dimensions and file size
- Failure category if the inference errors out
These flow to Prometheus and are visible in the Grafana "ML Pipeline" dashboard (config in `observability/grafana/dashboards/ml-pipeline.json`).
Use this data to:
- Set alerts when P95 latency drifts upward (often signals model loading wrong weights)
- Identify class drift (sudden drop in `dent` confidences usually means input distribution shifted β new car models, new camera type)
- Schedule retraining when false-positive rate creeps above 5% per a sampled human review.
---
## Limitations & honest caveats
- **English vehicles only**: training data is heavily biased toward Western and Turkish-market cars. SUVs and pickups from non-Turkish markets may underperform.
- **Night / low-light**: no IR or HDR training data. Below ~100 lux the system degrades quickly. Recommend rejecting low-light photos in v0.2.
- **Multiple vehicles per image**: the pipeline assumes one car. If two cars are in frame, parts and damages from both are merged β output is unreliable. Pre-check (planned) will reject multi-vehicle images.
- **Severity ground truth is subjective**: even human raters disagree ~15% of the time on `orta` vs. `agir`. A 74% accuracy is close to inter-rater agreement on this dataset.
- **Cost calibration drifts with inflation / FX**: re-calibrate `cost_table.yaml` quarterly.
---
## Related docs
- [DATA.md](../DATA.md) β dataset sources, licenses, train/val splits
- [ARCHITECTURE.md](../ARCHITECTURE.md) β pipeline internals at code level
- [services/ml/setup.ps1](../services/ml/setup.ps1) / [setup.sh](../services/ml/setup.sh) β ML environment bootstrap
- [tools/regression_test.py](../tools/regression_test.py) β pre-promotion validation
|