hasari-api / docs /MODEL_GUIDE.md
erdoganpeker's picture
v0.3.0 β€” multimodal vehicle damage MVP
e327f0d

ML Model Guide

Everything about the three machine-learning models powering HasarΔ° β€” performance numbers, when each one runs, known failure modes, and how to retrain.

Target audience: ML engineers and technical operators. End-user model intuition lives in USER_GUIDE_TR.md.


Pipeline overview

For every uploaded image, three models run in parallel, then a deterministic post-processor stitches the outputs into a part-centric JSON:

                       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                       β”‚  Damage YOLO11m-seg          β”‚
                       β”‚  6 classes, segmentation     β”‚
                       β”‚  (where is the damage?)      β”‚
                       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                  β”‚
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”       β”‚                       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚ Parts YOLO11s-seg    β”‚       β”‚                       β”‚ For each damage    β”‚
   β”‚ 21 classes, seg      β”‚       β”‚  IoU(damage, part)    β”‚ crop:              β”‚
   β”‚ (which part?)        β”œβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Ί   β”‚ Severity classifierβ”‚
   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜       β”‚                       β”‚ YOLO11n-cls 3 cls  β”‚
                                  β”‚                       β”‚ (hafif/orta/agir)  β”‚
                                  β–Ό                       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”             β”‚
                       β”‚ Match every damage to its    β”‚             β”‚
                       β”‚ best-overlapping part        β”‚β—„β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                       β”‚ β†’ "front_bumper has a dent"  β”‚
                       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                  β”‚
                                  β–Ό
                       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                       β”‚ Cost engine (lookup table +  β”‚
                       β”‚ part Γ— damage Γ— severity)    β”‚
                       β”‚ β†’ β‚Ί range per part           β”‚
                       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                  β”‚
                                  β–Ό
                       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                       β”‚ Aggregate to summary         β”‚
                       β”‚ (totals, recommendation)     β”‚
                       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

End-to-end latency on a single 1920Γ—1080 image, RTX 5050 8GB (Blackwell, sm_120, cu128):

Stage Time
Image decode + preprocess ~10 ms
Damage YOLO11m-seg ~45 ms
Parts YOLO11s-seg ~30 ms
Severity (per damage crop, avg 3 damages) ~36 ms total
IoU matching + post-processing ~5 ms
Cost engine ~1 ms
Single-image total ~125 ms

A typical 4-photo inspection runs sequentially on the worker and finishes in 5–8 seconds including S3 round trips.


Model 1 β€” Damage segmentation (YOLO11m-seg)

What it does: pixel-level segmentation of damage regions on the car body. Outputs a mask + class label + bounding box + confidence for each damage instance.

Specs

Item Value
Architecture YOLO11m-seg (Ultralytics)
Input size 640Γ—640
Parameters ~22M
Classes (6) dent, scratch, crack, glass_shatter, lamp_broken, tire_flat
Dataset CarDD (academic, non-commercial) β€” ~4 000 labeled images
Epochs trained 120
Optimizer SGD with default Ultralytics schedule
Augmentation Mosaic + HSV + flip (Ultralytics defaults)
Weights file services/ml/yolo11m-seg.pt

Performance (validation set)

Metric Value
mAP50-95 (box) 0.491
mAP50 (box) 0.671
mAP50-95 (mask) 0.509
mAP50 (mask) 0.683
Precision (mask, IoUβ‰₯0.5) 0.71
Recall (mask, IoUβ‰₯0.5) 0.67

Per-class behavior

Class mAP50 Notes
dent 0.74 Strongest. Lots of training data, distinctive shape.
scratch 0.69 Mostly good, occasionally confused with cosmetic dirt.
crack 0.61 Plastic crack vs. paint crack ambiguity; thin cracks under-recalled.
glass_shatter 0.78 Very strong β€” shatter pattern is distinctive.
lamp_broken 0.65 Good when lamp lens is shattered; missed when only a fine crack.
tire_flat 0.42 Weakest β€” only ~80 training instances; needs more data (v0.2).

Known failure modes

  • Cosmetic dirt / mud on bumpers occasionally classified as scratch. Mitigation: instruct users to clean the vehicle (USER_GUIDE rule).
  • Reflective glare on glass produces phantom glass_shatter detections. Confidence threshold β‰₯ 0.55 reduces this; tune per deployment.
  • Tire flatness rarely detected β€” class is included for completeness but should be considered advisory only until v0.2 dataset boost.
  • Wet surfaces reflect like cracks. Same mitigation as glare.

Model 2 β€” Parts segmentation (YOLO11s-seg)

What it does: pixel-level segmentation of vehicle body parts. Tells us which part each damage sits on.

Specs

Item Value
Architecture YOLO11s-seg
Input size 640Γ—640
Parameters ~10M
Classes (21) back_bumper, front_bumper, back_door, front_left_door, front_right_door, back_left_door, back_right_door, back_glass, front_glass, back_light, front_light, back_left_light, back_right_light, front_left_light, front_right_light, hood, trunk, tailgate, left_mirror, right_mirror, wheel
Dataset Combined: Roboflow car-parts + supplementary CarPartsDB scrape β€” ~6 000 images
Epochs trained ~50
Weights file services/ml/yolo11s-seg.pt

Performance (validation set)

Metric Value
mAP50 (mask) ~0.72
mAP50-95 (mask) ~0.55

Known failure modes

  • Left/right confusion on doors and headlights when the vehicle is photographed from the rear quarter β€” the system fuses left/right calls using image orientation heuristics, but it's still a known weak spot.
  • Mirror miss on small images: the mirror class has ~3% of bounding-box area on average and is sometimes missed on low-resolution input.
  • Trunk vs. tailgate ambiguity on hatchbacks β€” both classes can fire on the same region. Post-processor picks the higher-confidence one.

Model 3 β€” Severity classifier (YOLO11n-cls)

What it does: given a tight crop of a single damage region, classify its severity as hafif (minor), orta (moderate), or agir (severe).

Specs

Item Value
Architecture YOLO11n-cls
Input size 224Γ—224
Parameters ~2.6M
Classes (3) hafif, orta, agir
Dataset Roboflow Severity dataset β€” ~1 800 labeled crops
Epochs trained 30
Weights file services/ml/yolo11n-cls.pt

Performance (validation set)

Metric Value
Top-1 accuracy 0.742
Macro F1 0.71
Confusion Mostly orta ↔ agir; hafif is well-separated.

Known failure modes

  • Overfitting tendency: small dataset means the model is slightly biased toward orta. Val accuracy plateaued at ~0.74 β€” adding more agir examples is a v0.2 priority.
  • Crop quality dependency: if the damage YOLO produces a tight, well-centered crop, classification is reliable. Loose or off-center crops degrade accuracy by ~10%.
  • Glass shatter severity is currently always classified as orta or agir β€” there's no "minor glass shatter" in the training distribution. Acceptable for v0.1.

Cost engine

Not an ML model β€” a lookup-table-driven function:

cost(part, damage_type, severity) β†’ (min_tl, max_tl)

The table lives at services/ml/cost_table.yaml and is calibrated to local TΓΌrkiye OEM + aftermarket prices (March 2026). Example entries:

front_bumper:
  dent:
    hafif: [400, 1200]
    orta:  [2500, 5500]
    agir:  [6000, 12000]
  scratch:
    hafif: [200, 600]
    orta:  [800, 2000]
    agir:  [2500, 4500]

Why not ML for cost? Insufficient labeled price data (you need real repair invoices) and the lookup table is more debuggable for pilot use. v0.2 plans an ML regression head once the pilot accumulates ~500 verified inspections with actual repair costs.


Inference configuration

Default thresholds in services/ml/pipeline.py:

Hyperparameter Default When to tune
damage_conf_threshold 0.55 Lower β†’ more sensitive, more false positives. Raise to 0.65 in noisy environments.
parts_conf_threshold 0.5 Parts model is more reliable; rarely needs tuning.
iou_match_threshold 0.15 How much a damage mask must overlap a part to be assigned to it. Lower = more aggressive matching.
severity_min_crop_size 32Γ—32 px Smaller crops degrade severity accuracy; below this we skip severity and label bilinmiyor.
max_damages_per_image 25 Hard cap to prevent runaway false positives.

To change a threshold per request, pass the override in the API call (planned feature β€” not yet exposed in v0.1).


Retraining

Quick: incremental data, same architecture

For weekly fine-tuning runs on top of the existing checkpoint:

cd services\ml
# Damage model β€” 30 more epochs on top of the v0.1 weights
python train.py --resume yolo11m-seg.pt --data cardd.yaml --epochs 30 --batch 8 --device 0

# Parts model
python train_parts.py --resume yolo11s-seg.pt --data parts.yaml --epochs 20 --batch 16 --device 0

# Severity classifier
python train_severity.py --resume yolo11n-cls.pt --data data/severity --epochs 15 --batch 32 --device 0

Full: clean retrain from pre-trained YOLO11

For a major version bump (v0.2 β†’ v0.3):

cd services\ml
python train_all.py --full --device 0

train_all.py --full runs all three trainings sequentially and logs to services/ml/runs/ and services/ml/logs/. It will:

  1. Download pre-trained YOLO11 base weights if missing.
  2. Train damage model for 120 epochs.
  3. Train parts model for 50 epochs.
  4. Train severity classifier for 30 epochs.
  5. Run the regression test suite (tools/regression_test.py) and write a comparison report against the previous deployment.

Wall-clock time on RTX 5050 8GB: ~14 hours for the full run.

Dataset refresh

Before retraining, refresh datasets:

# CarDD β€” re-download if upstream HuggingFace mirror updated
python scripts\download_data.py --cardd-hf --force

# Roboflow severity β€” set API key first
$env:ROBOFLOW_API_KEY = "..."
python scripts\download_data.py --roboflow-severity --force

# Pilot in-the-wild data (if you've collected labeled images from pilot users)
python scripts\merge_pilot_data.py --in pilot_inspections.csv --out data/pilot/
python scripts\verify_data.py --datasets cardd pilot

Validating a new checkpoint

Always run the regression suite before promoting:

python tools\regression_test.py `
  --baseline services\ml\runs\v0.1\weights\best.pt `
  --candidate services\ml\runs\v0.2\weights\best.pt `
  --fixtures tools\fixtures\regression\

The regression suite scores both models on 200 hand-curated images and fails the build if any of these regresses by >2%:

  • mAP50 (mask) per class
  • IoU matching accuracy (does each damage land on the right part?)
  • Total cost variance (is the new model producing drastically different cost ranges?)

Promoting weights to production

  1. Copy the new .pt files to a versioned S3 location:
    aws s3 cp services/ml/runs/v0.2/weights/best.pt s3://hasari-models/v0.2/yolo11m-seg.pt
    
  2. Update ML_MODEL_VERSION=v0.2 env var on the Render API service.
  3. The backend reads ML_MODEL_VERSION at startup and downloads the matching weights from S3.
  4. Smoke-test on staging before pointing production at the new version.
  5. Keep the previous version (v0.1) on S3 for instant rollback.

Export for on-device (mobile, v0.2 backlog)

cd tools
python export.py --model yolo11n-seg.pt --format tflite --output models/damage_yolo11n.tflite
python export.py --model yolo11n-seg.pt --format coreml --output models/damage_yolo11n.mlpackage

Output models are quantized to int8 by default β€” ~3 MB, runs at ~80 ms on iPhone 13 Neural Engine.


Hardware requirements

Training (full pipeline)

  • GPU: NVIDIA, β‰₯8 GB VRAM (Blackwell architecture or newer recommended for sm_120 features)
  • CUDA: 12.8+
  • PyTorch: 2.4+ with cu128 wheels (see services/ml/setup.ps1 / setup.sh β€” Blackwell support is non-trivial)
  • RAM: 32 GB
  • CPU: β‰₯8 cores (for data loader workers)
  • Disk: 50 GB free (datasets + checkpoints)

Inference

  • GPU (preferred): 4 GB VRAM minimum
  • CPU-only (acceptable): any modern x86_64; ~5–10Γ— slower than GPU. Used in the Render-hosted pilot until GPU host is provisioned.

Telemetry: what we measure in production

Every inference logs:

  • Per-model wall time (damage_ms, parts_ms, severity_ms, total_ms)
  • Per-image counts: detected damages, detected parts, matched/unmatched damages
  • Confidence-score distributions (P50, P95) per class
  • Image dimensions and file size
  • Failure category if the inference errors out

These flow to Prometheus and are visible in the Grafana "ML Pipeline" dashboard (config in observability/grafana/dashboards/ml-pipeline.json).

Use this data to:

  • Set alerts when P95 latency drifts upward (often signals model loading wrong weights)
  • Identify class drift (sudden drop in dent confidences usually means input distribution shifted β€” new car models, new camera type)
  • Schedule retraining when false-positive rate creeps above 5% per a sampled human review.

Limitations & honest caveats

  • English vehicles only: training data is heavily biased toward Western and Turkish-market cars. SUVs and pickups from non-Turkish markets may underperform.
  • Night / low-light: no IR or HDR training data. Below ~100 lux the system degrades quickly. Recommend rejecting low-light photos in v0.2.
  • Multiple vehicles per image: the pipeline assumes one car. If two cars are in frame, parts and damages from both are merged β€” output is unreliable. Pre-check (planned) will reject multi-vehicle images.
  • Severity ground truth is subjective: even human raters disagree ~15% of the time on orta vs. agir. A 74% accuracy is close to inter-rater agreement on this dataset.
  • Cost calibration drifts with inflation / FX: re-calibrate cost_table.yaml quarterly.

Related docs