Spaces:
Sleeping
ML Model Guide
Everything about the three machine-learning models powering HasarΔ° β performance numbers, when each one runs, known failure modes, and how to retrain.
Target audience: ML engineers and technical operators. End-user model intuition lives in USER_GUIDE_TR.md.
Pipeline overview
For every uploaded image, three models run in parallel, then a deterministic post-processor stitches the outputs into a part-centric JSON:
ββββββββββββββββββββββββββββββββ
β Damage YOLO11m-seg β
β 6 classes, segmentation β
β (where is the damage?) β
ββββββββββββ¬ββββββββββββββββββββ
β
ββββββββββββββββββββββββ β ββββββββββββββββββββββ
β Parts YOLO11s-seg β β β For each damage β
β 21 classes, seg β β IoU(damage, part) β crop: β
β (which part?) βββββββββΌββββββββββββββββββββΊ β Severity classifierβ
ββββββββββββββββββββββββ β β YOLO11n-cls 3 cls β
β β (hafif/orta/agir) β
βΌ βββββββββββ¬βββββββββββ
ββββββββββββββββββββββββββββββββ β
β Match every damage to its β β
β best-overlapping part βββββββββββββββ
β β "front_bumper has a dent" β
ββββββββββββ¬ββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββ
β Cost engine (lookup table + β
β part Γ damage Γ severity) β
β β βΊ range per part β
ββββββββββββ¬ββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββ
β Aggregate to summary β
β (totals, recommendation) β
ββββββββββββββββββββββββββββββββ
End-to-end latency on a single 1920Γ1080 image, RTX 5050 8GB (Blackwell, sm_120, cu128):
| Stage | Time |
|---|---|
| Image decode + preprocess | ~10 ms |
| Damage YOLO11m-seg | ~45 ms |
| Parts YOLO11s-seg | ~30 ms |
| Severity (per damage crop, avg 3 damages) | ~36 ms total |
| IoU matching + post-processing | ~5 ms |
| Cost engine | ~1 ms |
| Single-image total | ~125 ms |
A typical 4-photo inspection runs sequentially on the worker and finishes in 5β8 seconds including S3 round trips.
Model 1 β Damage segmentation (YOLO11m-seg)
What it does: pixel-level segmentation of damage regions on the car body. Outputs a mask + class label + bounding box + confidence for each damage instance.
Specs
| Item | Value |
|---|---|
| Architecture | YOLO11m-seg (Ultralytics) |
| Input size | 640Γ640 |
| Parameters | ~22M |
| Classes (6) | dent, scratch, crack, glass_shatter, lamp_broken, tire_flat |
| Dataset | CarDD (academic, non-commercial) β ~4 000 labeled images |
| Epochs trained | 120 |
| Optimizer | SGD with default Ultralytics schedule |
| Augmentation | Mosaic + HSV + flip (Ultralytics defaults) |
| Weights file | services/ml/yolo11m-seg.pt |
Performance (validation set)
| Metric | Value |
|---|---|
| mAP50-95 (box) | 0.491 |
| mAP50 (box) | 0.671 |
| mAP50-95 (mask) | 0.509 |
| mAP50 (mask) | 0.683 |
| Precision (mask, IoUβ₯0.5) | 0.71 |
| Recall (mask, IoUβ₯0.5) | 0.67 |
Per-class behavior
| Class | mAP50 | Notes |
|---|---|---|
dent |
0.74 | Strongest. Lots of training data, distinctive shape. |
scratch |
0.69 | Mostly good, occasionally confused with cosmetic dirt. |
crack |
0.61 | Plastic crack vs. paint crack ambiguity; thin cracks under-recalled. |
glass_shatter |
0.78 | Very strong β shatter pattern is distinctive. |
lamp_broken |
0.65 | Good when lamp lens is shattered; missed when only a fine crack. |
tire_flat |
0.42 | Weakest β only ~80 training instances; needs more data (v0.2). |
Known failure modes
- Cosmetic dirt / mud on bumpers occasionally classified as scratch. Mitigation: instruct users to clean the vehicle (USER_GUIDE rule).
- Reflective glare on glass produces phantom
glass_shatterdetections. Confidence threshold β₯ 0.55 reduces this; tune per deployment. - Tire flatness rarely detected β class is included for completeness but should be considered advisory only until v0.2 dataset boost.
- Wet surfaces reflect like cracks. Same mitigation as glare.
Model 2 β Parts segmentation (YOLO11s-seg)
What it does: pixel-level segmentation of vehicle body parts. Tells us which part each damage sits on.
Specs
| Item | Value |
|---|---|
| Architecture | YOLO11s-seg |
| Input size | 640Γ640 |
| Parameters | ~10M |
| Classes (21) | back_bumper, front_bumper, back_door, front_left_door, front_right_door, back_left_door, back_right_door, back_glass, front_glass, back_light, front_light, back_left_light, back_right_light, front_left_light, front_right_light, hood, trunk, tailgate, left_mirror, right_mirror, wheel |
| Dataset | Combined: Roboflow car-parts + supplementary CarPartsDB scrape β ~6 000 images |
| Epochs trained | ~50 |
| Weights file | services/ml/yolo11s-seg.pt |
Performance (validation set)
| Metric | Value |
|---|---|
| mAP50 (mask) | ~0.72 |
| mAP50-95 (mask) | ~0.55 |
Known failure modes
- Left/right confusion on doors and headlights when the vehicle is photographed from the rear quarter β the system fuses left/right calls using image orientation heuristics, but it's still a known weak spot.
- Mirror miss on small images: the mirror class has ~3% of bounding-box area on average and is sometimes missed on low-resolution input.
- Trunk vs. tailgate ambiguity on hatchbacks β both classes can fire on the same region. Post-processor picks the higher-confidence one.
Model 3 β Severity classifier (YOLO11n-cls)
What it does: given a tight crop of a single damage region, classify its severity as hafif (minor), orta (moderate), or agir (severe).
Specs
| Item | Value |
|---|---|
| Architecture | YOLO11n-cls |
| Input size | 224Γ224 |
| Parameters | ~2.6M |
| Classes (3) | hafif, orta, agir |
| Dataset | Roboflow Severity dataset β ~1 800 labeled crops |
| Epochs trained | 30 |
| Weights file | services/ml/yolo11n-cls.pt |
Performance (validation set)
| Metric | Value |
|---|---|
| Top-1 accuracy | 0.742 |
| Macro F1 | 0.71 |
| Confusion | Mostly orta β agir; hafif is well-separated. |
Known failure modes
- Overfitting tendency: small dataset means the model is slightly biased toward
orta. Val accuracy plateaued at ~0.74 β adding moreagirexamples is a v0.2 priority. - Crop quality dependency: if the damage YOLO produces a tight, well-centered crop, classification is reliable. Loose or off-center crops degrade accuracy by ~10%.
- Glass shatter severity is currently always classified as
ortaoragirβ there's no "minor glass shatter" in the training distribution. Acceptable for v0.1.
Cost engine
Not an ML model β a lookup-table-driven function:
cost(part, damage_type, severity) β (min_tl, max_tl)
The table lives at services/ml/cost_table.yaml and is calibrated to local TΓΌrkiye OEM + aftermarket prices (March 2026). Example entries:
front_bumper:
dent:
hafif: [400, 1200]
orta: [2500, 5500]
agir: [6000, 12000]
scratch:
hafif: [200, 600]
orta: [800, 2000]
agir: [2500, 4500]
Why not ML for cost? Insufficient labeled price data (you need real repair invoices) and the lookup table is more debuggable for pilot use. v0.2 plans an ML regression head once the pilot accumulates ~500 verified inspections with actual repair costs.
Inference configuration
Default thresholds in services/ml/pipeline.py:
| Hyperparameter | Default | When to tune |
|---|---|---|
damage_conf_threshold |
0.55 | Lower β more sensitive, more false positives. Raise to 0.65 in noisy environments. |
parts_conf_threshold |
0.5 | Parts model is more reliable; rarely needs tuning. |
iou_match_threshold |
0.15 | How much a damage mask must overlap a part to be assigned to it. Lower = more aggressive matching. |
severity_min_crop_size |
32Γ32 px | Smaller crops degrade severity accuracy; below this we skip severity and label bilinmiyor. |
max_damages_per_image |
25 | Hard cap to prevent runaway false positives. |
To change a threshold per request, pass the override in the API call (planned feature β not yet exposed in v0.1).
Retraining
Quick: incremental data, same architecture
For weekly fine-tuning runs on top of the existing checkpoint:
cd services\ml
# Damage model β 30 more epochs on top of the v0.1 weights
python train.py --resume yolo11m-seg.pt --data cardd.yaml --epochs 30 --batch 8 --device 0
# Parts model
python train_parts.py --resume yolo11s-seg.pt --data parts.yaml --epochs 20 --batch 16 --device 0
# Severity classifier
python train_severity.py --resume yolo11n-cls.pt --data data/severity --epochs 15 --batch 32 --device 0
Full: clean retrain from pre-trained YOLO11
For a major version bump (v0.2 β v0.3):
cd services\ml
python train_all.py --full --device 0
train_all.py --full runs all three trainings sequentially and logs to services/ml/runs/ and services/ml/logs/. It will:
- Download pre-trained YOLO11 base weights if missing.
- Train damage model for 120 epochs.
- Train parts model for 50 epochs.
- Train severity classifier for 30 epochs.
- Run the regression test suite (
tools/regression_test.py) and write a comparison report against the previous deployment.
Wall-clock time on RTX 5050 8GB: ~14 hours for the full run.
Dataset refresh
Before retraining, refresh datasets:
# CarDD β re-download if upstream HuggingFace mirror updated
python scripts\download_data.py --cardd-hf --force
# Roboflow severity β set API key first
$env:ROBOFLOW_API_KEY = "..."
python scripts\download_data.py --roboflow-severity --force
# Pilot in-the-wild data (if you've collected labeled images from pilot users)
python scripts\merge_pilot_data.py --in pilot_inspections.csv --out data/pilot/
python scripts\verify_data.py --datasets cardd pilot
Validating a new checkpoint
Always run the regression suite before promoting:
python tools\regression_test.py `
--baseline services\ml\runs\v0.1\weights\best.pt `
--candidate services\ml\runs\v0.2\weights\best.pt `
--fixtures tools\fixtures\regression\
The regression suite scores both models on 200 hand-curated images and fails the build if any of these regresses by >2%:
- mAP50 (mask) per class
- IoU matching accuracy (does each damage land on the right part?)
- Total cost variance (is the new model producing drastically different cost ranges?)
Promoting weights to production
- Copy the new
.ptfiles to a versioned S3 location:aws s3 cp services/ml/runs/v0.2/weights/best.pt s3://hasari-models/v0.2/yolo11m-seg.pt - Update
ML_MODEL_VERSION=v0.2env var on the Render API service. - The backend reads
ML_MODEL_VERSIONat startup and downloads the matching weights from S3. - Smoke-test on staging before pointing production at the new version.
- Keep the previous version (
v0.1) on S3 for instant rollback.
Export for on-device (mobile, v0.2 backlog)
cd tools
python export.py --model yolo11n-seg.pt --format tflite --output models/damage_yolo11n.tflite
python export.py --model yolo11n-seg.pt --format coreml --output models/damage_yolo11n.mlpackage
Output models are quantized to int8 by default β ~3 MB, runs at ~80 ms on iPhone 13 Neural Engine.
Hardware requirements
Training (full pipeline)
- GPU: NVIDIA, β₯8 GB VRAM (Blackwell architecture or newer recommended for sm_120 features)
- CUDA: 12.8+
- PyTorch: 2.4+ with
cu128wheels (seeservices/ml/setup.ps1/setup.shβ Blackwell support is non-trivial) - RAM: 32 GB
- CPU: β₯8 cores (for data loader workers)
- Disk: 50 GB free (datasets + checkpoints)
Inference
- GPU (preferred): 4 GB VRAM minimum
- CPU-only (acceptable): any modern x86_64; ~5β10Γ slower than GPU. Used in the Render-hosted pilot until GPU host is provisioned.
Telemetry: what we measure in production
Every inference logs:
- Per-model wall time (
damage_ms,parts_ms,severity_ms,total_ms) - Per-image counts: detected damages, detected parts, matched/unmatched damages
- Confidence-score distributions (P50, P95) per class
- Image dimensions and file size
- Failure category if the inference errors out
These flow to Prometheus and are visible in the Grafana "ML Pipeline" dashboard (config in observability/grafana/dashboards/ml-pipeline.json).
Use this data to:
- Set alerts when P95 latency drifts upward (often signals model loading wrong weights)
- Identify class drift (sudden drop in
dentconfidences usually means input distribution shifted β new car models, new camera type) - Schedule retraining when false-positive rate creeps above 5% per a sampled human review.
Limitations & honest caveats
- English vehicles only: training data is heavily biased toward Western and Turkish-market cars. SUVs and pickups from non-Turkish markets may underperform.
- Night / low-light: no IR or HDR training data. Below ~100 lux the system degrades quickly. Recommend rejecting low-light photos in v0.2.
- Multiple vehicles per image: the pipeline assumes one car. If two cars are in frame, parts and damages from both are merged β output is unreliable. Pre-check (planned) will reject multi-vehicle images.
- Severity ground truth is subjective: even human raters disagree ~15% of the time on
ortavs.agir. A 74% accuracy is close to inter-rater agreement on this dataset. - Cost calibration drifts with inflation / FX: re-calibrate
cost_table.yamlquarterly.
Related docs
- DATA.md β dataset sources, licenses, train/val splits
- ARCHITECTURE.md β pipeline internals at code level
- services/ml/setup.ps1 / setup.sh β ML environment bootstrap
- tools/regression_test.py β pre-promotion validation