Spaces:
Running
Running
| # ML Model Guide | |
| Everything about the three machine-learning models powering HasarΔ° β performance numbers, when each one runs, known failure modes, and how to retrain. | |
| > Target audience: ML engineers and technical operators. End-user model intuition lives in [USER_GUIDE_TR.md](USER_GUIDE_TR.md#6-sonuΓ§larΔ±-anlama). | |
| --- | |
| ## Pipeline overview | |
| For every uploaded image, three models run **in parallel**, then a deterministic post-processor stitches the outputs into a part-centric JSON: | |
| ``` | |
| ββββββββββββββββββββββββββββββββ | |
| β Damage YOLO11m-seg β | |
| β 6 classes, segmentation β | |
| β (where is the damage?) β | |
| ββββββββββββ¬ββββββββββββββββββββ | |
| β | |
| ββββββββββββββββββββββββ β ββββββββββββββββββββββ | |
| β Parts YOLO11s-seg β β β For each damage β | |
| β 21 classes, seg β β IoU(damage, part) β crop: β | |
| β (which part?) βββββββββΌββββββββββββββββββββΊ β Severity classifierβ | |
| ββββββββββββββββββββββββ β β YOLO11n-cls 3 cls β | |
| β β (hafif/orta/agir) β | |
| βΌ βββββββββββ¬βββββββββββ | |
| ββββββββββββββββββββββββββββββββ β | |
| β Match every damage to its β β | |
| β best-overlapping part βββββββββββββββ | |
| β β "front_bumper has a dent" β | |
| ββββββββββββ¬ββββββββββββββββββββ | |
| β | |
| βΌ | |
| ββββββββββββββββββββββββββββββββ | |
| β Cost engine (lookup table + β | |
| β part Γ damage Γ severity) β | |
| β β βΊ range per part β | |
| ββββββββββββ¬ββββββββββββββββββββ | |
| β | |
| βΌ | |
| ββββββββββββββββββββββββββββββββ | |
| β Aggregate to summary β | |
| β (totals, recommendation) β | |
| ββββββββββββββββββββββββββββββββ | |
| ``` | |
| End-to-end latency on a single 1920Γ1080 image, RTX 5050 8GB (Blackwell, sm_120, cu128): | |
| | Stage | Time | | |
| |---|---| | |
| | Image decode + preprocess | ~10 ms | | |
| | Damage YOLO11m-seg | ~45 ms | | |
| | Parts YOLO11s-seg | ~30 ms | | |
| | Severity (per damage crop, avg 3 damages) | ~36 ms total | | |
| | IoU matching + post-processing | ~5 ms | | |
| | Cost engine | ~1 ms | | |
| | **Single-image total** | **~125 ms** | | |
| A typical 4-photo inspection runs sequentially on the worker and finishes in **5β8 seconds** including S3 round trips. | |
| --- | |
| ## Model 1 β Damage segmentation (YOLO11m-seg) | |
| **What it does**: pixel-level segmentation of damage regions on the car body. Outputs a mask + class label + bounding box + confidence for each damage instance. | |
| ### Specs | |
| | Item | Value | | |
| |---|---| | |
| | **Architecture** | YOLO11m-seg (Ultralytics) | | |
| | **Input size** | 640Γ640 | | |
| | **Parameters** | ~22M | | |
| | **Classes (6)** | `dent`, `scratch`, `crack`, `glass_shatter`, `lamp_broken`, `tire_flat` | | |
| | **Dataset** | CarDD (academic, non-commercial) β ~4 000 labeled images | | |
| | **Epochs trained** | 120 | | |
| | **Optimizer** | SGD with default Ultralytics schedule | | |
| | **Augmentation** | Mosaic + HSV + flip (Ultralytics defaults) | | |
| | **Weights file** | `services/ml/yolo11m-seg.pt` | | |
| ### Performance (validation set) | |
| | Metric | Value | | |
| |---|---| | |
| | **mAP50-95 (box)** | 0.491 | | |
| | **mAP50 (box)** | 0.671 | | |
| | **mAP50-95 (mask)** | **0.509** | | |
| | **mAP50 (mask)** | **0.683** | | |
| | **Precision (mask, IoUβ₯0.5)** | 0.71 | | |
| | **Recall (mask, IoUβ₯0.5)** | 0.67 | | |
| ### Per-class behavior | |
| | Class | mAP50 | Notes | | |
| |---|---|---| | |
| | `dent` | 0.74 | Strongest. Lots of training data, distinctive shape. | | |
| | `scratch` | 0.69 | Mostly good, occasionally confused with cosmetic dirt. | | |
| | `crack` | 0.61 | Plastic crack vs. paint crack ambiguity; thin cracks under-recalled. | | |
| | `glass_shatter` | 0.78 | Very strong β shatter pattern is distinctive. | | |
| | `lamp_broken` | 0.65 | Good when lamp lens is shattered; missed when only a fine crack. | | |
| | `tire_flat` | 0.42 | **Weakest** β only ~80 training instances; needs more data (v0.2). | | |
| ### Known failure modes | |
| - **Cosmetic dirt / mud** on bumpers occasionally classified as scratch. Mitigation: instruct users to clean the vehicle (USER_GUIDE rule). | |
| - **Reflective glare** on glass produces phantom `glass_shatter` detections. Confidence threshold β₯ 0.55 reduces this; tune per deployment. | |
| - **Tire flatness** rarely detected β class is included for completeness but should be considered advisory only until v0.2 dataset boost. | |
| - **Wet surfaces** reflect like cracks. Same mitigation as glare. | |
| --- | |
| ## Model 2 β Parts segmentation (YOLO11s-seg) | |
| **What it does**: pixel-level segmentation of vehicle body parts. Tells us *which* part each damage sits on. | |
| ### Specs | |
| | Item | Value | | |
| |---|---| | |
| | **Architecture** | YOLO11s-seg | | |
| | **Input size** | 640Γ640 | | |
| | **Parameters** | ~10M | | |
| | **Classes (21)** | `back_bumper`, `front_bumper`, `back_door`, `front_left_door`, `front_right_door`, `back_left_door`, `back_right_door`, `back_glass`, `front_glass`, `back_light`, `front_light`, `back_left_light`, `back_right_light`, `front_left_light`, `front_right_light`, `hood`, `trunk`, `tailgate`, `left_mirror`, `right_mirror`, `wheel` | | |
| | **Dataset** | Combined: Roboflow car-parts + supplementary CarPartsDB scrape β ~6 000 images | | |
| | **Epochs trained** | ~50 | | |
| | **Weights file** | `services/ml/yolo11s-seg.pt` | | |
| ### Performance (validation set) | |
| | Metric | Value | | |
| |---|---| | |
| | **mAP50 (mask)** | **~0.72** | | |
| | **mAP50-95 (mask)** | ~0.55 | | |
| ### Known failure modes | |
| - **Left/right confusion** on doors and headlights when the vehicle is photographed from the rear quarter β the system fuses left/right calls using image orientation heuristics, but it's still a known weak spot. | |
| - **Mirror miss** on small images: the mirror class has ~3% of bounding-box area on average and is sometimes missed on low-resolution input. | |
| - **Trunk vs. tailgate** ambiguity on hatchbacks β both classes can fire on the same region. Post-processor picks the higher-confidence one. | |
| --- | |
| ## Model 3 β Severity classifier (YOLO11n-cls) | |
| **What it does**: given a tight crop of a single damage region, classify its severity as `hafif` (minor), `orta` (moderate), or `agir` (severe). | |
| ### Specs | |
| | Item | Value | | |
| |---|---| | |
| | **Architecture** | YOLO11n-cls | | |
| | **Input size** | 224Γ224 | | |
| | **Parameters** | ~2.6M | | |
| | **Classes (3)** | `hafif`, `orta`, `agir` | | |
| | **Dataset** | Roboflow Severity dataset β ~1 800 labeled crops | | |
| | **Epochs trained** | 30 | | |
| | **Weights file** | `services/ml/yolo11n-cls.pt` | | |
| ### Performance (validation set) | |
| | Metric | Value | | |
| |---|---| | |
| | **Top-1 accuracy** | **0.742** | | |
| | **Macro F1** | 0.71 | | |
| | **Confusion** | Mostly `orta` β `agir`; `hafif` is well-separated. | | |
| ### Known failure modes | |
| - **Overfitting tendency**: small dataset means the model is slightly biased toward `orta`. Val accuracy plateaued at ~0.74 β adding more `agir` examples is a v0.2 priority. | |
| - **Crop quality dependency**: if the damage YOLO produces a tight, well-centered crop, classification is reliable. Loose or off-center crops degrade accuracy by ~10%. | |
| - **Glass shatter severity** is currently always classified as `orta` or `agir` β there's no "minor glass shatter" in the training distribution. Acceptable for v0.1. | |
| --- | |
| ## Cost engine | |
| Not an ML model β a **lookup-table-driven** function: | |
| ``` | |
| cost(part, damage_type, severity) β (min_tl, max_tl) | |
| ``` | |
| The table lives at `services/ml/cost_table.yaml` and is calibrated to local TΓΌrkiye OEM + aftermarket prices (March 2026). Example entries: | |
| ```yaml | |
| front_bumper: | |
| dent: | |
| hafif: [400, 1200] | |
| orta: [2500, 5500] | |
| agir: [6000, 12000] | |
| scratch: | |
| hafif: [200, 600] | |
| orta: [800, 2000] | |
| agir: [2500, 4500] | |
| ``` | |
| **Why not ML for cost?** Insufficient labeled price data (you need real repair invoices) and the lookup table is more debuggable for pilot use. v0.2 plans an ML regression head once the pilot accumulates ~500 verified inspections with actual repair costs. | |
| --- | |
| ## Inference configuration | |
| Default thresholds in `services/ml/pipeline.py`: | |
| | Hyperparameter | Default | When to tune | | |
| |---|---|---| | |
| | `damage_conf_threshold` | 0.55 | Lower β more sensitive, more false positives. Raise to 0.65 in noisy environments. | | |
| | `parts_conf_threshold` | 0.5 | Parts model is more reliable; rarely needs tuning. | | |
| | `iou_match_threshold` | 0.15 | How much a damage mask must overlap a part to be assigned to it. Lower = more aggressive matching. | | |
| | `severity_min_crop_size` | 32Γ32 px | Smaller crops degrade severity accuracy; below this we skip severity and label `bilinmiyor`. | | |
| | `max_damages_per_image` | 25 | Hard cap to prevent runaway false positives. | | |
| To change a threshold per request, pass the override in the API call (planned feature β not yet exposed in v0.1). | |
| --- | |
| ## Retraining | |
| ### Quick: incremental data, same architecture | |
| For weekly fine-tuning runs on top of the existing checkpoint: | |
| ```powershell | |
| cd services\ml | |
| # Damage model β 30 more epochs on top of the v0.1 weights | |
| python train.py --resume yolo11m-seg.pt --data cardd.yaml --epochs 30 --batch 8 --device 0 | |
| # Parts model | |
| python train_parts.py --resume yolo11s-seg.pt --data parts.yaml --epochs 20 --batch 16 --device 0 | |
| # Severity classifier | |
| python train_severity.py --resume yolo11n-cls.pt --data data/severity --epochs 15 --batch 32 --device 0 | |
| ``` | |
| ### Full: clean retrain from pre-trained YOLO11 | |
| For a major version bump (v0.2 β v0.3): | |
| ```powershell | |
| cd services\ml | |
| python train_all.py --full --device 0 | |
| ``` | |
| `train_all.py --full` runs all three trainings sequentially and logs to `services/ml/runs/` and `services/ml/logs/`. It will: | |
| 1. Download pre-trained YOLO11 base weights if missing. | |
| 2. Train damage model for 120 epochs. | |
| 3. Train parts model for 50 epochs. | |
| 4. Train severity classifier for 30 epochs. | |
| 5. Run the regression test suite (`tools/regression_test.py`) and write a comparison report against the previous deployment. | |
| **Wall-clock time** on RTX 5050 8GB: ~14 hours for the full run. | |
| ### Dataset refresh | |
| Before retraining, refresh datasets: | |
| ```powershell | |
| # CarDD β re-download if upstream HuggingFace mirror updated | |
| python scripts\download_data.py --cardd-hf --force | |
| # Roboflow severity β set API key first | |
| $env:ROBOFLOW_API_KEY = "..." | |
| python scripts\download_data.py --roboflow-severity --force | |
| # Pilot in-the-wild data (if you've collected labeled images from pilot users) | |
| python scripts\merge_pilot_data.py --in pilot_inspections.csv --out data/pilot/ | |
| python scripts\verify_data.py --datasets cardd pilot | |
| ``` | |
| ### Validating a new checkpoint | |
| Always run the regression suite before promoting: | |
| ```powershell | |
| python tools\regression_test.py ` | |
| --baseline services\ml\runs\v0.1\weights\best.pt ` | |
| --candidate services\ml\runs\v0.2\weights\best.pt ` | |
| --fixtures tools\fixtures\regression\ | |
| ``` | |
| The regression suite scores both models on 200 hand-curated images and fails the build if any of these regresses by >2%: | |
| - mAP50 (mask) per class | |
| - IoU matching accuracy (does each damage land on the right part?) | |
| - Total cost variance (is the new model producing drastically different cost ranges?) | |
| ### Promoting weights to production | |
| 1. Copy the new `.pt` files to a versioned S3 location: | |
| ```bash | |
| aws s3 cp services/ml/runs/v0.2/weights/best.pt s3://hasari-models/v0.2/yolo11m-seg.pt | |
| ``` | |
| 2. Update `ML_MODEL_VERSION=v0.2` env var on the Render API service. | |
| 3. The backend reads `ML_MODEL_VERSION` at startup and downloads the matching weights from S3. | |
| 4. Smoke-test on staging before pointing production at the new version. | |
| 5. Keep the previous version (`v0.1`) on S3 for instant rollback. | |
| ### Export for on-device (mobile, v0.2 backlog) | |
| ```powershell | |
| cd tools | |
| python export.py --model yolo11n-seg.pt --format tflite --output models/damage_yolo11n.tflite | |
| python export.py --model yolo11n-seg.pt --format coreml --output models/damage_yolo11n.mlpackage | |
| ``` | |
| Output models are quantized to int8 by default β ~3 MB, runs at ~80 ms on iPhone 13 Neural Engine. | |
| --- | |
| ## Hardware requirements | |
| ### Training (full pipeline) | |
| - **GPU**: NVIDIA, β₯8 GB VRAM (Blackwell architecture or newer recommended for sm_120 features) | |
| - **CUDA**: 12.8+ | |
| - **PyTorch**: 2.4+ with `cu128` wheels (see `services/ml/setup.ps1` / `setup.sh` β Blackwell support is non-trivial) | |
| - **RAM**: 32 GB | |
| - **CPU**: β₯8 cores (for data loader workers) | |
| - **Disk**: 50 GB free (datasets + checkpoints) | |
| ### Inference | |
| - **GPU (preferred)**: 4 GB VRAM minimum | |
| - **CPU-only (acceptable)**: any modern x86_64; ~5β10Γ slower than GPU. Used in the Render-hosted pilot until GPU host is provisioned. | |
| --- | |
| ## Telemetry: what we measure in production | |
| Every inference logs: | |
| - Per-model wall time (`damage_ms`, `parts_ms`, `severity_ms`, `total_ms`) | |
| - Per-image counts: detected damages, detected parts, matched/unmatched damages | |
| - Confidence-score distributions (P50, P95) per class | |
| - Image dimensions and file size | |
| - Failure category if the inference errors out | |
| These flow to Prometheus and are visible in the Grafana "ML Pipeline" dashboard (config in `observability/grafana/dashboards/ml-pipeline.json`). | |
| Use this data to: | |
| - Set alerts when P95 latency drifts upward (often signals model loading wrong weights) | |
| - Identify class drift (sudden drop in `dent` confidences usually means input distribution shifted β new car models, new camera type) | |
| - Schedule retraining when false-positive rate creeps above 5% per a sampled human review. | |
| --- | |
| ## Limitations & honest caveats | |
| - **English vehicles only**: training data is heavily biased toward Western and Turkish-market cars. SUVs and pickups from non-Turkish markets may underperform. | |
| - **Night / low-light**: no IR or HDR training data. Below ~100 lux the system degrades quickly. Recommend rejecting low-light photos in v0.2. | |
| - **Multiple vehicles per image**: the pipeline assumes one car. If two cars are in frame, parts and damages from both are merged β output is unreliable. Pre-check (planned) will reject multi-vehicle images. | |
| - **Severity ground truth is subjective**: even human raters disagree ~15% of the time on `orta` vs. `agir`. A 74% accuracy is close to inter-rater agreement on this dataset. | |
| - **Cost calibration drifts with inflation / FX**: re-calibrate `cost_table.yaml` quarterly. | |
| --- | |
| ## Related docs | |
| - [DATA.md](../DATA.md) β dataset sources, licenses, train/val splits | |
| - [ARCHITECTURE.md](../ARCHITECTURE.md) β pipeline internals at code level | |
| - [services/ml/setup.ps1](../services/ml/setup.ps1) / [setup.sh](../services/ml/setup.sh) β ML environment bootstrap | |
| - [tools/regression_test.py](../tools/regression_test.py) β pre-promotion validation | |