| --- |
| language: |
| - en |
| - fr |
| license: agpl-3.0 |
| library_name: transformers |
| base_model: Qwen/Qwen2.5-VL-3B-Instruct |
| pipeline_tag: image-text-to-text |
| tags: |
| - construction |
| - visual-analysis |
| - safety-inspection |
| - vlm |
| - qwen2_5_vl |
| - qwen2-vl |
| - lora |
| - horama |
| - btp |
| - structured-output |
| - json |
| - image-to-json |
| - peft |
| - safetensors |
| model-index: |
| - name: Horama_BTP |
| results: [] |
| --- |
| |
| <div align="center"> |
|
|
| # HORAMA-BTP |
|
|
| ### Vision-Language Model for Construction Site Analysis |
|
|
| **Image → Structured JSON** | Built on Qwen2.5-VL | Fine-tuned with LoRA |
|
|
| []() |
| []() |
| []() |
| []() |
|
|
| --- |
|
|
| *Horama-BTP transforms construction site photographs into comprehensive, machine-readable JSON reports covering progress tracking, safety compliance, quality assessment, and logistics.* |
|
|
| </div> |
|
|
| ## Overview |
|
|
| **Horama-BTP** is a domain-specialized Vision-Language Model (VLM) that converts construction site images into structured JSON analyses. Given a single photograph, the model produces a detailed report spanning **15 analysis dimensions** -- from construction progress estimation and safety compliance to quality defects and environmental impact. |
|
|
| The model enforces a strict, validated JSON schema with controlled vocabularies, confidence scores, and an evidence-linking system that traces every observation back to visual evidence in the image. |
|
|
| ### Key Capabilities |
|
|
| | Dimension | What the model extracts | |
| |---|---| |
| | **Progress** | Construction stage (earthworks → commissioning), estimated % completion, detected milestones | |
| | **Safety** | PPE compliance per worker, hazard identification (9 types), control measures present/missing | |
| | **Quality** | Structural defects (cracks, misalignment, corrosion...), non-conformities | |
| | **Observations** | Objects, materials, equipment, personnel, vehicles, structural parts with attributes | |
| | **Logistics** | Materials on site, equipment status (idle/operating), access constraints | |
| | **Environment** | Dust, noise, waste, spills; waste management assessment | |
| | **Evidence** | Traceable evidence entries with unique IDs linking every finding to visual proof | |
|
|
| ## Architecture |
|
|
| ``` |
| Input Image ───┐ |
| ├──► Qwen2.5-VL-3B-Instruct ──► LoRA-adapted layers ──► Structured JSON |
| System Prompt ─┘ (backbone) (r=32, alpha=64) |
| ``` |
|
|
| | Component | Details | |
| |---|---| |
| | **Backbone** | [Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct) -- 3B parameter multimodal transformer | |
| | **Adaptation** | LoRA (Low-Rank Adaptation) applied to all attention and MLP projections | |
| | **Target Modules** | `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj` | |
| | **LoRA Rank** | r=32, alpha=64 (2x scaling), dropout=0.1 | |
| | **Precision** | BF16 (GPU) / FP32 (CPU/MPS) | |
| | **Output** | Deterministic JSON (temperature=0, greedy decoding) | |
|
|
| ### Design Principles |
|
|
| - **Schema-first**: Every output is validated against a formal JSON Schema (draft 2020-12) with 15 required top-level fields and controlled enumerations |
| - **Evidence-linked**: All observations reference `evidence_id` entries -- no claim without visual justification |
| - **Confidence-scored**: Every detection carries a `[0, 1]` confidence score for downstream filtering |
| - **Honest by design**: When something is uncertain or not visible, the model uses `"unknown"`, `null`, or empty arrays -- never hallucinated details |
|
|
| ## Quick Start |
|
|
| ```python |
| import torch |
| from transformers import AutoModelForVision2Seq, AutoProcessor |
| from PIL import Image |
| |
| # Load model and processor |
| model_id = "Horama/Horama_BTP" |
| |
| model = AutoModelForVision2Seq.from_pretrained( |
| model_id, |
| torch_dtype=torch.bfloat16, |
| device_map="auto", |
| trust_remote_code=True, |
| ) |
| processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True) |
| |
| # Load image |
| image = Image.open("construction_site.jpg").convert("RGB") |
| |
| # System prompt -- instructs the model to output Horama-BTP v1 JSON |
| system_prompt = """You are Horama-BTP v1. Analyze construction site images. Output ONLY valid JSON. No text before/after. |
| CRITICAL RULES: |
| 1. ONLY describe what you can CLEARLY SEE in the image |
| 2. If you cannot see something -> use empty array [] or "unknown" |
| 3. Output must follow the Horama-BTP v1 JSON schema exactly""" |
| |
| user_prompt = "Analyze this construction site image and return the Horama-BTP v1 JSON output." |
| |
| # Prepare messages |
| messages = [ |
| {"role": "system", "content": system_prompt}, |
| { |
| "role": "user", |
| "content": [ |
| {"type": "image", "image": image}, |
| {"type": "text", "text": user_prompt}, |
| ], |
| }, |
| ] |
| |
| # Generate |
| text = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False) |
| inputs = processor(text=[text], images=[image], return_tensors="pt").to(model.device) |
| |
| with torch.no_grad(): |
| output = model.generate(**inputs, max_new_tokens=4096, do_sample=False) |
| |
| result = processor.decode(output[0], skip_special_tokens=True) |
| |
| # Extract JSON from response |
| import json |
| json_start = result.rfind("{") |
| json_end = result.rfind("}") + 1 |
| analysis = json.loads(result[json_start:json_end]) |
| |
| print(json.dumps(analysis, indent=2)) |
| ``` |
|
|
| ## Output Schema |
|
|
| The model outputs a single JSON object with **15 required top-level fields**: |
|
|
| ``` |
| { |
| "job_type": "construction" | "renovation" | "infrastructure" | "unknown", |
| "asset_type": "house" | "building" | "road" | "bridge" | "tunnel" | "site" | "unknown", |
| "scene_context": { location_hint, weather_light, viewpoint }, |
| "summary": { one_liner, confidence }, |
| "progress": { overall_stage, stage_confidence, progress_percent_estimate, milestones_detected }, |
| "work_activities": [{ activity, status, confidence, evidence_ids }], |
| "observations": [{ type, label, attributes, confidence, evidence_ids }], |
| "safety": { overall_risk_level, ppe[], hazards[], control_measures[] }, |
| "quality": { issues[], non_conformities[] }, |
| "logistics": { materials_on_site[], equipment_on_site[], access_constraints[] }, |
| "environment": { impacts[], waste_management }, |
| "evidence": [{ evidence_id, source, bbox_xyxy, description }], |
| "unknown": [{ question, why_unknown, needed_input }], |
| "domain_fields": { custom_kpis, lot_breakdown, client_specific }, |
| "metadata": { model, version, generated_at } |
| } |
| ``` |
|
|
| ### Controlled Vocabularies |
|
|
| The schema enforces controlled enumerations across all categorical fields: |
|
|
| | Field | Allowed values | |
| |---|---| |
| | `overall_stage` | `planning`, `earthworks`, `foundations`, `structure`, `envelope`, `mep`, `finishing`, `commissioning`, `unknown` | |
| | `ppe_item` | `helmet`, `vest`, `gloves`, `goggles`, `harness`, `boots`, `mask`, `other` | |
| | `hazard_type` | `fall_risk`, `open_trench`, `moving_vehicle`, `electrical`, `fire`, `unstable_load`, `poor_housekeeping`, `restricted_area`, `other` | |
| | `issue_type` | `crack`, `misalignment`, `water_infiltration`, `corrosion`, `spalling`, `poor_finish`, `missing_component`, `rework`, `other` | |
| | `observation_type` | `object`, `material`, `equipment`, `signage`, `defect`, `hazard`, `personnel`, `vehicle`, `structure_part`, `other` | |
|
|
| ## Example Output |
|
|
| Given a drone photograph of a wood-framed house under construction: |
|
|
| ```json |
| { |
| "job_type": "construction", |
| "asset_type": "house", |
| "scene_context": { |
| "location_hint": "outdoor", |
| "weather_light": "day", |
| "viewpoint": "drone" |
| }, |
| "summary": { |
| "one_liner": "Aerial view of a wood-framed house under construction; floor deck and wall framing visible, two workers on site.", |
| "confidence": 0.88 |
| }, |
| "progress": { |
| "overall_stage": "structure", |
| "stage_confidence": 0.85, |
| "progress_percent_estimate": 35, |
| "progress_confidence": 0.35, |
| "milestones_detected": [] |
| }, |
| "safety": { |
| "overall_risk_level": "medium", |
| "ppe": [ |
| { "role": "worker", "ppe_item": "helmet", "status": "compliant", "confidence": 0.8, "evidence_ids": ["ev_003"] }, |
| { "role": "worker", "ppe_item": "vest", "status": "compliant", "confidence": 0.8, "evidence_ids": ["ev_003"] } |
| ], |
| "hazards": [ |
| { "hazard_type": "fall_risk", "severity": "medium", "confidence": 0.6, "evidence_ids": ["ev_005"] } |
| ], |
| "control_measures": [ |
| { "measure": "barrier", "status": "present", "confidence": 0.6, "evidence_ids": ["ev_004"] } |
| ] |
| }, |
| "evidence": [ |
| { "evidence_id": "ev_001", "source": "image", "description": "Wood-framed structure with interior wall framing visible" }, |
| { "evidence_id": "ev_003", "source": "image", "description": "Two workers wearing high-visibility vests and hard hats" }, |
| { "evidence_id": "ev_005", "source": "image", "description": "Open edges and elevated framing suggesting fall risk" } |
| ] |
| } |
| ``` |
|
|
| *(Truncated for readability -- full output includes all 15 top-level fields)* |
|
|
| ## Training Details |
|
|
| | Parameter | Value | |
| |---|---| |
| | **Method** | LoRA (Parameter-Efficient Fine-Tuning) | |
| | **Epochs** | 15 | |
| | **Effective batch size** | 4 (batch=1, accumulation=4) | |
| | **Learning rate** | 1e-4 with cosine schedule | |
| | **Warmup** | 10% of training steps | |
| | **Weight decay** | 0.01 | |
| | **Gradient checkpointing** | Enabled | |
| | **Trainable parameters** | ~1.5% of total model parameters | |
| | **Framework** | Transformers + PEFT | |
| | **Hardware** | NVIDIA GPU with BF16 | |
|
|
| ## Intended Uses |
|
|
| **Primary use cases:** |
| - Automated construction progress reporting from site photographs |
| - Safety compliance auditing (PPE detection, hazard identification) |
| - Quality control -- detecting visible defects and non-conformities |
| - Logistics monitoring -- tracking materials and equipment on site |
| - Environmental impact documentation |
|
|
| **Input requirements:** |
| - Single construction site image (JPEG, PNG, WebP, BMP) |
| - Supports ground-level, drone, and fixed-camera viewpoints |
| - Works best with daylight, well-lit images |
|
|
| ## Limitations |
|
|
| - **Single-image analysis**: The model analyzes one image at a time; it cannot compare images across time for temporal progress tracking |
| - **Visible elements only**: Cannot infer hidden structural issues, underground utilities, or elements behind walls |
| - **No sensory data**: Cannot detect noise levels, dust concentration, or odors from static images |
| - **Resolution-dependent**: Small or distant objects (e.g., PPE details at long range) may have lower confidence |
| - **Schema-bound**: Output strictly follows the Horama-BTP v1 schema -- custom fields require the `domain_fields` extension point |
|
|
| ## Hardware Requirements |
|
|
| | Setup | VRAM / RAM | Precision | Notes | |
| |---|---|---|---| |
| | **NVIDIA GPU** | ~8 GB VRAM | BF16 | Recommended for production | |
| | **Apple Silicon** | ~8 GB RAM | FP32 | Supported via MPS backend | |
| | **CPU** | ~12 GB RAM | FP32 | Functional but slower | |
|
|
| ## License |
|
|
| **AGPL-3.0** -- This model can be freely used, modified, and redistributed as long as derivative work remains open-source under the same license. |
|
|
| For **commercial or closed-source** usage, please contact [Horama](https://horama.ai) for a commercial license. |
|
|
| ## Citation |
|
|
| ```bibtex |
| @misc{horama-btp-2025, |
| title = {Horama-BTP: Vision-Language Model for Construction Site Analysis}, |
| author = {Horama}, |
| year = {2025}, |
| url = {https://huggingface.co/Horama/Horama_BTP}, |
| note = {Fine-tuned from Qwen2.5-VL-3B-Instruct with LoRA for structured JSON construction analysis} |
| } |
| ``` |
|
|
| --- |
|
|
| <div align="center"> |
|
|
| **Built by [Horama](https://horama.ai)** | Construction intelligence, powered by vision AI |
|
|
| </div> |
|
|