Horama_BTP / README.md
LylianChallier's picture
Update README.md
6ed5206 verified
metadata
language:
  - en
  - fr
license: agpl-3.0
library_name: transformers
base_model: Qwen/Qwen2.5-VL-3B-Instruct
pipeline_tag: image-text-to-text
tags:
  - construction
  - visual-analysis
  - safety-inspection
  - vlm
  - qwen2_5_vl
  - qwen2-vl
  - lora
  - horama
  - btp
  - structured-output
  - json
  - image-to-json
  - peft
  - safetensors
model-index:
  - name: Horama_BTP
    results: []

HORAMA-BTP

Vision-Language Model for Construction Site Analysis

Image β†’ Structured JSON | Built on Qwen2.5-VL | Fine-tuned with LoRA

Model License Format Framework


Horama-BTP transforms construction site photographs into comprehensive, machine-readable JSON reports covering progress tracking, safety compliance, quality assessment, and logistics.

Overview

Horama-BTP is a domain-specialized Vision-Language Model (VLM) that converts construction site images into structured JSON analyses. Given a single photograph, the model produces a detailed report spanning 15 analysis dimensions -- from construction progress estimation and safety compliance to quality defects and environmental impact.

The model enforces a strict, validated JSON schema with controlled vocabularies, confidence scores, and an evidence-linking system that traces every observation back to visual evidence in the image.

Key Capabilities

Dimension What the model extracts
Progress Construction stage (earthworks β†’ commissioning), estimated % completion, detected milestones
Safety PPE compliance per worker, hazard identification (9 types), control measures present/missing
Quality Structural defects (cracks, misalignment, corrosion...), non-conformities
Observations Objects, materials, equipment, personnel, vehicles, structural parts with attributes
Logistics Materials on site, equipment status (idle/operating), access constraints
Environment Dust, noise, waste, spills; waste management assessment
Evidence Traceable evidence entries with unique IDs linking every finding to visual proof

Architecture

Input Image ───┐
               β”œβ”€β”€β–Ί Qwen2.5-VL-3B-Instruct ──► LoRA-adapted layers ──► Structured JSON
System Prompt β”€β”˜         (backbone)              (r=32, alpha=64)
Component Details
Backbone Qwen2.5-VL-3B-Instruct -- 3B parameter multimodal transformer
Adaptation LoRA (Low-Rank Adaptation) applied to all attention and MLP projections
Target Modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
LoRA Rank r=32, alpha=64 (2x scaling), dropout=0.1
Precision BF16 (GPU) / FP32 (CPU/MPS)
Output Deterministic JSON (temperature=0, greedy decoding)

Design Principles

  • Schema-first: Every output is validated against a formal JSON Schema (draft 2020-12) with 15 required top-level fields and controlled enumerations
  • Evidence-linked: All observations reference evidence_id entries -- no claim without visual justification
  • Confidence-scored: Every detection carries a [0, 1] confidence score for downstream filtering
  • Honest by design: When something is uncertain or not visible, the model uses "unknown", null, or empty arrays -- never hallucinated details

Quick Start

import torch
from transformers import AutoModelForVision2Seq, AutoProcessor
from PIL import Image

# Load model and processor
model_id = "Horama/Horama_BTP"

model = AutoModelForVision2Seq.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

# Load image
image = Image.open("construction_site.jpg").convert("RGB")

# System prompt -- instructs the model to output Horama-BTP v1 JSON
system_prompt = """You are Horama-BTP v1. Analyze construction site images. Output ONLY valid JSON. No text before/after.
CRITICAL RULES:
1. ONLY describe what you can CLEARLY SEE in the image
2. If you cannot see something -> use empty array [] or "unknown"
3. Output must follow the Horama-BTP v1 JSON schema exactly"""

user_prompt = "Analyze this construction site image and return the Horama-BTP v1 JSON output."

# Prepare messages
messages = [
    {"role": "system", "content": system_prompt},
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": user_prompt},
        ],
    },
]

# Generate
text = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
inputs = processor(text=[text], images=[image], return_tensors="pt").to(model.device)

with torch.no_grad():
    output = model.generate(**inputs, max_new_tokens=4096, do_sample=False)

result = processor.decode(output[0], skip_special_tokens=True)

# Extract JSON from response
import json
json_start = result.rfind("{")
json_end = result.rfind("}") + 1
analysis = json.loads(result[json_start:json_end])

print(json.dumps(analysis, indent=2))

Output Schema

The model outputs a single JSON object with 15 required top-level fields:

{
  "job_type":        "construction" | "renovation" | "infrastructure" | "unknown",
  "asset_type":      "house" | "building" | "road" | "bridge" | "tunnel" | "site" | "unknown",
  "scene_context":   { location_hint, weather_light, viewpoint },
  "summary":         { one_liner, confidence },
  "progress":        { overall_stage, stage_confidence, progress_percent_estimate, milestones_detected },
  "work_activities":  [{ activity, status, confidence, evidence_ids }],
  "observations":    [{ type, label, attributes, confidence, evidence_ids }],
  "safety":          { overall_risk_level, ppe[], hazards[], control_measures[] },
  "quality":         { issues[], non_conformities[] },
  "logistics":       { materials_on_site[], equipment_on_site[], access_constraints[] },
  "environment":     { impacts[], waste_management },
  "evidence":        [{ evidence_id, source, bbox_xyxy, description }],
  "unknown":         [{ question, why_unknown, needed_input }],
  "domain_fields":   { custom_kpis, lot_breakdown, client_specific },
  "metadata":        { model, version, generated_at }
}

Controlled Vocabularies

The schema enforces controlled enumerations across all categorical fields:

Field Allowed values
overall_stage planning, earthworks, foundations, structure, envelope, mep, finishing, commissioning, unknown
ppe_item helmet, vest, gloves, goggles, harness, boots, mask, other
hazard_type fall_risk, open_trench, moving_vehicle, electrical, fire, unstable_load, poor_housekeeping, restricted_area, other
issue_type crack, misalignment, water_infiltration, corrosion, spalling, poor_finish, missing_component, rework, other
observation_type object, material, equipment, signage, defect, hazard, personnel, vehicle, structure_part, other

Example Output

Given a drone photograph of a wood-framed house under construction:

{
  "job_type": "construction",
  "asset_type": "house",
  "scene_context": {
    "location_hint": "outdoor",
    "weather_light": "day",
    "viewpoint": "drone"
  },
  "summary": {
    "one_liner": "Aerial view of a wood-framed house under construction; floor deck and wall framing visible, two workers on site.",
    "confidence": 0.88
  },
  "progress": {
    "overall_stage": "structure",
    "stage_confidence": 0.85,
    "progress_percent_estimate": 35,
    "progress_confidence": 0.35,
    "milestones_detected": []
  },
  "safety": {
    "overall_risk_level": "medium",
    "ppe": [
      { "role": "worker", "ppe_item": "helmet", "status": "compliant", "confidence": 0.8, "evidence_ids": ["ev_003"] },
      { "role": "worker", "ppe_item": "vest", "status": "compliant", "confidence": 0.8, "evidence_ids": ["ev_003"] }
    ],
    "hazards": [
      { "hazard_type": "fall_risk", "severity": "medium", "confidence": 0.6, "evidence_ids": ["ev_005"] }
    ],
    "control_measures": [
      { "measure": "barrier", "status": "present", "confidence": 0.6, "evidence_ids": ["ev_004"] }
    ]
  },
  "evidence": [
    { "evidence_id": "ev_001", "source": "image", "description": "Wood-framed structure with interior wall framing visible" },
    { "evidence_id": "ev_003", "source": "image", "description": "Two workers wearing high-visibility vests and hard hats" },
    { "evidence_id": "ev_005", "source": "image", "description": "Open edges and elevated framing suggesting fall risk" }
  ]
}

(Truncated for readability -- full output includes all 15 top-level fields)

Training Details

Parameter Value
Method LoRA (Parameter-Efficient Fine-Tuning)
Epochs 15
Effective batch size 4 (batch=1, accumulation=4)
Learning rate 1e-4 with cosine schedule
Warmup 10% of training steps
Weight decay 0.01
Gradient checkpointing Enabled
Trainable parameters ~1.5% of total model parameters
Framework Transformers + PEFT
Hardware NVIDIA GPU with BF16

Intended Uses

Primary use cases:

  • Automated construction progress reporting from site photographs
  • Safety compliance auditing (PPE detection, hazard identification)
  • Quality control -- detecting visible defects and non-conformities
  • Logistics monitoring -- tracking materials and equipment on site
  • Environmental impact documentation

Input requirements:

  • Single construction site image (JPEG, PNG, WebP, BMP)
  • Supports ground-level, drone, and fixed-camera viewpoints
  • Works best with daylight, well-lit images

Limitations

  • Single-image analysis: The model analyzes one image at a time; it cannot compare images across time for temporal progress tracking
  • Visible elements only: Cannot infer hidden structural issues, underground utilities, or elements behind walls
  • No sensory data: Cannot detect noise levels, dust concentration, or odors from static images
  • Resolution-dependent: Small or distant objects (e.g., PPE details at long range) may have lower confidence
  • Schema-bound: Output strictly follows the Horama-BTP v1 schema -- custom fields require the domain_fields extension point

Hardware Requirements

Setup VRAM / RAM Precision Notes
NVIDIA GPU ~8 GB VRAM BF16 Recommended for production
Apple Silicon ~8 GB RAM FP32 Supported via MPS backend
CPU ~12 GB RAM FP32 Functional but slower

License

AGPL-3.0 -- This model can be freely used, modified, and redistributed as long as derivative work remains open-source under the same license.

For commercial or closed-source usage, please contact Horama for a commercial license.

Citation

@misc{horama-btp-2025,
  title   = {Horama-BTP: Vision-Language Model for Construction Site Analysis},
  author  = {Horama},
  year    = {2025},
  url     = {https://huggingface.co/Horama/Horama_BTP},
  note    = {Fine-tuned from Qwen2.5-VL-3B-Instruct with LoRA for structured JSON construction analysis}
}

Built by Horama | Construction intelligence, powered by vision AI