Update README.md

6ed5206 verified 7 days ago

11.8 kB

language:
  - en
  - fr
license: agpl-3.0
library_name: transformers
base_model: Qwen/Qwen2.5-VL-3B-Instruct
pipeline_tag: image-text-to-text
tags:
  - construction
  - visual-analysis
  - safety-inspection
  - vlm
  - qwen2_5_vl
  - qwen2-vl
  - lora
  - horama
  - btp
  - structured-output
  - json
  - image-to-json
  - peft
  - safetensors
model-index:
  - name: Horama_BTP
    results: []

HORAMA-BTP

Vision-Language Model for Construction Site Analysis

Image → Structured JSON | Built on Qwen2.5-VL | Fine-tuned with LoRA

Horama-BTP transforms construction site photographs into comprehensive, machine-readable JSON reports covering progress tracking, safety compliance, quality assessment, and logistics.

Overview

Horama-BTP is a domain-specialized Vision-Language Model (VLM) that converts construction site images into structured JSON analyses. Given a single photograph, the model produces a detailed report spanning 15 analysis dimensions -- from construction progress estimation and safety compliance to quality defects and environmental impact.

The model enforces a strict, validated JSON schema with controlled vocabularies, confidence scores, and an evidence-linking system that traces every observation back to visual evidence in the image.

Key Capabilities

Dimension	What the model extracts
Progress	Construction stage (earthworks → commissioning), estimated % completion, detected milestones
Safety	PPE compliance per worker, hazard identification (9 types), control measures present/missing
Quality	Structural defects (cracks, misalignment, corrosion...), non-conformities
Observations	Objects, materials, equipment, personnel, vehicles, structural parts with attributes
Logistics	Materials on site, equipment status (idle/operating), access constraints
Environment	Dust, noise, waste, spills; waste management assessment
Evidence	Traceable evidence entries with unique IDs linking every finding to visual proof

Architecture

Input Image ───┐
               ├──► Qwen2.5-VL-3B-Instruct ──► LoRA-adapted layers ──► Structured JSON
System Prompt ─┘         (backbone)              (r=32, alpha=64)

Component	Details
Backbone	Qwen2.5-VL-3B-Instruct -- 3B parameter multimodal transformer
Adaptation	LoRA (Low-Rank Adaptation) applied to all attention and MLP projections
Target Modules	`q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj`
LoRA Rank	r=32, alpha=64 (2x scaling), dropout=0.1
Precision	BF16 (GPU) / FP32 (CPU/MPS)
Output	Deterministic JSON (temperature=0, greedy decoding)

Design Principles

Schema-first: Every output is validated against a formal JSON Schema (draft 2020-12) with 15 required top-level fields and controlled enumerations
Evidence-linked: All observations reference evidence_id entries -- no claim without visual justification
Confidence-scored: Every detection carries a [0, 1] confidence score for downstream filtering
Honest by design: When something is uncertain or not visible, the model uses "unknown", null, or empty arrays -- never hallucinated details

Quick Start

import torch
from transformers import AutoModelForVision2Seq, AutoProcessor
from PIL import Image

# Load model and processor
model_id = "Horama/Horama_BTP"

model = AutoModelForVision2Seq.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

# Load image
image = Image.open("construction_site.jpg").convert("RGB")

# System prompt -- instructs the model to output Horama-BTP v1 JSON
system_prompt = """You are Horama-BTP v1. Analyze construction site images. Output ONLY valid JSON. No text before/after.
CRITICAL RULES:
1. ONLY describe what you can CLEARLY SEE in the image
2. If you cannot see something -> use empty array [] or "unknown"
3. Output must follow the Horama-BTP v1 JSON schema exactly"""

user_prompt = "Analyze this construction site image and return the Horama-BTP v1 JSON output."

# Prepare messages
messages = [
    {"role": "system", "content": system_prompt},
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": user_prompt},
        ],
    },
]

# Generate
text = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
inputs = processor(text=[text], images=[image], return_tensors="pt").to(model.device)

with torch.no_grad():
    output = model.generate(**inputs, max_new_tokens=4096, do_sample=False)

result = processor.decode(output[0], skip_special_tokens=True)

# Extract JSON from response
import json
json_start = result.rfind("{")
json_end = result.rfind("}") + 1
analysis = json.loads(result[json_start:json_end])

print(json.dumps(analysis, indent=2))

Output Schema

The model outputs a single JSON object with 15 required top-level fields:

{
  "job_type":        "construction" | "renovation" | "infrastructure" | "unknown",
  "asset_type":      "house" | "building" | "road" | "bridge" | "tunnel" | "site" | "unknown",
  "scene_context":   { location_hint, weather_light, viewpoint },
  "summary":         { one_liner, confidence },
  "progress":        { overall_stage, stage_confidence, progress_percent_estimate, milestones_detected },
  "work_activities":  [{ activity, status, confidence, evidence_ids }],
  "observations":    [{ type, label, attributes, confidence, evidence_ids }],
  "safety":          { overall_risk_level, ppe[], hazards[], control_measures[] },
  "quality":         { issues[], non_conformities[] },
  "logistics":       { materials_on_site[], equipment_on_site[], access_constraints[] },
  "environment":     { impacts[], waste_management },
  "evidence":        [{ evidence_id, source, bbox_xyxy, description }],
  "unknown":         [{ question, why_unknown, needed_input }],
  "domain_fields":   { custom_kpis, lot_breakdown, client_specific },
  "metadata":        { model, version, generated_at }
}

Controlled Vocabularies

The schema enforces controlled enumerations across all categorical fields:

Field	Allowed values
`overall_stage`	`planning`, `earthworks`, `foundations`, `structure`, `envelope`, `mep`, `finishing`, `commissioning`, `unknown`
`ppe_item`	`helmet`, `vest`, `gloves`, `goggles`, `harness`, `boots`, `mask`, `other`
`hazard_type`	`fall_risk`, `open_trench`, `moving_vehicle`, `electrical`, `fire`, `unstable_load`, `poor_housekeeping`, `restricted_area`, `other`
`issue_type`	`crack`, `misalignment`, `water_infiltration`, `corrosion`, `spalling`, `poor_finish`, `missing_component`, `rework`, `other`
`observation_type`	`object`, `material`, `equipment`, `signage`, `defect`, `hazard`, `personnel`, `vehicle`, `structure_part`, `other`

Example Output

Given a drone photograph of a wood-framed house under construction:

{
  "job_type": "construction",
  "asset_type": "house",
  "scene_context": {
    "location_hint": "outdoor",
    "weather_light": "day",
    "viewpoint": "drone"
  },
  "summary": {
    "one_liner": "Aerial view of a wood-framed house under construction; floor deck and wall framing visible, two workers on site.",
    "confidence": 0.88
  },
  "progress": {
    "overall_stage": "structure",
    "stage_confidence": 0.85,
    "progress_percent_estimate": 35,
    "progress_confidence": 0.35,
    "milestones_detected": []
  },
  "safety": {
    "overall_risk_level": "medium",
    "ppe": [
      { "role": "worker", "ppe_item": "helmet", "status": "compliant", "confidence": 0.8, "evidence_ids": ["ev_003"] },
      { "role": "worker", "ppe_item": "vest", "status": "compliant", "confidence": 0.8, "evidence_ids": ["ev_003"] }
    ],
    "hazards": [
      { "hazard_type": "fall_risk", "severity": "medium", "confidence": 0.6, "evidence_ids": ["ev_005"] }
    ],
    "control_measures": [
      { "measure": "barrier", "status": "present", "confidence": 0.6, "evidence_ids": ["ev_004"] }
    ]
  },
  "evidence": [
    { "evidence_id": "ev_001", "source": "image", "description": "Wood-framed structure with interior wall framing visible" },
    { "evidence_id": "ev_003", "source": "image", "description": "Two workers wearing high-visibility vests and hard hats" },
    { "evidence_id": "ev_005", "source": "image", "description": "Open edges and elevated framing suggesting fall risk" }
  ]
}

(Truncated for readability -- full output includes all 15 top-level fields)

Training Details

Parameter	Value
Method	LoRA (Parameter-Efficient Fine-Tuning)
Epochs	15
Effective batch size	4 (batch=1, accumulation=4)
Learning rate	1e-4 with cosine schedule
Warmup	10% of training steps
Weight decay	0.01
Gradient checkpointing	Enabled
Trainable parameters	~1.5% of total model parameters
Framework	Transformers + PEFT
Hardware	NVIDIA GPU with BF16

Intended Uses

Primary use cases:

Automated construction progress reporting from site photographs
Safety compliance auditing (PPE detection, hazard identification)
Quality control -- detecting visible defects and non-conformities
Logistics monitoring -- tracking materials and equipment on site
Environmental impact documentation

Input requirements:

Single construction site image (JPEG, PNG, WebP, BMP)
Supports ground-level, drone, and fixed-camera viewpoints
Works best with daylight, well-lit images

Limitations

Single-image analysis: The model analyzes one image at a time; it cannot compare images across time for temporal progress tracking
Visible elements only: Cannot infer hidden structural issues, underground utilities, or elements behind walls
No sensory data: Cannot detect noise levels, dust concentration, or odors from static images
Resolution-dependent: Small or distant objects (e.g., PPE details at long range) may have lower confidence
Schema-bound: Output strictly follows the Horama-BTP v1 schema -- custom fields require the domain_fields extension point

Hardware Requirements

Setup	VRAM / RAM	Precision	Notes
NVIDIA GPU	~8 GB VRAM	BF16	Recommended for production
Apple Silicon	~8 GB RAM	FP32	Supported via MPS backend
CPU	~12 GB RAM	FP32	Functional but slower

License

AGPL-3.0 -- This model can be freely used, modified, and redistributed as long as derivative work remains open-source under the same license.

For commercial or closed-source usage, please contact Horama for a commercial license.

Citation

@misc{horama-btp-2025,
  title   = {Horama-BTP: Vision-Language Model for Construction Site Analysis},
  author  = {Horama},
  year    = {2025},
  url     = {https://huggingface.co/Horama/Horama_BTP},
  note    = {Fine-tuned from Qwen2.5-VL-3B-Instruct with LoRA for structured JSON construction analysis}
}

Built by Horama | Construction intelligence, powered by vision AI