Horama_BTP / README.md
LylianChallier's picture
Update README.md
6ed5206 verified
---
language:
- en
- fr
license: agpl-3.0
library_name: transformers
base_model: Qwen/Qwen2.5-VL-3B-Instruct
pipeline_tag: image-text-to-text
tags:
- construction
- visual-analysis
- safety-inspection
- vlm
- qwen2_5_vl
- qwen2-vl
- lora
- horama
- btp
- structured-output
- json
- image-to-json
- peft
- safetensors
model-index:
- name: Horama_BTP
results: []
---
<div align="center">
# HORAMA-BTP
### Vision-Language Model for Construction Site Analysis
**Image &rarr; Structured JSON** | Built on Qwen2.5-VL | Fine-tuned with LoRA
[![Model](https://img.shields.io/badge/Model-3B_params-blue)]()
[![License](https://img.shields.io/badge/License-AGPL--3.0-green)]()
[![Format](https://img.shields.io/badge/Output-Structured_JSON-orange)]()
[![Framework](https://img.shields.io/badge/Framework-Transformers-yellow)]()
---
*Horama-BTP transforms construction site photographs into comprehensive, machine-readable JSON reports covering progress tracking, safety compliance, quality assessment, and logistics.*
</div>
## Overview
**Horama-BTP** is a domain-specialized Vision-Language Model (VLM) that converts construction site images into structured JSON analyses. Given a single photograph, the model produces a detailed report spanning **15 analysis dimensions** -- from construction progress estimation and safety compliance to quality defects and environmental impact.
The model enforces a strict, validated JSON schema with controlled vocabularies, confidence scores, and an evidence-linking system that traces every observation back to visual evidence in the image.
### Key Capabilities
| Dimension | What the model extracts |
|---|---|
| **Progress** | Construction stage (earthworks &rarr; commissioning), estimated % completion, detected milestones |
| **Safety** | PPE compliance per worker, hazard identification (9 types), control measures present/missing |
| **Quality** | Structural defects (cracks, misalignment, corrosion...), non-conformities |
| **Observations** | Objects, materials, equipment, personnel, vehicles, structural parts with attributes |
| **Logistics** | Materials on site, equipment status (idle/operating), access constraints |
| **Environment** | Dust, noise, waste, spills; waste management assessment |
| **Evidence** | Traceable evidence entries with unique IDs linking every finding to visual proof |
## Architecture
```
Input Image ───┐
├──► Qwen2.5-VL-3B-Instruct ──► LoRA-adapted layers ──► Structured JSON
System Prompt ─┘ (backbone) (r=32, alpha=64)
```
| Component | Details |
|---|---|
| **Backbone** | [Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct) -- 3B parameter multimodal transformer |
| **Adaptation** | LoRA (Low-Rank Adaptation) applied to all attention and MLP projections |
| **Target Modules** | `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj` |
| **LoRA Rank** | r=32, alpha=64 (2x scaling), dropout=0.1 |
| **Precision** | BF16 (GPU) / FP32 (CPU/MPS) |
| **Output** | Deterministic JSON (temperature=0, greedy decoding) |
### Design Principles
- **Schema-first**: Every output is validated against a formal JSON Schema (draft 2020-12) with 15 required top-level fields and controlled enumerations
- **Evidence-linked**: All observations reference `evidence_id` entries -- no claim without visual justification
- **Confidence-scored**: Every detection carries a `[0, 1]` confidence score for downstream filtering
- **Honest by design**: When something is uncertain or not visible, the model uses `"unknown"`, `null`, or empty arrays -- never hallucinated details
## Quick Start
```python
import torch
from transformers import AutoModelForVision2Seq, AutoProcessor
from PIL import Image
# Load model and processor
model_id = "Horama/Horama_BTP"
model = AutoModelForVision2Seq.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
# Load image
image = Image.open("construction_site.jpg").convert("RGB")
# System prompt -- instructs the model to output Horama-BTP v1 JSON
system_prompt = """You are Horama-BTP v1. Analyze construction site images. Output ONLY valid JSON. No text before/after.
CRITICAL RULES:
1. ONLY describe what you can CLEARLY SEE in the image
2. If you cannot see something -> use empty array [] or "unknown"
3. Output must follow the Horama-BTP v1 JSON schema exactly"""
user_prompt = "Analyze this construction site image and return the Horama-BTP v1 JSON output."
# Prepare messages
messages = [
{"role": "system", "content": system_prompt},
{
"role": "user",
"content": [
{"type": "image", "image": image},
{"type": "text", "text": user_prompt},
],
},
]
# Generate
text = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
inputs = processor(text=[text], images=[image], return_tensors="pt").to(model.device)
with torch.no_grad():
output = model.generate(**inputs, max_new_tokens=4096, do_sample=False)
result = processor.decode(output[0], skip_special_tokens=True)
# Extract JSON from response
import json
json_start = result.rfind("{")
json_end = result.rfind("}") + 1
analysis = json.loads(result[json_start:json_end])
print(json.dumps(analysis, indent=2))
```
## Output Schema
The model outputs a single JSON object with **15 required top-level fields**:
```
{
"job_type": "construction" | "renovation" | "infrastructure" | "unknown",
"asset_type": "house" | "building" | "road" | "bridge" | "tunnel" | "site" | "unknown",
"scene_context": { location_hint, weather_light, viewpoint },
"summary": { one_liner, confidence },
"progress": { overall_stage, stage_confidence, progress_percent_estimate, milestones_detected },
"work_activities": [{ activity, status, confidence, evidence_ids }],
"observations": [{ type, label, attributes, confidence, evidence_ids }],
"safety": { overall_risk_level, ppe[], hazards[], control_measures[] },
"quality": { issues[], non_conformities[] },
"logistics": { materials_on_site[], equipment_on_site[], access_constraints[] },
"environment": { impacts[], waste_management },
"evidence": [{ evidence_id, source, bbox_xyxy, description }],
"unknown": [{ question, why_unknown, needed_input }],
"domain_fields": { custom_kpis, lot_breakdown, client_specific },
"metadata": { model, version, generated_at }
}
```
### Controlled Vocabularies
The schema enforces controlled enumerations across all categorical fields:
| Field | Allowed values |
|---|---|
| `overall_stage` | `planning`, `earthworks`, `foundations`, `structure`, `envelope`, `mep`, `finishing`, `commissioning`, `unknown` |
| `ppe_item` | `helmet`, `vest`, `gloves`, `goggles`, `harness`, `boots`, `mask`, `other` |
| `hazard_type` | `fall_risk`, `open_trench`, `moving_vehicle`, `electrical`, `fire`, `unstable_load`, `poor_housekeeping`, `restricted_area`, `other` |
| `issue_type` | `crack`, `misalignment`, `water_infiltration`, `corrosion`, `spalling`, `poor_finish`, `missing_component`, `rework`, `other` |
| `observation_type` | `object`, `material`, `equipment`, `signage`, `defect`, `hazard`, `personnel`, `vehicle`, `structure_part`, `other` |
## Example Output
Given a drone photograph of a wood-framed house under construction:
```json
{
"job_type": "construction",
"asset_type": "house",
"scene_context": {
"location_hint": "outdoor",
"weather_light": "day",
"viewpoint": "drone"
},
"summary": {
"one_liner": "Aerial view of a wood-framed house under construction; floor deck and wall framing visible, two workers on site.",
"confidence": 0.88
},
"progress": {
"overall_stage": "structure",
"stage_confidence": 0.85,
"progress_percent_estimate": 35,
"progress_confidence": 0.35,
"milestones_detected": []
},
"safety": {
"overall_risk_level": "medium",
"ppe": [
{ "role": "worker", "ppe_item": "helmet", "status": "compliant", "confidence": 0.8, "evidence_ids": ["ev_003"] },
{ "role": "worker", "ppe_item": "vest", "status": "compliant", "confidence": 0.8, "evidence_ids": ["ev_003"] }
],
"hazards": [
{ "hazard_type": "fall_risk", "severity": "medium", "confidence": 0.6, "evidence_ids": ["ev_005"] }
],
"control_measures": [
{ "measure": "barrier", "status": "present", "confidence": 0.6, "evidence_ids": ["ev_004"] }
]
},
"evidence": [
{ "evidence_id": "ev_001", "source": "image", "description": "Wood-framed structure with interior wall framing visible" },
{ "evidence_id": "ev_003", "source": "image", "description": "Two workers wearing high-visibility vests and hard hats" },
{ "evidence_id": "ev_005", "source": "image", "description": "Open edges and elevated framing suggesting fall risk" }
]
}
```
*(Truncated for readability -- full output includes all 15 top-level fields)*
## Training Details
| Parameter | Value |
|---|---|
| **Method** | LoRA (Parameter-Efficient Fine-Tuning) |
| **Epochs** | 15 |
| **Effective batch size** | 4 (batch=1, accumulation=4) |
| **Learning rate** | 1e-4 with cosine schedule |
| **Warmup** | 10% of training steps |
| **Weight decay** | 0.01 |
| **Gradient checkpointing** | Enabled |
| **Trainable parameters** | ~1.5% of total model parameters |
| **Framework** | Transformers + PEFT |
| **Hardware** | NVIDIA GPU with BF16 |
## Intended Uses
**Primary use cases:**
- Automated construction progress reporting from site photographs
- Safety compliance auditing (PPE detection, hazard identification)
- Quality control -- detecting visible defects and non-conformities
- Logistics monitoring -- tracking materials and equipment on site
- Environmental impact documentation
**Input requirements:**
- Single construction site image (JPEG, PNG, WebP, BMP)
- Supports ground-level, drone, and fixed-camera viewpoints
- Works best with daylight, well-lit images
## Limitations
- **Single-image analysis**: The model analyzes one image at a time; it cannot compare images across time for temporal progress tracking
- **Visible elements only**: Cannot infer hidden structural issues, underground utilities, or elements behind walls
- **No sensory data**: Cannot detect noise levels, dust concentration, or odors from static images
- **Resolution-dependent**: Small or distant objects (e.g., PPE details at long range) may have lower confidence
- **Schema-bound**: Output strictly follows the Horama-BTP v1 schema -- custom fields require the `domain_fields` extension point
## Hardware Requirements
| Setup | VRAM / RAM | Precision | Notes |
|---|---|---|---|
| **NVIDIA GPU** | ~8 GB VRAM | BF16 | Recommended for production |
| **Apple Silicon** | ~8 GB RAM | FP32 | Supported via MPS backend |
| **CPU** | ~12 GB RAM | FP32 | Functional but slower |
## License
**AGPL-3.0** -- This model can be freely used, modified, and redistributed as long as derivative work remains open-source under the same license.
For **commercial or closed-source** usage, please contact [Horama](https://horama.ai) for a commercial license.
## Citation
```bibtex
@misc{horama-btp-2025,
title = {Horama-BTP: Vision-Language Model for Construction Site Analysis},
author = {Horama},
year = {2025},
url = {https://huggingface.co/Horama/Horama_BTP},
note = {Fine-tuned from Qwen2.5-VL-3B-Instruct with LoRA for structured JSON construction analysis}
}
```
---
<div align="center">
**Built by [Horama](https://horama.ai)** | Construction intelligence, powered by vision AI
</div>