README.md · Horama/Horama

Horama_BTP / README.md

LylianChallier

Update README.md

6ed5206 verified 8 days ago

preview code

raw

history blame contribute delete

11.8 kB

	---
	language:
	- en
	- fr
	license: agpl-3.0
	library_name: transformers
	base_model: Qwen/Qwen2.5-VL-3B-Instruct
	pipeline_tag: image-text-to-text
	tags:
	- construction
	- visual-analysis
	- safety-inspection
	- vlm
	- qwen2_5_vl
	- qwen2-vl
	- lora
	- horama
	- btp
	- structured-output
	- json
	- image-to-json
	- peft
	- safetensors
	model-index:
	- name: Horama_BTP
	results: []
	---

	<div align="center">

	# HORAMA-BTP

	### Vision-Language Model for Construction Site Analysis

	Image → Structured JSON \| Built on Qwen2.5-VL \| Fine-tuned with LoRA

	[![Model](https://img.shields.io/badge/Model-3B_params-blue)]()
	[![License](https://img.shields.io/badge/License-AGPL--3.0-green)]()
	[![Format](https://img.shields.io/badge/Output-Structured_JSON-orange)]()
	[![Framework](https://img.shields.io/badge/Framework-Transformers-yellow)]()

	---

	Horama-BTP transforms construction site photographs into comprehensive, machine-readable JSON reports covering progress tracking, safety compliance, quality assessment, and logistics.

	</div>

	## Overview

	Horama-BTP is a domain-specialized Vision-Language Model (VLM) that converts construction site images into structured JSON analyses. Given a single photograph, the model produces a detailed report spanning 15 analysis dimensions -- from construction progress estimation and safety compliance to quality defects and environmental impact.

	The model enforces a strict, validated JSON schema with controlled vocabularies, confidence scores, and an evidence-linking system that traces every observation back to visual evidence in the image.

	### Key Capabilities

	\| Dimension \| What the model extracts \|
	\|---\|---\|
	\| Progress \| Construction stage (earthworks → commissioning), estimated % completion, detected milestones \|
	\| Safety \| PPE compliance per worker, hazard identification (9 types), control measures present/missing \|
	\| Quality \| Structural defects (cracks, misalignment, corrosion...), non-conformities \|
	\| Observations \| Objects, materials, equipment, personnel, vehicles, structural parts with attributes \|
	\| Logistics \| Materials on site, equipment status (idle/operating), access constraints \|
	\| Environment \| Dust, noise, waste, spills; waste management assessment \|
	\| Evidence \| Traceable evidence entries with unique IDs linking every finding to visual proof \|

	## Architecture

	```
	Input Image ───┐
	├──► Qwen2.5-VL-3B-Instruct ──► LoRA-adapted layers ──► Structured JSON
	System Prompt ─┘ (backbone) (r=32, alpha=64)
	```

	\| Component \| Details \|
	\|---\|---\|
	\| Backbone \| [Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct) -- 3B parameter multimodal transformer \|
	\| Adaptation \| LoRA (Low-Rank Adaptation) applied to all attention and MLP projections \|
	\| Target Modules \| `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj` \|
	\| LoRA Rank \| r=32, alpha=64 (2x scaling), dropout=0.1 \|
	\| Precision \| BF16 (GPU) / FP32 (CPU/MPS) \|
	\| Output \| Deterministic JSON (temperature=0, greedy decoding) \|

	### Design Principles

	- Schema-first: Every output is validated against a formal JSON Schema (draft 2020-12) with 15 required top-level fields and controlled enumerations
	- Evidence-linked: All observations reference `evidence_id` entries -- no claim without visual justification
	- Confidence-scored: Every detection carries a `[0, 1]` confidence score for downstream filtering
	- Honest by design: When something is uncertain or not visible, the model uses `"unknown"`, `null`, or empty arrays -- never hallucinated details

	## Quick Start

	```python
	import torch
	from transformers import AutoModelForVision2Seq, AutoProcessor
	from PIL import Image

	# Load model and processor
	model_id = "Horama/Horama_BTP"

	model = AutoModelForVision2Seq.from_pretrained(
	model_id,
	torch_dtype=torch.bfloat16,
	device_map="auto",
	trust_remote_code=True,
	)
	processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

	# Load image
	image = Image.open("construction_site.jpg").convert("RGB")

	# System prompt -- instructs the model to output Horama-BTP v1 JSON
	system_prompt = """You are Horama-BTP v1. Analyze construction site images. Output ONLY valid JSON. No text before/after.
	CRITICAL RULES:
	1. ONLY describe what you can CLEARLY SEE in the image
	2. If you cannot see something -> use empty array [] or "unknown"
	3. Output must follow the Horama-BTP v1 JSON schema exactly"""

	user_prompt = "Analyze this construction site image and return the Horama-BTP v1 JSON output."

	# Prepare messages
	messages = [
	{"role": "system", "content": system_prompt},
	{
	"role": "user",
	"content": [
	{"type": "image", "image": image},
	{"type": "text", "text": user_prompt},
	],
	},
	]

	# Generate
	text = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
	inputs = processor(text=[text], images=[image], return_tensors="pt").to(model.device)

	with torch.no_grad():
	output = model.generate(**inputs, max_new_tokens=4096, do_sample=False)

	result = processor.decode(output[0], skip_special_tokens=True)

	# Extract JSON from response
	import json
	json_start = result.rfind("{")
	json_end = result.rfind("}") + 1
	analysis = json.loads(result[json_start:json_end])

	print(json.dumps(analysis, indent=2))
	```

	## Output Schema

	The model outputs a single JSON object with 15 required top-level fields:

	```
	{
	"job_type": "construction" \| "renovation" \| "infrastructure" \| "unknown",
	"asset_type": "house" \| "building" \| "road" \| "bridge" \| "tunnel" \| "site" \| "unknown",
	"scene_context": { location_hint, weather_light, viewpoint },
	"summary": { one_liner, confidence },
	"progress": { overall_stage, stage_confidence, progress_percent_estimate, milestones_detected },
	"work_activities": [{ activity, status, confidence, evidence_ids }],
	"observations": [{ type, label, attributes, confidence, evidence_ids }],
	"safety": { overall_risk_level, ppe[], hazards[], control_measures[] },
	"quality": { issues[], non_conformities[] },
	"logistics": { materials_on_site[], equipment_on_site[], access_constraints[] },
	"environment": { impacts[], waste_management },
	"evidence": [{ evidence_id, source, bbox_xyxy, description }],
	"unknown": [{ question, why_unknown, needed_input }],
	"domain_fields": { custom_kpis, lot_breakdown, client_specific },
	"metadata": { model, version, generated_at }
	}
	```

	### Controlled Vocabularies

	The schema enforces controlled enumerations across all categorical fields:

	\| Field \| Allowed values \|
	\|---\|---\|
	\| `overall_stage` \| `planning`, `earthworks`, `foundations`, `structure`, `envelope`, `mep`, `finishing`, `commissioning`, `unknown` \|
	\| `ppe_item` \| `helmet`, `vest`, `gloves`, `goggles`, `harness`, `boots`, `mask`, `other` \|
	\| `hazard_type` \| `fall_risk`, `open_trench`, `moving_vehicle`, `electrical`, `fire`, `unstable_load`, `poor_housekeeping`, `restricted_area`, `other` \|
	\| `issue_type` \| `crack`, `misalignment`, `water_infiltration`, `corrosion`, `spalling`, `poor_finish`, `missing_component`, `rework`, `other` \|
	\| `observation_type` \| `object`, `material`, `equipment`, `signage`, `defect`, `hazard`, `personnel`, `vehicle`, `structure_part`, `other` \|

	## Example Output

	Given a drone photograph of a wood-framed house under construction:

	```json
	{
	"job_type": "construction",
	"asset_type": "house",
	"scene_context": {
	"location_hint": "outdoor",
	"weather_light": "day",
	"viewpoint": "drone"
	},
	"summary": {
	"one_liner": "Aerial view of a wood-framed house under construction; floor deck and wall framing visible, two workers on site.",
	"confidence": 0.88
	},
	"progress": {
	"overall_stage": "structure",
	"stage_confidence": 0.85,
	"progress_percent_estimate": 35,
	"progress_confidence": 0.35,
	"milestones_detected": []
	},
	"safety": {
	"overall_risk_level": "medium",
	"ppe": [
	{ "role": "worker", "ppe_item": "helmet", "status": "compliant", "confidence": 0.8, "evidence_ids": ["ev_003"] },
	{ "role": "worker", "ppe_item": "vest", "status": "compliant", "confidence": 0.8, "evidence_ids": ["ev_003"] }
	],
	"hazards": [
	{ "hazard_type": "fall_risk", "severity": "medium", "confidence": 0.6, "evidence_ids": ["ev_005"] }
	],
	"control_measures": [
	{ "measure": "barrier", "status": "present", "confidence": 0.6, "evidence_ids": ["ev_004"] }
	]
	},
	"evidence": [
	{ "evidence_id": "ev_001", "source": "image", "description": "Wood-framed structure with interior wall framing visible" },
	{ "evidence_id": "ev_003", "source": "image", "description": "Two workers wearing high-visibility vests and hard hats" },
	{ "evidence_id": "ev_005", "source": "image", "description": "Open edges and elevated framing suggesting fall risk" }
	]
	}
	```

	(Truncated for readability -- full output includes all 15 top-level fields)

	## Training Details

	\| Parameter \| Value \|
	\|---\|---\|
	\| Method \| LoRA (Parameter-Efficient Fine-Tuning) \|
	\| Epochs \| 15 \|
	\| Effective batch size \| 4 (batch=1, accumulation=4) \|
	\| Learning rate \| 1e-4 with cosine schedule \|
	\| Warmup \| 10% of training steps \|
	\| Weight decay \| 0.01 \|
	\| Gradient checkpointing \| Enabled \|
	\| Trainable parameters \| ~1.5% of total model parameters \|
	\| Framework \| Transformers + PEFT \|
	\| Hardware \| NVIDIA GPU with BF16 \|

	## Intended Uses

	Primary use cases:
	- Automated construction progress reporting from site photographs
	- Safety compliance auditing (PPE detection, hazard identification)
	- Quality control -- detecting visible defects and non-conformities
	- Logistics monitoring -- tracking materials and equipment on site
	- Environmental impact documentation

	Input requirements:
	- Single construction site image (JPEG, PNG, WebP, BMP)
	- Supports ground-level, drone, and fixed-camera viewpoints
	- Works best with daylight, well-lit images

	## Limitations

	- Single-image analysis: The model analyzes one image at a time; it cannot compare images across time for temporal progress tracking
	- Visible elements only: Cannot infer hidden structural issues, underground utilities, or elements behind walls
	- No sensory data: Cannot detect noise levels, dust concentration, or odors from static images
	- Resolution-dependent: Small or distant objects (e.g., PPE details at long range) may have lower confidence
	- Schema-bound: Output strictly follows the Horama-BTP v1 schema -- custom fields require the `domain_fields` extension point

	## Hardware Requirements

	\| Setup \| VRAM / RAM \| Precision \| Notes \|
	\|---\|---\|---\|---\|
	\| NVIDIA GPU \| ~8 GB VRAM \| BF16 \| Recommended for production \|
	\| Apple Silicon \| ~8 GB RAM \| FP32 \| Supported via MPS backend \|
	\| CPU \| ~12 GB RAM \| FP32 \| Functional but slower \|

	## License

	AGPL-3.0 -- This model can be freely used, modified, and redistributed as long as derivative work remains open-source under the same license.

	For commercial or closed-source usage, please contact [Horama](https://horama.ai) for a commercial license.

	## Citation

	```bibtex
	@misc{horama-btp-2025,
	title = {Horama-BTP: Vision-Language Model for Construction Site Analysis},
	author = {Horama},
	year = {2025},
	url = {https://huggingface.co/Horama/Horama_BTP},
	note = {Fine-tuned from Qwen2.5-VL-3B-Instruct with LoRA for structured JSON construction analysis}
	}
	```

	---

	<div align="center">

	Built by [Horama](https://horama.ai) \| Construction intelligence, powered by vision AI

	</div>