Spaces:

AISA-Framework
/

HaramGuard

Running

App Files Files Community

HaramGuard / README.md

adeem6

Update README.md (#1)

b274069 3 days ago

30.9 kB

title: HaramGuard
emoji: 🕌
colorFrom: green
colorTo: blue
sdk: docker
pinned: false

HaramGuard — Agentic AI Safety System for Hajj Crowd Management

HaramGuard is a real-time, multi-agent decision-support system that integrates computer vision, risk modeling, reflective bias correction, and LLM-based coordination to assist human operators in preventing crowd crush during Hajj and Umrah.

Designed for deployment at the Grand Mosque (Masjid al-Haram), the system analyzes video feeds, estimates crowd risk levels, and generates structured operational recommendations — while maintaining strict human-in-the-loop governance, safety guardrails, and full auditability.

Capstone Project · Tuwaiq Academy

Developed by: Adeem Alotaibi, Reem Alamoudi, Munirah Alsubaie, Nourah Alhumaid · Supervised by: Eng. Omer Nacar

Overview
Problem Definition
Solution / System Architecture
Agentic System Design (Agents Description)
Human-in-the-Loop Design
Guardrails
System Architecture Diagram
Installation & Running
Repository Structure
Iterative Improvements
Ethics & Safety
Limitations & Future Work

1. Overview

HaramGuard is implemented as a single-pipeline, multi-agent system. One orchestration layer (RealTimePipeline in backend/pipeline.py) runs a fixed sequence of five agents per video frame and maintains a single shared state. The pipeline does not replace the operator; it produces a stream of recommendations and alerts that the operator may approve, reject, or ignore via a React dashboard.

Backend: Python (FastAPI), Ultralytics YOLO, OpenCV, NumPy, SciPy; SQLite persistence via HajjFlowDB (backend/core/database.py). Optional Streamlit dashboard entry point in dashboard.py.
Frontend: React 18, Vite, Tailwind CSS, Lucide icons; polls the backend REST API for real-time state and displays KPIs, risk gauge, proposed actions, and decisions log with approve/reject/delete controls.
Data flow: Video frames → PerceptionAgent → RiskAgent → ReflectionAgent → OperationsAgent → CoordinatorAgent (when a decision exists). State is updated each frame and persisted (risk events every 30 frames; reflection every frame; decisions and coordinator plans when emitted). The FastAPI server exposes /api/realtime/state, /api/frames/buffer, /api/actions/{id}/approve, /api/actions/{id}/reject, /api/reset, and /health.

2. Problem Definition

2.1 Real-World Problem

Mass gatherings during Hajj and Umrah create extreme crowd densities in and around the Grand Mosque. Crowd crush and stampede events have occurred in the past, with serious consequences. Effective crowd management depends on timely detection of rising density and flow bottlenecks, and on recommending proportionate interventions (e.g. opening gates, directing flow, broadcasting guidance) before conditions become critical.

Manual monitoring of many camera feeds is error-prone and does not scale. Operators need a system that (1) continuously estimates crowd density and risk from video, (2) explains why a risk level was assigned, and (3) suggests concrete actions while leaving all execution decisions to humans.

2.2 Significance

Safety: Reducing the likelihood of crowd crush by earlier, evidence-based recommendations.
Scale: Supporting operators who cannot watch every feed; the system aggregates perception and risk into a single, interpretable state.
Accountability: Every recommendation is logged with reasoning (risk events, reflection log, decisions, coordinator plans), supporting post-incident review and governance.
Context: The system is designed to respect the religious and social context of Hajj (no individual identification, human authority over actions, alignment with consultation-based decision-making).

3. Solution / System Architecture

3.1 High-Level Design

HaramGuard uses a deterministic, unidirectional pipeline:

PerceptionAgent turns each video frame into a structured FrameResult (person count, density, spacing, 3×3 spatial grid for hotspots, annotated frame). It uses a YOLO model (path and size in backend/config.py); an optional VisionCountAgent (Claude Vision) is available in code but disabled in the default pipeline.
RiskAgent maintains a sliding window of FrameResults and computes a risk score and level (LOW/MEDIUM/HIGH) using four paths: Fruin-style EMA of count, instant high-count floor, pre-emptive rate-of-change, and spatial clustering from the 3×3 grid. Output is RiskResult.
ReflectionAgent observes RiskResult and FrameResult, detects four bias patterns (chronic LOW, rising trend ignored, count–risk mismatch, over-estimation), and corrects risk level/score when needed. All reflections are logged to the database.
OperationsAgent emits a Decision only when the risk level changes (event-driven). Priority (P0/P1/P2) is derived from config thresholds aligned with RiskAgent. P0 decisions are rate-limited per zone. Decisions are stored in the database; actions and selected gates are left empty until the Coordinator fills them.
CoordinatorAgent is invoked by the pipeline for every decision (P0, P1, P2). It calls an LLM (Groq API) to generate a structured plan (threat level, executive summary, selected gates, immediate actions, Arabic alert, confidence). A ReAct loop (reason → act → observe, max 3 iterations) validates output with six guardrails (GR-C1–C6); the pipeline then fills the decision's actions, justification, and selected_gates from the plan and stores the plan in the database.

Single state: One state dictionary is updated each frame and exposed to the FastAPI server; the React dashboard polls it. All numeric thresholds and caps live in backend/config.py.

3.2 Data Flow

Input: Video frames from a file or camera (path set by VIDEO_PATH in config).
Output: Per-frame state (frame_id, person_count, density_score, risk_score, risk_level, trend, latest_decision, coordinator_plan, arabic_alert, reflection_summary, risk_history, decisions_log, etc.) plus persisted records in SQLite: risk_events, op_decisions, coordinator_plans, reflection_log.
Interfaces: Backend: FastAPI on port 8000 (configurable via API_PORT). Frontend: Vite dev server (e.g. port 5173), configurable via VITE_API_BASE_URL for the API base URL.

4. Agentic System Design (Agents Description)

The pipeline runs five agents in order each frame. Each agent is implemented in a single module under backend/agents/. Data flows unidirectionally; agents do not call each other directly.

4.1 PerceptionAgent (`perception_agent.py`)

Role: Convert a raw video frame into a FrameResult: person count, density score, average spacing, bounding boxes, annotated frame, track IDs, occupation percentage, and a 3×3 spatial grid (grid_counts, grid_max, hotspot_zone) for downstream hotspot detection. Based on Umm Al-Qura University (UQU) Haram crowd research: local clustering in one cell can indicate risk even when global count is moderate.
Design pattern: Tool use — YOLO for detection and tracking; optional VisionCountAgent (Claude Vision) for an alternative count when an Anthropic key is provided. In the current default pipeline, PerceptionAgent is instantiated without an Anthropic key (YOLO-only).
Guardrails: GR1 — person count capped at MAX_PERSONS (1000 in agent class). GR2 — density score capped at MAX_DENSITY (50.0).
Input: Raw frame (numpy array). Output: FrameResult (see backend/core/models.py).

4.2 RiskAgent (`risk_agent.py`)

Role: Maintain a sliding window (14 frames) of recent FrameResults and compute a scalar risk score in [0, 1] and a discrete risk level (LOW / MEDIUM / HIGH), plus trend (rising / stable / falling). Final score is the maximum of four paths: (1) Fruin smooth — EMA of current person count normalized to RISK_HIGH_COUNT (50), with spacing and trend weights; (2) instant floor — if current count ≥ HIGH_COUNT, score floor 0.70; (3) pre-emptive ROC — 5-frame growth and EMA thresholds; (4) spatial clustering — if any 3×3 grid cell has ≥ GRID_CELL_HIGH persons (from FrameResult), floor 0.70. Score is clamped to [0, 1] (GR3).
Design pattern: Sliding window + multi-path weighted scoring. Window size, thresholds, and weights are in config.py.
Input: FrameResult. Output: RiskResult (frame_id, risk_score, risk_level, trend, level_changed, window_avg, window_max, density_ema, density_pct).

4.3 ReflectionAgent (`reflection_agent.py`)

Role: Critique the current risk assessment and correct it when one of four bias patterns is detected: (1) chronic LOW — N consecutive LOW frames with average person count above threshold → upgrade to MEDIUM; (2) rising trend ignored — trend=rising, risk=LOW, count above threshold → upgrade to MEDIUM; (3) count–risk mismatch — high person count but LOW risk → upgrade to MEDIUM or HIGH; (4) over-estimation — HIGH risk but person count below threshold (e.g. < 15) → downgrade to MEDIUM. All reflections are persisted to reflection_log by the pipeline.
Design pattern: Reflection (observe → critique → correct → log). History window and thresholds in config (REFLECTION_BIAS_WINDOW, REFLECTION_CROWD_LOW_THRESH, REFLECTION_HIGH_CROWD_THRESH, REFLECTION_OVER_EST_THRESH).
Input: RiskResult, FrameResult. Output: Reflection dict; pipeline applies corrections to RiskResult before passing to OperationsAgent.

4.4 OperationsAgent (`operations_agent.py`)

Role: Map the (possibly reflection-corrected) risk level to an operational priority (P0 / P1 / P2) and emit a Decision only when the risk level changes. Priority is derived from config (OPS_P0_SCORE, OPS_P1_SCORE) aligned with RiskAgent thresholds. P0 emission is rate-limited per zone (cooldown 300 s in agent class). The decision's actions and selected_gates are left empty; the pipeline fills them via CoordinatorAgent and then stores the decision in op_decisions.
Design pattern: Event-driven; no decision when level unchanged.
Input: RiskResult, context string (e.g. Mecca_Main_Area). Output: Decision or None.

4.5 CoordinatorAgent (`coordinator_agent.py`)

Role: For every decision (P0, P1, or P2), produce a structured action plan using the Groq LLM (model in agent, e.g. openai/gpt-oss-120b). Plan includes threat_level, executive_summary, selected_gates, immediate_actions, actions_justification, arabic_alert, confidence_score. Implements a ReAct loop (max 3 iterations): reason (build prompt from RiskResult, Decision, frame buffer; optional feedback from failed validation) → act (LLM call, parse JSON) → observe (run guardrails GR-C1–C6); repeat until valid or max iterations. Pipeline fills the decision's actions, justification, and selected_gates from the plan and stores the plan in coordinator_plans.
Design pattern: ReAct (reason → act → observe) + output guardrails.
Input: RiskResult, Decision, list of recent FrameResults. Output: Plan dict.

4.6 VisionCountAgent (`vision_count_agent.py`) — Optional

Role: Provide an alternative person count by sending a subset of frames to a vision API (e.g. Claude Vision). Designed to be called from PerceptionAgent in hybrid mode to improve count in dense or occluded scenes. Not used in the default pipeline (PerceptionAgent is instantiated with anthropic_key=None).
Design pattern: Tool use; sampling and rate limiting internal to avoid API overload.

5. Human-in-the-Loop Design

HaramGuard is a decision-support system, not an autonomous enforcement system. Every action that affects the physical world (opening gates, dispatching security, broadcasting alerts) is a recommendation to a human operator. No such action is executed by the system alone.

OperationsAgent emits prioritized decisions (P0/P1/P2) and stores them; CoordinatorAgent produces Arabic alert text and action plans (gates, immediate actions) for each decision. These are shown on the React dashboard as "proposed actions."
Human operator approves or rejects each proposed action via the dashboard. The API records approve/reject via /api/actions/{id}/approve and /api/actions/{id}/reject; the system does not execute any action itself.
Operator responsibilities: Treat system silence (e.g. API down or pipeline stopped) as a trigger to switch to manual monitoring; review P0 recommendations and Arabic alerts before any broadcast or deployment; use the dashboard as one input among others (e.g. direct camera views, on-ground reports).

This design aligns with the principle of consultation (Shura) and with due diligence in decisions that affect lives: the machine informs, the human decides.

6. Guardrails

Guardrails are hard constraints and validations applied in code to keep outputs within safe and interpretable bounds. The following are implemented in the current repository.

ID	Agent	Guardrail	Justification
GR1	PerceptionAgent	Person count capped at MAX_PERSONS (1000 in agent)	Prevents implausibly high counts from YOLO artifacts from propagating to risk and alerts.
GR2	PerceptionAgent	Density score capped at MAX_DENSITY (50.0)	Keeps density in a bounded range for downstream risk formulas.
GR3	RiskAgent	Risk score clamped to [0.0, 1.0]	Ensures threshold comparisons (e.g. 0.35, 0.65) remain valid.
GR4	OperationsAgent	P0 rate-limited per zone (cooldown 300 s in agent)	Reduces alert fatigue; risk is still logged; only decision emission is rate-limited.
GR-C1	CoordinatorAgent	Required JSON fields enforced; missing set to safe defaults	Prevents dashboard or downstream logic from breaking when the LLM omits fields.
GR-C2	CoordinatorAgent	threat_level whitelist (CRITICAL, HIGH, MEDIUM, LOW)	Avoids invalid or adversarial values that would break UI or logic.
GR-C3	CoordinatorAgent	confidence_score in [0, 1]; otherwise 0.5	Normalizes LLM output so confidence is comparable.
GR-C4	CoordinatorAgent	Full range enforcement: threat_level overridden to match actual risk_score thresholds (LOW/MEDIUM/HIGH)	Prevents LLM from returning HIGH threat during MEDIUM risk or CRITICAL during LOW risk.
GR-C5	CoordinatorAgent	Arabic alert fallback if empty	Ensures safety-critical Arabic alert is never empty on the dashboard.
GR-C6	CoordinatorAgent	selected_gates must be non-empty list; otherwise fallback	Ensures operators receive concrete gate recommendations.
RF1	ReflectionAgent	Chronic LOW bias: N consecutive LOW with avg count above threshold → MEDIUM	Addresses sliding-window lag during rapid escalation.
RF2	ReflectionAgent	Rising trend ignored: trend=rising, LOW, count above threshold → MEDIUM	Corrects inconsistent state (rising crowd with LOW risk).
RF3	ReflectionAgent	Count–risk mismatch: high count but LOW risk → upgrade to MEDIUM/HIGH	Corrects mathematically inconsistent states.
RF4	ReflectionAgent	Over-estimation: HIGH risk but count < threshold (e.g. 15) → MEDIUM	Reduces false HIGH from empty or near-empty frames.

Each guardrail is implemented in the corresponding agent file; further justification is documented in ethics_and_safety_report.txt.

7. System Architecture Diagram

                    ┌───────────────┐
                    │  Video Frame  │
                    └───────┬───────┘
                            │
                            ▼
                    ┌───────────────┐
                    │PerceptionAgent│
                    └───────┬───────┘
                            │
                            ▼
                    ┌───────────────┐
                    │   RiskAgent   │
                    └───────┬───────┘
                            │
                            ▼
                    ┌───────────────┐
                    │ReflectionAgent│
                    └───────┬───────┘
                            │
                            ▼
                    ┌───────────────┐
                    │OperationsAgent│
                    └───────┬───────┘
                            │
                            ▼
                    ┌───────────────┐
                    │CoordinatorAgent│
                    └───────┬───────┘
                            │
                            ▼
                    ┌───────────────┐
                    │   HajjFlowDB  │
                    │   (SQLite)    │
                    └───────┬───────┘
                            │
                            ▼
                    ┌───────────────┐
                    │  FastAPI      │
                    │  REST API     │
                    └───────┬───────┘
                            │
                            ▼
                    ┌───────────────┐
                    │ React         │
                    │ Dashboard     │
                    │(Human-in-the- │
                    │     Loop)     │
                    └───────────────┘

8. Installation & Running

8.1 Prerequisites

Python 3.9+ (backend)
Node.js 18+ (frontend)
Groq API key (required for CoordinatorAgent). Anthropic API key optional (only if enabling VisionCountAgent in the pipeline).

8.2 Backend

cd backend
python -m venv venv
source venv/bin/activate   # Windows: venv\Scripts\activate
pip install -r requirements.txt

Set GROQ_API_KEY in the environment or in backend/config.py. Set VIDEO_PATH to a valid video file path (default: hajj_real_video.mp4 in the backend directory). Set MODEL_PATH if using a different YOLO weight file (default: yolo11l.pt).

python api.py

API listens on http://0.0.0.0:8000 by default (port configurable via API_PORT in config).

8.3 Frontend

cd frontend
npm install
npm run dev

Dashboard at http://localhost:5173 (or the port Vite reports). If the API is not at http://localhost:8000, set VITE_API_BASE_URL in frontend/.env or the environment.

8.4 Evaluation

From the backend directory:

python evaluation.py

Outputs are written to backend/outputs/eval/.

9. Repository Structure

Haramguard/
├── README.md
├── ethics_and_safety_report.txt
├── backend/
│   ├── config.py
│   ├── requirements.txt
│   ├── api.py
│   ├── pipeline.py
│   ├── evaluation.py
│   ├── dashboard.py
│   ├── core/
│   │   ├── __init__.py
│   │   ├── models.py
│   │   └── database.py
│   ├── agents/
│   │   ├── __init__.py
│   │   ├── perception_agent.py
│   │   ├── risk_agent.py
│   │   ├── reflection_agent.py
│   │   ├── operations_agent.py
│   │   ├── coordinator_agent.py
│   │   └── vision_count_agent.py
│   └── outputs/
│       └── eval/
│           ├── summary.json
│           ├── full_results.json
└── frontend/
    ├── package.json
    ├── package-lock.json
    ├── index.html
    ├── vite.config.js
    ├── tailwind.config.js
    ├── postcss.config.js
    ├── .env
    ├── STATE_REFERENCE.md
    ├── src/
    │   ├── main.jsx
    │   ├── App.jsx
    │   ├── index.css
    │   ├── pages/
    │   │   └── Dashboard.jsx
    │   ├── Fin.svg
    └── dist/
        ├── index.html
        └── assets/

10. Iterative Improvements

HaramGuard was developed through 14 documented iterations, each addressing a measured problem with a verifiable before/after result. The first 10 iterations are documented in ITERATIVE_IMPROVEMENT2.md; the following 4 are documented in changes.md.

Summary Table

#	Title	Problem	Before	After
1	YOLO Model Upgrade	nano model detected 3–4 persons on 30+ person frames	~10% recall	~~85% recall (~~8× improvement)
2	Count-Based Risk Scoring	Density-based formula: HIGH risk mathematically unreachable on aerial cameras	Scene C accuracy: 0%	Scene C accuracy: 100%
3	ReflectionAgent Added	30-frame sliding window caused 20+ frame blind spot during rapid escalation	Uncorrected bias	5/5 bias detectors passing
4	Risk–Priority Threshold Alignment	HIGH risk (score 0.66) incorrectly received P1 instead of P0	Risk→Priority alignment: ~75%	100% alignment
5	Hybrid PerceptionAgent (YOLO + Claude Vision)	Dense crowd under-count due to white ihram occlusion	3–4 persons detected	Matches ground truth
6	Modular Architecture	Entire system in one notebook — untestable, unconfigurable	0 isolated tests	6 independent agent modules
7	SQLite Audit Trail	Reflection corrections lost after session — no auditability	Console logs only	Full SQLite history
8	Evaluation Framework	No systematic metrics — manual testing only	Manual testing	8 quantified metrics, 4 scenarios
9	Condition-Based Risk Factors	High compression + clustering still reported LOW risk	Compression undetected	Compression/clustering detected
10	Weight Recalibration	Condition factors weakened the primary count signal	System accuracy: 50% (2/4)	System accuracy: 75% (3/4)
11	Risk Index Direction Fix	17 persons + shrinking crowd → 82% risk (peak window bug)	window_peak = max(counts)	current_count EMA — risk falls with crowd
12	Trend Score Bidirectionality	t_score always ≥ 0.4 even during rapid crowd decrease	Decreasing crowd added constant risk	t_score = 0.0 when crowd shrinking fast
13	Arabic UI & Decision Log	English-only labels; decisions replaced on each poll	English labels, lost history	Arabic labels (منخفض/متوسط/عالي), cumulative log
14	Clean Dashboard State	FALLBACK_STATE showed fake HIGH emergency on load	Fake P0 alert on startup	ZERO_STATE — clean until real data arrives

Key Iterations in Detail

Iteration 2 — Count-Based Scoring (most critical architectural fix)

The original formula computed pixel density: persons / (frame_pixels / 10,000). For a 1920×1080 aerial frame (~2M pixels), even 100 persons yields density = 0.5, far below the HIGH threshold of 20. The system was architecturally incapable of ever reporting HIGH risk. Replacing the primary signal with absolute person count normalized to a Hajj-calibrated threshold of 50 persons brought Scene C accuracy from 0% to 100%.

Iteration 4 — Threshold Alignment (critical safety fix)

RiskAgent labeled scores ≥ 0.65 as HIGH, but OperationsAgent only issued P0 for scores ≥ 0.70. A score of 0.66 — a genuine HIGH emergency — would receive P1 (routine monitoring) instead of P0 (immediate response). Aligning both thresholds to 0.65/0.35 in config.py fixed this safety gap and brought Risk→Priority alignment to 100%.

Iteration 11 — Risk Index Direction Fix

The EMA was computed using max(counts) over the last window, causing the risk to remain inflated long after the crowd had dispersed. Example: 70 persons 15 frames ago but only 17 now → 82% risk. Switching to current_count as the EMA input allows risk to decrease in proportion to the actual crowd, while the EMA still smooths out frame-to-frame noise.

Iteration 14 — Clean Dashboard State

FALLBACK_STATE was a hardcoded demo object showing a fake P0 HIGH emergency, designed for UI screenshots. It was left in production code and flashed on screen before the backend connected — showing operators a false emergency every time the dashboard loaded. Replacing it with ZERO_STATE (all zeros, LOW level, empty arrays) ensures the dashboard starts clean.

11. Ethics & Safety

11.1 Human-in-the-Loop Design

HaramGuard is a decision-support system, not an autonomous enforcement system. Every output is a recommendation to a human operator — no gate opens, no security is dispatched, no PA broadcast is made without a human approving the action. This design aligns with the Islamic principle of consultation (Shura) and with due diligence in decisions that affect the lives of millions of pilgrims.

11.2 Privacy & Surveillance

HaramGuard processes crowd count data only — not individual identities.

No facial recognition is performed
No biometric data is stored
YOLO detects person bounding boxes (anonymous blobs only)
Claude Vision counts persons without identification
SQLite stores: risk scores, counts, timestamps — no personal data

None of the database tables (risk_events, op_decisions, coordinator_plans, reflection_log) contain personally identifiable information (PII). Bounding box data is discarded after spacing calculation and tracking IDs are not persisted.

11.3 Fairness & Bias

YOLO models trained on general datasets may under-detect pilgrims in white ihram clothing (domain shift). Two mitigations are implemented:

VisionCountAgent (Claude Vision) provides a context-aware second counting layer that understands "pilgrims in white garments"
ReflectionAgent detects chronic under-counting (CHRONIC_LOW_BIAS) and corrects upward, preventing model bias from causing under-response

Residual risk: under-count in extreme occlusion remains possible and is documented in evaluation.py Section 6 as a known limitation.

11.4 Transparency & Explainability

Every decision in HaramGuard is logged with human-readable reasoning:

RiskAgent: logs risk_score, trend, window_avg
ReflectionAgent: logs critique text explaining why bias was detected (e.g. "RISING_TREND_IGNORED: trend=rising, persons=25, but risk=LOW. Upgraded to MEDIUM.")
OperationsAgent: logs which playbook actions were triggered and why
CoordinatorAgent: logs GPT confidence score and any guardrail corrections applied

Operators can audit every decision post-incident by querying the SQLite database.

11.5 Conservative Bias by Design

The system is deliberately tuned to err toward higher risk:

HIGH_COUNT = 50 (not 100) — triggers HIGH alert at moderate crowd sizes
ReflectionAgent corrections always go upward — never downgrade an assessment
Missing confidence score defaults to 0.5 (not 0)

Rationale: in crowd safety, a false alarm is far less costly than a missed stampede.

11.6 Potential Misuse Scenarios

Scenario	Risk	Mitigation
Surveillance creep	System extended to track individuals	Bounding boxes discarded after use; no tracking IDs in DB; Vision prompt states "count only — do not identify"
False positive causing panic	Incorrect HIGH alert triggers overreaction	HITL design; ReflectionAgent monitors for oscillation; 30-frame window smooths spikes; P0 rate limiting
System failure during peak crowd	Pipeline crash → operators lose visibility	Fail-safe per-agent isolation; operators trained to treat silence as trigger for manual monitoring
Adversarial prompt injection	Malicious input manipulates LLM output	Structured JSON-only output; GR-C1–C5 guardrails validate every field; threat_level constrained to whitelist
Disproportionate security response	P0 triggers aggressive enforcement harming pilgrims	Playbook actions are crowd-management only (open gates, PA broadcast, crowd guides) — not enforcement; human operator has final authority

11.7 Fail-Safe Behavior

If any agent fails, the pipeline continues with safe defaults:

PerceptionAgent fails → returns empty FrameResult (count=0)
RiskAgent fails → previous risk level is retained
VisionCountAgent fails → falls back to YOLO count (logged in flags)
CoordinatorAgent fails → P0 decision still issued without GPT plan
DB write fails → logged to console, pipeline continues

12. Limitations & Future Work

Single-camera, single-zone: The pipeline processes one video stream per instance. Real deployment at the Grand Mosque would require multiple cameras and zones. Future work: multi-zone state and one pipeline per camera with aggregation at the API or dashboard layer.
Synthetic-only evaluation: Quantitative metrics in evaluation.py are computed on synthetically generated scenarios with known ground-truth counts. Real aerial footage has occlusion, blur, and lighting conditions not fully captured. Future work: annotate real frames and measure real-world accuracy and recall.
No Hajj-specific fine-tuning: The YOLO model is pre-trained on general datasets. Pilgrims in white ihram can be under-detected (domain shift). VisionCountAgent and ReflectionAgent mitigate this in part when enabled. Future work: fine-tune a detector on Hajj-annotated data to improve recall (estimated +15%).
Coordinator output quality not automatically measured: Evaluation covers risk levels, priorities, and guardrail compliance — not the appropriateness or clarity of generated Arabic plans and alerts. Future work: human-expert rubric and sample-based evaluation of plan quality.
Production scaling: The architecture supports running multiple pipeline instances; the dashboard and API would need to be extended for per-zone or per-camera state and approve/reject controls per zone.

HaramGuard — Capstone Project · Tuwaiq Academy