LinaAlkh
/

Real-Estate-Forensics

+Technical Report: Hybrid VLM-CNN Architecture for Real Estate Manipulation Detection
+Team Name: Lina Alkhatib
+Track: Track B (Real Estate)
+Date: January 28, 2026
+1. Executive Summary
+This report details the development of an automated forensic system designed to detect and explain digital manipulations in real estate imagery. Addressing the challenge of "fake listings," our solution employs a Hybrid Vision-Language Architecture. By combining the high-speed pattern recognition of a Convolutional Neural Network (ResNet-18) with the semantic reasoning capabilities of a Vision-Language Model (BLIP), the system achieves both high detection accuracy and human-readable interpretability. The system specifically targets physical inconsistencies such as irregular shadows, conflicting illumination, and floating objects.
+2. System Architecture
+The system operates on a Serial Cascading Pipeline, prioritizing computational efficiency without sacrificing explanatory depth. The architecture is divided into two distinct modules that interact via a logic controller.
+Module 1: The Detector (Quantitative Analysis)
+•	Architecture: ResNet-18 (Residual Neural Network).
+•	Role: Rapid binary classification and manipulation type scoring.
+•	Input: $224 \times 224$ RGB Images.
+•	Classes: Real, Fake_AI (Fully generated), Fake_Splice (Copy-paste edits).
+•	Mechanism: The CNN extracts low-level latent features (texture anomalies, pixel noise, compression artifacts) that are often invisible to the human eye but indicative of digital tampering.
+•	Output: An Authenticity Score (0.0 - 1.0) and a predicted class label.
+Module 2: The Reasoner (Qualitative Forensics)
+•	Architecture: BLIP (Bootstrapping Language-Image Pre-training) for Visual Question Answering (VQA).
+•	Role: Semantic analysis and report generation.
+•	Mechanism: Unlike standard captioning models, this module acts as a forensic interrogator. It does not simply "describe" the image; it answers specific physics-based questions about the scene's geometry and lighting.
+•	Output: Natural language justifications (e.g., "Inconsistent lighting detected on the sofa").
+3. The "Fusion Strategy"
+The core innovation of our solution is the Conditional Logic Fusion Strategy. Instead of running both heavy models simultaneously (parallel fusion), we use a conditional dependency approach to optimize for inference speed and relevance.
+Step 1: The Gatekeeper (ResNet-18)
+The image is first passed through Module 1.
+•	If the predicted class is Real (with high confidence): The pipeline terminates early. The system outputs a high Authenticity Score and a standard verification message. This saves computational resources by not invoking the VLM for clean images.
+•	If the predicted class is Fake (AI or Splice): The image is flagged and passed to Module 2 for "interrogation."
+Step 2: Forensic Interrogation (BLIP VLM)
+Once an image is flagged as manipulated, Module 2 executes a Targeted VQA Protocol. Instead of generic prompts, we inject three specific probes into the VLM:
+1.	Object Identification: "What is the main furniture object in the center?" (Context grounding).
+2.	Shadow Physics: "Does the [object] cast a shadow on the floor?" (Grounding consistency).
+3.	Lighting Consistency: "Is the lighting on the [object] matching the background?" (Global illumination check).
+Step 3: Rule-Based Synthesis
+The final output is not raw model text. A logic layer synthesizes the VQA answers into a coherent forensic report.
+•	Input: Shadow=No, Lighting=No, Object=Chair.
+•	Synthesized Output: "Manipulation detected: the chair lacks a grounded contact shadow; illumination on the chair contradicts the scene."
+4. Data Engineering
+To ensure robustness, the models were trained and tested on a curated dataset representing real-world attack vectors:
+•	Real Sources: High-quality interior photography from the Places365 dataset (Living rooms, Bedrooms).
+•	Synthetic Attacks: Generated using Stable Diffusion Inpainting to insert objects (sofas, tables) into scenes without proper lighting integration.
+•	Splicing Attacks: "Dumb" copy-paste augmentations created programmatically to simulate amateur Photoshop errors (floating objects, broken perspective).
+5. Conclusion
+This system successfully bridges the gap between "Black Box" AI detection and human interpretability. By using ResNet-18 for detecting pixels and BLIP for detecting physics, the solution provides real estate platforms with a reliable, scalable, and explainable tool for verifying listing integrity.