Upload README.txt.txt
Browse files- README.txt.txt +43 -0
README.txt.txt
ADDED
|
@@ -0,0 +1,43 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
Technical Report: Hybrid VLM-CNN Architecture for Real Estate Manipulation Detection
|
| 2 |
+
Team Name: Lina Alkhatib
|
| 3 |
+
Track: Track B (Real Estate)
|
| 4 |
+
Date: January 28, 2026
|
| 5 |
+
1. Executive Summary
|
| 6 |
+
This report details the development of an automated forensic system designed to detect and explain digital manipulations in real estate imagery. Addressing the challenge of "fake listings," our solution employs a Hybrid Vision-Language Architecture. By combining the high-speed pattern recognition of a Convolutional Neural Network (ResNet-18) with the semantic reasoning capabilities of a Vision-Language Model (BLIP), the system achieves both high detection accuracy and human-readable interpretability. The system specifically targets physical inconsistencies such as irregular shadows, conflicting illumination, and floating objects.
|
| 7 |
+
2. System Architecture
|
| 8 |
+
The system operates on a Serial Cascading Pipeline, prioritizing computational efficiency without sacrificing explanatory depth. The architecture is divided into two distinct modules that interact via a logic controller.
|
| 9 |
+
Module 1: The Detector (Quantitative Analysis)
|
| 10 |
+
• Architecture: ResNet-18 (Residual Neural Network).
|
| 11 |
+
• Role: Rapid binary classification and manipulation type scoring.
|
| 12 |
+
• Input: $224 \times 224$ RGB Images.
|
| 13 |
+
• Classes: Real, Fake_AI (Fully generated), Fake_Splice (Copy-paste edits).
|
| 14 |
+
• Mechanism: The CNN extracts low-level latent features (texture anomalies, pixel noise, compression artifacts) that are often invisible to the human eye but indicative of digital tampering.
|
| 15 |
+
• Output: An Authenticity Score (0.0 - 1.0) and a predicted class label.
|
| 16 |
+
Module 2: The Reasoner (Qualitative Forensics)
|
| 17 |
+
• Architecture: BLIP (Bootstrapping Language-Image Pre-training) for Visual Question Answering (VQA).
|
| 18 |
+
• Role: Semantic analysis and report generation.
|
| 19 |
+
• Mechanism: Unlike standard captioning models, this module acts as a forensic interrogator. It does not simply "describe" the image; it answers specific physics-based questions about the scene's geometry and lighting.
|
| 20 |
+
• Output: Natural language justifications (e.g., "Inconsistent lighting detected on the sofa").
|
| 21 |
+
3. The "Fusion Strategy"
|
| 22 |
+
The core innovation of our solution is the Conditional Logic Fusion Strategy. Instead of running both heavy models simultaneously (parallel fusion), we use a conditional dependency approach to optimize for inference speed and relevance.
|
| 23 |
+
Step 1: The Gatekeeper (ResNet-18)
|
| 24 |
+
The image is first passed through Module 1.
|
| 25 |
+
• If the predicted class is Real (with high confidence): The pipeline terminates early. The system outputs a high Authenticity Score and a standard verification message. This saves computational resources by not invoking the VLM for clean images.
|
| 26 |
+
• If the predicted class is Fake (AI or Splice): The image is flagged and passed to Module 2 for "interrogation."
|
| 27 |
+
Step 2: Forensic Interrogation (BLIP VLM)
|
| 28 |
+
Once an image is flagged as manipulated, Module 2 executes a Targeted VQA Protocol. Instead of generic prompts, we inject three specific probes into the VLM:
|
| 29 |
+
1. Object Identification: "What is the main furniture object in the center?" (Context grounding).
|
| 30 |
+
2. Shadow Physics: "Does the [object] cast a shadow on the floor?" (Grounding consistency).
|
| 31 |
+
3. Lighting Consistency: "Is the lighting on the [object] matching the background?" (Global illumination check).
|
| 32 |
+
Step 3: Rule-Based Synthesis
|
| 33 |
+
The final output is not raw model text. A logic layer synthesizes the VQA answers into a coherent forensic report.
|
| 34 |
+
• Input: Shadow=No, Lighting=No, Object=Chair.
|
| 35 |
+
• Synthesized Output: "Manipulation detected: the chair lacks a grounded contact shadow; illumination on the chair contradicts the scene."
|
| 36 |
+
4. Data Engineering
|
| 37 |
+
To ensure robustness, the models were trained and tested on a curated dataset representing real-world attack vectors:
|
| 38 |
+
• Real Sources: High-quality interior photography from the Places365 dataset (Living rooms, Bedrooms).
|
| 39 |
+
• Synthetic Attacks: Generated using Stable Diffusion Inpainting to insert objects (sofas, tables) into scenes without proper lighting integration.
|
| 40 |
+
• Splicing Attacks: "Dumb" copy-paste augmentations created programmatically to simulate amateur Photoshop errors (floating objects, broken perspective).
|
| 41 |
+
5. Conclusion
|
| 42 |
+
This system successfully bridges the gap between "Black Box" AI detection and human interpretability. By using ResNet-18 for detecting pixels and BLIP for detecting physics, the solution provides real estate platforms with a reliable, scalable, and explainable tool for verifying listing integrity.
|
| 43 |
+
|