LinaAlkh commited on
Commit
591a8e4
·
verified ·
1 Parent(s): 2d96524

Upload README.txt.txt

Browse files
Files changed (1) hide show
  1. README.txt.txt +43 -0
README.txt.txt ADDED
@@ -0,0 +1,43 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Technical Report: Hybrid VLM-CNN Architecture for Real Estate Manipulation Detection
2
+ Team Name: Lina Alkhatib
3
+ Track: Track B (Real Estate)
4
+ Date: January 28, 2026
5
+ 1. Executive Summary
6
+ This report details the development of an automated forensic system designed to detect and explain digital manipulations in real estate imagery. Addressing the challenge of "fake listings," our solution employs a Hybrid Vision-Language Architecture. By combining the high-speed pattern recognition of a Convolutional Neural Network (ResNet-18) with the semantic reasoning capabilities of a Vision-Language Model (BLIP), the system achieves both high detection accuracy and human-readable interpretability. The system specifically targets physical inconsistencies such as irregular shadows, conflicting illumination, and floating objects.
7
+ 2. System Architecture
8
+ The system operates on a Serial Cascading Pipeline, prioritizing computational efficiency without sacrificing explanatory depth. The architecture is divided into two distinct modules that interact via a logic controller.
9
+ Module 1: The Detector (Quantitative Analysis)
10
+ • Architecture: ResNet-18 (Residual Neural Network).
11
+ • Role: Rapid binary classification and manipulation type scoring.
12
+ • Input: $224 \times 224$ RGB Images.
13
+ • Classes: Real, Fake_AI (Fully generated), Fake_Splice (Copy-paste edits).
14
+ • Mechanism: The CNN extracts low-level latent features (texture anomalies, pixel noise, compression artifacts) that are often invisible to the human eye but indicative of digital tampering.
15
+ • Output: An Authenticity Score (0.0 - 1.0) and a predicted class label.
16
+ Module 2: The Reasoner (Qualitative Forensics)
17
+ • Architecture: BLIP (Bootstrapping Language-Image Pre-training) for Visual Question Answering (VQA).
18
+ • Role: Semantic analysis and report generation.
19
+ • Mechanism: Unlike standard captioning models, this module acts as a forensic interrogator. It does not simply "describe" the image; it answers specific physics-based questions about the scene's geometry and lighting.
20
+ • Output: Natural language justifications (e.g., "Inconsistent lighting detected on the sofa").
21
+ 3. The "Fusion Strategy"
22
+ The core innovation of our solution is the Conditional Logic Fusion Strategy. Instead of running both heavy models simultaneously (parallel fusion), we use a conditional dependency approach to optimize for inference speed and relevance.
23
+ Step 1: The Gatekeeper (ResNet-18)
24
+ The image is first passed through Module 1.
25
+ • If the predicted class is Real (with high confidence): The pipeline terminates early. The system outputs a high Authenticity Score and a standard verification message. This saves computational resources by not invoking the VLM for clean images.
26
+ • If the predicted class is Fake (AI or Splice): The image is flagged and passed to Module 2 for "interrogation."
27
+ Step 2: Forensic Interrogation (BLIP VLM)
28
+ Once an image is flagged as manipulated, Module 2 executes a Targeted VQA Protocol. Instead of generic prompts, we inject three specific probes into the VLM:
29
+ 1. Object Identification: "What is the main furniture object in the center?" (Context grounding).
30
+ 2. Shadow Physics: "Does the [object] cast a shadow on the floor?" (Grounding consistency).
31
+ 3. Lighting Consistency: "Is the lighting on the [object] matching the background?" (Global illumination check).
32
+ Step 3: Rule-Based Synthesis
33
+ The final output is not raw model text. A logic layer synthesizes the VQA answers into a coherent forensic report.
34
+ • Input: Shadow=No, Lighting=No, Object=Chair.
35
+ • Synthesized Output: "Manipulation detected: the chair lacks a grounded contact shadow; illumination on the chair contradicts the scene."
36
+ 4. Data Engineering
37
+ To ensure robustness, the models were trained and tested on a curated dataset representing real-world attack vectors:
38
+ • Real Sources: High-quality interior photography from the Places365 dataset (Living rooms, Bedrooms).
39
+ • Synthetic Attacks: Generated using Stable Diffusion Inpainting to insert objects (sofas, tables) into scenes without proper lighting integration.
40
+ • Splicing Attacks: "Dumb" copy-paste augmentations created programmatically to simulate amateur Photoshop errors (floating objects, broken perspective).
41
+ 5. Conclusion
42
+ This system successfully bridges the gap between "Black Box" AI detection and human interpretability. By using ResNet-18 for detecting pixels and BLIP for detecting physics, the solution provides real estate platforms with a reliable, scalable, and explainable tool for verifying listing integrity.
43
+