File size: 2,532 Bytes
e0127c3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
---
title: Real Estate Manipulation Detector
emoji: 🏠
colorFrom: blue
colorTo: indigo
sdk: static
pinned: false
tags:
- computer-vision
- forensics
- real-estate
- blip
- resnet
license: mit
---

# 🕵️‍♂️ Real Estate Manipulation Detector (Hybrid VLM-CNN)

**Team Name:** Lina Alkhatib  
**Track:** Track B (Real Estate)  
**Date:** January 28, 2026

## 1. Executive Summary
This project implements an automated forensic system designed to detect and explain digital manipulations in real estate imagery. Addressing the challenge of "fake listings," our solution employs a **Hybrid Vision-Language Architecture**. By combining the high-speed pattern recognition of a Convolutional Neural Network (**ResNet-18**) with the semantic reasoning capabilities of a Vision-Language Model (**BLIP**), the system achieves both high detection accuracy and human-readable interpretability.

## 2. System Architecture
The system operates on a **Serial Cascading Pipeline**, utilizing two distinct modules:

### Module 1: The Detector (Quantitative Analysis)
* **Architecture:** ResNet-18 (Residual Neural Network).
* **Role:** Rapid binary classification and manipulation type scoring.
* **Classes:** `Real`, `Fake_AI`, `Fake_Splice`.
* **Output:** An `Authenticity Score` (0.0 - 1.0) and a predicted class label.

### Module 2: The Reasoner (Qualitative Forensics)
* **Architecture:** BLIP (Visual Question Answering).
* **Role:** Semantic analysis and report generation.
* **Mechanism:** The model answers specific physics-based questions about shadows, lighting, and object floating to generate a forensic report.

## 3. The "Fusion Strategy"
We use a **Conditional Logic Fusion Strategy**:
1.  **Step 1:** The image is passed through ResNet-18.
2.  **Step 2:** If flagged as `Fake`, the image is passed to BLIP.
3.  **Step 3:** BLIP is "interrogated" with targeted prompts (*"Does the object cast a shadow?"*, *"Is lighting consistent?"*).
4.  **Step 4:** A logic layer synthesizes the answers into a final text report (e.g., *"Manipulation detected: the chair lacks a grounded contact shadow"*).

## 4. How to Run
1.  Clone this repository.
2.  Install dependencies: `pip install -r requirements.txt`
3.  Run the inference script:
    ```bash
    python predict.py --input_dir ./test_images --output_file submission.json --model_path detector_model.pth
    ```

## 5. Files in this Repo
* `predict.py`: The main inference script.
* `detector_model.pth`: The trained ResNet-18 weights.
* `requirements.txt`: Python dependencies.