File size: 4,349 Bytes
5161197
 
 
 
 
 
 
f136244
5161197
 
 
 
9e1a324
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
648852a
9e1a324
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7dba396
9e1a324
 
 
 
 
 
 
 
 
 
 
 
7dba396
 
9e1a324
 
f136244
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
---
title: Face Matcher AI 
emoji: πŸ”
colorFrom: green
colorTo: yellow
sdk: gradio
sdk_version: 5.0.0
python_version: "3.10"
app_file: app.py
pinned: false
---

# 🎭 Celebrity Twin Matcher & AI Face Analysis

**A Full-Stack Data Science Project implementing an end-to-end pipeline for facial recognition, synthetic data generation, and vector database search.**

## πŸ“– Project Overview
This project builds an AI-powered application that takes a user's photo and finds their closest celebrity lookalike. Unlike simple matching scripts, this project implements a robust **Data Engineering Pipeline**:
1.  **Hybrid Dataset:** Combines real celebrity photos (LFW) with **AI-generated synthetic faces** (Stable Diffusion).
2.  **Model Tournament:** Scientifically selects the best embedding model based on speed and clustering quality.
3.  **Vector Search:** Builds a professional **Parquet Database** and uses Cosine Similarity for matching.
4.  **Interactive App:** Deployed via a Gradio UI.

---

## πŸ› οΈ Tech Stack
* **Core Logic:** Python, NumPy, Pandas
* **Computer Vision:** DeepFace, OpenCV
* **Generative AI:** Stable Diffusion (Diffusers), Torch
* **Machine Learning:** Scikit-Learn (PCA, t-SNE, Cosine Similarity)
* **Data Engineering:** Parquet (Arrow), ETL Pipelines
* **UI/Deployment:** Gradio

---

## πŸš€ The Data Pipeline (Step-by-Step)

### 1. Synthetic Data Generation
Inhancing our dataset by creating synthetic data and appending it to our dataset.
* **Engine:** Stable Diffusion v1.5
* **Method:** We generated photorealistic portraits using the prompt *"A highly detailed photorealistic portrait of a hollywood celebrity..."*
* **Outcome:** These images are labeled as "Synthetic" in our database to visualize how they cluster compared to real faces.

### 2. Data Acquisition & Cleaning
We aggregated data from three sources:
1.  **LFW (Labeled Faces in the Wild):** A benchmark dataset for face recognition.
2.  **User Data:** Custom uploads (e.g., test subjects).
3.  **Synthetic Data:** The AI images generated in Step 1.

**The Filter:** We implemented a `has_face()` function using **OpenCV Haar Cascades**. This ensures that every image entering our pipeline actually contains a readable face, removing bad crops or blurry background data.

![alt text](image-1.png)
*> Description: A distribution plot showing the balance of images per celebrity (capped at 40 per person).*

### 3. The Model "Battle" (Model Selection)
Instead of guessing which AI model to use, we ran a tournament comparing three state-of-the-art architectures:
* **Facenet512**
* **ArcFace**
* **GhostFaceNet**

We evaluated them on **Inference Speed** (seconds per image) and **Cluster Quality** (Silhouette Score). The winner was automatically selected to build the final database.

![alt text](image-2.png)
*> Description: Bar chart comparing the inference speed of the different models.*

### 4. ETL Pipeline (Extract, Transform, Load)
* **Extract:** Loop through valid images.
* **Transform:** Convert images into 512-dimensional vector embeddings using `DeepFace.represent`. Normalize names and categorize as "Real" or "Synthetic".
* **Load:** Save the structured data into a **Parquet file** (`famous_faces.parquet`). This acts as our persistent Vector Database.

### 5. Advanced Visualization (EDA)
To prove the quality of our embeddings, we projected the 512-dimensional vectors down to 2D using **PCA** (Principal Component Analysis) and **t-SNE**.
* This visualizes how the model groups similar faces.
* It allows us to see if "Synthetic" faces form their own cluster or blend in with "Real" faces.

![alt text](image-3.png)
*> Description: Scatter plots showing the vector space separation between Real and Synthetic faces.*

---

## 🧠 Core Logic: How Matching Works

1.  **Vectorization:** The user's image is converted into an embedding vector ($V_{user}$) using the winning model.
2.  **Cosine Similarity:** We calculate the angle between the user's vector and every vector in our Parquet database.
3.  **Ranking:** The system returns the top 3 images with the highest similarity score (closest to 1.0).

---

## πŸ“Έ Application Demo
The final application is built with **Gradio**, allowing users to upload a photo or use their webcam.

---


## πŸ† Credits
* **Project Author:** Matan Kriel
* **Project Author:** Odeya Shmuel