face-match / README.md
Matan Kriel
readme
f136244
---
title: Face Matcher AI
emoji: πŸ”
colorFrom: green
colorTo: yellow
sdk: gradio
sdk_version: 5.0.0
python_version: "3.10"
app_file: app.py
pinned: false
---
# 🎭 Celebrity Twin Matcher & AI Face Analysis
**A Full-Stack Data Science Project implementing an end-to-end pipeline for facial recognition, synthetic data generation, and vector database search.**
## πŸ“– Project Overview
This project builds an AI-powered application that takes a user's photo and finds their closest celebrity lookalike. Unlike simple matching scripts, this project implements a robust **Data Engineering Pipeline**:
1. **Hybrid Dataset:** Combines real celebrity photos (LFW) with **AI-generated synthetic faces** (Stable Diffusion).
2. **Model Tournament:** Scientifically selects the best embedding model based on speed and clustering quality.
3. **Vector Search:** Builds a professional **Parquet Database** and uses Cosine Similarity for matching.
4. **Interactive App:** Deployed via a Gradio UI.
---
## πŸ› οΈ Tech Stack
* **Core Logic:** Python, NumPy, Pandas
* **Computer Vision:** DeepFace, OpenCV
* **Generative AI:** Stable Diffusion (Diffusers), Torch
* **Machine Learning:** Scikit-Learn (PCA, t-SNE, Cosine Similarity)
* **Data Engineering:** Parquet (Arrow), ETL Pipelines
* **UI/Deployment:** Gradio
---
## πŸš€ The Data Pipeline (Step-by-Step)
### 1. Synthetic Data Generation
Inhancing our dataset by creating synthetic data and appending it to our dataset.
* **Engine:** Stable Diffusion v1.5
* **Method:** We generated photorealistic portraits using the prompt *"A highly detailed photorealistic portrait of a hollywood celebrity..."*
* **Outcome:** These images are labeled as "Synthetic" in our database to visualize how they cluster compared to real faces.
### 2. Data Acquisition & Cleaning
We aggregated data from three sources:
1. **LFW (Labeled Faces in the Wild):** A benchmark dataset for face recognition.
2. **User Data:** Custom uploads (e.g., test subjects).
3. **Synthetic Data:** The AI images generated in Step 1.
**The Filter:** We implemented a `has_face()` function using **OpenCV Haar Cascades**. This ensures that every image entering our pipeline actually contains a readable face, removing bad crops or blurry background data.
![alt text](image-1.png)
*> Description: A distribution plot showing the balance of images per celebrity (capped at 40 per person).*
### 3. The Model "Battle" (Model Selection)
Instead of guessing which AI model to use, we ran a tournament comparing three state-of-the-art architectures:
* **Facenet512**
* **ArcFace**
* **GhostFaceNet**
We evaluated them on **Inference Speed** (seconds per image) and **Cluster Quality** (Silhouette Score). The winner was automatically selected to build the final database.
![alt text](image-2.png)
*> Description: Bar chart comparing the inference speed of the different models.*
### 4. ETL Pipeline (Extract, Transform, Load)
* **Extract:** Loop through valid images.
* **Transform:** Convert images into 512-dimensional vector embeddings using `DeepFace.represent`. Normalize names and categorize as "Real" or "Synthetic".
* **Load:** Save the structured data into a **Parquet file** (`famous_faces.parquet`). This acts as our persistent Vector Database.
### 5. Advanced Visualization (EDA)
To prove the quality of our embeddings, we projected the 512-dimensional vectors down to 2D using **PCA** (Principal Component Analysis) and **t-SNE**.
* This visualizes how the model groups similar faces.
* It allows us to see if "Synthetic" faces form their own cluster or blend in with "Real" faces.
![alt text](image-3.png)
*> Description: Scatter plots showing the vector space separation between Real and Synthetic faces.*
---
## 🧠 Core Logic: How Matching Works
1. **Vectorization:** The user's image is converted into an embedding vector ($V_{user}$) using the winning model.
2. **Cosine Similarity:** We calculate the angle between the user's vector and every vector in our Parquet database.
3. **Ranking:** The system returns the top 3 images with the highest similarity score (closest to 1.0).
---
## πŸ“Έ Application Demo
The final application is built with **Gradio**, allowing users to upload a photo or use their webcam.
---
## πŸ† Credits
* **Project Author:** Matan Kriel
* **Project Author:** Odeya Shmuel