Spaces:

MatanKriel
/

face-match

Sleeping

App Files Files Community

face-match / README.md

Matan Kriel

readme

f136244 about 1 month ago

preview code

raw

history blame contribute delete

4.35 kB

A newer version of the Gradio SDK is available: 6.5.1

Upgrade

metadata

title: Face Matcher AI
emoji: 🍔
colorFrom: green
colorTo: yellow
sdk: gradio
sdk_version: 5.0.0
python_version: '3.10'
app_file: app.py
pinned: false

🎭 Celebrity Twin Matcher & AI Face Analysis

A Full-Stack Data Science Project implementing an end-to-end pipeline for facial recognition, synthetic data generation, and vector database search.

📖 Project Overview

This project builds an AI-powered application that takes a user's photo and finds their closest celebrity lookalike. Unlike simple matching scripts, this project implements a robust Data Engineering Pipeline:

Hybrid Dataset: Combines real celebrity photos (LFW) with AI-generated synthetic faces (Stable Diffusion).
Model Tournament: Scientifically selects the best embedding model based on speed and clustering quality.
Vector Search: Builds a professional Parquet Database and uses Cosine Similarity for matching.
Interactive App: Deployed via a Gradio UI.

🛠️ Tech Stack

Core Logic: Python, NumPy, Pandas
Computer Vision: DeepFace, OpenCV
Generative AI: Stable Diffusion (Diffusers), Torch
Machine Learning: Scikit-Learn (PCA, t-SNE, Cosine Similarity)
Data Engineering: Parquet (Arrow), ETL Pipelines
UI/Deployment: Gradio

🚀 The Data Pipeline (Step-by-Step)

1. Synthetic Data Generation

Inhancing our dataset by creating synthetic data and appending it to our dataset.

Engine: Stable Diffusion v1.5
Method: We generated photorealistic portraits using the prompt "A highly detailed photorealistic portrait of a hollywood celebrity..."
Outcome: These images are labeled as "Synthetic" in our database to visualize how they cluster compared to real faces.

2. Data Acquisition & Cleaning

We aggregated data from three sources:

LFW (Labeled Faces in the Wild): A benchmark dataset for face recognition.
User Data: Custom uploads (e.g., test subjects).
Synthetic Data: The AI images generated in Step 1.

The Filter: We implemented a has_face() function using OpenCV Haar Cascades. This ensures that every image entering our pipeline actually contains a readable face, removing bad crops or blurry background data.

> Description: A distribution plot showing the balance of images per celebrity (capped at 40 per person).

3. The Model "Battle" (Model Selection)

Instead of guessing which AI model to use, we ran a tournament comparing three state-of-the-art architectures:

Facenet512
ArcFace
GhostFaceNet

We evaluated them on Inference Speed (seconds per image) and Cluster Quality (Silhouette Score). The winner was automatically selected to build the final database.

> Description: Bar chart comparing the inference speed of the different models.

4. ETL Pipeline (Extract, Transform, Load)

Extract: Loop through valid images.
Transform: Convert images into 512-dimensional vector embeddings using DeepFace.represent. Normalize names and categorize as "Real" or "Synthetic".
Load: Save the structured data into a Parquet file (famous_faces.parquet). This acts as our persistent Vector Database.

5. Advanced Visualization (EDA)

To prove the quality of our embeddings, we projected the 512-dimensional vectors down to 2D using PCA (Principal Component Analysis) and t-SNE.

This visualizes how the model groups similar faces.
It allows us to see if "Synthetic" faces form their own cluster or blend in with "Real" faces.

> Description: Scatter plots showing the vector space separation between Real and Synthetic faces.

🧠 Core Logic: How Matching Works

Vectorization: The user's image is converted into an embedding vector ($V_{user}$) using the winning model.
Cosine Similarity: We calculate the angle between the user's vector and every vector in our Parquet database.
Ranking: The system returns the top 3 images with the highest similarity score (closest to 1.0).

📸 Application Demo

The final application is built with Gradio, allowing users to upload a photo or use their webcam.

🏆 Credits

Project Author: Matan Kriel
Project Author: Odeya Shmuel