Spaces:
Sleeping
A newer version of the Gradio SDK is available:
6.5.1
title: Face Matcher AI
emoji: π
colorFrom: green
colorTo: yellow
sdk: gradio
sdk_version: 5.0.0
python_version: '3.10'
app_file: app.py
pinned: false
π Celebrity Twin Matcher & AI Face Analysis
A Full-Stack Data Science Project implementing an end-to-end pipeline for facial recognition, synthetic data generation, and vector database search.
π Project Overview
This project builds an AI-powered application that takes a user's photo and finds their closest celebrity lookalike. Unlike simple matching scripts, this project implements a robust Data Engineering Pipeline:
- Hybrid Dataset: Combines real celebrity photos (LFW) with AI-generated synthetic faces (Stable Diffusion).
- Model Tournament: Scientifically selects the best embedding model based on speed and clustering quality.
- Vector Search: Builds a professional Parquet Database and uses Cosine Similarity for matching.
- Interactive App: Deployed via a Gradio UI.
π οΈ Tech Stack
- Core Logic: Python, NumPy, Pandas
- Computer Vision: DeepFace, OpenCV
- Generative AI: Stable Diffusion (Diffusers), Torch
- Machine Learning: Scikit-Learn (PCA, t-SNE, Cosine Similarity)
- Data Engineering: Parquet (Arrow), ETL Pipelines
- UI/Deployment: Gradio
π The Data Pipeline (Step-by-Step)
1. Synthetic Data Generation
Inhancing our dataset by creating synthetic data and appending it to our dataset.
- Engine: Stable Diffusion v1.5
- Method: We generated photorealistic portraits using the prompt "A highly detailed photorealistic portrait of a hollywood celebrity..."
- Outcome: These images are labeled as "Synthetic" in our database to visualize how they cluster compared to real faces.
2. Data Acquisition & Cleaning
We aggregated data from three sources:
- LFW (Labeled Faces in the Wild): A benchmark dataset for face recognition.
- User Data: Custom uploads (e.g., test subjects).
- Synthetic Data: The AI images generated in Step 1.
The Filter: We implemented a has_face() function using OpenCV Haar Cascades. This ensures that every image entering our pipeline actually contains a readable face, removing bad crops or blurry background data.
> Description: A distribution plot showing the balance of images per celebrity (capped at 40 per person).
3. The Model "Battle" (Model Selection)
Instead of guessing which AI model to use, we ran a tournament comparing three state-of-the-art architectures:
- Facenet512
- ArcFace
- GhostFaceNet
We evaluated them on Inference Speed (seconds per image) and Cluster Quality (Silhouette Score). The winner was automatically selected to build the final database.
> Description: Bar chart comparing the inference speed of the different models.
4. ETL Pipeline (Extract, Transform, Load)
- Extract: Loop through valid images.
- Transform: Convert images into 512-dimensional vector embeddings using
DeepFace.represent. Normalize names and categorize as "Real" or "Synthetic". - Load: Save the structured data into a Parquet file (
famous_faces.parquet). This acts as our persistent Vector Database.
5. Advanced Visualization (EDA)
To prove the quality of our embeddings, we projected the 512-dimensional vectors down to 2D using PCA (Principal Component Analysis) and t-SNE.
- This visualizes how the model groups similar faces.
- It allows us to see if "Synthetic" faces form their own cluster or blend in with "Real" faces.
> Description: Scatter plots showing the vector space separation between Real and Synthetic faces.
π§ Core Logic: How Matching Works
- Vectorization: The user's image is converted into an embedding vector ($V_{user}$) using the winning model.
- Cosine Similarity: We calculate the angle between the user's vector and every vector in our Parquet database.
- Ranking: The system returns the top 3 images with the highest similarity score (closest to 1.0).
πΈ Application Demo
The final application is built with Gradio, allowing users to upload a photo or use their webcam.
π Credits
- Project Author: Matan Kriel
- Project Author: Odeya Shmuel