face-match / README.md
Matan Kriel
readme
f136244

A newer version of the Gradio SDK is available: 6.5.1

Upgrade
metadata
title: Face Matcher AI
emoji: πŸ”
colorFrom: green
colorTo: yellow
sdk: gradio
sdk_version: 5.0.0
python_version: '3.10'
app_file: app.py
pinned: false

🎭 Celebrity Twin Matcher & AI Face Analysis

A Full-Stack Data Science Project implementing an end-to-end pipeline for facial recognition, synthetic data generation, and vector database search.

πŸ“– Project Overview

This project builds an AI-powered application that takes a user's photo and finds their closest celebrity lookalike. Unlike simple matching scripts, this project implements a robust Data Engineering Pipeline:

  1. Hybrid Dataset: Combines real celebrity photos (LFW) with AI-generated synthetic faces (Stable Diffusion).
  2. Model Tournament: Scientifically selects the best embedding model based on speed and clustering quality.
  3. Vector Search: Builds a professional Parquet Database and uses Cosine Similarity for matching.
  4. Interactive App: Deployed via a Gradio UI.

πŸ› οΈ Tech Stack

  • Core Logic: Python, NumPy, Pandas
  • Computer Vision: DeepFace, OpenCV
  • Generative AI: Stable Diffusion (Diffusers), Torch
  • Machine Learning: Scikit-Learn (PCA, t-SNE, Cosine Similarity)
  • Data Engineering: Parquet (Arrow), ETL Pipelines
  • UI/Deployment: Gradio

πŸš€ The Data Pipeline (Step-by-Step)

1. Synthetic Data Generation

Inhancing our dataset by creating synthetic data and appending it to our dataset.

  • Engine: Stable Diffusion v1.5
  • Method: We generated photorealistic portraits using the prompt "A highly detailed photorealistic portrait of a hollywood celebrity..."
  • Outcome: These images are labeled as "Synthetic" in our database to visualize how they cluster compared to real faces.

2. Data Acquisition & Cleaning

We aggregated data from three sources:

  1. LFW (Labeled Faces in the Wild): A benchmark dataset for face recognition.
  2. User Data: Custom uploads (e.g., test subjects).
  3. Synthetic Data: The AI images generated in Step 1.

The Filter: We implemented a has_face() function using OpenCV Haar Cascades. This ensures that every image entering our pipeline actually contains a readable face, removing bad crops or blurry background data.

alt text > Description: A distribution plot showing the balance of images per celebrity (capped at 40 per person).

3. The Model "Battle" (Model Selection)

Instead of guessing which AI model to use, we ran a tournament comparing three state-of-the-art architectures:

  • Facenet512
  • ArcFace
  • GhostFaceNet

We evaluated them on Inference Speed (seconds per image) and Cluster Quality (Silhouette Score). The winner was automatically selected to build the final database.

alt text > Description: Bar chart comparing the inference speed of the different models.

4. ETL Pipeline (Extract, Transform, Load)

  • Extract: Loop through valid images.
  • Transform: Convert images into 512-dimensional vector embeddings using DeepFace.represent. Normalize names and categorize as "Real" or "Synthetic".
  • Load: Save the structured data into a Parquet file (famous_faces.parquet). This acts as our persistent Vector Database.

5. Advanced Visualization (EDA)

To prove the quality of our embeddings, we projected the 512-dimensional vectors down to 2D using PCA (Principal Component Analysis) and t-SNE.

  • This visualizes how the model groups similar faces.
  • It allows us to see if "Synthetic" faces form their own cluster or blend in with "Real" faces.

alt text > Description: Scatter plots showing the vector space separation between Real and Synthetic faces.


🧠 Core Logic: How Matching Works

  1. Vectorization: The user's image is converted into an embedding vector ($V_{user}$) using the winning model.
  2. Cosine Similarity: We calculate the angle between the user's vector and every vector in our Parquet database.
  3. Ranking: The system returns the top 3 images with the highest similarity score (closest to 1.0).

πŸ“Έ Application Demo

The final application is built with Gradio, allowing users to upload a photo or use their webcam.


πŸ† Credits

  • Project Author: Matan Kriel
  • Project Author: Odeya Shmuel