Spaces:

MatanKriel
/

face-match

Sleeping

App Files Files Community

face-match / README.md

Matan Kriel

readme

f136244 about 1 month ago

preview code

raw

history blame contribute delete

4.35 kB

	---
	title: Face Matcher AI
	emoji: 🍔
	colorFrom: green
	colorTo: yellow
	sdk: gradio
	sdk_version: 5.0.0
	python_version: "3.10"
	app_file: app.py
	pinned: false
	---

	# 🎭 Celebrity Twin Matcher & AI Face Analysis

	A Full-Stack Data Science Project implementing an end-to-end pipeline for facial recognition, synthetic data generation, and vector database search.

	## 📖 Project Overview
	This project builds an AI-powered application that takes a user's photo and finds their closest celebrity lookalike. Unlike simple matching scripts, this project implements a robust Data Engineering Pipeline:
	1. Hybrid Dataset: Combines real celebrity photos (LFW) with AI-generated synthetic faces (Stable Diffusion).
	2. Model Tournament: Scientifically selects the best embedding model based on speed and clustering quality.
	3. Vector Search: Builds a professional Parquet Database and uses Cosine Similarity for matching.
	4. Interactive App: Deployed via a Gradio UI.

	---

	## 🛠️ Tech Stack
	* Core Logic: Python, NumPy, Pandas
	* Computer Vision: DeepFace, OpenCV
	* Generative AI: Stable Diffusion (Diffusers), Torch
	* Machine Learning: Scikit-Learn (PCA, t-SNE, Cosine Similarity)
	* Data Engineering: Parquet (Arrow), ETL Pipelines
	* UI/Deployment: Gradio

	---

	## 🚀 The Data Pipeline (Step-by-Step)

	### 1. Synthetic Data Generation
	Inhancing our dataset by creating synthetic data and appending it to our dataset.
	* Engine: Stable Diffusion v1.5
	* Method: We generated photorealistic portraits using the prompt "A highly detailed photorealistic portrait of a hollywood celebrity..."
	* Outcome: These images are labeled as "Synthetic" in our database to visualize how they cluster compared to real faces.

	### 2. Data Acquisition & Cleaning
	We aggregated data from three sources:
	1. LFW (Labeled Faces in the Wild): A benchmark dataset for face recognition.
	2. User Data: Custom uploads (e.g., test subjects).
	3. Synthetic Data: The AI images generated in Step 1.

	The Filter: We implemented a `has_face()` function using OpenCV Haar Cascades. This ensures that every image entering our pipeline actually contains a readable face, removing bad crops or blurry background data.

	![alt text](image-1.png)
	> Description: A distribution plot showing the balance of images per celebrity (capped at 40 per person).

	### 3. The Model "Battle" (Model Selection)
	Instead of guessing which AI model to use, we ran a tournament comparing three state-of-the-art architectures:
	* Facenet512
	* ArcFace
	* GhostFaceNet

	We evaluated them on Inference Speed (seconds per image) and Cluster Quality (Silhouette Score). The winner was automatically selected to build the final database.

	![alt text](image-2.png)
	> Description: Bar chart comparing the inference speed of the different models.

	### 4. ETL Pipeline (Extract, Transform, Load)
	* Extract: Loop through valid images.
	* Transform: Convert images into 512-dimensional vector embeddings using `DeepFace.represent`. Normalize names and categorize as "Real" or "Synthetic".
	* Load: Save the structured data into a Parquet file (`famous_faces.parquet`). This acts as our persistent Vector Database.

	### 5. Advanced Visualization (EDA)
	To prove the quality of our embeddings, we projected the 512-dimensional vectors down to 2D using PCA (Principal Component Analysis) and t-SNE.
	* This visualizes how the model groups similar faces.
	* It allows us to see if "Synthetic" faces form their own cluster or blend in with "Real" faces.

	![alt text](image-3.png)
	> Description: Scatter plots showing the vector space separation between Real and Synthetic faces.

	---

	## 🧠 Core Logic: How Matching Works

	1. Vectorization: The user's image is converted into an embedding vector ($V_{user}$) using the winning model.
	2. Cosine Similarity: We calculate the angle between the user's vector and every vector in our Parquet database.
	3. Ranking: The system returns the top 3 images with the highest similarity score (closest to 1.0).

	---

	## 📸 Application Demo
	The final application is built with Gradio, allowing users to upload a photo or use their webcam.

	---


	## 🏆 Credits
	* Project Author: Matan Kriel
	* Project Author: Odeya Shmuel