--- title: Face Matcher AI emoji: 🍔 colorFrom: green colorTo: yellow sdk: gradio sdk_version: 5.0.0 python_version: "3.10" app_file: app.py pinned: false --- # 🎭 Celebrity Twin Matcher & AI Face Analysis **A Full-Stack Data Science Project implementing an end-to-end pipeline for facial recognition, synthetic data generation, and vector database search.** ## 📖 Project Overview This project builds an AI-powered application that takes a user's photo and finds their closest celebrity lookalike. Unlike simple matching scripts, this project implements a robust **Data Engineering Pipeline**: 1. **Hybrid Dataset:** Combines real celebrity photos (LFW) with **AI-generated synthetic faces** (Stable Diffusion). 2. **Model Tournament:** Scientifically selects the best embedding model based on speed and clustering quality. 3. **Vector Search:** Builds a professional **Parquet Database** and uses Cosine Similarity for matching. 4. **Interactive App:** Deployed via a Gradio UI. --- ## 🛠️ Tech Stack * **Core Logic:** Python, NumPy, Pandas * **Computer Vision:** DeepFace, OpenCV * **Generative AI:** Stable Diffusion (Diffusers), Torch * **Machine Learning:** Scikit-Learn (PCA, t-SNE, Cosine Similarity) * **Data Engineering:** Parquet (Arrow), ETL Pipelines * **UI/Deployment:** Gradio --- ## 🚀 The Data Pipeline (Step-by-Step) ### 1. Synthetic Data Generation Inhancing our dataset by creating synthetic data and appending it to our dataset. * **Engine:** Stable Diffusion v1.5 * **Method:** We generated photorealistic portraits using the prompt *"A highly detailed photorealistic portrait of a hollywood celebrity..."* * **Outcome:** These images are labeled as "Synthetic" in our database to visualize how they cluster compared to real faces. ### 2. Data Acquisition & Cleaning We aggregated data from three sources: 1. **LFW (Labeled Faces in the Wild):** A benchmark dataset for face recognition. 2. **User Data:** Custom uploads (e.g., test subjects). 3. **Synthetic Data:** The AI images generated in Step 1. **The Filter:** We implemented a `has_face()` function using **OpenCV Haar Cascades**. This ensures that every image entering our pipeline actually contains a readable face, removing bad crops or blurry background data. ![alt text](image-1.png) *> Description: A distribution plot showing the balance of images per celebrity (capped at 40 per person).* ### 3. The Model "Battle" (Model Selection) Instead of guessing which AI model to use, we ran a tournament comparing three state-of-the-art architectures: * **Facenet512** * **ArcFace** * **GhostFaceNet** We evaluated them on **Inference Speed** (seconds per image) and **Cluster Quality** (Silhouette Score). The winner was automatically selected to build the final database. ![alt text](image-2.png) *> Description: Bar chart comparing the inference speed of the different models.* ### 4. ETL Pipeline (Extract, Transform, Load) * **Extract:** Loop through valid images. * **Transform:** Convert images into 512-dimensional vector embeddings using `DeepFace.represent`. Normalize names and categorize as "Real" or "Synthetic". * **Load:** Save the structured data into a **Parquet file** (`famous_faces.parquet`). This acts as our persistent Vector Database. ### 5. Advanced Visualization (EDA) To prove the quality of our embeddings, we projected the 512-dimensional vectors down to 2D using **PCA** (Principal Component Analysis) and **t-SNE**. * This visualizes how the model groups similar faces. * It allows us to see if "Synthetic" faces form their own cluster or blend in with "Real" faces. ![alt text](image-3.png) *> Description: Scatter plots showing the vector space separation between Real and Synthetic faces.* --- ## 🧠 Core Logic: How Matching Works 1. **Vectorization:** The user's image is converted into an embedding vector ($V_{user}$) using the winning model. 2. **Cosine Similarity:** We calculate the angle between the user's vector and every vector in our Parquet database. 3. **Ranking:** The system returns the top 3 images with the highest similarity score (closest to 1.0). --- ## 📸 Application Demo The final application is built with **Gradio**, allowing users to upload a photo or use their webcam. --- ## 🏆 Credits * **Project Author:** Matan Kriel * **Project Author:** Odeya Shmuel