Spaces:
Sleeping
Sleeping
| title: Face Matcher AI | |
| emoji: π | |
| colorFrom: green | |
| colorTo: yellow | |
| sdk: gradio | |
| sdk_version: 5.0.0 | |
| python_version: "3.10" | |
| app_file: app.py | |
| pinned: false | |
| # π Celebrity Twin Matcher & AI Face Analysis | |
| **A Full-Stack Data Science Project implementing an end-to-end pipeline for facial recognition, synthetic data generation, and vector database search.** | |
| ## π Project Overview | |
| This project builds an AI-powered application that takes a user's photo and finds their closest celebrity lookalike. Unlike simple matching scripts, this project implements a robust **Data Engineering Pipeline**: | |
| 1. **Hybrid Dataset:** Combines real celebrity photos (LFW) with **AI-generated synthetic faces** (Stable Diffusion). | |
| 2. **Model Tournament:** Scientifically selects the best embedding model based on speed and clustering quality. | |
| 3. **Vector Search:** Builds a professional **Parquet Database** and uses Cosine Similarity for matching. | |
| 4. **Interactive App:** Deployed via a Gradio UI. | |
| --- | |
| ## π οΈ Tech Stack | |
| * **Core Logic:** Python, NumPy, Pandas | |
| * **Computer Vision:** DeepFace, OpenCV | |
| * **Generative AI:** Stable Diffusion (Diffusers), Torch | |
| * **Machine Learning:** Scikit-Learn (PCA, t-SNE, Cosine Similarity) | |
| * **Data Engineering:** Parquet (Arrow), ETL Pipelines | |
| * **UI/Deployment:** Gradio | |
| --- | |
| ## π The Data Pipeline (Step-by-Step) | |
| ### 1. Synthetic Data Generation | |
| Inhancing our dataset by creating synthetic data and appending it to our dataset. | |
| * **Engine:** Stable Diffusion v1.5 | |
| * **Method:** We generated photorealistic portraits using the prompt *"A highly detailed photorealistic portrait of a hollywood celebrity..."* | |
| * **Outcome:** These images are labeled as "Synthetic" in our database to visualize how they cluster compared to real faces. | |
| ### 2. Data Acquisition & Cleaning | |
| We aggregated data from three sources: | |
| 1. **LFW (Labeled Faces in the Wild):** A benchmark dataset for face recognition. | |
| 2. **User Data:** Custom uploads (e.g., test subjects). | |
| 3. **Synthetic Data:** The AI images generated in Step 1. | |
| **The Filter:** We implemented a `has_face()` function using **OpenCV Haar Cascades**. This ensures that every image entering our pipeline actually contains a readable face, removing bad crops or blurry background data. | |
|  | |
| *> Description: A distribution plot showing the balance of images per celebrity (capped at 40 per person).* | |
| ### 3. The Model "Battle" (Model Selection) | |
| Instead of guessing which AI model to use, we ran a tournament comparing three state-of-the-art architectures: | |
| * **Facenet512** | |
| * **ArcFace** | |
| * **GhostFaceNet** | |
| We evaluated them on **Inference Speed** (seconds per image) and **Cluster Quality** (Silhouette Score). The winner was automatically selected to build the final database. | |
|  | |
| *> Description: Bar chart comparing the inference speed of the different models.* | |
| ### 4. ETL Pipeline (Extract, Transform, Load) | |
| * **Extract:** Loop through valid images. | |
| * **Transform:** Convert images into 512-dimensional vector embeddings using `DeepFace.represent`. Normalize names and categorize as "Real" or "Synthetic". | |
| * **Load:** Save the structured data into a **Parquet file** (`famous_faces.parquet`). This acts as our persistent Vector Database. | |
| ### 5. Advanced Visualization (EDA) | |
| To prove the quality of our embeddings, we projected the 512-dimensional vectors down to 2D using **PCA** (Principal Component Analysis) and **t-SNE**. | |
| * This visualizes how the model groups similar faces. | |
| * It allows us to see if "Synthetic" faces form their own cluster or blend in with "Real" faces. | |
|  | |
| *> Description: Scatter plots showing the vector space separation between Real and Synthetic faces.* | |
| --- | |
| ## π§ Core Logic: How Matching Works | |
| 1. **Vectorization:** The user's image is converted into an embedding vector ($V_{user}$) using the winning model. | |
| 2. **Cosine Similarity:** We calculate the angle between the user's vector and every vector in our Parquet database. | |
| 3. **Ranking:** The system returns the top 3 images with the highest similarity score (closest to 1.0). | |
| --- | |
| ## πΈ Application Demo | |
| The final application is built with **Gradio**, allowing users to upload a photo or use their webcam. | |
| --- | |
| ## π Credits | |
| * **Project Author:** Matan Kriel | |
| * **Project Author:** Odeya Shmuel |