Spaces:

MatanKriel
/

face-match

Sleeping

App Files Files Community

Matan Kriel commited on Jan 10

Commit

9e1a324

1 Parent(s): 7dba396

Fix binary files with LFS

Browse files

Files changed (9) hide show

.gitattributes +1 -0
README.md +81 -9
app.py +131 -0
face_match_kriel.ipynb +0 -0
famous_faces_GhostFaceNet.parquet +3 -0
image-1.png +3 -0
image-2.png +3 -0
image-3.png +3 -0
image.png +3 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+*.png filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -1,12 +1,84 @@
 ---
-title: Face Match
-emoji: 📈
-colorFrom: red
-colorTo: red
-sdk: gradio
-sdk_version: 6.3.0
-app_file: app.py
-pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+# 🎭 Celebrity Twin Matcher & AI Face Analysis
+**A Full-Stack Data Science Project implementing an end-to-end pipeline for facial recognition, synthetic data generation, and vector database search.**
+## 📖 Project Overview
+This project builds an AI-powered application that takes a user's photo and finds their closest celebrity lookalike. Unlike simple matching scripts, this project implements a robust **Data Engineering Pipeline**:
+1.  **Hybrid Dataset:** Combines real celebrity photos (LFW) with **AI-generated synthetic faces** (Stable Diffusion).
+2.  **Model Tournament:** Scientifically selects the best embedding model based on speed and clustering quality.
+3.  **Vector Search:** Builds a professional **Parquet Database** and uses Cosine Similarity for matching.
+4.  **Interactive App:** Deployed via a Gradio UI.
+---
+## 🛠️ Tech Stack
+* **Core Logic:** Python, NumPy, Pandas
+* **Computer Vision:** DeepFace, OpenCV
+* **Generative AI:** Stable Diffusion (Diffusers), Torch
+* **Machine Learning:** Scikit-Learn (PCA, t-SNE, Cosine Similarity)
+* **Data Engineering:** Parquet (Arrow), ETL Pipelines
+* **UI/Deployment:** Gradio
+---
+## 🚀 The Data Pipeline (Step-by-Step)
+### 1. Synthetic Data Generation
+To test the model's ability to distinguish between real humans and AI, we integrated a Generative AI step.
+* **Engine:** Stable Diffusion v1.5
+* **Method:** We generated photorealistic portraits using the prompt *"A highly detailed photorealistic portrait of a hollywood celebrity..."*
+* **Outcome:** These images are labeled as "Synthetic" in our database to visualize how they cluster compared to real faces.
+### 2. Data Acquisition & Cleaning
+We aggregated data from three sources:
+1.  **LFW (Labeled Faces in the Wild):** A benchmark dataset for face recognition.
+2.  **User Data:** Custom uploads (e.g., test subjects).
+3.  **Synthetic Data:** The AI images generated in Step 1.
+**The Filter:** We implemented a `has_face()` function using **OpenCV Haar Cascades**. This ensures that every image entering our pipeline actually contains a readable face, removing bad crops or blurry background data.
+![alt text](image-1.png)
+*> Description: A distribution plot showing the balance of images per celebrity (capped at 40 per person).*
+### 3. The Model "Battle" (Model Selection)
+Instead of guessing which AI model to use, we ran a tournament comparing three state-of-the-art architectures:
+* **Facenet512**
+* **ArcFace**
+* **GhostFaceNet**
+We evaluated them on **Inference Speed** (seconds per image) and **Cluster Quality** (Silhouette Score). The winner was automatically selected to build the final database.
+![alt text](image-2.png)
+*> Description: Bar chart comparing the inference speed of the different models.*
+### 4. ETL Pipeline (Extract, Transform, Load)
+* **Extract:** Loop through valid images.
+* **Transform:** Convert images into 512-dimensional vector embeddings using `DeepFace.represent`. Normalize names and categorize as "Real" or "Synthetic".
+* **Load:** Save the structured data into a **Parquet file** (`famous_faces.parquet`). This acts as our persistent Vector Database.
+### 5. Advanced Visualization (EDA)
+To prove the quality of our embeddings, we projected the 512-dimensional vectors down to 2D using **PCA** (Principal Component Analysis) and **t-SNE**.
+* This visualizes how the model groups similar faces.
+* It allows us to see if "Synthetic" faces form their own cluster or blend in with "Real" faces.
+![alt text](image-3.png)
+*> Description: Scatter plots showing the vector space separation between Real and Synthetic faces.*
 ---
+## 🧠 Core Logic: How Matching Works
+1.  **Vectorization:** The user's image is converted into an embedding vector ($V_{user}$) using the winning model.
+2.  **Cosine Similarity:** We calculate the angle between the user's vector and every vector in our Parquet database.
+3.  **Ranking:** The system returns the top 3 images with the highest similarity score (closest to 1.0).
+---
+## 📸 Application Demo
+The final application is built with **Gradio**, allowing users to upload a photo or use their webcam.
 ---
+## 🏆 Credits
+* **Project Author:** Matan Kriel

app.py ADDED Viewed

	@@ -0,0 +1,131 @@

+# Cell 6: Gradio App (Parquet-Based Search Engine)
+import gradio as gr
+import pandas as pd
+import numpy as np
+import glob
+import os
+from deepface import DeepFace
+from PIL import Image
+from sklearn.metrics.pairwise import cosine_similarity
+# --- 1. Load the Knowledge Base (Specific Target) ---
+# We prioritize your specific file, but keep a fallback just in case
+TARGET_DB = "famous_faces_GhostFaceNet.parquet"
+if os.path.exists(TARGET_DB):
+    DB_PATH = TARGET_DB
+else:
+    # Fallback: Find any parquet file if the specific one is missing
+    parquet_files = glob.glob("famous_faces_*.parquet")
+    if parquet_files:
+        # Sort by modification time to get the newest one
+        parquet_files.sort(key=os.path.getmtime, reverse=True)
+        DB_PATH = parquet_files[0]
+    else:
+        DB_PATH = None
+if DB_PATH:
+    print(f"📂 Loaded Knowledge Base: {DB_PATH}")
+    df_db = pd.read_parquet(DB_PATH)
+    # Convert embedding column to a clean numpy matrix for fast math
+    DB_VECTORS = np.stack(df_db['embedding'].values)
+    # Identify Model Name from filename
+    # e.g. "famous_faces_GhostFaceNet.parquet" -> "GhostFaceNet"
+    MODEL_NAME = DB_PATH.split("_")[-1].replace(".parquet", "")
+    print(f"⚙️ Model configured: {MODEL_NAME}")
+else:
+    print("❌ CRITICAL: No Parquet file found! Please run the Model Battle step.")
+    DB_VECTORS = None
+    MODEL_NAME = "Unknown"
+# --- 2. Define the Search Logic ---
+def find_best_matches(user_image):
+    # Error handling for empty inputs
+    if user_image is None:
+        return None, "No Image", None, "No Image", None, "No Image"
+    if DB_VECTORS is None:
+        return None, "System Error: No DB", None, "", None, ""
+    try:
+        # A. Get User Embedding
+        user_embedding_obj = DeepFace.represent(
+            img_path=user_image,
+            model_name=MODEL_NAME,
+            enforce_detection=False
+        )
+        user_vector = user_embedding_obj[0]["embedding"]
+        # B. Calculate Cosine Similarity
+        user_vector = np.array(user_vector).reshape(1, -1)
+        similarities = cosine_similarity(user_vector, DB_VECTORS)[0]
+        # C. Get Top 3 Indices
+        top_indices = np.argsort(similarities)[::-1][:3]
+        # Prepare Output List
+        output_data = []
+        for idx in top_indices:
+            score = similarities[idx]
+            row = df_db.iloc[idx]
+            # Format Name
+            display_name = f"{row['name']}\n(Match: {int(score*100)}%)"
+            # Load Image
+            try:
+                if os.path.exists(row['image_path']):
+                    img = Image.open(row['image_path'])
+                else:
+                    img = None # File missing
+            except:
+                img = None
+            output_data.append(img)
+            output_data.append(display_name)
+        # Pad with empty data if we found fewer than 3 matches
+        while len(output_data) < 6:
+            output_data.append(None)
+            output_data.append("No Match")
+        # Return 6 items: (Img1, Lbl1, Img2, Lbl2, Img3, Lbl3)
+        return output_data[0], output_data[1], output_data[2], output_data[3], output_data[4], output_data[5]
+    except Exception as e:
+        return None, f"Error: {str(e)}", None, "", None, ""
+# --- 3. Build Interface ---
+with gr.Blocks(title="Famous Face Matcher") as demo:
+    gr.Markdown("# 🎭 Who is your Celebrity Twin?")
+    gr.Markdown(f"Searching **{len(df_db) if df_db is not None else 0} faces** using **{MODEL_NAME}**.")
+    with gr.Row():
+        with gr.Column():
+            user_input = gr.Image(sources=["upload", "webcam"], type="numpy", label="Your Photo")
+            btn = gr.Button("Find Match", variant="primary")
+            # Safe Example loading
+            if os.path.exists("test_image.png"):
+                gr.Examples(examples=[["test_image.png"]], inputs=user_input)
+        with gr.Column():
+            with gr.Row():
+                out1_img = gr.Image(label="#1 Match", type="pil")
+                out1_lbl = gr.Label(label="Name")
+            with gr.Row():
+                out2_img = gr.Image(label="#2 Match", type="pil")
+                out2_lbl = gr.Label(label="Name")
+            with gr.Row():
+                out3_img = gr.Image(label="#3 Match", type="pil")
+                out3_lbl = gr.Label(label="Name")
+    # FIX: Outputs must be a FLAT list, not a list of tuples
+    btn.click(
+        fn=find_best_matches,
+        inputs=user_input,
+        outputs=[out1_img, out1_lbl, out2_img, out2_lbl, out3_img, out3_lbl]
+    )
+demo.launch(debug=True, share=True)