Matan Kriel commited on
Commit
9e1a324
·
1 Parent(s): 7dba396

Fix binary files with LFS

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ *.png filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -1,12 +1,84 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
- title: Face Match
3
- emoji: 📈
4
- colorFrom: red
5
- colorTo: red
6
- sdk: gradio
7
- sdk_version: 6.3.0
8
- app_file: app.py
9
- pinned: false
 
 
 
 
10
  ---
11
 
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
1
+ # 🎭 Celebrity Twin Matcher & AI Face Analysis
2
+
3
+ **A Full-Stack Data Science Project implementing an end-to-end pipeline for facial recognition, synthetic data generation, and vector database search.**
4
+
5
+ ## 📖 Project Overview
6
+ This project builds an AI-powered application that takes a user's photo and finds their closest celebrity lookalike. Unlike simple matching scripts, this project implements a robust **Data Engineering Pipeline**:
7
+ 1. **Hybrid Dataset:** Combines real celebrity photos (LFW) with **AI-generated synthetic faces** (Stable Diffusion).
8
+ 2. **Model Tournament:** Scientifically selects the best embedding model based on speed and clustering quality.
9
+ 3. **Vector Search:** Builds a professional **Parquet Database** and uses Cosine Similarity for matching.
10
+ 4. **Interactive App:** Deployed via a Gradio UI.
11
+
12
+ ---
13
+
14
+ ## 🛠️ Tech Stack
15
+ * **Core Logic:** Python, NumPy, Pandas
16
+ * **Computer Vision:** DeepFace, OpenCV
17
+ * **Generative AI:** Stable Diffusion (Diffusers), Torch
18
+ * **Machine Learning:** Scikit-Learn (PCA, t-SNE, Cosine Similarity)
19
+ * **Data Engineering:** Parquet (Arrow), ETL Pipelines
20
+ * **UI/Deployment:** Gradio
21
+
22
+ ---
23
+
24
+ ## 🚀 The Data Pipeline (Step-by-Step)
25
+
26
+ ### 1. Synthetic Data Generation
27
+ To test the model's ability to distinguish between real humans and AI, we integrated a Generative AI step.
28
+ * **Engine:** Stable Diffusion v1.5
29
+ * **Method:** We generated photorealistic portraits using the prompt *"A highly detailed photorealistic portrait of a hollywood celebrity..."*
30
+ * **Outcome:** These images are labeled as "Synthetic" in our database to visualize how they cluster compared to real faces.
31
+
32
+ ### 2. Data Acquisition & Cleaning
33
+ We aggregated data from three sources:
34
+ 1. **LFW (Labeled Faces in the Wild):** A benchmark dataset for face recognition.
35
+ 2. **User Data:** Custom uploads (e.g., test subjects).
36
+ 3. **Synthetic Data:** The AI images generated in Step 1.
37
+
38
+ **The Filter:** We implemented a `has_face()` function using **OpenCV Haar Cascades**. This ensures that every image entering our pipeline actually contains a readable face, removing bad crops or blurry background data.
39
+
40
+ ![alt text](image-1.png)
41
+ *> Description: A distribution plot showing the balance of images per celebrity (capped at 40 per person).*
42
+
43
+ ### 3. The Model "Battle" (Model Selection)
44
+ Instead of guessing which AI model to use, we ran a tournament comparing three state-of-the-art architectures:
45
+ * **Facenet512**
46
+ * **ArcFace**
47
+ * **GhostFaceNet**
48
+
49
+ We evaluated them on **Inference Speed** (seconds per image) and **Cluster Quality** (Silhouette Score). The winner was automatically selected to build the final database.
50
+
51
+ ![alt text](image-2.png)
52
+ *> Description: Bar chart comparing the inference speed of the different models.*
53
+
54
+ ### 4. ETL Pipeline (Extract, Transform, Load)
55
+ * **Extract:** Loop through valid images.
56
+ * **Transform:** Convert images into 512-dimensional vector embeddings using `DeepFace.represent`. Normalize names and categorize as "Real" or "Synthetic".
57
+ * **Load:** Save the structured data into a **Parquet file** (`famous_faces.parquet`). This acts as our persistent Vector Database.
58
+
59
+ ### 5. Advanced Visualization (EDA)
60
+ To prove the quality of our embeddings, we projected the 512-dimensional vectors down to 2D using **PCA** (Principal Component Analysis) and **t-SNE**.
61
+ * This visualizes how the model groups similar faces.
62
+ * It allows us to see if "Synthetic" faces form their own cluster or blend in with "Real" faces.
63
+
64
+ ![alt text](image-3.png)
65
+ *> Description: Scatter plots showing the vector space separation between Real and Synthetic faces.*
66
+
67
  ---
68
+
69
+ ## 🧠 Core Logic: How Matching Works
70
+
71
+ 1. **Vectorization:** The user's image is converted into an embedding vector ($V_{user}$) using the winning model.
72
+ 2. **Cosine Similarity:** We calculate the angle between the user's vector and every vector in our Parquet database.
73
+ 3. **Ranking:** The system returns the top 3 images with the highest similarity score (closest to 1.0).
74
+
75
+ ---
76
+
77
+ ## 📸 Application Demo
78
+ The final application is built with **Gradio**, allowing users to upload a photo or use their webcam.
79
+
80
  ---
81
 
82
+
83
+ ## 🏆 Credits
84
+ * **Project Author:** Matan Kriel
app.py ADDED
@@ -0,0 +1,131 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Cell 6: Gradio App (Parquet-Based Search Engine)
2
+ import gradio as gr
3
+ import pandas as pd
4
+ import numpy as np
5
+ import glob
6
+ import os
7
+ from deepface import DeepFace
8
+ from PIL import Image
9
+ from sklearn.metrics.pairwise import cosine_similarity
10
+
11
+ # --- 1. Load the Knowledge Base (Specific Target) ---
12
+ # We prioritize your specific file, but keep a fallback just in case
13
+ TARGET_DB = "famous_faces_GhostFaceNet.parquet"
14
+
15
+ if os.path.exists(TARGET_DB):
16
+ DB_PATH = TARGET_DB
17
+ else:
18
+ # Fallback: Find any parquet file if the specific one is missing
19
+ parquet_files = glob.glob("famous_faces_*.parquet")
20
+ if parquet_files:
21
+ # Sort by modification time to get the newest one
22
+ parquet_files.sort(key=os.path.getmtime, reverse=True)
23
+ DB_PATH = parquet_files[0]
24
+ else:
25
+ DB_PATH = None
26
+
27
+ if DB_PATH:
28
+ print(f"📂 Loaded Knowledge Base: {DB_PATH}")
29
+ df_db = pd.read_parquet(DB_PATH)
30
+ # Convert embedding column to a clean numpy matrix for fast math
31
+ DB_VECTORS = np.stack(df_db['embedding'].values)
32
+
33
+ # Identify Model Name from filename
34
+ # e.g. "famous_faces_GhostFaceNet.parquet" -> "GhostFaceNet"
35
+ MODEL_NAME = DB_PATH.split("_")[-1].replace(".parquet", "")
36
+ print(f"⚙️ Model configured: {MODEL_NAME}")
37
+ else:
38
+ print("❌ CRITICAL: No Parquet file found! Please run the Model Battle step.")
39
+ DB_VECTORS = None
40
+ MODEL_NAME = "Unknown"
41
+
42
+ # --- 2. Define the Search Logic ---
43
+ def find_best_matches(user_image):
44
+ # Error handling for empty inputs
45
+ if user_image is None:
46
+ return None, "No Image", None, "No Image", None, "No Image"
47
+ if DB_VECTORS is None:
48
+ return None, "System Error: No DB", None, "", None, ""
49
+
50
+ try:
51
+ # A. Get User Embedding
52
+ user_embedding_obj = DeepFace.represent(
53
+ img_path=user_image,
54
+ model_name=MODEL_NAME,
55
+ enforce_detection=False
56
+ )
57
+ user_vector = user_embedding_obj[0]["embedding"]
58
+
59
+ # B. Calculate Cosine Similarity
60
+ user_vector = np.array(user_vector).reshape(1, -1)
61
+ similarities = cosine_similarity(user_vector, DB_VECTORS)[0]
62
+
63
+ # C. Get Top 3 Indices
64
+ top_indices = np.argsort(similarities)[::-1][:3]
65
+
66
+ # Prepare Output List
67
+ output_data = []
68
+
69
+ for idx in top_indices:
70
+ score = similarities[idx]
71
+ row = df_db.iloc[idx]
72
+
73
+ # Format Name
74
+ display_name = f"{row['name']}\n(Match: {int(score*100)}%)"
75
+
76
+ # Load Image
77
+ try:
78
+ if os.path.exists(row['image_path']):
79
+ img = Image.open(row['image_path'])
80
+ else:
81
+ img = None # File missing
82
+ except:
83
+ img = None
84
+
85
+ output_data.append(img)
86
+ output_data.append(display_name)
87
+
88
+ # Pad with empty data if we found fewer than 3 matches
89
+ while len(output_data) < 6:
90
+ output_data.append(None)
91
+ output_data.append("No Match")
92
+
93
+ # Return 6 items: (Img1, Lbl1, Img2, Lbl2, Img3, Lbl3)
94
+ return output_data[0], output_data[1], output_data[2], output_data[3], output_data[4], output_data[5]
95
+
96
+ except Exception as e:
97
+ return None, f"Error: {str(e)}", None, "", None, ""
98
+
99
+ # --- 3. Build Interface ---
100
+ with gr.Blocks(title="Famous Face Matcher") as demo:
101
+ gr.Markdown("# 🎭 Who is your Celebrity Twin?")
102
+ gr.Markdown(f"Searching **{len(df_db) if df_db is not None else 0} faces** using **{MODEL_NAME}**.")
103
+
104
+ with gr.Row():
105
+ with gr.Column():
106
+ user_input = gr.Image(sources=["upload", "webcam"], type="numpy", label="Your Photo")
107
+ btn = gr.Button("Find Match", variant="primary")
108
+
109
+ # Safe Example loading
110
+ if os.path.exists("test_image.png"):
111
+ gr.Examples(examples=[["test_image.png"]], inputs=user_input)
112
+
113
+ with gr.Column():
114
+ with gr.Row():
115
+ out1_img = gr.Image(label="#1 Match", type="pil")
116
+ out1_lbl = gr.Label(label="Name")
117
+ with gr.Row():
118
+ out2_img = gr.Image(label="#2 Match", type="pil")
119
+ out2_lbl = gr.Label(label="Name")
120
+ with gr.Row():
121
+ out3_img = gr.Image(label="#3 Match", type="pil")
122
+ out3_lbl = gr.Label(label="Name")
123
+
124
+ # FIX: Outputs must be a FLAT list, not a list of tuples
125
+ btn.click(
126
+ fn=find_best_matches,
127
+ inputs=user_input,
128
+ outputs=[out1_img, out1_lbl, out2_img, out2_lbl, out3_img, out3_lbl]
129
+ )
130
+
131
+ demo.launch(debug=True, share=True)
face_match_kriel.ipynb ADDED
The diff for this file is too large to render. See raw diff
 
famous_faces_GhostFaceNet.parquet ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2683e1a44df3c686cd25fc73b922b46628702947e8c2ddef27083342e16c8beb
3
+ size 3625095
image-1.png ADDED

Git LFS Details

  • SHA256: 2e695ccfbf31263a0e1ca76250594aa4cbd90c1c61a004da931409752c17e69b
  • Pointer size: 130 Bytes
  • Size of remote file: 73.9 kB
image-2.png ADDED

Git LFS Details

  • SHA256: 405925b82141be9ddbf1d95e854259215d59f215504aaea7874bea60998ef5ac
  • Pointer size: 130 Bytes
  • Size of remote file: 32.2 kB
image-3.png ADDED

Git LFS Details

  • SHA256: 970fab07aff989c9292c3fbf4d2dbafca24c6437c0b8984067bae47bdc1a8edf
  • Pointer size: 131 Bytes
  • Size of remote file: 221 kB
image.png ADDED

Git LFS Details

  • SHA256: 2e695ccfbf31263a0e1ca76250594aa4cbd90c1c61a004da931409752c17e69b
  • Pointer size: 130 Bytes
  • Size of remote file: 73.9 kB