File size: 4,882 Bytes
6b39516
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
# AI Face Search Engine ๐Ÿ”

A full-stack AI application that performs facial recognition and similarity search on large datasets. It uses **Vector Embeddings** to find matches in milliseconds, scalable to millions of images.

## ๐Ÿš€ Project Overview

This system allows users to upload a photo of a person and instantly find other photos of that same person from a database. Unlike simple filename matching, this uses **Deep Learning** to generate 128-dimensional vector embeddings for every face, allowing it to recognize people even with different lighting, poses, or expressions.

### Key Features
* **Vector Search:** Uses **Qdrant** (Vector Database) for high-performance similarity search.
* **Face Embeddings:** powered by `face_recognition` (dlib) to generate state-of-the-art face encodings.
* **Scalable Architecture:** Decoupled Microservices architecture (Ingestion, API, Frontend).
* **Smart Ingestion:** ETL pipeline that processes raw images, handles deduplication, and indexes metadata.
* **Interactive UI:** A clean frontend built with **Streamlit** for real-time interaction.

---

## ๐Ÿ› ๏ธ Tech Stack

| Component | Technology | Why this choice? |
| :--- | :--- | :--- |
| **Database** | **Qdrant** | Production-grade Vector DB (Rust-based), faster and more scalable than storing numpy arrays. |
| **Backend** | **FastAPI** | High-performance async API to serve model inference and static files. |
| **Frontend** | **Streamlit** | Rapid prototyping UI for data apps; handles file uploads and state management seamlessly. |
| **ML Model** | **dlib (HOG/CNN)** | Industry standard for face detection and encoding (99.38% accuracy on LFW benchmark). |
| **Dataset** | **CelebA** | Large-scale dataset (200k+ images) used to demonstrate "Wild" face recognition capabilities. |

---

## ๐Ÿ“‚ Project Structure

```text

โ”œโ”€โ”€ stored_images/       # The "S3 Bucket" (Local storage for indexed images)

โ”œโ”€โ”€ qdrant_db/           # Local vector database files (Auto-generated)

โ”œโ”€โ”€ raw_data/            # Staging area for raw datasets (e.g., CelebA)

โ”œโ”€โ”€ ingestion.py         # ETL pipeline: Detects faces -> Embeds -> Upserts to Qdrant

โ”œโ”€โ”€ api.py               # FastAPI Backend: Handles search requests & file serving

โ”œโ”€โ”€ frontend_streamlit.py# Streamlit Frontend: User Interface

โ”œโ”€โ”€ prepare_data.py      # Utility: Samples and cleans the raw CelebA dataset

โ””โ”€โ”€ requirements.txt     # Python dependencies



โšก Getting Started

1. Prerequisites

Python 3.9+



Visual Studio C++ Build Tools (Windows only, required for dlib)



2. Installation

Clone the repository and install dependencies:



Bash



git clone [https://github.com/yourusername/face-search-engine.git](https://github.com/yourusername/face-search-engine.git)

cd face-search-engine



# Create virtual environment

python -m venv venv

# Activate: source venv/bin/activate (Mac/Linux) or venv\Scripts\activate (Windows)



pip install -r requirements.txt

3. Data Setup (The ETL Pipeline)

This project uses the CelebA dataset.



Download img_align_celeba.zip and identity_CelebA.txt from Kaggle.



Extract them into a raw_data/ folder.



Run the preparation script to sample and clean the data:



Bash



python prepare_data.py

Run the ingestion script to generate embeddings and populate the Vector DB:



Bash



python ingestion.py

4. Running the Application

You need two terminal windows running simultaneously.



Terminal 1: The API Backend



Bash



uvicorn api:app --reload --port 8000

Runs on http://localhost:8000



Terminal 2: The Frontend



Bash



streamlit run frontend_streamlit.py

Opens in browser at http://localhost:8501



๐Ÿ“ธ Demo

Open the Streamlit UI.



Upload a photo of a celebrity (e.g., from Google Images).



The system converts the face to a vector, queries Qdrant, and returns the top 5 closest matches from the dataset.



๐Ÿง  How It Works (Under the Hood)

Preprocessing: When an image is ingested, the system uses HOG (Histogram of Oriented Gradients) to locate the face.



Encoding: The face pixels are mapped to a 128-dimensional vector space where similar faces are close together.



Indexing: These vectors are stored in Qdrant with a payload containing the filename and metadata.



Search: When you upload a query image, it is converted to a vector. We calculate the Euclidean Distance between the query vector and all vectors in the DB.



Ranking: Results with a distance score < 0.6 (threshold) are returned as matches.



๐Ÿ”ฎ Future Improvements

Dockerization: Containerize the API and Frontend for easy deployment.



Cloud Storage: Replace local stored_images with AWS S3.



GPU Acceleration: Enable CNN-based face detection for higher accuracy.



Author

Waseem Ashraf Khan, AI/ML Engineer, https://www.linkedin.com/in/waseem-khan-data-scientist/