match-api / README.md
JermaineAI's picture
changed readme file
3407011
---
title: CineMatch API
emoji: ๐ŸŽฌ
colorFrom: purple
colorTo: pink
sdk: docker
app_port: 7860
---
# ๐ŸŽฌ CineMatch API
**CineMatch** is an intelligent, content-based movie recommendation engine powered by cutting-edge AI. It combines semantic search, vector embeddings, and personalization to deliver highly accurate movie recommendations tailored to user preferences.
## ๐Ÿ“‹ Table of Contents
- [Features](#features)
- [Architecture](#architecture)
- [Tech Stack](#tech-stack)
- [Installation](#installation)
- [Configuration](#configuration)
- [Usage](#usage)
- [Running the Server](#running-the-server)
- [API Endpoints](#api-endpoints)
- [Examples](#examples)
- [Project Structure](#project-structure)
- [How It Works](#how-it-works)
- [Performance Considerations](#performance-considerations)
- [Deployment](#deployment)
- [Troubleshooting](#troubleshooting)
---
## โœจ Features
### 1. **Semantic Search** ๐Ÿ”
Search for movies using natural language queries. The system converts your text into a vector embedding and finds semantically similar movies.
- Example: *"A romantic movie about a sinking ship"* โ†’ Returns *Titanic*
### 2. **Vibe-Based Recommendations** ๐ŸŽฏ
Search by combining tags (genres, themes) and descriptions for more refined results.
- Example: Tags: `["Sci-Fi", "Action"]`, Description: `"Robots fighting in space"` โ†’ Returns relevant matches
### 3. **Personalized Recommendations** ๐Ÿ‘ค
Provide a list of movies you've liked, and the system averages their vectors to create a personalized profile, then recommends similar movies.
- Example: Liked: `["The Matrix", "Inception"]` โ†’ Get similar mind-bending films
### 4. **Content-Based Similarity** ๐Ÿ”—
Find movies similar to a specific title already in the database.
- Example: Similar to *"Inception"* โ†’ Returns *"Interstellar"*, *"The Matrix"*, etc.
### 5. **Rich Movie Metadata** ๐Ÿ“Š
Each movie includes:
- Director information
- Top 4 cast members
- Keywords (e.g., "time travel", "dystopia")
- Genres
- Plot overview
- IMDB ratings
### 6. **Incremental Learning** ๐Ÿ“ˆ
Add new movies to the system without retrainingโ€”updates are instant!
---
## ๐Ÿ—๏ธ Architecture
```
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ User Request โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”‚
โ”Œโ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ FastAPI Server โ”‚
โ”‚ (Endpoint Handler) โ”‚
โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”‚
โ”Œโ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ MovieRecommender Engine โ”‚
โ”‚ (FAISS + Vector Search) โ”‚
โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”‚
โ”Œโ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Embedding Model โ”‚
โ”‚ (SentenceTransformers) โ”‚
โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”‚
โ”Œโ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ FAISS Index โ”‚
โ”‚ (movie_index.faiss) โ”‚
โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”‚
โ”Œโ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Movie Metadata โ”‚
โ”‚ (metadata.pkl) โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
```
---
## ๐Ÿ› ๏ธ Tech Stack
| Component | Technology | Purpose |
|-----------|-----------|---------|
| **Backend Framework** | FastAPI | High-performance async API |
| **Vector Search** | FAISS | Fast similarity search on embeddings |
| **Embeddings** | SentenceTransformers (MiniLM-L6-v2) | Convert text to 384-dim vectors |
| **Data Source** | TMDB API | Movie metadata (titles, cast, genres, etc.) |
| **Data Processing** | Pandas, NumPy | Data cleaning & preprocessing |
| **Deployment** | Docker | Containerized deployment |
| **Python Version** | 3.9+ | Modern async/await support |
---
## ๐Ÿ“ฆ Installation
### Prerequisites
- Python 3.9 or higher
- TMDB API Key (free, get it at [themoviedb.org](https://www.themoviedb.org/settings/api))
- ~2GB free disk space (for models and indices)
### Step 1: Clone & Setup
```bash
# Navigate to project directory
cd CineMatch
# Create virtual environment
python -m venv .venv
# Activate virtual environment
# On Windows:
.\.venv\Scripts\Activate.ps1
# On macOS/Linux:
source .venv/bin/activate
```
### Step 2: Install Dependencies
```bash
pip install -r requirements.txt
```
### Step 3: Configure Environment
Create a `.env` file in the project root:
```
TMDB_API_KEY=your_api_key_here
```
---
## โš™๏ธ Configuration
### Environment Variables
| Variable | Description | Example |
|----------|-------------|---------|
| `TMDB_API_KEY` | Your TMDB API key | `abc123xyz...` |
### Model Configuration
The default embedding model is **`all-MiniLM-L6-v2`** from SentenceTransformers:
- **Embedding Dimension**: 384
- **Speed**: Very fast (optimized for CPU)
- **Quality**: High for semantic similarity
- **Memory**: ~80MB
To use a different model, modify [recommender.py](src/recommender.py#L6) in the `MovieRecommender.__init__()` method.
---
## ๐Ÿš€ Usage
### Running the Server
```bash
# Make sure your virtual environment is activated
python app.py
```
The server will start at `http://localhost:8000`
**API Documentation** (auto-generated Swagger UI):
- Swagger: `http://localhost:8000/docs`
- ReDoc: `http://localhost:8000/redoc`
### Data Ingestion
Before using the API, you need to populate the FAISS index with movies:
```bash
python src/ingest.py
```
This will:
1. Fetch ~50 high-quality movies from TMDB (popularity โ‰ฅ 7.0, votes โ‰ฅ 500)
2. Extract director, cast, and keywords for each movie
3. Generate embeddings
4. Save to `models/movie_index.faiss` and `models/metadata.pkl`
To reset and rebuild the index:
```python
# In src/ingest.py, modify the last line:
ingest_high_quality_movies(target_count=100, reset=True) # reset=True to rebuild
```
---
## ๐Ÿ“ก API Endpoints
### 1. **Health Check**
```
GET /
```
**Response:**
```json
{
"status": "online and active!!!",
"model_loaded": true
}
```
---
### 2. **Semantic Search** ๐Ÿ”
```
POST /search
```
**Request:**
```json
{
"query": "A romantic movie about a sinking ship",
"k": 5
}
```
**Response:**
```json
[
{
"movie_id": 597,
"title": "Titanic",
"score": 0.856
},
{
"movie_id": 285,
"title": "The Poseidon Adventure",
"score": 0.743
}
]
```
---
### 3. **Vibe-Based Search** ๐ŸŽฏ
```
POST /recommend/vibe
```
**Request:**
```json
{
"tags": ["Sci-Fi", "Action", "Space"],
"description": "Robots fighting in space with stunning visuals",
"k": 10
}
```
**Response:**
```json
{
"interpreted_query": "Sci-Fi Action Space Sci-Fi Action Space Robots fighting in space with stunning visuals",
"results": [
{
"movie_id": 58,
"title": "The Fifth Element",
"score": 0.912
}
]
}
```
---
### 4. **Personalized Recommendations** ๐Ÿ‘ค
```
POST /recommend/user
```
**Request:**
```json
{
"liked_movies": ["The Matrix", "Inception", "Interstellar"],
"k": 5
}
```
**Response:**
```json
[
{
"movie_id": 27205,
"title": "Oblivion",
"score": 0.834
},
{
"movie_id": 284054,
"title": "Doctor Strange",
"score": 0.798
}
]
```
---
### 5. **Similar Movies** ๐Ÿ”—
```
GET /recommend/movie/{title}
```
**Example:**
```
GET /recommend/movie/Inception
```
**Response:**
```json
[
{
"movie_id": 38372,
"title": "Interstellar",
"score": 0.891
},
{
"movie_id": 603,
"title": "The Matrix",
"score": 0.867
}
]
```
---
### 6. **Admin: Trigger Background Update** ๐Ÿ”„
```
POST /admin/trigger-update
```
**Response:**
```json
{
"message": "Update process started in background. Check server logs for progress."
}
```
This endpoint triggers background ingestion without blocking the API.
---
## ๐Ÿ“ Examples
### Example 1: Find Movies Similar to Your Favorite
```python
import requests
BASE_URL = "http://localhost:8000"
# Get movies similar to "The Matrix"
response = requests.get(f"{BASE_URL}/recommend/movie/The Matrix")
recommendations = response.json()
for movie in recommendations:
print(f"{movie['title']} (Score: {movie['score']:.2f})")
```
### Example 2: Semantic Search with Natural Language
```python
response = requests.post(
f"{BASE_URL}/search",
json={
"query": "A thrilling space adventure with amazing visuals",
"k": 5
}
)
for movie in response.json():
print(f"โœ“ {movie['title']}")
```
### Example 3: Personalized Recommendations Based on History
```python
response = requests.post(
f"{BASE_URL}/recommend/user",
json={
"liked_movies": ["Dune", "Blade Runner 2049", "Arrival"],
"k": 10
}
)
for movie in response.json():
print(f"โ˜… {movie['title']}")
```
---
## ๐Ÿ“‚ Project Structure
```
CineMatch/
โ”œโ”€โ”€ app.py # Main FastAPI application
โ”œโ”€โ”€ main.py # (Optional) Alternative entry point
โ”œโ”€โ”€ Dockerfile # Docker configuration
โ”œโ”€โ”€ requirements.txt # Python dependencies
โ”œโ”€โ”€ .env # API keys (create this)
โ”‚
โ”œโ”€โ”€ src/
โ”‚ โ”œโ”€โ”€ __init__.py
โ”‚ โ”œโ”€โ”€ recommender.py # Core FAISS-based recommendation engine
โ”‚ โ”œโ”€โ”€ ingest.py # TMDB data ingestion pipeline
โ”‚ โ””โ”€โ”€ preprocessing.py # Data cleaning & feature engineering
โ”‚
โ”œโ”€โ”€ models/
โ”‚ โ”œโ”€โ”€ movie_index.faiss # FAISS index (generated after ingestion)
โ”‚ โ””โ”€โ”€ metadata.pkl # Movie metadata dataframe (generated)
โ”‚
โ”œโ”€โ”€ eda/
โ”‚ โ””โ”€โ”€ Untitled.ipynb # Exploratory data analysis notebook
โ”‚
โ””โ”€โ”€ README.md # This file
```
---
## ๐Ÿง  How It Works
### The Embedding Pipeline
```
Raw Text Input (Movie Title + Metadata)
โ†“
[SentenceTransformers]
โ†“
384-Dimensional Vector
โ†“
[L2 Normalization]
โ†“
Normalized Vector (Unit Length)
โ†“
[FAISS IndexFlatIP]
โ†“
Stored in Index
```
### Recommendation Flow
1. **User provides query** (text, tags, or movie titles)
2. **Convert to vector** using SentenceTransformers
3. **Normalize vector** (for cosine similarity)
4. **FAISS search** finds K nearest neighbors in index
5. **Return results** with similarity scores
### Why This Approach?
- **Fast**: FAISS is optimized for billion-scale vector search
- **Accurate**: Semantic embeddings capture meaning, not just keywords
- **Scalable**: Can handle millions of movies
- **CPU-Friendly**: MiniLM model is tiny but effective
- **Incremental**: Add movies without retraining
---
## โšก Performance Considerations
### Indexing Speed
- **MiniLM Model**: ~100-200 movies/second on modern CPU
- **FAISS Indexing**: Instant for additions
- **Memory**: ~384 bytes per movie embedding
### Search Speed
- **Single Query**: 1-5ms
- **Batch Queries**: Linear time complexity O(n)
- **Max Practical Size**: 10+ million movies
### Optimization Tips
1. **Use Batch Processing**: Send multiple queries at once
2. **Tune k Parameter**: Lower k = faster results (typically k=5-10 is good)
3. **CPU**: The MiniLM model leverages BLAS libraries for speed
4. **GPU**: Optionalโ€”can speed up embedding generation 10x
---
## ๐Ÿณ Deployment
### Docker Build & Run
```bash
# Build image
docker build -t cinematch:latest .
# Run container
docker run -p 8000:8000 \
-e TMDB_API_KEY=your_key \
cinematch:latest
```
### Production Deployment
The project includes a `Dockerfile` configured for production use:
- **Base Image**: Python 3.9+
- **Port**: 8000 (configurable)
- **Entry**: `python app.py`
For production, consider:
- Using **Gunicorn** or **Uvicorn** with multiple workers
- Adding **Nginx** reverse proxy
- Implementing **authentication** (API keys)
- Using **cloud storage** for models (S3, GCS)
---
## ๐Ÿ› Troubleshooting
### Issue: "No model found" Error
**Solution**: Run data ingestion first:
```bash
python src/ingest.py
```
### Issue: TMDB API Key Invalid
**Solution**: Verify your `.env` file:
```bash
cat .env # Check the key is there
```
### Issue: Out of Memory
**Solution**: Reduce batch size in [recommender.py](src/recommender.py#L18):
```python
batch_size = 32 # Lower from 64
```
### Issue: Slow Embedding Generation
**Solution**:
- The MiniLM model is already optimized for CPU
- For GPU support, install PyTorch with CUDA:
```bash
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
```
### Issue: CORS Errors
**Solution**: Already handled in [app.py](app.py#L15). The API allows all origins (`allow_origins=["*"]`). For production, restrict this:
```python
allow_origins=["https://yourdomain.com"]
```
---
## ๐Ÿ“Š Dataset Information
**Movie Source**: The Movie Database (TMDB) API
**Filtering Criteria**:
- Minimum Rating: 7.0 / 10.0
- Minimum Vote Count: 500 votes
- Sorted by: Popularity (descending)
**Metadata Included**:
- Title
- Director
- Cast (top 4 actors)
- Keywords
- Genres
- Overview / Plot
- Vote Average
---
## ๐Ÿ”ฎ Future Enhancements
- [ ] User authentication & API key management
- [ ] Collaborative filtering (user-user similarity)
- [ ] Real-time model updates with webhooks
- [ ] Advanced filtering (year, rating, runtime)
- [ ] Movie rating & feedback loop for model improvement
- [ ] Multi-language support
- [ ] Mobile app integration
---
## ๐Ÿ“„ License
This project is open source. Feel free to modify and extend it!
---
## ๐Ÿ’ฌ Support
For issues, questions, or contributions:
1. Check the [Troubleshooting](#troubleshooting) section
2. Review the [API Documentation](http://localhost:8000/docs)
3. Examine the source code in `src/` directory
---
**Enjoy discovering your next favorite movie! ๐Ÿฟ๐ŸŽฌ**