--- title: CineMatch API emoji: ๐ŸŽฌ colorFrom: purple colorTo: pink sdk: docker app_port: 7860 --- # ๐ŸŽฌ CineMatch API **CineMatch** is an intelligent, content-based movie recommendation engine powered by cutting-edge AI. It combines semantic search, vector embeddings, and personalization to deliver highly accurate movie recommendations tailored to user preferences. ## ๐Ÿ“‹ Table of Contents - [Features](#features) - [Architecture](#architecture) - [Tech Stack](#tech-stack) - [Installation](#installation) - [Configuration](#configuration) - [Usage](#usage) - [Running the Server](#running-the-server) - [API Endpoints](#api-endpoints) - [Examples](#examples) - [Project Structure](#project-structure) - [How It Works](#how-it-works) - [Performance Considerations](#performance-considerations) - [Deployment](#deployment) - [Troubleshooting](#troubleshooting) --- ## โœจ Features ### 1. **Semantic Search** ๐Ÿ” Search for movies using natural language queries. The system converts your text into a vector embedding and finds semantically similar movies. - Example: *"A romantic movie about a sinking ship"* โ†’ Returns *Titanic* ### 2. **Vibe-Based Recommendations** ๐ŸŽฏ Search by combining tags (genres, themes) and descriptions for more refined results. - Example: Tags: `["Sci-Fi", "Action"]`, Description: `"Robots fighting in space"` โ†’ Returns relevant matches ### 3. **Personalized Recommendations** ๐Ÿ‘ค Provide a list of movies you've liked, and the system averages their vectors to create a personalized profile, then recommends similar movies. - Example: Liked: `["The Matrix", "Inception"]` โ†’ Get similar mind-bending films ### 4. **Content-Based Similarity** ๐Ÿ”— Find movies similar to a specific title already in the database. - Example: Similar to *"Inception"* โ†’ Returns *"Interstellar"*, *"The Matrix"*, etc. ### 5. **Rich Movie Metadata** ๐Ÿ“Š Each movie includes: - Director information - Top 4 cast members - Keywords (e.g., "time travel", "dystopia") - Genres - Plot overview - IMDB ratings ### 6. **Incremental Learning** ๐Ÿ“ˆ Add new movies to the system without retrainingโ€”updates are instant! --- ## ๐Ÿ—๏ธ Architecture ``` โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ User Request โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ FastAPI Server โ”‚ โ”‚ (Endpoint Handler) โ”‚ โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ MovieRecommender Engine โ”‚ โ”‚ (FAISS + Vector Search) โ”‚ โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Embedding Model โ”‚ โ”‚ (SentenceTransformers) โ”‚ โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ FAISS Index โ”‚ โ”‚ (movie_index.faiss) โ”‚ โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Movie Metadata โ”‚ โ”‚ (metadata.pkl) โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ``` --- ## ๐Ÿ› ๏ธ Tech Stack | Component | Technology | Purpose | |-----------|-----------|---------| | **Backend Framework** | FastAPI | High-performance async API | | **Vector Search** | FAISS | Fast similarity search on embeddings | | **Embeddings** | SentenceTransformers (MiniLM-L6-v2) | Convert text to 384-dim vectors | | **Data Source** | TMDB API | Movie metadata (titles, cast, genres, etc.) | | **Data Processing** | Pandas, NumPy | Data cleaning & preprocessing | | **Deployment** | Docker | Containerized deployment | | **Python Version** | 3.9+ | Modern async/await support | --- ## ๐Ÿ“ฆ Installation ### Prerequisites - Python 3.9 or higher - TMDB API Key (free, get it at [themoviedb.org](https://www.themoviedb.org/settings/api)) - ~2GB free disk space (for models and indices) ### Step 1: Clone & Setup ```bash # Navigate to project directory cd CineMatch # Create virtual environment python -m venv .venv # Activate virtual environment # On Windows: .\.venv\Scripts\Activate.ps1 # On macOS/Linux: source .venv/bin/activate ``` ### Step 2: Install Dependencies ```bash pip install -r requirements.txt ``` ### Step 3: Configure Environment Create a `.env` file in the project root: ``` TMDB_API_KEY=your_api_key_here ``` --- ## โš™๏ธ Configuration ### Environment Variables | Variable | Description | Example | |----------|-------------|---------| | `TMDB_API_KEY` | Your TMDB API key | `abc123xyz...` | ### Model Configuration The default embedding model is **`all-MiniLM-L6-v2`** from SentenceTransformers: - **Embedding Dimension**: 384 - **Speed**: Very fast (optimized for CPU) - **Quality**: High for semantic similarity - **Memory**: ~80MB To use a different model, modify [recommender.py](src/recommender.py#L6) in the `MovieRecommender.__init__()` method. --- ## ๐Ÿš€ Usage ### Running the Server ```bash # Make sure your virtual environment is activated python app.py ``` The server will start at `http://localhost:8000` **API Documentation** (auto-generated Swagger UI): - Swagger: `http://localhost:8000/docs` - ReDoc: `http://localhost:8000/redoc` ### Data Ingestion Before using the API, you need to populate the FAISS index with movies: ```bash python src/ingest.py ``` This will: 1. Fetch ~50 high-quality movies from TMDB (popularity โ‰ฅ 7.0, votes โ‰ฅ 500) 2. Extract director, cast, and keywords for each movie 3. Generate embeddings 4. Save to `models/movie_index.faiss` and `models/metadata.pkl` To reset and rebuild the index: ```python # In src/ingest.py, modify the last line: ingest_high_quality_movies(target_count=100, reset=True) # reset=True to rebuild ``` --- ## ๐Ÿ“ก API Endpoints ### 1. **Health Check** ``` GET / ``` **Response:** ```json { "status": "online and active!!!", "model_loaded": true } ``` --- ### 2. **Semantic Search** ๐Ÿ” ``` POST /search ``` **Request:** ```json { "query": "A romantic movie about a sinking ship", "k": 5 } ``` **Response:** ```json [ { "movie_id": 597, "title": "Titanic", "score": 0.856 }, { "movie_id": 285, "title": "The Poseidon Adventure", "score": 0.743 } ] ``` --- ### 3. **Vibe-Based Search** ๐ŸŽฏ ``` POST /recommend/vibe ``` **Request:** ```json { "tags": ["Sci-Fi", "Action", "Space"], "description": "Robots fighting in space with stunning visuals", "k": 10 } ``` **Response:** ```json { "interpreted_query": "Sci-Fi Action Space Sci-Fi Action Space Robots fighting in space with stunning visuals", "results": [ { "movie_id": 58, "title": "The Fifth Element", "score": 0.912 } ] } ``` --- ### 4. **Personalized Recommendations** ๐Ÿ‘ค ``` POST /recommend/user ``` **Request:** ```json { "liked_movies": ["The Matrix", "Inception", "Interstellar"], "k": 5 } ``` **Response:** ```json [ { "movie_id": 27205, "title": "Oblivion", "score": 0.834 }, { "movie_id": 284054, "title": "Doctor Strange", "score": 0.798 } ] ``` --- ### 5. **Similar Movies** ๐Ÿ”— ``` GET /recommend/movie/{title} ``` **Example:** ``` GET /recommend/movie/Inception ``` **Response:** ```json [ { "movie_id": 38372, "title": "Interstellar", "score": 0.891 }, { "movie_id": 603, "title": "The Matrix", "score": 0.867 } ] ``` --- ### 6. **Admin: Trigger Background Update** ๐Ÿ”„ ``` POST /admin/trigger-update ``` **Response:** ```json { "message": "Update process started in background. Check server logs for progress." } ``` This endpoint triggers background ingestion without blocking the API. --- ## ๐Ÿ“ Examples ### Example 1: Find Movies Similar to Your Favorite ```python import requests BASE_URL = "http://localhost:8000" # Get movies similar to "The Matrix" response = requests.get(f"{BASE_URL}/recommend/movie/The Matrix") recommendations = response.json() for movie in recommendations: print(f"{movie['title']} (Score: {movie['score']:.2f})") ``` ### Example 2: Semantic Search with Natural Language ```python response = requests.post( f"{BASE_URL}/search", json={ "query": "A thrilling space adventure with amazing visuals", "k": 5 } ) for movie in response.json(): print(f"โœ“ {movie['title']}") ``` ### Example 3: Personalized Recommendations Based on History ```python response = requests.post( f"{BASE_URL}/recommend/user", json={ "liked_movies": ["Dune", "Blade Runner 2049", "Arrival"], "k": 10 } ) for movie in response.json(): print(f"โ˜… {movie['title']}") ``` --- ## ๐Ÿ“‚ Project Structure ``` CineMatch/ โ”œโ”€โ”€ app.py # Main FastAPI application โ”œโ”€โ”€ main.py # (Optional) Alternative entry point โ”œโ”€โ”€ Dockerfile # Docker configuration โ”œโ”€โ”€ requirements.txt # Python dependencies โ”œโ”€โ”€ .env # API keys (create this) โ”‚ โ”œโ”€โ”€ src/ โ”‚ โ”œโ”€โ”€ __init__.py โ”‚ โ”œโ”€โ”€ recommender.py # Core FAISS-based recommendation engine โ”‚ โ”œโ”€โ”€ ingest.py # TMDB data ingestion pipeline โ”‚ โ””โ”€โ”€ preprocessing.py # Data cleaning & feature engineering โ”‚ โ”œโ”€โ”€ models/ โ”‚ โ”œโ”€โ”€ movie_index.faiss # FAISS index (generated after ingestion) โ”‚ โ””โ”€โ”€ metadata.pkl # Movie metadata dataframe (generated) โ”‚ โ”œโ”€โ”€ eda/ โ”‚ โ””โ”€โ”€ Untitled.ipynb # Exploratory data analysis notebook โ”‚ โ””โ”€โ”€ README.md # This file ``` --- ## ๐Ÿง  How It Works ### The Embedding Pipeline ``` Raw Text Input (Movie Title + Metadata) โ†“ [SentenceTransformers] โ†“ 384-Dimensional Vector โ†“ [L2 Normalization] โ†“ Normalized Vector (Unit Length) โ†“ [FAISS IndexFlatIP] โ†“ Stored in Index ``` ### Recommendation Flow 1. **User provides query** (text, tags, or movie titles) 2. **Convert to vector** using SentenceTransformers 3. **Normalize vector** (for cosine similarity) 4. **FAISS search** finds K nearest neighbors in index 5. **Return results** with similarity scores ### Why This Approach? - **Fast**: FAISS is optimized for billion-scale vector search - **Accurate**: Semantic embeddings capture meaning, not just keywords - **Scalable**: Can handle millions of movies - **CPU-Friendly**: MiniLM model is tiny but effective - **Incremental**: Add movies without retraining --- ## โšก Performance Considerations ### Indexing Speed - **MiniLM Model**: ~100-200 movies/second on modern CPU - **FAISS Indexing**: Instant for additions - **Memory**: ~384 bytes per movie embedding ### Search Speed - **Single Query**: 1-5ms - **Batch Queries**: Linear time complexity O(n) - **Max Practical Size**: 10+ million movies ### Optimization Tips 1. **Use Batch Processing**: Send multiple queries at once 2. **Tune k Parameter**: Lower k = faster results (typically k=5-10 is good) 3. **CPU**: The MiniLM model leverages BLAS libraries for speed 4. **GPU**: Optionalโ€”can speed up embedding generation 10x --- ## ๐Ÿณ Deployment ### Docker Build & Run ```bash # Build image docker build -t cinematch:latest . # Run container docker run -p 8000:8000 \ -e TMDB_API_KEY=your_key \ cinematch:latest ``` ### Production Deployment The project includes a `Dockerfile` configured for production use: - **Base Image**: Python 3.9+ - **Port**: 8000 (configurable) - **Entry**: `python app.py` For production, consider: - Using **Gunicorn** or **Uvicorn** with multiple workers - Adding **Nginx** reverse proxy - Implementing **authentication** (API keys) - Using **cloud storage** for models (S3, GCS) --- ## ๐Ÿ› Troubleshooting ### Issue: "No model found" Error **Solution**: Run data ingestion first: ```bash python src/ingest.py ``` ### Issue: TMDB API Key Invalid **Solution**: Verify your `.env` file: ```bash cat .env # Check the key is there ``` ### Issue: Out of Memory **Solution**: Reduce batch size in [recommender.py](src/recommender.py#L18): ```python batch_size = 32 # Lower from 64 ``` ### Issue: Slow Embedding Generation **Solution**: - The MiniLM model is already optimized for CPU - For GPU support, install PyTorch with CUDA: ```bash pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 ``` ### Issue: CORS Errors **Solution**: Already handled in [app.py](app.py#L15). The API allows all origins (`allow_origins=["*"]`). For production, restrict this: ```python allow_origins=["https://yourdomain.com"] ``` --- ## ๐Ÿ“Š Dataset Information **Movie Source**: The Movie Database (TMDB) API **Filtering Criteria**: - Minimum Rating: 7.0 / 10.0 - Minimum Vote Count: 500 votes - Sorted by: Popularity (descending) **Metadata Included**: - Title - Director - Cast (top 4 actors) - Keywords - Genres - Overview / Plot - Vote Average --- ## ๐Ÿ”ฎ Future Enhancements - [ ] User authentication & API key management - [ ] Collaborative filtering (user-user similarity) - [ ] Real-time model updates with webhooks - [ ] Advanced filtering (year, rating, runtime) - [ ] Movie rating & feedback loop for model improvement - [ ] Multi-language support - [ ] Mobile app integration --- ## ๐Ÿ“„ License This project is open source. Feel free to modify and extend it! --- ## ๐Ÿ’ฌ Support For issues, questions, or contributions: 1. Check the [Troubleshooting](#troubleshooting) section 2. Review the [API Documentation](http://localhost:8000/docs) 3. Examine the source code in `src/` directory --- **Enjoy discovering your next favorite movie! ๐Ÿฟ๐ŸŽฌ**