Spaces:

Jaykay73
/

match-api

Sleeping

App Files Files Community

match-api / README.md

JermaineAI

changed readme file

3407011 3 months ago

preview code

raw

history blame contribute delete

14.3 kB

metadata

title: CineMatch API
emoji: 🎬
colorFrom: purple
colorTo: pink
sdk: docker
app_port: 7860

🎬 CineMatch API

CineMatch is an intelligent, content-based movie recommendation engine powered by cutting-edge AI. It combines semantic search, vector embeddings, and personalization to deliver highly accurate movie recommendations tailored to user preferences.

✨ Features

1. Semantic Search 🔍

Search for movies using natural language queries. The system converts your text into a vector embedding and finds semantically similar movies.

Example: "A romantic movie about a sinking ship" → Returns Titanic

2. Vibe-Based Recommendations 🎯

Search by combining tags (genres, themes) and descriptions for more refined results.

Example: Tags: ["Sci-Fi", "Action"], Description: "Robots fighting in space" → Returns relevant matches

3. Personalized Recommendations 👤

Provide a list of movies you've liked, and the system averages their vectors to create a personalized profile, then recommends similar movies.

Example: Liked: ["The Matrix", "Inception"] → Get similar mind-bending films

4. Content-Based Similarity 🔗

Find movies similar to a specific title already in the database.

Example: Similar to "Inception" → Returns "Interstellar", "The Matrix", etc.

5. Rich Movie Metadata 📊

Each movie includes:

Director information
Top 4 cast members
Keywords (e.g., "time travel", "dystopia")
Genres
Plot overview
IMDB ratings

6. Incremental Learning 📈

Add new movies to the system without retraining—updates are instant!

🏗️ Architecture

┌─────────────────┐
│   User Request  │
└────────┬────────┘
         │
    ┌────▼─────────────────────┐
    │   FastAPI Server         │
    │  (Endpoint Handler)       │
    └────┬──────────────────────┘
         │
    ┌────▼──────────────────────────┐
    │  MovieRecommender Engine      │
    │  (FAISS + Vector Search)      │
    └────┬───────────────────────────┘
         │
    ┌────▼──────────────────────┐
    │  Embedding Model          │
    │  (SentenceTransformers)   │
    └────┬──────────────────────┘
         │
    ┌────▼──────────────────────┐
    │  FAISS Index              │
    │  (movie_index.faiss)      │
    └────┬──────────────────────┘
         │
    ┌────▼──────────────────────┐
    │  Movie Metadata           │
    │  (metadata.pkl)           │
    └──────────────────────────┘

🛠️ Tech Stack

Component	Technology	Purpose
Backend Framework	FastAPI	High-performance async API
Vector Search	FAISS	Fast similarity search on embeddings
Embeddings	SentenceTransformers (MiniLM-L6-v2)	Convert text to 384-dim vectors
Data Source	TMDB API	Movie metadata (titles, cast, genres, etc.)
Data Processing	Pandas, NumPy	Data cleaning & preprocessing
Deployment	Docker	Containerized deployment
Python Version	3.9+	Modern async/await support

📦 Installation

Prerequisites

Python 3.9 or higher
TMDB API Key (free, get it at themoviedb.org)
~2GB free disk space (for models and indices)

Step 1: Clone & Setup

# Navigate to project directory
cd CineMatch

# Create virtual environment
python -m venv .venv

# Activate virtual environment
# On Windows:
.\.venv\Scripts\Activate.ps1

# On macOS/Linux:
source .venv/bin/activate

Step 2: Install Dependencies

pip install -r requirements.txt

Step 3: Configure Environment

Create a .env file in the project root:

TMDB_API_KEY=your_api_key_here

⚙️ Configuration

Environment Variables

Variable	Description	Example
`TMDB_API_KEY`	Your TMDB API key	`abc123xyz...`

Model Configuration

The default embedding model is all-MiniLM-L6-v2 from SentenceTransformers:

Embedding Dimension: 384
Speed: Very fast (optimized for CPU)
Quality: High for semantic similarity
Memory: ~80MB

To use a different model, modify recommender.py in the MovieRecommender.__init__() method.

🚀 Usage

Running the Server

# Make sure your virtual environment is activated
python app.py

The server will start at http://localhost:8000

API Documentation (auto-generated Swagger UI):

Swagger: http://localhost:8000/docs
ReDoc: http://localhost:8000/redoc

Data Ingestion

Before using the API, you need to populate the FAISS index with movies:

python src/ingest.py

This will:

Fetch ~50 high-quality movies from TMDB (popularity ≥ 7.0, votes ≥ 500)
Extract director, cast, and keywords for each movie
Generate embeddings
Save to models/movie_index.faiss and models/metadata.pkl

To reset and rebuild the index:

# In src/ingest.py, modify the last line:
ingest_high_quality_movies(target_count=100, reset=True)  # reset=True to rebuild

📡 API Endpoints

1. Health Check

GET /

Response:

{
  "status": "online and active!!!",
  "model_loaded": true
}

2. Semantic Search 🔍

POST /search

Request:

{
  "query": "A romantic movie about a sinking ship",
  "k": 5
}

Response:

[
  {
    "movie_id": 597,
    "title": "Titanic",
    "score": 0.856
  },
  {
    "movie_id": 285,
    "title": "The Poseidon Adventure",
    "score": 0.743
  }
]

3. Vibe-Based Search 🎯

POST /recommend/vibe

Request:

{
  "tags": ["Sci-Fi", "Action", "Space"],
  "description": "Robots fighting in space with stunning visuals",
  "k": 10
}

Response:

{
  "interpreted_query": "Sci-Fi Action Space Sci-Fi Action Space Robots fighting in space with stunning visuals",
  "results": [
    {
      "movie_id": 58,
      "title": "The Fifth Element",
      "score": 0.912
    }
  ]
}

4. Personalized Recommendations 👤

POST /recommend/user

Request:

{
  "liked_movies": ["The Matrix", "Inception", "Interstellar"],
  "k": 5
}

Response:

[
  {
    "movie_id": 27205,
    "title": "Oblivion",
    "score": 0.834
  },
  {
    "movie_id": 284054,
    "title": "Doctor Strange",
    "score": 0.798
  }
]

5. Similar Movies 🔗

GET /recommend/movie/{title}

Example:

GET /recommend/movie/Inception

Response:

[
  {
    "movie_id": 38372,
    "title": "Interstellar",
    "score": 0.891
  },
  {
    "movie_id": 603,
    "title": "The Matrix",
    "score": 0.867
  }
]

6. Admin: Trigger Background Update 🔄

POST /admin/trigger-update

Response:

{
  "message": "Update process started in background. Check server logs for progress."
}

This endpoint triggers background ingestion without blocking the API.

📝 Examples

Example 1: Find Movies Similar to Your Favorite

import requests

BASE_URL = "http://localhost:8000"

# Get movies similar to "The Matrix"
response = requests.get(f"{BASE_URL}/recommend/movie/The Matrix")
recommendations = response.json()

for movie in recommendations:
    print(f"{movie['title']} (Score: {movie['score']:.2f})")

Example 2: Semantic Search with Natural Language

response = requests.post(
    f"{BASE_URL}/search",
    json={
        "query": "A thrilling space adventure with amazing visuals",
        "k": 5
    }
)

for movie in response.json():
    print(f"✓ {movie['title']}")

Example 3: Personalized Recommendations Based on History

response = requests.post(
    f"{BASE_URL}/recommend/user",
    json={
        "liked_movies": ["Dune", "Blade Runner 2049", "Arrival"],
        "k": 10
    }
)

for movie in response.json():
    print(f"★ {movie['title']}")

📂 Project Structure

CineMatch/
├── app.py                    # Main FastAPI application
├── main.py                   # (Optional) Alternative entry point
├── Dockerfile                # Docker configuration
├── requirements.txt          # Python dependencies
├── .env                      # API keys (create this)
│
├── src/
│   ├── __init__.py
│   ├── recommender.py        # Core FAISS-based recommendation engine
│   ├── ingest.py             # TMDB data ingestion pipeline
│   └── preprocessing.py      # Data cleaning & feature engineering
│
├── models/
│   ├── movie_index.faiss     # FAISS index (generated after ingestion)
│   └── metadata.pkl          # Movie metadata dataframe (generated)
│
├── eda/
│   └── Untitled.ipynb        # Exploratory data analysis notebook
│
└── README.md                 # This file

🧠 How It Works

The Embedding Pipeline

Raw Text Input (Movie Title + Metadata)
          ↓
    [SentenceTransformers]
          ↓
    384-Dimensional Vector
          ↓
    [L2 Normalization]
          ↓
    Normalized Vector (Unit Length)
          ↓
    [FAISS IndexFlatIP]
          ↓
    Stored in Index

Recommendation Flow

User provides query (text, tags, or movie titles)
Convert to vector using SentenceTransformers
Normalize vector (for cosine similarity)
FAISS search finds K nearest neighbors in index
Return results with similarity scores

Why This Approach?

Fast: FAISS is optimized for billion-scale vector search
Accurate: Semantic embeddings capture meaning, not just keywords
Scalable: Can handle millions of movies
CPU-Friendly: MiniLM model is tiny but effective
Incremental: Add movies without retraining

⚡ Performance Considerations

Indexing Speed

MiniLM Model: ~100-200 movies/second on modern CPU
FAISS Indexing: Instant for additions
Memory: ~384 bytes per movie embedding

Search Speed

Single Query: 1-5ms
Batch Queries: Linear time complexity O(n)
Max Practical Size: 10+ million movies

Optimization Tips

Use Batch Processing: Send multiple queries at once
Tune k Parameter: Lower k = faster results (typically k=5-10 is good)
CPU: The MiniLM model leverages BLAS libraries for speed
GPU: Optional—can speed up embedding generation 10x

🐳 Deployment

Docker Build & Run

# Build image
docker build -t cinematch:latest .

# Run container
docker run -p 8000:8000 \
  -e TMDB_API_KEY=your_key \
  cinematch:latest

Production Deployment

The project includes a Dockerfile configured for production use:

Base Image: Python 3.9+
Port: 8000 (configurable)
Entry: python app.py

For production, consider:

Using Gunicorn or Uvicorn with multiple workers
Adding Nginx reverse proxy
Implementing authentication (API keys)
Using cloud storage for models (S3, GCS)

🐛 Troubleshooting

Issue: "No model found" Error

Solution: Run data ingestion first:

python src/ingest.py

Issue: TMDB API Key Invalid

Solution: Verify your .env file:

cat .env  # Check the key is there

Issue: Out of Memory

Solution: Reduce batch size in recommender.py:

batch_size = 32  # Lower from 64

Issue: Slow Embedding Generation

Solution:

The MiniLM model is already optimized for CPU
For GPU support, install PyTorch with CUDA:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Issue: CORS Errors

Solution: Already handled in app.py. The API allows all origins (allow_origins=["*"]). For production, restrict this:

allow_origins=["https://yourdomain.com"]

📊 Dataset Information

Movie Source: The Movie Database (TMDB) API

Filtering Criteria:

Minimum Rating: 7.0 / 10.0
Minimum Vote Count: 500 votes
Sorted by: Popularity (descending)

Metadata Included:

Title
Director
Cast (top 4 actors)
Keywords
Genres
Overview / Plot
Vote Average

🔮 Future Enhancements

User authentication & API key management
Collaborative filtering (user-user similarity)
Real-time model updates with webhooks
Advanced filtering (year, rating, runtime)
Movie rating & feedback loop for model improvement
Multi-language support
Mobile app integration

📄 License

This project is open source. Feel free to modify and extend it!

💬 Support

For issues, questions, or contributions:

Check the Troubleshooting section
Review the API Documentation
Examine the source code in src/ directory

Enjoy discovering your next favorite movie! 🍿🎬