deepshelf-api / docs /ARCHITECTURE.md
nice-bill's picture
initial commit
cdb73a8

Architecture Overview: DeepShelf

This document provides a detailed overview of the BookFinder application's architecture, outlining its components, data flow, and key technology decisions.

System Overview

BookFinder is a content-based book recommendation system that leverages natural language processing (NLP) and vector similarity search to help users discover books. It comprises a Streamlit-based web application for user interaction, a FastAPI service for programmatic access, and a suite of Python scripts for data processing and embedding generation.

The core principle is to transform book descriptions into high-dimensional numerical vectors (embeddings) using a pre-trained sentence transformer model. These embeddings are then used to find semantically similar books or to cluster books into thematic collections.

Component Descriptions

The system is structured into several modular components:

  1. Data Layer (data/): Stores raw, prepared, and processed data.

    • data/raw/: Original datasets (e.g., goodreads_data.csv).
    • data/processed/: Cleaned book metadata (books_cleaned.parquet), pre-computed embeddings (book_embeddings.npy), and embedding metadata (embedding_metadata.json).
  2. Scripts Layer (scripts/): Utility scripts for data handling and model preparation.

    • prepare_goodreads_data.py: Adapts raw data sources to a standardized format.
    • download_model.py (optional): For pre-downloading the sentence transformer model.
  3. Source Code Layer (src/book_recommender/): Contains the core logic of the application, organized into sub-packages.

    • src/book_recommender/core/: Core utilities and configurations.
      • config.py: Centralized configuration settings and constants.
      • exceptions.py: Custom exception definitions for error handling.
      • logging_config.py: Centralized logging configuration for the entire application.
    • src/book_recommender/data/: Data processing components.
      • processor.py: Handles data cleaning, normalization, deduplication, and feature engineering (e.g., creating combined_text for embeddings).
    • src/book_recommender/ml/: Machine Learning related components.
      • embedder.py: Manages the loading of the Sentence Transformer model and generation of embeddings for books and user queries. It uses lru_cache for efficient model loading.
      • recommender.py: Implements the core recommendation logic using a FAISS index for fast Approximate Nearest Neighbor (ANN) search on book embeddings.
      • clustering.py: Contains logic for K-Means clustering of book embeddings and generating descriptive names for these clusters based on common genres.
      • explainability.py: Provides rule-based explanations for recommendations, detailing contributing factors like genre, keywords, and author similarity.
      • feedback.py: Manages saving and retrieving user feedback on recommendations to a JSONL file.
    • src/book_recommender/utils.py: General utility functions, including book cover fetching from various APIs (Google Books, Open Library) and batch processing.
  4. User Interface Layer (src/book_recommender/apps/): Streamlit applications.

    • main_app.py: The main interactive web application where users can get book recommendations, browse by genre/query, view explanations, and provide feedback.
    • analytics_app.py: A separate Streamlit dashboard to visualize collected user feedback and system usage statistics.
  5. API Layer (src/book_recommender/api/): FastAPI application for programmatic access.

    • main.py: Defines the FastAPI application and its endpoints (recommendations, book listing, search, stats, clusters, explanations, feedback).
    • models.py: Pydantic models for request and response data validation and serialization.
    • dependencies.py: FastAPI dependency injection functions to manage and cache shared resources like the BookRecommender and embedding models.
  6. CI/CD (.github/workflows/ci-cd.yml): GitHub Actions workflow for automated testing and code quality checks.

  7. Containerization (streamlit.Dockerfile, api.Dockerfile, analytics.Dockerfile, docker-compose.yml): For packaging and orchestrating the application services.

    • streamlit.Dockerfile: Defines the build process for the main Streamlit application into a Docker image.
    • api.Dockerfile: Defines the build process for the FastAPI application into a Docker image.
    • analytics.Dockerfile: Defines the build process for the analytics Streamlit application into a Docker image.
    • docker-compose.yml: Orchestrates the streamlit, api, and analytics services for local development and deployment.

Data Flow

The primary data flow for generating recommendations and user interaction is as follows:

  1. Raw Data Ingestion: Raw CSV datasets (e.g., goodreads_data.csv) are placed in data/raw/.

  2. Data Preparation: The scripts/prepare_goodreads_data.py script and src/book_recommender/data/processor.py clean, standardize, and deduplicate the raw data, saving it as books_cleaned.parquet in data/processed/.

  3. Embedding Generation: src/book_recommender/ml/embedder.py loads books_cleaned.parquet and generates semantic embeddings for each book's combined_text, saving them as book_embeddings.npy in data/processed/.

  4. Application Startup:

    • The src/book_recommender/apps/main_app.py (Streamlit) and src/book_recommender/api/main.py (FastAPI) applications load books_cleaned.parquet and book_embeddings.npy into memory.
    • src/book_recommender/ml/recommender.py initializes a FAISS index with these embeddings for fast similarity search.
    • src/book_recommender/ml/clustering.py generates book clusters and their names from the embeddings.
  5. User Interaction (Streamlit src/book_recommender/apps/main_app.py):

    • User inputs a natural language query or browses clusters.
    • If a query, src/book_recommender/ml/embedder.py generates an embedding for the query.
    • src/book_recommender/ml/recommender.py uses the query embedding (or a book's embedding from a title search) to find similar books.
    • src/book_recommender/ml/explainability.py generates reasons for recommendations.
    • src/book_recommender/utils.py fetches book cover images.
    • src/book_recommender/ml/feedback.py stores user feedback on recommendations.
  6. API Interaction (FastAPI src/book_recommender/api/main.py):

    • External clients send requests to API endpoints (e.g., /recommend/query, /books, /feedback).

    • src/book_recommender/api/dependencies.py ensures efficient loading and caching of recommender and embedder instances.

    • Requests are validated using Pydantic models (src/book_recommender/api/models.py).

    • Core logic in src/book_recommender/ modules is invoked (e.g., recommender.py, explainability.py, feedback.py).

    • Responses are returned, adhering to defined Pydantic response models.

      Example API Calls:

      # Health Check
      curl http://localhost:8000/health
      
      # Recommend by Query
      curl -X POST "http://localhost:8000/recommend/query" \
        -H "Content-Type: application/json" \
        -d '{
          "query": "fantasy adventure with dragons",
          "top_k": 5
        }'
      
      # List all Clusters
      curl http://localhost:8000/clusters
      
      # Submit Feedback
      curl -X POST "http://localhost:8000/feedback" \
        -H "Content-Type: application/json" \
        -d '{
          "query": "fantasy adventure with dragons",
          "book_id": "example_book_id",
          "feedback_type": "positive",
          "session_id": "user_session_abc"
        }'
      
  7. Analytics (src/book_recommender/apps/analytics_app.py):

    • Loads accumulated feedback data from data/feedback/user_feedback.jsonl using src/book_recommender/ml/feedback.py.
    • Processes and visualizes statistics using Streamlit and Plotly.
graph TD
    subgraph Data Flow & Processing
        raw_data[Raw Data (CSV)] --> A(scripts/prepare_goodreads_data.py);
        A --> B[books_prepared.csv];
        B --> C{src/book_recommender/data/processor.py};
        C --> D[books_cleaned.parquet];
        D --contains text--> E{src/book_recommender/ml/embedder.py};
        E --> F[book_embeddings.npy];
    end

    subgraph Application Runtime
        G(src/book_recommender/apps/main_app.py - Streamlit UI) --loads--> D & F;
        G --uses--> H(src/book_recommender/ml/recommender.py);
        G --uses--> I(src/book_recommender/ml/embedder.py);
        G --uses--> J(src/book_recommender/ml/clustering.py);
        G --uses--> K(src/book_recommender/ml/explainability.py);
        G --uses--> L(src/book_recommender/ml/feedback.py);
        
        M(src/book_recommender/api/main.py - FastAPI) --uses--> N(src/book_recommender/api/dependencies.py);
        N --loads--> D & F;
        N --uses--> H & I & J & K & L;

        O(src/book_recommender/apps/analytics_app.py - Streamlit Dashboard) --loads--> P[user_feedback.jsonl];
        P --via--> L;

        User[User] --Interacts with--> G;
        Client[External Client] --Interacts with--> M;
    end

    subgraph Utilities & Configuration
        Q[src/book_recommender/core/config.py];
        R[src/book_recommender/utils.py];
        S[src/book_recommender/core/logging_config.py];
    end

    style raw_data fill:#f9f,stroke:#333,stroke-width:2px
    style D fill:#ccf,stroke:#333,stroke-width:2px
    style F fill:#ccf,stroke:#333,stroke-width:2px
    style P fill:#fcc,stroke:#333,stroke-width:2px

Technology Decisions

  • Python 3.10+: Modern, versatile language.
  • Streamlit: Chosen for rapid development of interactive web UIs with minimal frontend code. Its caching mechanisms (@st.cache_resource, @st.cache_data) are crucial for performance with ML models.
  • FastAPI: Selected for building a high-performance, asynchronous API with automatic Pydantic-based data validation and Swagger/OpenAPI documentation.
  • Sentence-Transformers: A powerful library for generating dense vector embeddings from text, suitable for semantic search.
  • FAISS: An efficient library for similarity search and clustering of dense vectors, essential for scaling recommendations.
  • Pandas / NumPy: Standard libraries for data manipulation and numerical operations.
  • Scikit-learn: Used for traditional machine learning tasks like K-Means clustering.
  • Plotly Express: For creating interactive and aesthetically pleasing visualizations in the analytics dashboard.
  • Pydantic: Data validation and settings management using Python type hints, integral to FastAPI.
  • python-dotenv: For managing environment variables, facilitating flexible configuration across environments.
  • GitHub Actions: For Continuous Integration/Continuous Deployment (CI/CD), automating testing, linting, and Docker image builds.
  • Docker / Docker Compose: For containerizing the application and orchestrating multi-service deployments, ensuring consistent environments.

Future Considerations

  • Data Version Control (DVC): Implement DVC for robust tracking of data and model versions, enhancing reproducibility in production.
  • Scalability: For extremely large datasets, consider distributed FAISS indexes or cloud-native vector databases.
  • Advanced Explanations: Explore more sophisticated XAI techniques beyond rule-based, potentially involving LLMs or specific feature attribution methods.
  • User Authentication: For multi-user scenarios, integrate an authentication system (e.g., OAuth2, JWT).
  • Database Integration: Replace JSONL feedback storage with a dedicated database (e.g., PostgreSQL) for more robust data management and querying.
  • Full UI Testing: Implement UI tests using tools like Playwright or Selenium to ensure frontend consistency and functionality.