Spaces:

nice-bill
/

deepshelf-api

Running

File size: 12,237 Bytes

cdb73a8

# Architecture Overview: DeepShelf

This document provides a detailed overview of the BookFinder application's architecture, outlining its components, data flow, and key technology decisions.

## System Overview

BookFinder is a content-based book recommendation system that leverages natural language processing (NLP) and vector similarity search to help users discover books. It comprises a Streamlit-based web application for user interaction, a FastAPI service for programmatic access, and a suite of Python scripts for data processing and embedding generation.

The core principle is to transform book descriptions into high-dimensional numerical vectors (embeddings) using a pre-trained sentence transformer model. These embeddings are then used to find semantically similar books or to cluster books into thematic collections.

## Component Descriptions

The system is structured into several modular components:

1.  **Data Layer (`data/`)**: Stores raw, prepared, and processed data.
    *   `data/raw/`: Original datasets (e.g., `goodreads_data.csv`).
    *   `data/processed/`: Cleaned book metadata (`books_cleaned.parquet`), pre-computed embeddings (`book_embeddings.npy`), and embedding metadata (`embedding_metadata.json`).

2.  **Scripts Layer (`scripts/`)**: Utility scripts for data handling and model preparation.
    *   `prepare_goodreads_data.py`: Adapts raw data sources to a standardized format.
    *   `download_model.py` (optional): For pre-downloading the sentence transformer model.

3.  **Source Code Layer (`src/book_recommender/`)**: Contains the core logic of the application, organized into sub-packages.
    *   **`src/book_recommender/core/`**: Core utilities and configurations.
        *   `config.py`: Centralized configuration settings and constants.
        *   `exceptions.py`: Custom exception definitions for error handling.
        *   `logging_config.py`: Centralized logging configuration for the entire application.
    *   **`src/book_recommender/data/`**: Data processing components.
        *   `processor.py`: Handles data cleaning, normalization, deduplication, and feature engineering (e.g., creating `combined_text` for embeddings).
    *   **`src/book_recommender/ml/`**: Machine Learning related components.
        *   `embedder.py`: Manages the loading of the Sentence Transformer model and generation of embeddings for books and user queries. It uses `lru_cache` for efficient model loading.
        *   `recommender.py`: Implements the core recommendation logic using a FAISS index for fast Approximate Nearest Neighbor (ANN) search on book embeddings.
        *   `clustering.py`: Contains logic for K-Means clustering of book embeddings and generating descriptive names for these clusters based on common genres.
        *   `explainability.py`: Provides rule-based explanations for recommendations, detailing contributing factors like genre, keywords, and author similarity.
        *   `feedback.py`: Manages saving and retrieving user feedback on recommendations to a JSONL file.
    *   `src/book_recommender/utils.py`: General utility functions, including book cover fetching from various APIs (Google Books, Open Library) and batch processing.

4.  **User Interface Layer (`src/book_recommender/apps/`)**: Streamlit applications.
    *   `main_app.py`: The main interactive web application where users can get book recommendations, browse by genre/query, view explanations, and provide feedback.
    *   `analytics_app.py`: A separate Streamlit dashboard to visualize collected user feedback and system usage statistics.

5.  **API Layer (`src/book_recommender/api/`)**: FastAPI application for programmatic access.
    *   `main.py`: Defines the FastAPI application and its endpoints (recommendations, book listing, search, stats, clusters, explanations, feedback).
    *   `models.py`: Pydantic models for request and response data validation and serialization.
    *   `dependencies.py`: FastAPI dependency injection functions to manage and cache shared resources like the `BookRecommender` and embedding models.

6.  **CI/CD (`.github/workflows/ci-cd.yml`)**: GitHub Actions workflow for automated testing and code quality checks.

7.  **Containerization (`streamlit.Dockerfile`, `api.Dockerfile`, `analytics.Dockerfile`, `docker-compose.yml`)**: For packaging and orchestrating the application services.
    *   `streamlit.Dockerfile`: Defines the build process for the main Streamlit application into a Docker image.
    *   `api.Dockerfile`: Defines the build process for the FastAPI application into a Docker image.
    *   `analytics.Dockerfile`: Defines the build process for the analytics Streamlit application into a Docker image.
    *   `docker-compose.yml`: Orchestrates the `streamlit`, `api`, and `analytics` services for local development and deployment.

## Data Flow

The primary data flow for generating recommendations and user interaction is as follows:

1.  **Raw Data Ingestion**: Raw CSV datasets (e.g., `goodreads_data.csv`) are placed in `data/raw/`.
2.  **Data Preparation**: The `scripts/prepare_goodreads_data.py` script and `src/book_recommender/data/processor.py` clean, standardize, and deduplicate the raw data, saving it as `books_cleaned.parquet` in `data/processed/`.
3.  **Embedding Generation**: `src/book_recommender/ml/embedder.py` loads `books_cleaned.parquet` and generates semantic embeddings for each book's `combined_text`, saving them as `book_embeddings.npy` in `data/processed/`.
4.  **Application Startup**:
    *   The `src/book_recommender/apps/main_app.py` (Streamlit) and `src/book_recommender/api/main.py` (FastAPI) applications load `books_cleaned.parquet` and `book_embeddings.npy` into memory.
    *   `src/book_recommender/ml/recommender.py` initializes a FAISS index with these embeddings for fast similarity search.
    *   `src/book_recommender/ml/clustering.py` generates book clusters and their names from the embeddings.
5.  **User Interaction (Streamlit `src/book_recommender/apps/main_app.py`)**:
    *   User inputs a natural language query or browses clusters.
    *   If a query, `src/book_recommender/ml/embedder.py` generates an embedding for the query.
    *   `src/book_recommender/ml/recommender.py` uses the query embedding (or a book's embedding from a title search) to find similar books.
    *   `src/book_recommender/ml/explainability.py` generates reasons for recommendations.
    *   `src/book_recommender/utils.py` fetches book cover images.
    *   `src/book_recommender/ml/feedback.py` stores user feedback on recommendations.
6.  **API Interaction (FastAPI `src/book_recommender/api/main.py`)**:
    *   External clients send requests to API endpoints (e.g., `/recommend/query`, `/books`, `/feedback`).
    *   `src/book_recommender/api/dependencies.py` ensures efficient loading and caching of `recommender` and `embedder` instances.
    *   Requests are validated using Pydantic models (`src/book_recommender/api/models.py`).
    *   Core logic in `src/book_recommender/` modules is invoked (e.g., `recommender.py`, `explainability.py`, `feedback.py`).
    *   Responses are returned, adhering to defined Pydantic response models.
    
        **Example API Calls:**

        ```bash
        # Health Check
        curl http://localhost:8000/health

        # Recommend by Query
        curl -X POST "http://localhost:8000/recommend/query" \
          -H "Content-Type: application/json" \
          -d '{
            "query": "fantasy adventure with dragons",
            "top_k": 5
          }'

        # List all Clusters
        curl http://localhost:8000/clusters

        # Submit Feedback
        curl -X POST "http://localhost:8000/feedback" \
          -H "Content-Type: application/json" \
          -d '{
            "query": "fantasy adventure with dragons",
            "book_id": "example_book_id",
            "feedback_type": "positive",
            "session_id": "user_session_abc"
          }'
        ```

7.  **Analytics (`src/book_recommender/apps/analytics_app.py`)**:
    *   Loads accumulated feedback data from `data/feedback/user_feedback.jsonl` using `src/book_recommender/ml/feedback.py`.
    *   Processes and visualizes statistics using Streamlit and Plotly.

```mermaid
graph TD
    subgraph Data Flow & Processing
        raw_data[Raw Data (CSV)] --> A(scripts/prepare_goodreads_data.py);
        A --> B[books_prepared.csv];
        B --> C{src/book_recommender/data/processor.py};
        C --> D[books_cleaned.parquet];
        D --contains text--> E{src/book_recommender/ml/embedder.py};
        E --> F[book_embeddings.npy];
    end

    subgraph Application Runtime
        G(src/book_recommender/apps/main_app.py - Streamlit UI) --loads--> D & F;
        G --uses--> H(src/book_recommender/ml/recommender.py);
        G --uses--> I(src/book_recommender/ml/embedder.py);
        G --uses--> J(src/book_recommender/ml/clustering.py);
        G --uses--> K(src/book_recommender/ml/explainability.py);
        G --uses--> L(src/book_recommender/ml/feedback.py);
        
        M(src/book_recommender/api/main.py - FastAPI) --uses--> N(src/book_recommender/api/dependencies.py);
        N --loads--> D & F;
        N --uses--> H & I & J & K & L;

        O(src/book_recommender/apps/analytics_app.py - Streamlit Dashboard) --loads--> P[user_feedback.jsonl];
        P --via--> L;

        User[User] --Interacts with--> G;
        Client[External Client] --Interacts with--> M;
    end

    subgraph Utilities & Configuration
        Q[src/book_recommender/core/config.py];
        R[src/book_recommender/utils.py];
        S[src/book_recommender/core/logging_config.py];
    end

    style raw_data fill:#f9f,stroke:#333,stroke-width:2px
    style D fill:#ccf,stroke:#333,stroke-width:2px
    style F fill:#ccf,stroke:#333,stroke-width:2px
    style P fill:#fcc,stroke:#333,stroke-width:2px
```

## Technology Decisions

*   **Python 3.10+**: Modern, versatile language.
*   **Streamlit**: Chosen for rapid development of interactive web UIs with minimal frontend code. Its caching mechanisms (`@st.cache_resource`, `@st.cache_data`) are crucial for performance with ML models.
*   **FastAPI**: Selected for building a high-performance, asynchronous API with automatic Pydantic-based data validation and Swagger/OpenAPI documentation.
*   **Sentence-Transformers**: A powerful library for generating dense vector embeddings from text, suitable for semantic search.
*   **FAISS**: An efficient library for similarity search and clustering of dense vectors, essential for scaling recommendations.
*   **Pandas / NumPy**: Standard libraries for data manipulation and numerical operations.
*   **Scikit-learn**: Used for traditional machine learning tasks like K-Means clustering.
*   **Plotly Express**: For creating interactive and aesthetically pleasing visualizations in the analytics dashboard.
*   **Pydantic**: Data validation and settings management using Python type hints, integral to FastAPI.
*   **python-dotenv**: For managing environment variables, facilitating flexible configuration across environments.
*   **GitHub Actions**: For Continuous Integration/Continuous Deployment (CI/CD), automating testing, linting, and Docker image builds.
*   **Docker / Docker Compose**: For containerizing the application and orchestrating multi-service deployments, ensuring consistent environments.

## Future Considerations

*   **Data Version Control (DVC)**: Implement DVC for robust tracking of data and model versions, enhancing reproducibility in production.
*   **Scalability**: For extremely large datasets, consider distributed FAISS indexes or cloud-native vector databases.
*   **Advanced Explanations**: Explore more sophisticated XAI techniques beyond rule-based, potentially involving LLMs or specific feature attribution methods.
*   **User Authentication**: For multi-user scenarios, integrate an authentication system (e.g., OAuth2, JWT).
*   **Database Integration**: Replace JSONL feedback storage with a dedicated database (e.g., PostgreSQL) for more robust data management and querying.
*   **Full UI Testing**: Implement UI tests using tools like Playwright or Selenium to ensure frontend consistency and functionality.