# Architecture Overview: DeepShelf This document provides a detailed overview of the BookFinder application's architecture, outlining its components, data flow, and key technology decisions. ## System Overview BookFinder is a content-based book recommendation system that leverages natural language processing (NLP) and vector similarity search to help users discover books. It comprises a Streamlit-based web application for user interaction, a FastAPI service for programmatic access, and a suite of Python scripts for data processing and embedding generation. The core principle is to transform book descriptions into high-dimensional numerical vectors (embeddings) using a pre-trained sentence transformer model. These embeddings are then used to find semantically similar books or to cluster books into thematic collections. ## Component Descriptions The system is structured into several modular components: 1. **Data Layer (`data/`)**: Stores raw, prepared, and processed data. * `data/raw/`: Original datasets (e.g., `goodreads_data.csv`). * `data/processed/`: Cleaned book metadata (`books_cleaned.parquet`), pre-computed embeddings (`book_embeddings.npy`), and embedding metadata (`embedding_metadata.json`). 2. **Scripts Layer (`scripts/`)**: Utility scripts for data handling and model preparation. * `prepare_goodreads_data.py`: Adapts raw data sources to a standardized format. * `download_model.py` (optional): For pre-downloading the sentence transformer model. 3. **Source Code Layer (`src/book_recommender/`)**: Contains the core logic of the application, organized into sub-packages. * **`src/book_recommender/core/`**: Core utilities and configurations. * `config.py`: Centralized configuration settings and constants. * `exceptions.py`: Custom exception definitions for error handling. * `logging_config.py`: Centralized logging configuration for the entire application. * **`src/book_recommender/data/`**: Data processing components. * `processor.py`: Handles data cleaning, normalization, deduplication, and feature engineering (e.g., creating `combined_text` for embeddings). * **`src/book_recommender/ml/`**: Machine Learning related components. * `embedder.py`: Manages the loading of the Sentence Transformer model and generation of embeddings for books and user queries. It uses `lru_cache` for efficient model loading. * `recommender.py`: Implements the core recommendation logic using a FAISS index for fast Approximate Nearest Neighbor (ANN) search on book embeddings. * `clustering.py`: Contains logic for K-Means clustering of book embeddings and generating descriptive names for these clusters based on common genres. * `explainability.py`: Provides rule-based explanations for recommendations, detailing contributing factors like genre, keywords, and author similarity. * `feedback.py`: Manages saving and retrieving user feedback on recommendations to a JSONL file. * `src/book_recommender/utils.py`: General utility functions, including book cover fetching from various APIs (Google Books, Open Library) and batch processing. 4. **User Interface Layer (`src/book_recommender/apps/`)**: Streamlit applications. * `main_app.py`: The main interactive web application where users can get book recommendations, browse by genre/query, view explanations, and provide feedback. * `analytics_app.py`: A separate Streamlit dashboard to visualize collected user feedback and system usage statistics. 5. **API Layer (`src/book_recommender/api/`)**: FastAPI application for programmatic access. * `main.py`: Defines the FastAPI application and its endpoints (recommendations, book listing, search, stats, clusters, explanations, feedback). * `models.py`: Pydantic models for request and response data validation and serialization. * `dependencies.py`: FastAPI dependency injection functions to manage and cache shared resources like the `BookRecommender` and embedding models. 6. **CI/CD (`.github/workflows/ci-cd.yml`)**: GitHub Actions workflow for automated testing and code quality checks. 7. **Containerization (`streamlit.Dockerfile`, `api.Dockerfile`, `analytics.Dockerfile`, `docker-compose.yml`)**: For packaging and orchestrating the application services. * `streamlit.Dockerfile`: Defines the build process for the main Streamlit application into a Docker image. * `api.Dockerfile`: Defines the build process for the FastAPI application into a Docker image. * `analytics.Dockerfile`: Defines the build process for the analytics Streamlit application into a Docker image. * `docker-compose.yml`: Orchestrates the `streamlit`, `api`, and `analytics` services for local development and deployment. ## Data Flow The primary data flow for generating recommendations and user interaction is as follows: 1. **Raw Data Ingestion**: Raw CSV datasets (e.g., `goodreads_data.csv`) are placed in `data/raw/`. 2. **Data Preparation**: The `scripts/prepare_goodreads_data.py` script and `src/book_recommender/data/processor.py` clean, standardize, and deduplicate the raw data, saving it as `books_cleaned.parquet` in `data/processed/`. 3. **Embedding Generation**: `src/book_recommender/ml/embedder.py` loads `books_cleaned.parquet` and generates semantic embeddings for each book's `combined_text`, saving them as `book_embeddings.npy` in `data/processed/`. 4. **Application Startup**: * The `src/book_recommender/apps/main_app.py` (Streamlit) and `src/book_recommender/api/main.py` (FastAPI) applications load `books_cleaned.parquet` and `book_embeddings.npy` into memory. * `src/book_recommender/ml/recommender.py` initializes a FAISS index with these embeddings for fast similarity search. * `src/book_recommender/ml/clustering.py` generates book clusters and their names from the embeddings. 5. **User Interaction (Streamlit `src/book_recommender/apps/main_app.py`)**: * User inputs a natural language query or browses clusters. * If a query, `src/book_recommender/ml/embedder.py` generates an embedding for the query. * `src/book_recommender/ml/recommender.py` uses the query embedding (or a book's embedding from a title search) to find similar books. * `src/book_recommender/ml/explainability.py` generates reasons for recommendations. * `src/book_recommender/utils.py` fetches book cover images. * `src/book_recommender/ml/feedback.py` stores user feedback on recommendations. 6. **API Interaction (FastAPI `src/book_recommender/api/main.py`)**: * External clients send requests to API endpoints (e.g., `/recommend/query`, `/books`, `/feedback`). * `src/book_recommender/api/dependencies.py` ensures efficient loading and caching of `recommender` and `embedder` instances. * Requests are validated using Pydantic models (`src/book_recommender/api/models.py`). * Core logic in `src/book_recommender/` modules is invoked (e.g., `recommender.py`, `explainability.py`, `feedback.py`). * Responses are returned, adhering to defined Pydantic response models. **Example API Calls:** ```bash # Health Check curl http://localhost:8000/health # Recommend by Query curl -X POST "http://localhost:8000/recommend/query" \ -H "Content-Type: application/json" \ -d '{ "query": "fantasy adventure with dragons", "top_k": 5 }' # List all Clusters curl http://localhost:8000/clusters # Submit Feedback curl -X POST "http://localhost:8000/feedback" \ -H "Content-Type: application/json" \ -d '{ "query": "fantasy adventure with dragons", "book_id": "example_book_id", "feedback_type": "positive", "session_id": "user_session_abc" }' ``` 7. **Analytics (`src/book_recommender/apps/analytics_app.py`)**: * Loads accumulated feedback data from `data/feedback/user_feedback.jsonl` using `src/book_recommender/ml/feedback.py`. * Processes and visualizes statistics using Streamlit and Plotly. ```mermaid graph TD subgraph Data Flow & Processing raw_data[Raw Data (CSV)] --> A(scripts/prepare_goodreads_data.py); A --> B[books_prepared.csv]; B --> C{src/book_recommender/data/processor.py}; C --> D[books_cleaned.parquet]; D --contains text--> E{src/book_recommender/ml/embedder.py}; E --> F[book_embeddings.npy]; end subgraph Application Runtime G(src/book_recommender/apps/main_app.py - Streamlit UI) --loads--> D & F; G --uses--> H(src/book_recommender/ml/recommender.py); G --uses--> I(src/book_recommender/ml/embedder.py); G --uses--> J(src/book_recommender/ml/clustering.py); G --uses--> K(src/book_recommender/ml/explainability.py); G --uses--> L(src/book_recommender/ml/feedback.py); M(src/book_recommender/api/main.py - FastAPI) --uses--> N(src/book_recommender/api/dependencies.py); N --loads--> D & F; N --uses--> H & I & J & K & L; O(src/book_recommender/apps/analytics_app.py - Streamlit Dashboard) --loads--> P[user_feedback.jsonl]; P --via--> L; User[User] --Interacts with--> G; Client[External Client] --Interacts with--> M; end subgraph Utilities & Configuration Q[src/book_recommender/core/config.py]; R[src/book_recommender/utils.py]; S[src/book_recommender/core/logging_config.py]; end style raw_data fill:#f9f,stroke:#333,stroke-width:2px style D fill:#ccf,stroke:#333,stroke-width:2px style F fill:#ccf,stroke:#333,stroke-width:2px style P fill:#fcc,stroke:#333,stroke-width:2px ``` ## Technology Decisions * **Python 3.10+**: Modern, versatile language. * **Streamlit**: Chosen for rapid development of interactive web UIs with minimal frontend code. Its caching mechanisms (`@st.cache_resource`, `@st.cache_data`) are crucial for performance with ML models. * **FastAPI**: Selected for building a high-performance, asynchronous API with automatic Pydantic-based data validation and Swagger/OpenAPI documentation. * **Sentence-Transformers**: A powerful library for generating dense vector embeddings from text, suitable for semantic search. * **FAISS**: An efficient library for similarity search and clustering of dense vectors, essential for scaling recommendations. * **Pandas / NumPy**: Standard libraries for data manipulation and numerical operations. * **Scikit-learn**: Used for traditional machine learning tasks like K-Means clustering. * **Plotly Express**: For creating interactive and aesthetically pleasing visualizations in the analytics dashboard. * **Pydantic**: Data validation and settings management using Python type hints, integral to FastAPI. * **python-dotenv**: For managing environment variables, facilitating flexible configuration across environments. * **GitHub Actions**: For Continuous Integration/Continuous Deployment (CI/CD), automating testing, linting, and Docker image builds. * **Docker / Docker Compose**: For containerizing the application and orchestrating multi-service deployments, ensuring consistent environments. ## Future Considerations * **Data Version Control (DVC)**: Implement DVC for robust tracking of data and model versions, enhancing reproducibility in production. * **Scalability**: For extremely large datasets, consider distributed FAISS indexes or cloud-native vector databases. * **Advanced Explanations**: Explore more sophisticated XAI techniques beyond rule-based, potentially involving LLMs or specific feature attribution methods. * **User Authentication**: For multi-user scenarios, integrate an authentication system (e.g., OAuth2, JWT). * **Database Integration**: Replace JSONL feedback storage with a dedicated database (e.g., PostgreSQL) for more robust data management and querying. * **Full UI Testing**: Implement UI tests using tools like Playwright or Selenium to ensure frontend consistency and functionality.