File size: 12,237 Bytes
cdb73a8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
# Architecture Overview: DeepShelf

This document provides a detailed overview of the BookFinder application's architecture, outlining its components, data flow, and key technology decisions.

## System Overview

BookFinder is a content-based book recommendation system that leverages natural language processing (NLP) and vector similarity search to help users discover books. It comprises a Streamlit-based web application for user interaction, a FastAPI service for programmatic access, and a suite of Python scripts for data processing and embedding generation.

The core principle is to transform book descriptions into high-dimensional numerical vectors (embeddings) using a pre-trained sentence transformer model. These embeddings are then used to find semantically similar books or to cluster books into thematic collections.

## Component Descriptions

The system is structured into several modular components:

1.  **Data Layer (`data/`)**: Stores raw, prepared, and processed data.
    *   `data/raw/`: Original datasets (e.g., `goodreads_data.csv`).
    *   `data/processed/`: Cleaned book metadata (`books_cleaned.parquet`), pre-computed embeddings (`book_embeddings.npy`), and embedding metadata (`embedding_metadata.json`).

2.  **Scripts Layer (`scripts/`)**: Utility scripts for data handling and model preparation.
    *   `prepare_goodreads_data.py`: Adapts raw data sources to a standardized format.
    *   `download_model.py` (optional): For pre-downloading the sentence transformer model.

3.  **Source Code Layer (`src/book_recommender/`)**: Contains the core logic of the application, organized into sub-packages.
    *   **`src/book_recommender/core/`**: Core utilities and configurations.
        *   `config.py`: Centralized configuration settings and constants.
        *   `exceptions.py`: Custom exception definitions for error handling.
        *   `logging_config.py`: Centralized logging configuration for the entire application.
    *   **`src/book_recommender/data/`**: Data processing components.
        *   `processor.py`: Handles data cleaning, normalization, deduplication, and feature engineering (e.g., creating `combined_text` for embeddings).
    *   **`src/book_recommender/ml/`**: Machine Learning related components.
        *   `embedder.py`: Manages the loading of the Sentence Transformer model and generation of embeddings for books and user queries. It uses `lru_cache` for efficient model loading.
        *   `recommender.py`: Implements the core recommendation logic using a FAISS index for fast Approximate Nearest Neighbor (ANN) search on book embeddings.
        *   `clustering.py`: Contains logic for K-Means clustering of book embeddings and generating descriptive names for these clusters based on common genres.
        *   `explainability.py`: Provides rule-based explanations for recommendations, detailing contributing factors like genre, keywords, and author similarity.
        *   `feedback.py`: Manages saving and retrieving user feedback on recommendations to a JSONL file.
    *   `src/book_recommender/utils.py`: General utility functions, including book cover fetching from various APIs (Google Books, Open Library) and batch processing.

4.  **User Interface Layer (`src/book_recommender/apps/`)**: Streamlit applications.
    *   `main_app.py`: The main interactive web application where users can get book recommendations, browse by genre/query, view explanations, and provide feedback.
    *   `analytics_app.py`: A separate Streamlit dashboard to visualize collected user feedback and system usage statistics.

5.  **API Layer (`src/book_recommender/api/`)**: FastAPI application for programmatic access.
    *   `main.py`: Defines the FastAPI application and its endpoints (recommendations, book listing, search, stats, clusters, explanations, feedback).
    *   `models.py`: Pydantic models for request and response data validation and serialization.
    *   `dependencies.py`: FastAPI dependency injection functions to manage and cache shared resources like the `BookRecommender` and embedding models.

6.  **CI/CD (`.github/workflows/ci-cd.yml`)**: GitHub Actions workflow for automated testing and code quality checks.

7.  **Containerization (`streamlit.Dockerfile`, `api.Dockerfile`, `analytics.Dockerfile`, `docker-compose.yml`)**: For packaging and orchestrating the application services.
    *   `streamlit.Dockerfile`: Defines the build process for the main Streamlit application into a Docker image.
    *   `api.Dockerfile`: Defines the build process for the FastAPI application into a Docker image.
    *   `analytics.Dockerfile`: Defines the build process for the analytics Streamlit application into a Docker image.
    *   `docker-compose.yml`: Orchestrates the `streamlit`, `api`, and `analytics` services for local development and deployment.

## Data Flow

The primary data flow for generating recommendations and user interaction is as follows:

1.  **Raw Data Ingestion**: Raw CSV datasets (e.g., `goodreads_data.csv`) are placed in `data/raw/`.
2.  **Data Preparation**: The `scripts/prepare_goodreads_data.py` script and `src/book_recommender/data/processor.py` clean, standardize, and deduplicate the raw data, saving it as `books_cleaned.parquet` in `data/processed/`.
3.  **Embedding Generation**: `src/book_recommender/ml/embedder.py` loads `books_cleaned.parquet` and generates semantic embeddings for each book's `combined_text`, saving them as `book_embeddings.npy` in `data/processed/`.
4.  **Application Startup**:
    *   The `src/book_recommender/apps/main_app.py` (Streamlit) and `src/book_recommender/api/main.py` (FastAPI) applications load `books_cleaned.parquet` and `book_embeddings.npy` into memory.
    *   `src/book_recommender/ml/recommender.py` initializes a FAISS index with these embeddings for fast similarity search.
    *   `src/book_recommender/ml/clustering.py` generates book clusters and their names from the embeddings.
5.  **User Interaction (Streamlit `src/book_recommender/apps/main_app.py`)**:
    *   User inputs a natural language query or browses clusters.
    *   If a query, `src/book_recommender/ml/embedder.py` generates an embedding for the query.
    *   `src/book_recommender/ml/recommender.py` uses the query embedding (or a book's embedding from a title search) to find similar books.
    *   `src/book_recommender/ml/explainability.py` generates reasons for recommendations.
    *   `src/book_recommender/utils.py` fetches book cover images.
    *   `src/book_recommender/ml/feedback.py` stores user feedback on recommendations.
6.  **API Interaction (FastAPI `src/book_recommender/api/main.py`)**:
    *   External clients send requests to API endpoints (e.g., `/recommend/query`, `/books`, `/feedback`).
    *   `src/book_recommender/api/dependencies.py` ensures efficient loading and caching of `recommender` and `embedder` instances.
    *   Requests are validated using Pydantic models (`src/book_recommender/api/models.py`).
    *   Core logic in `src/book_recommender/` modules is invoked (e.g., `recommender.py`, `explainability.py`, `feedback.py`).
    *   Responses are returned, adhering to defined Pydantic response models.
    
        **Example API Calls:**

        ```bash
        # Health Check
        curl http://localhost:8000/health

        # Recommend by Query
        curl -X POST "http://localhost:8000/recommend/query" \
          -H "Content-Type: application/json" \
          -d '{
            "query": "fantasy adventure with dragons",
            "top_k": 5
          }'

        # List all Clusters
        curl http://localhost:8000/clusters

        # Submit Feedback
        curl -X POST "http://localhost:8000/feedback" \
          -H "Content-Type: application/json" \
          -d '{
            "query": "fantasy adventure with dragons",
            "book_id": "example_book_id",
            "feedback_type": "positive",
            "session_id": "user_session_abc"
          }'
        ```

7.  **Analytics (`src/book_recommender/apps/analytics_app.py`)**:
    *   Loads accumulated feedback data from `data/feedback/user_feedback.jsonl` using `src/book_recommender/ml/feedback.py`.
    *   Processes and visualizes statistics using Streamlit and Plotly.

```mermaid
graph TD
    subgraph Data Flow & Processing
        raw_data[Raw Data (CSV)] --> A(scripts/prepare_goodreads_data.py);
        A --> B[books_prepared.csv];
        B --> C{src/book_recommender/data/processor.py};
        C --> D[books_cleaned.parquet];
        D --contains text--> E{src/book_recommender/ml/embedder.py};
        E --> F[book_embeddings.npy];
    end

    subgraph Application Runtime
        G(src/book_recommender/apps/main_app.py - Streamlit UI) --loads--> D & F;
        G --uses--> H(src/book_recommender/ml/recommender.py);
        G --uses--> I(src/book_recommender/ml/embedder.py);
        G --uses--> J(src/book_recommender/ml/clustering.py);
        G --uses--> K(src/book_recommender/ml/explainability.py);
        G --uses--> L(src/book_recommender/ml/feedback.py);
        
        M(src/book_recommender/api/main.py - FastAPI) --uses--> N(src/book_recommender/api/dependencies.py);
        N --loads--> D & F;
        N --uses--> H & I & J & K & L;

        O(src/book_recommender/apps/analytics_app.py - Streamlit Dashboard) --loads--> P[user_feedback.jsonl];
        P --via--> L;

        User[User] --Interacts with--> G;
        Client[External Client] --Interacts with--> M;
    end

    subgraph Utilities & Configuration
        Q[src/book_recommender/core/config.py];
        R[src/book_recommender/utils.py];
        S[src/book_recommender/core/logging_config.py];
    end

    style raw_data fill:#f9f,stroke:#333,stroke-width:2px
    style D fill:#ccf,stroke:#333,stroke-width:2px
    style F fill:#ccf,stroke:#333,stroke-width:2px
    style P fill:#fcc,stroke:#333,stroke-width:2px
```

## Technology Decisions

*   **Python 3.10+**: Modern, versatile language.
*   **Streamlit**: Chosen for rapid development of interactive web UIs with minimal frontend code. Its caching mechanisms (`@st.cache_resource`, `@st.cache_data`) are crucial for performance with ML models.
*   **FastAPI**: Selected for building a high-performance, asynchronous API with automatic Pydantic-based data validation and Swagger/OpenAPI documentation.
*   **Sentence-Transformers**: A powerful library for generating dense vector embeddings from text, suitable for semantic search.
*   **FAISS**: An efficient library for similarity search and clustering of dense vectors, essential for scaling recommendations.
*   **Pandas / NumPy**: Standard libraries for data manipulation and numerical operations.
*   **Scikit-learn**: Used for traditional machine learning tasks like K-Means clustering.
*   **Plotly Express**: For creating interactive and aesthetically pleasing visualizations in the analytics dashboard.
*   **Pydantic**: Data validation and settings management using Python type hints, integral to FastAPI.
*   **python-dotenv**: For managing environment variables, facilitating flexible configuration across environments.
*   **GitHub Actions**: For Continuous Integration/Continuous Deployment (CI/CD), automating testing, linting, and Docker image builds.
*   **Docker / Docker Compose**: For containerizing the application and orchestrating multi-service deployments, ensuring consistent environments.

## Future Considerations

*   **Data Version Control (DVC)**: Implement DVC for robust tracking of data and model versions, enhancing reproducibility in production.
*   **Scalability**: For extremely large datasets, consider distributed FAISS indexes or cloud-native vector databases.
*   **Advanced Explanations**: Explore more sophisticated XAI techniques beyond rule-based, potentially involving LLMs or specific feature attribution methods.
*   **User Authentication**: For multi-user scenarios, integrate an authentication system (e.g., OAuth2, JWT).
*   **Database Integration**: Replace JSONL feedback storage with a dedicated database (e.g., PostgreSQL) for more robust data management and querying.
*   **Full UI Testing**: Implement UI tests using tools like Playwright or Selenium to ensure frontend consistency and functionality.