Spaces:
Running
Running
File size: 12,237 Bytes
cdb73a8 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 | # Architecture Overview: DeepShelf
This document provides a detailed overview of the BookFinder application's architecture, outlining its components, data flow, and key technology decisions.
## System Overview
BookFinder is a content-based book recommendation system that leverages natural language processing (NLP) and vector similarity search to help users discover books. It comprises a Streamlit-based web application for user interaction, a FastAPI service for programmatic access, and a suite of Python scripts for data processing and embedding generation.
The core principle is to transform book descriptions into high-dimensional numerical vectors (embeddings) using a pre-trained sentence transformer model. These embeddings are then used to find semantically similar books or to cluster books into thematic collections.
## Component Descriptions
The system is structured into several modular components:
1. **Data Layer (`data/`)**: Stores raw, prepared, and processed data.
* `data/raw/`: Original datasets (e.g., `goodreads_data.csv`).
* `data/processed/`: Cleaned book metadata (`books_cleaned.parquet`), pre-computed embeddings (`book_embeddings.npy`), and embedding metadata (`embedding_metadata.json`).
2. **Scripts Layer (`scripts/`)**: Utility scripts for data handling and model preparation.
* `prepare_goodreads_data.py`: Adapts raw data sources to a standardized format.
* `download_model.py` (optional): For pre-downloading the sentence transformer model.
3. **Source Code Layer (`src/book_recommender/`)**: Contains the core logic of the application, organized into sub-packages.
* **`src/book_recommender/core/`**: Core utilities and configurations.
* `config.py`: Centralized configuration settings and constants.
* `exceptions.py`: Custom exception definitions for error handling.
* `logging_config.py`: Centralized logging configuration for the entire application.
* **`src/book_recommender/data/`**: Data processing components.
* `processor.py`: Handles data cleaning, normalization, deduplication, and feature engineering (e.g., creating `combined_text` for embeddings).
* **`src/book_recommender/ml/`**: Machine Learning related components.
* `embedder.py`: Manages the loading of the Sentence Transformer model and generation of embeddings for books and user queries. It uses `lru_cache` for efficient model loading.
* `recommender.py`: Implements the core recommendation logic using a FAISS index for fast Approximate Nearest Neighbor (ANN) search on book embeddings.
* `clustering.py`: Contains logic for K-Means clustering of book embeddings and generating descriptive names for these clusters based on common genres.
* `explainability.py`: Provides rule-based explanations for recommendations, detailing contributing factors like genre, keywords, and author similarity.
* `feedback.py`: Manages saving and retrieving user feedback on recommendations to a JSONL file.
* `src/book_recommender/utils.py`: General utility functions, including book cover fetching from various APIs (Google Books, Open Library) and batch processing.
4. **User Interface Layer (`src/book_recommender/apps/`)**: Streamlit applications.
* `main_app.py`: The main interactive web application where users can get book recommendations, browse by genre/query, view explanations, and provide feedback.
* `analytics_app.py`: A separate Streamlit dashboard to visualize collected user feedback and system usage statistics.
5. **API Layer (`src/book_recommender/api/`)**: FastAPI application for programmatic access.
* `main.py`: Defines the FastAPI application and its endpoints (recommendations, book listing, search, stats, clusters, explanations, feedback).
* `models.py`: Pydantic models for request and response data validation and serialization.
* `dependencies.py`: FastAPI dependency injection functions to manage and cache shared resources like the `BookRecommender` and embedding models.
6. **CI/CD (`.github/workflows/ci-cd.yml`)**: GitHub Actions workflow for automated testing and code quality checks.
7. **Containerization (`streamlit.Dockerfile`, `api.Dockerfile`, `analytics.Dockerfile`, `docker-compose.yml`)**: For packaging and orchestrating the application services.
* `streamlit.Dockerfile`: Defines the build process for the main Streamlit application into a Docker image.
* `api.Dockerfile`: Defines the build process for the FastAPI application into a Docker image.
* `analytics.Dockerfile`: Defines the build process for the analytics Streamlit application into a Docker image.
* `docker-compose.yml`: Orchestrates the `streamlit`, `api`, and `analytics` services for local development and deployment.
## Data Flow
The primary data flow for generating recommendations and user interaction is as follows:
1. **Raw Data Ingestion**: Raw CSV datasets (e.g., `goodreads_data.csv`) are placed in `data/raw/`.
2. **Data Preparation**: The `scripts/prepare_goodreads_data.py` script and `src/book_recommender/data/processor.py` clean, standardize, and deduplicate the raw data, saving it as `books_cleaned.parquet` in `data/processed/`.
3. **Embedding Generation**: `src/book_recommender/ml/embedder.py` loads `books_cleaned.parquet` and generates semantic embeddings for each book's `combined_text`, saving them as `book_embeddings.npy` in `data/processed/`.
4. **Application Startup**:
* The `src/book_recommender/apps/main_app.py` (Streamlit) and `src/book_recommender/api/main.py` (FastAPI) applications load `books_cleaned.parquet` and `book_embeddings.npy` into memory.
* `src/book_recommender/ml/recommender.py` initializes a FAISS index with these embeddings for fast similarity search.
* `src/book_recommender/ml/clustering.py` generates book clusters and their names from the embeddings.
5. **User Interaction (Streamlit `src/book_recommender/apps/main_app.py`)**:
* User inputs a natural language query or browses clusters.
* If a query, `src/book_recommender/ml/embedder.py` generates an embedding for the query.
* `src/book_recommender/ml/recommender.py` uses the query embedding (or a book's embedding from a title search) to find similar books.
* `src/book_recommender/ml/explainability.py` generates reasons for recommendations.
* `src/book_recommender/utils.py` fetches book cover images.
* `src/book_recommender/ml/feedback.py` stores user feedback on recommendations.
6. **API Interaction (FastAPI `src/book_recommender/api/main.py`)**:
* External clients send requests to API endpoints (e.g., `/recommend/query`, `/books`, `/feedback`).
* `src/book_recommender/api/dependencies.py` ensures efficient loading and caching of `recommender` and `embedder` instances.
* Requests are validated using Pydantic models (`src/book_recommender/api/models.py`).
* Core logic in `src/book_recommender/` modules is invoked (e.g., `recommender.py`, `explainability.py`, `feedback.py`).
* Responses are returned, adhering to defined Pydantic response models.
**Example API Calls:**
```bash
# Health Check
curl http://localhost:8000/health
# Recommend by Query
curl -X POST "http://localhost:8000/recommend/query" \
-H "Content-Type: application/json" \
-d '{
"query": "fantasy adventure with dragons",
"top_k": 5
}'
# List all Clusters
curl http://localhost:8000/clusters
# Submit Feedback
curl -X POST "http://localhost:8000/feedback" \
-H "Content-Type: application/json" \
-d '{
"query": "fantasy adventure with dragons",
"book_id": "example_book_id",
"feedback_type": "positive",
"session_id": "user_session_abc"
}'
```
7. **Analytics (`src/book_recommender/apps/analytics_app.py`)**:
* Loads accumulated feedback data from `data/feedback/user_feedback.jsonl` using `src/book_recommender/ml/feedback.py`.
* Processes and visualizes statistics using Streamlit and Plotly.
```mermaid
graph TD
subgraph Data Flow & Processing
raw_data[Raw Data (CSV)] --> A(scripts/prepare_goodreads_data.py);
A --> B[books_prepared.csv];
B --> C{src/book_recommender/data/processor.py};
C --> D[books_cleaned.parquet];
D --contains text--> E{src/book_recommender/ml/embedder.py};
E --> F[book_embeddings.npy];
end
subgraph Application Runtime
G(src/book_recommender/apps/main_app.py - Streamlit UI) --loads--> D & F;
G --uses--> H(src/book_recommender/ml/recommender.py);
G --uses--> I(src/book_recommender/ml/embedder.py);
G --uses--> J(src/book_recommender/ml/clustering.py);
G --uses--> K(src/book_recommender/ml/explainability.py);
G --uses--> L(src/book_recommender/ml/feedback.py);
M(src/book_recommender/api/main.py - FastAPI) --uses--> N(src/book_recommender/api/dependencies.py);
N --loads--> D & F;
N --uses--> H & I & J & K & L;
O(src/book_recommender/apps/analytics_app.py - Streamlit Dashboard) --loads--> P[user_feedback.jsonl];
P --via--> L;
User[User] --Interacts with--> G;
Client[External Client] --Interacts with--> M;
end
subgraph Utilities & Configuration
Q[src/book_recommender/core/config.py];
R[src/book_recommender/utils.py];
S[src/book_recommender/core/logging_config.py];
end
style raw_data fill:#f9f,stroke:#333,stroke-width:2px
style D fill:#ccf,stroke:#333,stroke-width:2px
style F fill:#ccf,stroke:#333,stroke-width:2px
style P fill:#fcc,stroke:#333,stroke-width:2px
```
## Technology Decisions
* **Python 3.10+**: Modern, versatile language.
* **Streamlit**: Chosen for rapid development of interactive web UIs with minimal frontend code. Its caching mechanisms (`@st.cache_resource`, `@st.cache_data`) are crucial for performance with ML models.
* **FastAPI**: Selected for building a high-performance, asynchronous API with automatic Pydantic-based data validation and Swagger/OpenAPI documentation.
* **Sentence-Transformers**: A powerful library for generating dense vector embeddings from text, suitable for semantic search.
* **FAISS**: An efficient library for similarity search and clustering of dense vectors, essential for scaling recommendations.
* **Pandas / NumPy**: Standard libraries for data manipulation and numerical operations.
* **Scikit-learn**: Used for traditional machine learning tasks like K-Means clustering.
* **Plotly Express**: For creating interactive and aesthetically pleasing visualizations in the analytics dashboard.
* **Pydantic**: Data validation and settings management using Python type hints, integral to FastAPI.
* **python-dotenv**: For managing environment variables, facilitating flexible configuration across environments.
* **GitHub Actions**: For Continuous Integration/Continuous Deployment (CI/CD), automating testing, linting, and Docker image builds.
* **Docker / Docker Compose**: For containerizing the application and orchestrating multi-service deployments, ensuring consistent environments.
## Future Considerations
* **Data Version Control (DVC)**: Implement DVC for robust tracking of data and model versions, enhancing reproducibility in production.
* **Scalability**: For extremely large datasets, consider distributed FAISS indexes or cloud-native vector databases.
* **Advanced Explanations**: Explore more sophisticated XAI techniques beyond rule-based, potentially involving LLMs or specific feature attribution methods.
* **User Authentication**: For multi-user scenarios, integrate an authentication system (e.g., OAuth2, JWT).
* **Database Integration**: Replace JSONL feedback storage with a dedicated database (e.g., PostgreSQL) for more robust data management and querying.
* **Full UI Testing**: Implement UI tests using tools like Playwright or Selenium to ensure frontend consistency and functionality. |