Spaces:

nice-bill
/

deepshelf-api

Running

App Files Files Community

deepshelf-api / docs /ARCHITECTURE.md

nice-bill

initial commit

cdb73a8 3 months ago

preview code

raw

history blame contribute delete

12.2 kB

	# Architecture Overview: DeepShelf

	This document provides a detailed overview of the BookFinder application's architecture, outlining its components, data flow, and key technology decisions.

	## System Overview

	BookFinder is a content-based book recommendation system that leverages natural language processing (NLP) and vector similarity search to help users discover books. It comprises a Streamlit-based web application for user interaction, a FastAPI service for programmatic access, and a suite of Python scripts for data processing and embedding generation.

	The core principle is to transform book descriptions into high-dimensional numerical vectors (embeddings) using a pre-trained sentence transformer model. These embeddings are then used to find semantically similar books or to cluster books into thematic collections.

	## Component Descriptions

	The system is structured into several modular components:

	1. Data Layer (`data/`): Stores raw, prepared, and processed data.
	* `data/raw/`: Original datasets (e.g., `goodreads_data.csv`).
	* `data/processed/`: Cleaned book metadata (`books_cleaned.parquet`), pre-computed embeddings (`book_embeddings.npy`), and embedding metadata (`embedding_metadata.json`).

	2. Scripts Layer (`scripts/`): Utility scripts for data handling and model preparation.
	* `prepare_goodreads_data.py`: Adapts raw data sources to a standardized format.
	* `download_model.py` (optional): For pre-downloading the sentence transformer model.

	3. Source Code Layer (`src/book_recommender/`): Contains the core logic of the application, organized into sub-packages.
	* `src/book_recommender/core/`: Core utilities and configurations.
	* `config.py`: Centralized configuration settings and constants.
	* `exceptions.py`: Custom exception definitions for error handling.
	* `logging_config.py`: Centralized logging configuration for the entire application.
	* `src/book_recommender/data/`: Data processing components.
	* `processor.py`: Handles data cleaning, normalization, deduplication, and feature engineering (e.g., creating `combined_text` for embeddings).
	* `src/book_recommender/ml/`: Machine Learning related components.
	* `embedder.py`: Manages the loading of the Sentence Transformer model and generation of embeddings for books and user queries. It uses `lru_cache` for efficient model loading.
	* `recommender.py`: Implements the core recommendation logic using a FAISS index for fast Approximate Nearest Neighbor (ANN) search on book embeddings.
	* `clustering.py`: Contains logic for K-Means clustering of book embeddings and generating descriptive names for these clusters based on common genres.
	* `explainability.py`: Provides rule-based explanations for recommendations, detailing contributing factors like genre, keywords, and author similarity.
	* `feedback.py`: Manages saving and retrieving user feedback on recommendations to a JSONL file.
	* `src/book_recommender/utils.py`: General utility functions, including book cover fetching from various APIs (Google Books, Open Library) and batch processing.

	4. User Interface Layer (`src/book_recommender/apps/`): Streamlit applications.
	* `main_app.py`: The main interactive web application where users can get book recommendations, browse by genre/query, view explanations, and provide feedback.
	* `analytics_app.py`: A separate Streamlit dashboard to visualize collected user feedback and system usage statistics.

	5. API Layer (`src/book_recommender/api/`): FastAPI application for programmatic access.
	* `main.py`: Defines the FastAPI application and its endpoints (recommendations, book listing, search, stats, clusters, explanations, feedback).
	* `models.py`: Pydantic models for request and response data validation and serialization.
	* `dependencies.py`: FastAPI dependency injection functions to manage and cache shared resources like the `BookRecommender` and embedding models.

	6. CI/CD (`.github/workflows/ci-cd.yml`): GitHub Actions workflow for automated testing and code quality checks.

	7. Containerization (`streamlit.Dockerfile`, `api.Dockerfile`, `analytics.Dockerfile`, `docker-compose.yml`): For packaging and orchestrating the application services.
	* `streamlit.Dockerfile`: Defines the build process for the main Streamlit application into a Docker image.
	* `api.Dockerfile`: Defines the build process for the FastAPI application into a Docker image.
	* `analytics.Dockerfile`: Defines the build process for the analytics Streamlit application into a Docker image.
	* `docker-compose.yml`: Orchestrates the `streamlit`, `api`, and `analytics` services for local development and deployment.

	## Data Flow

	The primary data flow for generating recommendations and user interaction is as follows:

	1. Raw Data Ingestion: Raw CSV datasets (e.g., `goodreads_data.csv`) are placed in `data/raw/`.
	2. Data Preparation: The `scripts/prepare_goodreads_data.py` script and `src/book_recommender/data/processor.py` clean, standardize, and deduplicate the raw data, saving it as `books_cleaned.parquet` in `data/processed/`.
	3. Embedding Generation: `src/book_recommender/ml/embedder.py` loads `books_cleaned.parquet` and generates semantic embeddings for each book's `combined_text`, saving them as `book_embeddings.npy` in `data/processed/`.
	4. Application Startup:
	* The `src/book_recommender/apps/main_app.py` (Streamlit) and `src/book_recommender/api/main.py` (FastAPI) applications load `books_cleaned.parquet` and `book_embeddings.npy` into memory.
	* `src/book_recommender/ml/recommender.py` initializes a FAISS index with these embeddings for fast similarity search.
	* `src/book_recommender/ml/clustering.py` generates book clusters and their names from the embeddings.
	5. User Interaction (Streamlit `src/book_recommender/apps/main_app.py`):
	* User inputs a natural language query or browses clusters.
	* If a query, `src/book_recommender/ml/embedder.py` generates an embedding for the query.
	* `src/book_recommender/ml/recommender.py` uses the query embedding (or a book's embedding from a title search) to find similar books.
	* `src/book_recommender/ml/explainability.py` generates reasons for recommendations.
	* `src/book_recommender/utils.py` fetches book cover images.
	* `src/book_recommender/ml/feedback.py` stores user feedback on recommendations.
	6. API Interaction (FastAPI `src/book_recommender/api/main.py`):
	* External clients send requests to API endpoints (e.g., `/recommend/query`, `/books`, `/feedback`).
	* `src/book_recommender/api/dependencies.py` ensures efficient loading and caching of `recommender` and `embedder` instances.
	* Requests are validated using Pydantic models (`src/book_recommender/api/models.py`).
	* Core logic in `src/book_recommender/` modules is invoked (e.g., `recommender.py`, `explainability.py`, `feedback.py`).
	* Responses are returned, adhering to defined Pydantic response models.

	Example API Calls:

	```bash
	# Health Check
	curl http://localhost:8000/health

	# Recommend by Query
	curl -X POST "http://localhost:8000/recommend/query" \
	-H "Content-Type: application/json" \
	-d '{
	"query": "fantasy adventure with dragons",
	"top_k": 5
	}'

	# List all Clusters
	curl http://localhost:8000/clusters

	# Submit Feedback
	curl -X POST "http://localhost:8000/feedback" \
	-H "Content-Type: application/json" \
	-d '{
	"query": "fantasy adventure with dragons",
	"book_id": "example_book_id",
	"feedback_type": "positive",
	"session_id": "user_session_abc"
	}'
	```

	7. Analytics (`src/book_recommender/apps/analytics_app.py`):
	* Loads accumulated feedback data from `data/feedback/user_feedback.jsonl` using `src/book_recommender/ml/feedback.py`.
	* Processes and visualizes statistics using Streamlit and Plotly.

	```mermaid
	graph TD
	subgraph Data Flow & Processing
	raw_data[Raw Data (CSV)] --> A(scripts/prepare_goodreads_data.py);
	A --> B[books_prepared.csv];
	B --> C{src/book_recommender/data/processor.py};
	C --> D[books_cleaned.parquet];
	D --contains text--> E{src/book_recommender/ml/embedder.py};
	E --> F[book_embeddings.npy];
	end

	subgraph Application Runtime
	G(src/book_recommender/apps/main_app.py - Streamlit UI) --loads--> D & F;
	G --uses--> H(src/book_recommender/ml/recommender.py);
	G --uses--> I(src/book_recommender/ml/embedder.py);
	G --uses--> J(src/book_recommender/ml/clustering.py);
	G --uses--> K(src/book_recommender/ml/explainability.py);
	G --uses--> L(src/book_recommender/ml/feedback.py);

	M(src/book_recommender/api/main.py - FastAPI) --uses--> N(src/book_recommender/api/dependencies.py);
	N --loads--> D & F;
	N --uses--> H & I & J & K & L;

	O(src/book_recommender/apps/analytics_app.py - Streamlit Dashboard) --loads--> P[user_feedback.jsonl];
	P --via--> L;

	User[User] --Interacts with--> G;
	Client[External Client] --Interacts with--> M;
	end

	subgraph Utilities & Configuration
	Q[src/book_recommender/core/config.py];
	R[src/book_recommender/utils.py];
	S[src/book_recommender/core/logging_config.py];
	end

	style raw_data fill:#f9f,stroke:#333,stroke-width:2px
	style D fill:#ccf,stroke:#333,stroke-width:2px
	style F fill:#ccf,stroke:#333,stroke-width:2px
	style P fill:#fcc,stroke:#333,stroke-width:2px
	```

	## Technology Decisions

	* Python 3.10+: Modern, versatile language.
	* Streamlit: Chosen for rapid development of interactive web UIs with minimal frontend code. Its caching mechanisms (`@st.cache_resource`, `@st.cache_data`) are crucial for performance with ML models.
	* FastAPI: Selected for building a high-performance, asynchronous API with automatic Pydantic-based data validation and Swagger/OpenAPI documentation.
	* Sentence-Transformers: A powerful library for generating dense vector embeddings from text, suitable for semantic search.
	* FAISS: An efficient library for similarity search and clustering of dense vectors, essential for scaling recommendations.
	* Pandas / NumPy: Standard libraries for data manipulation and numerical operations.
	* Scikit-learn: Used for traditional machine learning tasks like K-Means clustering.
	* Plotly Express: For creating interactive and aesthetically pleasing visualizations in the analytics dashboard.
	* Pydantic: Data validation and settings management using Python type hints, integral to FastAPI.
	* python-dotenv: For managing environment variables, facilitating flexible configuration across environments.
	* GitHub Actions: For Continuous Integration/Continuous Deployment (CI/CD), automating testing, linting, and Docker image builds.
	* Docker / Docker Compose: For containerizing the application and orchestrating multi-service deployments, ensuring consistent environments.

	## Future Considerations

	* Data Version Control (DVC): Implement DVC for robust tracking of data and model versions, enhancing reproducibility in production.
	* Scalability: For extremely large datasets, consider distributed FAISS indexes or cloud-native vector databases.
	* Advanced Explanations: Explore more sophisticated XAI techniques beyond rule-based, potentially involving LLMs or specific feature attribution methods.
	* User Authentication: For multi-user scenarios, integrate an authentication system (e.g., OAuth2, JWT).
	* Database Integration: Replace JSONL feedback storage with a dedicated database (e.g., PostgreSQL) for more robust data management and querying.
	* Full UI Testing: Implement UI tests using tools like Playwright or Selenium to ensure frontend consistency and functionality.