Spaces:
Sleeping
A newer version of the Streamlit SDK is available: 1.56.0
title: Book Recommendation Engine
emoji: π
colorFrom: blue
colorTo: indigo
sdk: streamlit
sdk_version: 1.39.0
app_file: app.py
pinned: false
This project implements a content-based book recommendation system using a hybrid of:
- Qwen3 8B Embedding Model for high-quality semantic embeddings
- HNSWLib for fast approximate nearest neighbor search
- Clean Architecture principles for modularity
- Streamlit frontend for user interaction
- Problem Overview
Given a book title, the system must:
- Understand the semantic meaning of its summary and metadata
- Compare it against all books in the dataset
- Retrieve and rank the top similar books
- Display them in a Streamlit UI
Dataset used: Kaggle β Book Summary Dataset (Contains duplicated entries, categories, metadata, and summaries.)
- System Architecture
The project is structured using Clean Architecture:
--> Streamlit UI The user interacts here
--> Recommendation Layer Orchestrates embedding + search
--> Embedder Turns text into dense vectors Qwen embeddings
--> Search Engine Indexes embeddings & retrieves nearest neighbors (HNSW)
- Text Preprocessing Strategy
For each book, we build a rich combined representation:
- Summary
- Categories / Genre
This combined text becomes the input for embedding generation.
Duplicates in the dataset were discovered β multiple entries of the same book with different category combinations. The preprocessing step:
- Identifies duplicates
- Ensures each book is represented at least once
- Removes artificial repeated entries (e.g., index difference = 9)
- Embedding Generation (Qwen3 8B Embedding Model)
- The Qwen embedding model provides hidden states; we apply last token pooling, adjusting for left-padding.
- Normalize vectors using L2 norm
- All embeddings are stored as a .pkl file for reuse.
- ANN Search With HNSWLib
To retrieve top-k similar books efficiently, we use HNSWLib configured with:
- space="ip" (inner product = cosine similarity when normalized)
- L2 normalized vectors
- Efficient search params (ef=50, M=16)
During search:
- Normalize query vector
- Perform knn_query
- Convert negative inner product β similarity
This ensures results match manual cosine similarity (q @ d.T).
- Recommendation Logic
Steps performed:
- Take user's book title
- Retrieve its precomputed embedding
- Search nearest neighbors using HNSW
- Filter out the query book itself
- Return top-k recommendations
Output includes:
- Book name
- Summary
- Categories
- Similarity score
- Streamlit Frontend
A simple UI:
- Input text box for book title
- Submit button
- Calls recommender backend
- Displays top similar books with:
- Evaluation Strategy
Evaluating a recommender with no labels is tricky. Approaches used:
- Manual qualitative checks
- Cross-model comparison (MiniLM vs Qwen)
- Human judgement to verify semantic correctness
- LLM-based evaluation