Suriya
add readme requirement
ca6d8c0

A newer version of the Streamlit SDK is available: 1.56.0

Upgrade
metadata
title: Book Recommendation Engine
emoji: πŸ“š
colorFrom: blue
colorTo: indigo
sdk: streamlit
sdk_version: 1.39.0
app_file: app.py
pinned: false

This project implements a content-based book recommendation system using a hybrid of:

  • Qwen3 8B Embedding Model for high-quality semantic embeddings
  • HNSWLib for fast approximate nearest neighbor search
  • Clean Architecture principles for modularity
  • Streamlit frontend for user interaction
  1. Problem Overview

Given a book title, the system must:

  1. Understand the semantic meaning of its summary and metadata
  2. Compare it against all books in the dataset
  3. Retrieve and rank the top similar books
  4. Display them in a Streamlit UI

Dataset used: Kaggle – Book Summary Dataset (Contains duplicated entries, categories, metadata, and summaries.)

  1. System Architecture

The project is structured using Clean Architecture:

--> Streamlit UI The user interacts here

--> Recommendation Layer Orchestrates embedding + search

--> Embedder Turns text into dense vectors Qwen embeddings

--> Search Engine Indexes embeddings & retrieves nearest neighbors (HNSW)

  1. Text Preprocessing Strategy

For each book, we build a rich combined representation:

  • Summary
  • Categories / Genre

This combined text becomes the input for embedding generation.

Duplicates in the dataset were discovered β€” multiple entries of the same book with different category combinations. The preprocessing step:

  • Identifies duplicates
  • Ensures each book is represented at least once
  • Removes artificial repeated entries (e.g., index difference = 9)
  1. Embedding Generation (Qwen3 8B Embedding Model)
  • The Qwen embedding model provides hidden states; we apply last token pooling, adjusting for left-padding.
  • Normalize vectors using L2 norm
  • All embeddings are stored as a .pkl file for reuse.
  1. ANN Search With HNSWLib

To retrieve top-k similar books efficiently, we use HNSWLib configured with:

  • space="ip" (inner product = cosine similarity when normalized)
  • L2 normalized vectors
  • Efficient search params (ef=50, M=16)

During search:

  1. Normalize query vector
  2. Perform knn_query
  3. Convert negative inner product β†’ similarity

This ensures results match manual cosine similarity (q @ d.T).

  1. Recommendation Logic

Steps performed:

  1. Take user's book title
  2. Retrieve its precomputed embedding
  3. Search nearest neighbors using HNSW
  4. Filter out the query book itself
  5. Return top-k recommendations

Output includes:

  • Book name
  • Summary
  • Categories
  • Similarity score
  1. Streamlit Frontend

A simple UI:

  • Input text box for book title
  • Submit button
  • Calls recommender backend
  • Displays top similar books with:
  1. Evaluation Strategy

Evaluating a recommender with no labels is tricky. Approaches used:

  • Manual qualitative checks
  • Cross-model comparison (MiniLM vs Qwen)
  • Human judgement to verify semantic correctness
  • LLM-based evaluation