Nirman Patel
Upload folder using huggingface_hub
f632dba verified

A newer version of the Gradio SDK is available: 6.9.0

Upgrade
metadata
title: semantic-book-recommender
app_file: gradio_dashboard.py
sdk: gradio
sdk_version: 5.38.0

πŸ“š Semantic Book Recommendation System

Python Transformers Gradio LangChain License

A sophisticated book recommendation system that combines semantic search with emotion analysis to provide personalized book suggestions. The system uses vector embeddings, zero-shot classification, and emotion detection to understand user preferences and recommend books based on content similarity and emotional tone.

🌟 Features

  • Semantic Search: Uses HuggingFace embeddings and ChromaDB for vector-based similarity search
  • Emotion Analysis: Analyzes book descriptions for emotional content (joy, sadness, anger, fear, surprise, disgust, neutral)
  • Zero-Shot Classification: Automatically categorizes books into Fiction/Non-Fiction using BART-large-MNLI
  • Interactive Dashboard: Gradio-based web interface for easy book discovery
  • Advanced Filtering: Filter by category, emotional tone, and rating
  • Data Visualization: Statistical insights and data exploration tools

πŸ—οΈ System Architecture

books.csv β†’ Data Cleaning β†’ Category Classification β†’ Emotion Analysis β†’ Vector Database β†’ Gradio UI

Pipeline Components:

  1. Data Exploration & Cleaning (data_exploration.py)

    • Handles missing values and data quality issues
    • Filters books with substantial descriptions (25+ words)
    • Creates correlation analysis and visualizations
  2. Text Classification (text_classification.py)

    • Zero-shot classification for Fiction/Non-Fiction categorization
    • Uses Facebook's BART-large-MNLI model
    • Achieves high accuracy in automated categorization
  3. Sentiment Analysis (sentiment_analysis.py)

    • Emotion detection using DistilRoBERTa model
    • Analyzes 7 emotions: anger, disgust, fear, joy, sadness, surprise, neutral
    • Sentence-level emotion scoring with max aggregation
  4. Vector Search (vector_search.py)

    • Creates embeddings using HuggingFace sentence-transformers
    • Implements ChromaDB for efficient similarity search
    • Supports semantic book discovery
  5. Gradio Dashboard (gradio_dashboard.py)

    • Interactive web interface for book recommendations
    • Real-time filtering and visualization
    • Statistical dashboards and data insights

πŸ“ Project Structure

semantic-book-recommender/
β”œβ”€β”€ πŸ“„ Core Files
β”‚   β”œβ”€β”€ .env.example                 # Template for environment variables
β”‚   β”œβ”€β”€ .gitignore                   # Git ignore file (IMPORTANT!)
β”‚   β”œβ”€β”€ README.md                    # This file
β”‚   └── requirements.txt             # Python dependencies
β”‚
β”œβ”€β”€ 🐍 Python Scripts
β”‚   β”œβ”€β”€ data_exploration.py          # Data cleaning and exploration
β”‚   β”œβ”€β”€ text_classification.py       # Zero-shot classification
β”‚   β”œβ”€β”€ sentiment_analysis.py        # Emotion analysis
β”‚   β”œβ”€β”€ vector_search.py            # Vector database operations
β”‚   └── gradio_dashboard.py         # Web interface
β”‚
β”œβ”€β”€ πŸ“Š Data Files (Generated/Input)
β”‚   β”œβ”€β”€ books.csv                   # Input dataset (not included in repo)
β”‚   β”œβ”€β”€ books_cleaned.csv           # Cleaned dataset
β”‚   β”œβ”€β”€ books_with_categories.csv   # Dataset with categories
β”‚   β”œβ”€β”€ books_with_emotions.csv     # Final dataset with emotions
β”‚   β”œβ”€β”€ tagged_description.txt      # Generated text file for embeddings
β”‚   └── predictions_results.csv     # Classification results
β”‚
β”œβ”€β”€ πŸ–ΌοΈ Assets
β”‚   └── cover-not-found.jpg         # Default book cover image
β”‚
β”œβ”€β”€ πŸ—„οΈ Vector Databases (Auto-generated)
β”‚   β”œβ”€β”€ chroma_db_books/            # OpenAI embeddings vector DB
β”‚   └── chroma_db_books_hf/         # HuggingFace embeddings vector DB
β”‚
└── πŸ”§ Environment (Ignored)
    β”œβ”€β”€ .env                        # Your API keys (NEVER commit!)
    └── .venv/                      # Virtual environment (ignored)

πŸ“‹ File Descriptions

File Purpose Generated By
data_exploration.py Data cleaning, missing value analysis, correlation heatmaps Manual
text_classification.py Zero-shot classification (Fiction/Non-Fiction) Manual
sentiment_analysis.py Emotion analysis (7 emotions) Manual
vector_search.py Vector embeddings and similarity search Manual
gradio_dashboard.py Interactive web interface Manual
books.csv Original dataset User provided
books_cleaned.csv Cleaned dataset (25+ word descriptions) data_exploration.py
books_with_categories.csv Dataset with Fiction/Non-Fiction labels text_classification.py
books_with_emotions.csv Final dataset with emotion scores sentiment_analysis.py
tagged_description.txt Text file for vector embeddings vector_search.py
predictions_results.csv Classification accuracy results text_classification.py

πŸ”„ Processing Pipeline

books.csv 
    ↓ (data_exploration.py)
books_cleaned.csv 
    ↓ (text_classification.py)
books_with_categories.csv 
    ↓ (sentiment_analysis.py)
books_with_emotions.csv 
    ↓ (vector_search.py)
tagged_description.txt + Vector DB
    ↓ (gradio_dashboard.py)
πŸ“± Web Interface

πŸ”’ Security Setup (IMPORTANT!)

Before uploading to GitHub:

  1. Create .gitignore file (copy the one provided below)
  2. Never commit .env files - they contain your API keys
  3. Use .env.example as a template for others
  4. Remove any API keys from code files

Required .gitignore file:

# Environment variables (NEVER commit these!)
.env
.env.local
.env.development.local
.env.test.local
.env.production.local

# Virtual environment
venv/
.venv/
env/
ENV/

# Python cache
__pycache__/
*.py[cod]
*$py.class
*.so
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# Vector databases (large files)
chroma_db_books/
chroma_db_books_hf/
*.db
*.sqlite

# Data files (add to .gitignore if sensitive)
books.csv
books_cleaned.csv
books_with_categories.csv
books_with_emotions.csv
tagged_description.txt
predictions_results.csv

# IDE files
.vscode/
.idea/
*.swp
*.swo
*~

# OS files
.DS_Store
.DS_Store?
._*
.Spotlight-V100
.Trashes
ehthumbs.db
Thumbs.db

# Jupyter Notebook checkpoints
.ipynb_checkpoints

# PyTorch model files
*.pth
*.pt

# Logs
*.log
logs/

πŸš€ Quick Start

Prerequisites

  • Python 3.8 or higher
  • Virtual environment (recommended)

Installation

  1. Clone the repository:
git clone https://github.com/yourusername/semantic-book-recommender.git
cd semantic-book-recommender
  1. Create and activate virtual environment:
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
  1. Install dependencies:
pip install -r requirements.txt
  1. Set up environment variables:
cp .env.example .env
# Edit .env with your OpenAI API key (optional, for OpenAI embeddings)

Running the System

  1. Data Processing Pipeline:
# Step 1: Clean and explore data
python data_exploration.py

# Step 2: Classify books into categories
python text_classification.py

# Step 3: Analyze emotions in book descriptions
python sentiment_analysis.py

# Step 4: Create vector database
python vector_search.py
  1. Launch Dashboard:
python gradio_dashboard.py

Access the dashboard at http://localhost:7860

πŸ“Š Data Requirements

The system expects a books.csv file with the following columns:

  • isbn13: Unique book identifier
  • title: Book title
  • subtitle: Book subtitle (optional)
  • authors: Author names (semicolon-separated)
  • categories: Book categories
  • description: Book description
  • num_pages: Number of pages
  • average_rating: Average rating (1-5 scale)
  • published_year: Publication year
  • thumbnail: Book cover image URL

🎯 Usage Examples

Semantic Search

from vector_search import retrieve_semantic_recommendations

# Find books similar to a query
results = retrieve_semantic_recommendations(
    "A mystery novel about redemption and forgiveness",
    top_k=10
)

Emotion-Based Filtering

# Get happy books in fiction category
recommendations = retrieve_semantic_recommendations(
    query="adventure story",
    category="Fiction",
    tone="Happy"
)

πŸ”§ Configuration

Model Settings

  • Embedding Model: sentence-transformers/all-MiniLM-L6-v2 (384 dimensions)
  • Classification Model: facebook/bart-large-mnli
  • Emotion Model: j-hartmann/emotion-english-distilroberta-base

Performance Tuning

  • Adjust initial_top_k and final_top_k in recommendation functions
  • Modify chunk size and overlap in text splitting
  • Configure vector database persistence settings

πŸ“ˆ Model Performance

  • Zero-Shot Classification Accuracy: ~85% on Fiction/Non-Fiction categorization
  • Emotion Detection: 7-class emotion classification with confidence scores
  • Semantic Search: Cosine similarity-based ranking with embedding vectors

πŸ› οΈ Technical Details

Dependencies

  • Core ML: transformers, torch, sentence-transformers
  • Vector Database: chromadb, langchain
  • Data Processing: pandas, numpy
  • Visualization: matplotlib, seaborn, gradio
  • Utilities: tqdm, tabulate, python-dotenv

Hardware Requirements

  • RAM: 8GB+ recommended for model loading
  • GPU: Optional, supports CUDA/MPS for faster inference
  • Storage: 2GB+ for model weights and vector database

πŸ“ API Reference

Main Functions

retrieve_semantic_recommendations(query, category, tone, initial_top_k, final_top_k)

Returns book recommendations based on semantic similarity and filters.

Parameters:

  • query (str): Search query describing desired book
  • category (str): Book category filter ("All", "Fiction", "Non-Fiction", etc.)
  • tone (str): Emotional tone filter ("Happy", "Sad", "Suspenseful", etc.)
  • initial_top_k (int): Initial number of candidates to retrieve
  • final_top_k (int): Final number of recommendations to return

Returns:

  • pandas.DataFrame: Filtered book recommendations with metadata

🀝 Contributing

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit changes (git commit -m 'Add amazing feature')
  4. Push to branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ™ Acknowledgments

  • HuggingFace for providing pre-trained models
  • OpenAI for embedding models
  • ChromaDB for vector database functionality
  • Gradio for the intuitive web interface
  • The open-source community for various Python libraries

πŸ“š References


Note: This system is designed for educational and research purposes. Ensure compliance with data usage policies and model licenses when deploying in production environments.