Spaces:

nirmanpatel
/

semantic-book-recommender

Sleeping

App Files Files Community

semantic-book-recommender / README.md

Nirman Patel

Upload folder using huggingface_hub

f632dba verified 8 months ago

preview code

raw

history blame contribute delete

11.9 kB

A newer version of the Gradio SDK is available: 6.9.0

Upgrade

metadata

title: semantic-book-recommender
app_file: gradio_dashboard.py
sdk: gradio
sdk_version: 5.38.0

📚 Semantic Book Recommendation System

A sophisticated book recommendation system that combines semantic search with emotion analysis to provide personalized book suggestions. The system uses vector embeddings, zero-shot classification, and emotion detection to understand user preferences and recommend books based on content similarity and emotional tone.

🌟 Features

Semantic Search: Uses HuggingFace embeddings and ChromaDB for vector-based similarity search
Emotion Analysis: Analyzes book descriptions for emotional content (joy, sadness, anger, fear, surprise, disgust, neutral)
Zero-Shot Classification: Automatically categorizes books into Fiction/Non-Fiction using BART-large-MNLI
Interactive Dashboard: Gradio-based web interface for easy book discovery
Advanced Filtering: Filter by category, emotional tone, and rating
Data Visualization: Statistical insights and data exploration tools

🏗️ System Architecture

books.csv → Data Cleaning → Category Classification → Emotion Analysis → Vector Database → Gradio UI

Pipeline Components:

Data Exploration & Cleaning (data_exploration.py)
- Handles missing values and data quality issues
- Filters books with substantial descriptions (25+ words)
- Creates correlation analysis and visualizations
Text Classification (text_classification.py)
- Zero-shot classification for Fiction/Non-Fiction categorization
- Uses Facebook's BART-large-MNLI model
- Achieves high accuracy in automated categorization
Sentiment Analysis (sentiment_analysis.py)
- Emotion detection using DistilRoBERTa model
- Analyzes 7 emotions: anger, disgust, fear, joy, sadness, surprise, neutral
- Sentence-level emotion scoring with max aggregation
Vector Search (vector_search.py)
- Creates embeddings using HuggingFace sentence-transformers
- Implements ChromaDB for efficient similarity search
- Supports semantic book discovery
Gradio Dashboard (gradio_dashboard.py)
- Interactive web interface for book recommendations
- Real-time filtering and visualization
- Statistical dashboards and data insights

📁 Project Structure

semantic-book-recommender/
├── 📄 Core Files
│   ├── .env.example                 # Template for environment variables
│   ├── .gitignore                   # Git ignore file (IMPORTANT!)
│   ├── README.md                    # This file
│   └── requirements.txt             # Python dependencies
│
├── 🐍 Python Scripts
│   ├── data_exploration.py          # Data cleaning and exploration
│   ├── text_classification.py       # Zero-shot classification
│   ├── sentiment_analysis.py        # Emotion analysis
│   ├── vector_search.py            # Vector database operations
│   └── gradio_dashboard.py         # Web interface
│
├── 📊 Data Files (Generated/Input)
│   ├── books.csv                   # Input dataset (not included in repo)
│   ├── books_cleaned.csv           # Cleaned dataset
│   ├── books_with_categories.csv   # Dataset with categories
│   ├── books_with_emotions.csv     # Final dataset with emotions
│   ├── tagged_description.txt      # Generated text file for embeddings
│   └── predictions_results.csv     # Classification results
│
├── 🖼️ Assets
│   └── cover-not-found.jpg         # Default book cover image
│
├── 🗄️ Vector Databases (Auto-generated)
│   ├── chroma_db_books/            # OpenAI embeddings vector DB
│   └── chroma_db_books_hf/         # HuggingFace embeddings vector DB
│
└── 🔧 Environment (Ignored)
    ├── .env                        # Your API keys (NEVER commit!)
    └── .venv/                      # Virtual environment (ignored)

📋 File Descriptions

File	Purpose	Generated By
`data_exploration.py`	Data cleaning, missing value analysis, correlation heatmaps	Manual
`text_classification.py`	Zero-shot classification (Fiction/Non-Fiction)	Manual
`sentiment_analysis.py`	Emotion analysis (7 emotions)	Manual
`vector_search.py`	Vector embeddings and similarity search	Manual
`gradio_dashboard.py`	Interactive web interface	Manual
`books.csv`	Original dataset	User provided
`books_cleaned.csv`	Cleaned dataset (25+ word descriptions)	`data_exploration.py`
`books_with_categories.csv`	Dataset with Fiction/Non-Fiction labels	`text_classification.py`
`books_with_emotions.csv`	Final dataset with emotion scores	`sentiment_analysis.py`
`tagged_description.txt`	Text file for vector embeddings	`vector_search.py`
`predictions_results.csv`	Classification accuracy results	`text_classification.py`

🔄 Processing Pipeline

books.csv 
    ↓ (data_exploration.py)
books_cleaned.csv 
    ↓ (text_classification.py)
books_with_categories.csv 
    ↓ (sentiment_analysis.py)
books_with_emotions.csv 
    ↓ (vector_search.py)
tagged_description.txt + Vector DB
    ↓ (gradio_dashboard.py)
📱 Web Interface

🔒 Security Setup (IMPORTANT!)

Before uploading to GitHub:

Create .gitignore file (copy the one provided below)
Never commit .env files - they contain your API keys
Use .env.example as a template for others
Remove any API keys from code files

Required `.gitignore` file:

# Environment variables (NEVER commit these!)
.env
.env.local
.env.development.local
.env.test.local
.env.production.local

# Virtual environment
venv/
.venv/
env/
ENV/

# Python cache
__pycache__/
*.py[cod]
*$py.class
*.so
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# Vector databases (large files)
chroma_db_books/
chroma_db_books_hf/
*.db
*.sqlite

# Data files (add to .gitignore if sensitive)
books.csv
books_cleaned.csv
books_with_categories.csv
books_with_emotions.csv
tagged_description.txt
predictions_results.csv

# IDE files
.vscode/
.idea/
*.swp
*.swo
*~

# OS files
.DS_Store
.DS_Store?
._*
.Spotlight-V100
.Trashes
ehthumbs.db
Thumbs.db

# Jupyter Notebook checkpoints
.ipynb_checkpoints

# PyTorch model files
*.pth
*.pt

# Logs
*.log
logs/

🚀 Quick Start

Prerequisites

Python 3.8 or higher
Virtual environment (recommended)

Installation

Clone the repository:

git clone https://github.com/yourusername/semantic-book-recommender.git
cd semantic-book-recommender

Create and activate virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:

pip install -r requirements.txt

Set up environment variables:

cp .env.example .env
# Edit .env with your OpenAI API key (optional, for OpenAI embeddings)

Running the System

Data Processing Pipeline:

# Step 1: Clean and explore data
python data_exploration.py

# Step 2: Classify books into categories
python text_classification.py

# Step 3: Analyze emotions in book descriptions
python sentiment_analysis.py

# Step 4: Create vector database
python vector_search.py

Launch Dashboard:

python gradio_dashboard.py

Access the dashboard at http://localhost:7860

📊 Data Requirements

The system expects a books.csv file with the following columns:

isbn13: Unique book identifier
title: Book title
subtitle: Book subtitle (optional)
authors: Author names (semicolon-separated)
categories: Book categories
description: Book description
num_pages: Number of pages
average_rating: Average rating (1-5 scale)
published_year: Publication year
thumbnail: Book cover image URL

🎯 Usage Examples

Semantic Search

from vector_search import retrieve_semantic_recommendations

# Find books similar to a query
results = retrieve_semantic_recommendations(
    "A mystery novel about redemption and forgiveness",
    top_k=10
)

Emotion-Based Filtering

# Get happy books in fiction category
recommendations = retrieve_semantic_recommendations(
    query="adventure story",
    category="Fiction",
    tone="Happy"
)

🔧 Configuration

Model Settings

Embedding Model: sentence-transformers/all-MiniLM-L6-v2 (384 dimensions)
Classification Model: facebook/bart-large-mnli
Emotion Model: j-hartmann/emotion-english-distilroberta-base

Performance Tuning

Adjust initial_top_k and final_top_k in recommendation functions
Modify chunk size and overlap in text splitting
Configure vector database persistence settings

📈 Model Performance

Zero-Shot Classification Accuracy: ~85% on Fiction/Non-Fiction categorization
Emotion Detection: 7-class emotion classification with confidence scores
Semantic Search: Cosine similarity-based ranking with embedding vectors

🛠️ Technical Details

Dependencies

Core ML: transformers, torch, sentence-transformers
Vector Database: chromadb, langchain
Data Processing: pandas, numpy
Visualization: matplotlib, seaborn, gradio
Utilities: tqdm, tabulate, python-dotenv

Hardware Requirements

RAM: 8GB+ recommended for model loading
GPU: Optional, supports CUDA/MPS for faster inference
Storage: 2GB+ for model weights and vector database

📝 API Reference

Main Functions

`retrieve_semantic_recommendations(query, category, tone, initial_top_k, final_top_k)`

Returns book recommendations based on semantic similarity and filters.

Parameters:

query (str): Search query describing desired book
category (str): Book category filter ("All", "Fiction", "Non-Fiction", etc.)
tone (str): Emotional tone filter ("Happy", "Sad", "Suspenseful", etc.)
initial_top_k (int): Initial number of candidates to retrieve
final_top_k (int): Final number of recommendations to return

Returns:

pandas.DataFrame: Filtered book recommendations with metadata

🤝 Contributing

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit changes (git commit -m 'Add amazing feature')
Push to branch (git push origin feature/amazing-feature)
Open a Pull Request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

HuggingFace for providing pre-trained models
OpenAI for embedding models
ChromaDB for vector database functionality
Gradio for the intuitive web interface
The open-source community for various Python libraries

📚 References

Note: This system is designed for educational and research purposes. Ensure compliance with data usage policies and model licenses when deploying in production environments.