A newer version of the Gradio SDK is available:
6.9.0
title: semantic-book-recommender
app_file: gradio_dashboard.py
sdk: gradio
sdk_version: 5.38.0
π Semantic Book Recommendation System
A sophisticated book recommendation system that combines semantic search with emotion analysis to provide personalized book suggestions. The system uses vector embeddings, zero-shot classification, and emotion detection to understand user preferences and recommend books based on content similarity and emotional tone.
π Features
- Semantic Search: Uses HuggingFace embeddings and ChromaDB for vector-based similarity search
- Emotion Analysis: Analyzes book descriptions for emotional content (joy, sadness, anger, fear, surprise, disgust, neutral)
- Zero-Shot Classification: Automatically categorizes books into Fiction/Non-Fiction using BART-large-MNLI
- Interactive Dashboard: Gradio-based web interface for easy book discovery
- Advanced Filtering: Filter by category, emotional tone, and rating
- Data Visualization: Statistical insights and data exploration tools
ποΈ System Architecture
books.csv β Data Cleaning β Category Classification β Emotion Analysis β Vector Database β Gradio UI
Pipeline Components:
Data Exploration & Cleaning (
data_exploration.py)- Handles missing values and data quality issues
- Filters books with substantial descriptions (25+ words)
- Creates correlation analysis and visualizations
Text Classification (
text_classification.py)- Zero-shot classification for Fiction/Non-Fiction categorization
- Uses Facebook's BART-large-MNLI model
- Achieves high accuracy in automated categorization
Sentiment Analysis (
sentiment_analysis.py)- Emotion detection using DistilRoBERTa model
- Analyzes 7 emotions: anger, disgust, fear, joy, sadness, surprise, neutral
- Sentence-level emotion scoring with max aggregation
Vector Search (
vector_search.py)- Creates embeddings using HuggingFace sentence-transformers
- Implements ChromaDB for efficient similarity search
- Supports semantic book discovery
Gradio Dashboard (
gradio_dashboard.py)- Interactive web interface for book recommendations
- Real-time filtering and visualization
- Statistical dashboards and data insights
π Project Structure
semantic-book-recommender/
βββ π Core Files
β βββ .env.example # Template for environment variables
β βββ .gitignore # Git ignore file (IMPORTANT!)
β βββ README.md # This file
β βββ requirements.txt # Python dependencies
β
βββ π Python Scripts
β βββ data_exploration.py # Data cleaning and exploration
β βββ text_classification.py # Zero-shot classification
β βββ sentiment_analysis.py # Emotion analysis
β βββ vector_search.py # Vector database operations
β βββ gradio_dashboard.py # Web interface
β
βββ π Data Files (Generated/Input)
β βββ books.csv # Input dataset (not included in repo)
β βββ books_cleaned.csv # Cleaned dataset
β βββ books_with_categories.csv # Dataset with categories
β βββ books_with_emotions.csv # Final dataset with emotions
β βββ tagged_description.txt # Generated text file for embeddings
β βββ predictions_results.csv # Classification results
β
βββ πΌοΈ Assets
β βββ cover-not-found.jpg # Default book cover image
β
βββ ποΈ Vector Databases (Auto-generated)
β βββ chroma_db_books/ # OpenAI embeddings vector DB
β βββ chroma_db_books_hf/ # HuggingFace embeddings vector DB
β
βββ π§ Environment (Ignored)
βββ .env # Your API keys (NEVER commit!)
βββ .venv/ # Virtual environment (ignored)
π File Descriptions
| File | Purpose | Generated By |
|---|---|---|
data_exploration.py |
Data cleaning, missing value analysis, correlation heatmaps | Manual |
text_classification.py |
Zero-shot classification (Fiction/Non-Fiction) | Manual |
sentiment_analysis.py |
Emotion analysis (7 emotions) | Manual |
vector_search.py |
Vector embeddings and similarity search | Manual |
gradio_dashboard.py |
Interactive web interface | Manual |
books.csv |
Original dataset | User provided |
books_cleaned.csv |
Cleaned dataset (25+ word descriptions) | data_exploration.py |
books_with_categories.csv |
Dataset with Fiction/Non-Fiction labels | text_classification.py |
books_with_emotions.csv |
Final dataset with emotion scores | sentiment_analysis.py |
tagged_description.txt |
Text file for vector embeddings | vector_search.py |
predictions_results.csv |
Classification accuracy results | text_classification.py |
π Processing Pipeline
books.csv
β (data_exploration.py)
books_cleaned.csv
β (text_classification.py)
books_with_categories.csv
β (sentiment_analysis.py)
books_with_emotions.csv
β (vector_search.py)
tagged_description.txt + Vector DB
β (gradio_dashboard.py)
π± Web Interface
π Security Setup (IMPORTANT!)
Before uploading to GitHub:
- Create
.gitignorefile (copy the one provided below) - Never commit
.envfiles - they contain your API keys - Use
.env.exampleas a template for others - Remove any API keys from code files
Required .gitignore file:
# Environment variables (NEVER commit these!)
.env
.env.local
.env.development.local
.env.test.local
.env.production.local
# Virtual environment
venv/
.venv/
env/
ENV/
# Python cache
__pycache__/
*.py[cod]
*$py.class
*.so
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST
# Vector databases (large files)
chroma_db_books/
chroma_db_books_hf/
*.db
*.sqlite
# Data files (add to .gitignore if sensitive)
books.csv
books_cleaned.csv
books_with_categories.csv
books_with_emotions.csv
tagged_description.txt
predictions_results.csv
# IDE files
.vscode/
.idea/
*.swp
*.swo
*~
# OS files
.DS_Store
.DS_Store?
._*
.Spotlight-V100
.Trashes
ehthumbs.db
Thumbs.db
# Jupyter Notebook checkpoints
.ipynb_checkpoints
# PyTorch model files
*.pth
*.pt
# Logs
*.log
logs/
π Quick Start
Prerequisites
- Python 3.8 or higher
- Virtual environment (recommended)
Installation
- Clone the repository:
git clone https://github.com/yourusername/semantic-book-recommender.git
cd semantic-book-recommender
- Create and activate virtual environment:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
- Install dependencies:
pip install -r requirements.txt
- Set up environment variables:
cp .env.example .env
# Edit .env with your OpenAI API key (optional, for OpenAI embeddings)
Running the System
- Data Processing Pipeline:
# Step 1: Clean and explore data
python data_exploration.py
# Step 2: Classify books into categories
python text_classification.py
# Step 3: Analyze emotions in book descriptions
python sentiment_analysis.py
# Step 4: Create vector database
python vector_search.py
- Launch Dashboard:
python gradio_dashboard.py
Access the dashboard at http://localhost:7860
π Data Requirements
The system expects a books.csv file with the following columns:
isbn13: Unique book identifiertitle: Book titlesubtitle: Book subtitle (optional)authors: Author names (semicolon-separated)categories: Book categoriesdescription: Book descriptionnum_pages: Number of pagesaverage_rating: Average rating (1-5 scale)published_year: Publication yearthumbnail: Book cover image URL
π― Usage Examples
Semantic Search
from vector_search import retrieve_semantic_recommendations
# Find books similar to a query
results = retrieve_semantic_recommendations(
"A mystery novel about redemption and forgiveness",
top_k=10
)
Emotion-Based Filtering
# Get happy books in fiction category
recommendations = retrieve_semantic_recommendations(
query="adventure story",
category="Fiction",
tone="Happy"
)
π§ Configuration
Model Settings
- Embedding Model:
sentence-transformers/all-MiniLM-L6-v2(384 dimensions) - Classification Model:
facebook/bart-large-mnli - Emotion Model:
j-hartmann/emotion-english-distilroberta-base
Performance Tuning
- Adjust
initial_top_kandfinal_top_kin recommendation functions - Modify chunk size and overlap in text splitting
- Configure vector database persistence settings
π Model Performance
- Zero-Shot Classification Accuracy: ~85% on Fiction/Non-Fiction categorization
- Emotion Detection: 7-class emotion classification with confidence scores
- Semantic Search: Cosine similarity-based ranking with embedding vectors
π οΈ Technical Details
Dependencies
- Core ML:
transformers,torch,sentence-transformers - Vector Database:
chromadb,langchain - Data Processing:
pandas,numpy - Visualization:
matplotlib,seaborn,gradio - Utilities:
tqdm,tabulate,python-dotenv
Hardware Requirements
- RAM: 8GB+ recommended for model loading
- GPU: Optional, supports CUDA/MPS for faster inference
- Storage: 2GB+ for model weights and vector database
π API Reference
Main Functions
retrieve_semantic_recommendations(query, category, tone, initial_top_k, final_top_k)
Returns book recommendations based on semantic similarity and filters.
Parameters:
query(str): Search query describing desired bookcategory(str): Book category filter ("All", "Fiction", "Non-Fiction", etc.)tone(str): Emotional tone filter ("Happy", "Sad", "Suspenseful", etc.)initial_top_k(int): Initial number of candidates to retrievefinal_top_k(int): Final number of recommendations to return
Returns:
pandas.DataFrame: Filtered book recommendations with metadata
π€ Contributing
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit changes (
git commit -m 'Add amazing feature') - Push to branch (
git push origin feature/amazing-feature) - Open a Pull Request
π License
This project is licensed under the MIT License - see the LICENSE file for details.
π Acknowledgments
- HuggingFace for providing pre-trained models
- OpenAI for embedding models
- ChromaDB for vector database functionality
- Gradio for the intuitive web interface
- The open-source community for various Python libraries
π References
- Sentence Transformers Documentation
- LangChain Documentation
- Gradio Documentation
- ChromaDB Documentation
Note: This system is designed for educational and research purposes. Ensure compliance with data usage policies and model licenses when deploying in production environments.