---
title: semantic-book-recommender
app_file: gradio_dashboard.py
sdk: gradio
sdk_version: 5.38.0
---
# 📚 Semantic Book Recommendation System

[![Python](https://img.shields.io/badge/python-3.8%2B-blue.svg)](https://www.python.org/downloads/)
[![Transformers](https://img.shields.io/badge/transformers-4.21.0-orange.svg)](https://huggingface.co/transformers/)
[![Gradio](https://img.shields.io/badge/gradio-3.40.0-green.svg)](https://gradio.app/)
[![LangChain](https://img.shields.io/badge/langchain-0.1.0-red.svg)](https://langchain.readthedocs.io/)
[![License](https://img.shields.io/badge/license-MIT-blue.svg)](LICENSE)

A sophisticated book recommendation system that combines semantic search with emotion analysis to provide personalized book suggestions. The system uses vector embeddings, zero-shot classification, and emotion detection to understand user preferences and recommend books based on content similarity and emotional tone.

## 🌟 Features

- **Semantic Search**: Uses HuggingFace embeddings and ChromaDB for vector-based similarity search
- **Emotion Analysis**: Analyzes book descriptions for emotional content (joy, sadness, anger, fear, surprise, disgust, neutral)
- **Zero-Shot Classification**: Automatically categorizes books into Fiction/Non-Fiction using BART-large-MNLI
- **Interactive Dashboard**: Gradio-based web interface for easy book discovery
- **Advanced Filtering**: Filter by category, emotional tone, and rating
- **Data Visualization**: Statistical insights and data exploration tools

## 🏗️ System Architecture

```
books.csv → Data Cleaning → Category Classification → Emotion Analysis → Vector Database → Gradio UI
```

### Pipeline Components:

1. **Data Exploration & Cleaning** (`data_exploration.py`)
   - Handles missing values and data quality issues
   - Filters books with substantial descriptions (25+ words)
   - Creates correlation analysis and visualizations

2. **Text Classification** (`text_classification.py`)
   - Zero-shot classification for Fiction/Non-Fiction categorization
   - Uses Facebook's BART-large-MNLI model
   - Achieves high accuracy in automated categorization

3. **Sentiment Analysis** (`sentiment_analysis.py`)
   - Emotion detection using DistilRoBERTa model
   - Analyzes 7 emotions: anger, disgust, fear, joy, sadness, surprise, neutral
   - Sentence-level emotion scoring with max aggregation

4. **Vector Search** (`vector_search.py`)
   - Creates embeddings using HuggingFace sentence-transformers
   - Implements ChromaDB for efficient similarity search
   - Supports semantic book discovery

5. **Gradio Dashboard** (`gradio_dashboard.py`)
   - Interactive web interface for book recommendations
   - Real-time filtering and visualization
   - Statistical dashboards and data insights

## 📁 Project Structure

```
semantic-book-recommender/
├── 📄 Core Files
│   ├── .env.example                 # Template for environment variables
│   ├── .gitignore                   # Git ignore file (IMPORTANT!)
│   ├── README.md                    # This file
│   └── requirements.txt             # Python dependencies
│
├── 🐍 Python Scripts
│   ├── data_exploration.py          # Data cleaning and exploration
│   ├── text_classification.py       # Zero-shot classification
│   ├── sentiment_analysis.py        # Emotion analysis
│   ├── vector_search.py            # Vector database operations
│   └── gradio_dashboard.py         # Web interface
│
├── 📊 Data Files (Generated/Input)
│   ├── books.csv                   # Input dataset (not included in repo)
│   ├── books_cleaned.csv           # Cleaned dataset
│   ├── books_with_categories.csv   # Dataset with categories
│   ├── books_with_emotions.csv     # Final dataset with emotions
│   ├── tagged_description.txt      # Generated text file for embeddings
│   └── predictions_results.csv     # Classification results
│
├── 🖼️ Assets
│   └── cover-not-found.jpg         # Default book cover image
│
├── 🗄️ Vector Databases (Auto-generated)
│   ├── chroma_db_books/            # OpenAI embeddings vector DB
│   └── chroma_db_books_hf/         # HuggingFace embeddings vector DB
│
└── 🔧 Environment (Ignored)
    ├── .env                        # Your API keys (NEVER commit!)
    └── .venv/                      # Virtual environment (ignored)
```

### 📋 File Descriptions

| File | Purpose | Generated By |
|------|---------|--------------|
| `data_exploration.py` | Data cleaning, missing value analysis, correlation heatmaps | Manual |
| `text_classification.py` | Zero-shot classification (Fiction/Non-Fiction) | Manual |
| `sentiment_analysis.py` | Emotion analysis (7 emotions) | Manual |
| `vector_search.py` | Vector embeddings and similarity search | Manual |
| `gradio_dashboard.py` | Interactive web interface | Manual |
| `books.csv` | Original dataset | User provided |
| `books_cleaned.csv` | Cleaned dataset (25+ word descriptions) | `data_exploration.py` |
| `books_with_categories.csv` | Dataset with Fiction/Non-Fiction labels | `text_classification.py` |
| `books_with_emotions.csv` | Final dataset with emotion scores | `sentiment_analysis.py` |
| `tagged_description.txt` | Text file for vector embeddings | `vector_search.py` |
| `predictions_results.csv` | Classification accuracy results | `text_classification.py` |

### 🔄 Processing Pipeline

```
books.csv 
    ↓ (data_exploration.py)
books_cleaned.csv 
    ↓ (text_classification.py)
books_with_categories.csv 
    ↓ (sentiment_analysis.py)
books_with_emotions.csv 
    ↓ (vector_search.py)
tagged_description.txt + Vector DB
    ↓ (gradio_dashboard.py)
📱 Web Interface
```

## 🔒 Security Setup (IMPORTANT!)

### Before uploading to GitHub:

1. **Create `.gitignore` file** (copy the one provided below)
2. **Never commit `.env` files** - they contain your API keys
3. **Use `.env.example`** as a template for others
4. **Remove any API keys** from code files

### Required `.gitignore` file:
```gitignore
# Environment variables (NEVER commit these!)
.env
.env.local
.env.development.local
.env.test.local
.env.production.local

# Virtual environment
venv/
.venv/
env/
ENV/

# Python cache
__pycache__/
*.py[cod]
*$py.class
*.so
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# Vector databases (large files)
chroma_db_books/
chroma_db_books_hf/
*.db
*.sqlite

# Data files (add to .gitignore if sensitive)
books.csv
books_cleaned.csv
books_with_categories.csv
books_with_emotions.csv
tagged_description.txt
predictions_results.csv

# IDE files
.vscode/
.idea/
*.swp
*.swo
*~

# OS files
.DS_Store
.DS_Store?
._*
.Spotlight-V100
.Trashes
ehthumbs.db
Thumbs.db

# Jupyter Notebook checkpoints
.ipynb_checkpoints

# PyTorch model files
*.pth
*.pt

# Logs
*.log
logs/
```

## 🚀 Quick Start

### Prerequisites

- Python 3.8 or higher
- Virtual environment (recommended)

### Installation

1. Clone the repository:
```bash
git clone https://github.com/yourusername/semantic-book-recommender.git
cd semantic-book-recommender
```

2. Create and activate virtual environment:
```bash
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
```

3. Install dependencies:
```bash
pip install -r requirements.txt
```

4. Set up environment variables:
```bash
cp .env.example .env
# Edit .env with your OpenAI API key (optional, for OpenAI embeddings)
```

### Running the System

1. **Data Processing Pipeline**:
```bash
# Step 1: Clean and explore data
python data_exploration.py

# Step 2: Classify books into categories
python text_classification.py

# Step 3: Analyze emotions in book descriptions
python sentiment_analysis.py

# Step 4: Create vector database
python vector_search.py
```

2. **Launch Dashboard**:
```bash
python gradio_dashboard.py
```

Access the dashboard at `http://localhost:7860`

## 📊 Data Requirements

The system expects a `books.csv` file with the following columns:
- `isbn13`: Unique book identifier
- `title`: Book title
- `subtitle`: Book subtitle (optional)
- `authors`: Author names (semicolon-separated)
- `categories`: Book categories
- `description`: Book description
- `num_pages`: Number of pages
- `average_rating`: Average rating (1-5 scale)
- `published_year`: Publication year
- `thumbnail`: Book cover image URL

## 🎯 Usage Examples

### Semantic Search
```python
from vector_search import retrieve_semantic_recommendations

# Find books similar to a query
results = retrieve_semantic_recommendations(
    "A mystery novel about redemption and forgiveness",
    top_k=10
)
```

### Emotion-Based Filtering
```python
# Get happy books in fiction category
recommendations = retrieve_semantic_recommendations(
    query="adventure story",
    category="Fiction",
    tone="Happy"
)
```

## 🔧 Configuration

### Model Settings
- **Embedding Model**: `sentence-transformers/all-MiniLM-L6-v2` (384 dimensions)
- **Classification Model**: `facebook/bart-large-mnli`
- **Emotion Model**: `j-hartmann/emotion-english-distilroberta-base`

### Performance Tuning
- Adjust `initial_top_k` and `final_top_k` in recommendation functions
- Modify chunk size and overlap in text splitting
- Configure vector database persistence settings

## 📈 Model Performance

- **Zero-Shot Classification Accuracy**: ~85% on Fiction/Non-Fiction categorization
- **Emotion Detection**: 7-class emotion classification with confidence scores
- **Semantic Search**: Cosine similarity-based ranking with embedding vectors

## 🛠️ Technical Details

### Dependencies
- **Core ML**: `transformers`, `torch`, `sentence-transformers`
- **Vector Database**: `chromadb`, `langchain`
- **Data Processing**: `pandas`, `numpy`
- **Visualization**: `matplotlib`, `seaborn`, `gradio`
- **Utilities**: `tqdm`, `tabulate`, `python-dotenv`

### Hardware Requirements
- **RAM**: 8GB+ recommended for model loading
- **GPU**: Optional, supports CUDA/MPS for faster inference
- **Storage**: 2GB+ for model weights and vector database

## 📝 API Reference

### Main Functions

#### `retrieve_semantic_recommendations(query, category, tone, initial_top_k, final_top_k)`
Returns book recommendations based on semantic similarity and filters.

**Parameters:**
- `query` (str): Search query describing desired book
- `category` (str): Book category filter ("All", "Fiction", "Non-Fiction", etc.)
- `tone` (str): Emotional tone filter ("Happy", "Sad", "Suspenseful", etc.)
- `initial_top_k` (int): Initial number of candidates to retrieve
- `final_top_k` (int): Final number of recommendations to return

**Returns:**
- `pandas.DataFrame`: Filtered book recommendations with metadata

## 🤝 Contributing

1. Fork the repository
2. Create a feature branch (`git checkout -b feature/amazing-feature`)
3. Commit changes (`git commit -m 'Add amazing feature'`)
4. Push to branch (`git push origin feature/amazing-feature`)
5. Open a Pull Request

## 📄 License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## 🙏 Acknowledgments

- HuggingFace for providing pre-trained models
- OpenAI for embedding models
- ChromaDB for vector database functionality
- Gradio for the intuitive web interface
- The open-source community for various Python libraries

## 📚 References

- [Sentence Transformers Documentation](https://www.sbert.net/)
- [LangChain Documentation](https://python.langchain.com/)
- [Gradio Documentation](https://gradio.app/docs/)
- [ChromaDB Documentation](https://docs.trychroma.com/)

---

**Note**: This system is designed for educational and research purposes. Ensure compliance with data usage policies and model licenses when deploying in production environments.