--- title: semantic-book-recommender app_file: gradio_dashboard.py sdk: gradio sdk_version: 5.38.0 --- # 📚 Semantic Book Recommendation System [![Python](https://img.shields.io/badge/python-3.8%2B-blue.svg)](https://www.python.org/downloads/) [![Transformers](https://img.shields.io/badge/transformers-4.21.0-orange.svg)](https://huggingface.co/transformers/) [![Gradio](https://img.shields.io/badge/gradio-3.40.0-green.svg)](https://gradio.app/) [![LangChain](https://img.shields.io/badge/langchain-0.1.0-red.svg)](https://langchain.readthedocs.io/) [![License](https://img.shields.io/badge/license-MIT-blue.svg)](LICENSE) A sophisticated book recommendation system that combines semantic search with emotion analysis to provide personalized book suggestions. The system uses vector embeddings, zero-shot classification, and emotion detection to understand user preferences and recommend books based on content similarity and emotional tone. ## 🌟 Features - **Semantic Search**: Uses HuggingFace embeddings and ChromaDB for vector-based similarity search - **Emotion Analysis**: Analyzes book descriptions for emotional content (joy, sadness, anger, fear, surprise, disgust, neutral) - **Zero-Shot Classification**: Automatically categorizes books into Fiction/Non-Fiction using BART-large-MNLI - **Interactive Dashboard**: Gradio-based web interface for easy book discovery - **Advanced Filtering**: Filter by category, emotional tone, and rating - **Data Visualization**: Statistical insights and data exploration tools ## 🏗️ System Architecture ``` books.csv → Data Cleaning → Category Classification → Emotion Analysis → Vector Database → Gradio UI ``` ### Pipeline Components: 1. **Data Exploration & Cleaning** (`data_exploration.py`) - Handles missing values and data quality issues - Filters books with substantial descriptions (25+ words) - Creates correlation analysis and visualizations 2. **Text Classification** (`text_classification.py`) - Zero-shot classification for Fiction/Non-Fiction categorization - Uses Facebook's BART-large-MNLI model - Achieves high accuracy in automated categorization 3. **Sentiment Analysis** (`sentiment_analysis.py`) - Emotion detection using DistilRoBERTa model - Analyzes 7 emotions: anger, disgust, fear, joy, sadness, surprise, neutral - Sentence-level emotion scoring with max aggregation 4. **Vector Search** (`vector_search.py`) - Creates embeddings using HuggingFace sentence-transformers - Implements ChromaDB for efficient similarity search - Supports semantic book discovery 5. **Gradio Dashboard** (`gradio_dashboard.py`) - Interactive web interface for book recommendations - Real-time filtering and visualization - Statistical dashboards and data insights ## 📁 Project Structure ``` semantic-book-recommender/ ├── 📄 Core Files │ ├── .env.example # Template for environment variables │ ├── .gitignore # Git ignore file (IMPORTANT!) │ ├── README.md # This file │ └── requirements.txt # Python dependencies │ ├── 🐍 Python Scripts │ ├── data_exploration.py # Data cleaning and exploration │ ├── text_classification.py # Zero-shot classification │ ├── sentiment_analysis.py # Emotion analysis │ ├── vector_search.py # Vector database operations │ └── gradio_dashboard.py # Web interface │ ├── 📊 Data Files (Generated/Input) │ ├── books.csv # Input dataset (not included in repo) │ ├── books_cleaned.csv # Cleaned dataset │ ├── books_with_categories.csv # Dataset with categories │ ├── books_with_emotions.csv # Final dataset with emotions │ ├── tagged_description.txt # Generated text file for embeddings │ └── predictions_results.csv # Classification results │ ├── 🖼️ Assets │ └── cover-not-found.jpg # Default book cover image │ ├── 🗄️ Vector Databases (Auto-generated) │ ├── chroma_db_books/ # OpenAI embeddings vector DB │ └── chroma_db_books_hf/ # HuggingFace embeddings vector DB │ └── 🔧 Environment (Ignored) ├── .env # Your API keys (NEVER commit!) └── .venv/ # Virtual environment (ignored) ``` ### 📋 File Descriptions | File | Purpose | Generated By | |------|---------|--------------| | `data_exploration.py` | Data cleaning, missing value analysis, correlation heatmaps | Manual | | `text_classification.py` | Zero-shot classification (Fiction/Non-Fiction) | Manual | | `sentiment_analysis.py` | Emotion analysis (7 emotions) | Manual | | `vector_search.py` | Vector embeddings and similarity search | Manual | | `gradio_dashboard.py` | Interactive web interface | Manual | | `books.csv` | Original dataset | User provided | | `books_cleaned.csv` | Cleaned dataset (25+ word descriptions) | `data_exploration.py` | | `books_with_categories.csv` | Dataset with Fiction/Non-Fiction labels | `text_classification.py` | | `books_with_emotions.csv` | Final dataset with emotion scores | `sentiment_analysis.py` | | `tagged_description.txt` | Text file for vector embeddings | `vector_search.py` | | `predictions_results.csv` | Classification accuracy results | `text_classification.py` | ### 🔄 Processing Pipeline ``` books.csv ↓ (data_exploration.py) books_cleaned.csv ↓ (text_classification.py) books_with_categories.csv ↓ (sentiment_analysis.py) books_with_emotions.csv ↓ (vector_search.py) tagged_description.txt + Vector DB ↓ (gradio_dashboard.py) 📱 Web Interface ``` ## 🔒 Security Setup (IMPORTANT!) ### Before uploading to GitHub: 1. **Create `.gitignore` file** (copy the one provided below) 2. **Never commit `.env` files** - they contain your API keys 3. **Use `.env.example`** as a template for others 4. **Remove any API keys** from code files ### Required `.gitignore` file: ```gitignore # Environment variables (NEVER commit these!) .env .env.local .env.development.local .env.test.local .env.production.local # Virtual environment venv/ .venv/ env/ ENV/ # Python cache __pycache__/ *.py[cod] *$py.class *.so .Python build/ develop-eggs/ dist/ downloads/ eggs/ .eggs/ lib/ lib64/ parts/ sdist/ var/ wheels/ *.egg-info/ .installed.cfg *.egg MANIFEST # Vector databases (large files) chroma_db_books/ chroma_db_books_hf/ *.db *.sqlite # Data files (add to .gitignore if sensitive) books.csv books_cleaned.csv books_with_categories.csv books_with_emotions.csv tagged_description.txt predictions_results.csv # IDE files .vscode/ .idea/ *.swp *.swo *~ # OS files .DS_Store .DS_Store? ._* .Spotlight-V100 .Trashes ehthumbs.db Thumbs.db # Jupyter Notebook checkpoints .ipynb_checkpoints # PyTorch model files *.pth *.pt # Logs *.log logs/ ``` ## 🚀 Quick Start ### Prerequisites - Python 3.8 or higher - Virtual environment (recommended) ### Installation 1. Clone the repository: ```bash git clone https://github.com/yourusername/semantic-book-recommender.git cd semantic-book-recommender ``` 2. Create and activate virtual environment: ```bash python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate ``` 3. Install dependencies: ```bash pip install -r requirements.txt ``` 4. Set up environment variables: ```bash cp .env.example .env # Edit .env with your OpenAI API key (optional, for OpenAI embeddings) ``` ### Running the System 1. **Data Processing Pipeline**: ```bash # Step 1: Clean and explore data python data_exploration.py # Step 2: Classify books into categories python text_classification.py # Step 3: Analyze emotions in book descriptions python sentiment_analysis.py # Step 4: Create vector database python vector_search.py ``` 2. **Launch Dashboard**: ```bash python gradio_dashboard.py ``` Access the dashboard at `http://localhost:7860` ## 📊 Data Requirements The system expects a `books.csv` file with the following columns: - `isbn13`: Unique book identifier - `title`: Book title - `subtitle`: Book subtitle (optional) - `authors`: Author names (semicolon-separated) - `categories`: Book categories - `description`: Book description - `num_pages`: Number of pages - `average_rating`: Average rating (1-5 scale) - `published_year`: Publication year - `thumbnail`: Book cover image URL ## 🎯 Usage Examples ### Semantic Search ```python from vector_search import retrieve_semantic_recommendations # Find books similar to a query results = retrieve_semantic_recommendations( "A mystery novel about redemption and forgiveness", top_k=10 ) ``` ### Emotion-Based Filtering ```python # Get happy books in fiction category recommendations = retrieve_semantic_recommendations( query="adventure story", category="Fiction", tone="Happy" ) ``` ## 🔧 Configuration ### Model Settings - **Embedding Model**: `sentence-transformers/all-MiniLM-L6-v2` (384 dimensions) - **Classification Model**: `facebook/bart-large-mnli` - **Emotion Model**: `j-hartmann/emotion-english-distilroberta-base` ### Performance Tuning - Adjust `initial_top_k` and `final_top_k` in recommendation functions - Modify chunk size and overlap in text splitting - Configure vector database persistence settings ## 📈 Model Performance - **Zero-Shot Classification Accuracy**: ~85% on Fiction/Non-Fiction categorization - **Emotion Detection**: 7-class emotion classification with confidence scores - **Semantic Search**: Cosine similarity-based ranking with embedding vectors ## 🛠️ Technical Details ### Dependencies - **Core ML**: `transformers`, `torch`, `sentence-transformers` - **Vector Database**: `chromadb`, `langchain` - **Data Processing**: `pandas`, `numpy` - **Visualization**: `matplotlib`, `seaborn`, `gradio` - **Utilities**: `tqdm`, `tabulate`, `python-dotenv` ### Hardware Requirements - **RAM**: 8GB+ recommended for model loading - **GPU**: Optional, supports CUDA/MPS for faster inference - **Storage**: 2GB+ for model weights and vector database ## 📝 API Reference ### Main Functions #### `retrieve_semantic_recommendations(query, category, tone, initial_top_k, final_top_k)` Returns book recommendations based on semantic similarity and filters. **Parameters:** - `query` (str): Search query describing desired book - `category` (str): Book category filter ("All", "Fiction", "Non-Fiction", etc.) - `tone` (str): Emotional tone filter ("Happy", "Sad", "Suspenseful", etc.) - `initial_top_k` (int): Initial number of candidates to retrieve - `final_top_k` (int): Final number of recommendations to return **Returns:** - `pandas.DataFrame`: Filtered book recommendations with metadata ## 🤝 Contributing 1. Fork the repository 2. Create a feature branch (`git checkout -b feature/amazing-feature`) 3. Commit changes (`git commit -m 'Add amazing feature'`) 4. Push to branch (`git push origin feature/amazing-feature`) 5. Open a Pull Request ## 📄 License This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details. ## 🙏 Acknowledgments - HuggingFace for providing pre-trained models - OpenAI for embedding models - ChromaDB for vector database functionality - Gradio for the intuitive web interface - The open-source community for various Python libraries ## 📚 References - [Sentence Transformers Documentation](https://www.sbert.net/) - [LangChain Documentation](https://python.langchain.com/) - [Gradio Documentation](https://gradio.app/docs/) - [ChromaDB Documentation](https://docs.trychroma.com/) --- **Note**: This system is designed for educational and research purposes. Ensure compliance with data usage policies and model licenses when deploying in production environments.