--- title: Scikit-learn Documentation Q&A Bot emoji: 🤖 colorFrom: blue colorTo: green sdk: streamlit sdk_version: 1.28.0 app_file: app.py pinned: false license: mit --- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference # Scikit-learn Documentation Q&A Bot 🤖 A Retrieval-Augmented Generation (RAG) chatbot that answers questions about Scikit-learn using the official documentation. ## Features - **🔍 Smart Retrieval**: Searches through 1,249+ documentation chunks using semantic similarity - **📝 Context-Aware**: Provides relevant documentation context to the AI model - **🤖 AI-Powered**: Uses OpenAI's GPT models for accurate, helpful answers - **🎯 Source Attribution**: Shows the exact documentation sources for each answer - **💻 User-Friendly**: Clean Streamlit web interface - **⚡ Fast**: Efficient vector search with ChromaDB ## Quick Start ### 1. Install Dependencies ```bash pip install -r requirements.txt ``` ### 2. Build the Vector Database (First Time Only) ```bash python scraper.py # Scrape Scikit-learn documentation python chunker.py # Split text into chunks python build_vector_db.py # Create vector embeddings ``` ### 3. Run the Application ```bash streamlit run app.py ``` ### 4. Get Your OpenAI API Key 1. Go to [OpenAI API Keys](https://platform.openai.com/api-keys) 2. Create a new API key 3. Enter it in the sidebar of the app ## How It Works ### The RAG Pipeline 1. **📄 Document Processing**: - Scrapes official Scikit-learn documentation - Splits into 1000-character chunks with 150-character overlap - Creates semantic embeddings using `all-MiniLM-L6-v2` 2. **🔍 Retrieval**: - User asks a question - Question is embedded using the same model - Top 3 most relevant chunks are retrieved from ChromaDB 3. **📝 Augmentation**: - Retrieved chunks are formatted as context - Detailed prompt is created with context and question 4. **🤖 Generation**: - OpenAI GPT model generates answer based on context - Sources are displayed for verification ## Project Structure ``` ├── app.py # Main Streamlit application ├── scraper.py # Documentation scraper ├── chunker.py # Text chunking utility ├── build_vector_db.py # Vector database builder ├── requirements.txt # Python dependencies ├── scraped_content.json # Raw scraped content ├── chunks.json # Processed text chunks ├── chroma_db/ # Vector database └── README.md # This file ``` ## Usage Examples ### Example Questions You Can Ask: - "How do I perform cross-validation in scikit-learn?" - "What is the difference between Ridge and Lasso regression?" - "How do I use GridSearchCV for parameter tuning?" - "What clustering algorithms are available in scikit-learn?" - "How do I preprocess data using StandardScaler?" - "What is feature selection and how do I use it?" ### Configuration Options: - **AI Model**: Choose between GPT-3.5-turbo, GPT-4, or GPT-4-turbo - **Context Chunks**: Adjust the number of relevant chunks (1-5) - **Chat History**: View and clear previous conversations ## Technical Details ### Vector Database - **Database**: ChromaDB with SQLite backend - **Embeddings**: 384-dimensional vectors from `all-MiniLM-L6-v2` - **Total Documents**: 1,249 chunks - **Database Size**: ~15 MB ### Performance - **Processing Speed**: ~56 docs/second during build - **Query Time**: <2 seconds for most questions - **Model Device**: Optimized for Apple Silicon (MPS) ## Requirements - Python 3.9+ - OpenAI API key - ~200 MB disk space for dependencies - ~15 MB for vector database ## Troubleshooting ### Common Issues: 1. **"OpenAI API key invalid"** - Make sure your API key is correct and has sufficient credits - Check that the key starts with "sk-" 2. **"ChromaDB collection not found"** - Run `python build_vector_db.py` to create the vector database - Make sure the `chroma_db` directory exists 3. **"Import errors"** - Run `pip install -r requirements.txt` to install all dependencies - Make sure you're using Python 3.9+ ### Getting Help: 1. Check the chat history for similar questions 2. Try rephrasing your question 3. Make sure your question is about Scikit-learn 4. Check the source links for additional context ## License This project is for educational and research purposes. The Scikit-learn documentation is under BSD license. ## Contributing Feel free to submit issues and enhancement requests! --- **Happy Learning with Scikit-learn! 🚀**