Spaces:
Sleeping
Sleeping
| title: Scikit-learn Documentation Q&A Bot | |
| emoji: π€ | |
| colorFrom: blue | |
| colorTo: green | |
| sdk: streamlit | |
| sdk_version: 1.28.0 | |
| app_file: app.py | |
| pinned: false | |
| license: mit | |
| Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference | |
| # Scikit-learn Documentation Q&A Bot π€ | |
| A Retrieval-Augmented Generation (RAG) chatbot that answers questions about Scikit-learn using the official documentation. | |
| ## Features | |
| - **π Smart Retrieval**: Searches through 1,249+ documentation chunks using semantic similarity | |
| - **π Context-Aware**: Provides relevant documentation context to the AI model | |
| - **π€ AI-Powered**: Uses OpenAI's GPT models for accurate, helpful answers | |
| - **π― Source Attribution**: Shows the exact documentation sources for each answer | |
| - **π» User-Friendly**: Clean Streamlit web interface | |
| - **β‘ Fast**: Efficient vector search with ChromaDB | |
| ## Quick Start | |
| ### 1. Install Dependencies | |
| ```bash | |
| pip install -r requirements.txt | |
| ``` | |
| ### 2. Build the Vector Database (First Time Only) | |
| ```bash | |
| python scraper.py # Scrape Scikit-learn documentation | |
| python chunker.py # Split text into chunks | |
| python build_vector_db.py # Create vector embeddings | |
| ``` | |
| ### 3. Run the Application | |
| ```bash | |
| streamlit run app.py | |
| ``` | |
| ### 4. Get Your OpenAI API Key | |
| 1. Go to [OpenAI API Keys](https://platform.openai.com/api-keys) | |
| 2. Create a new API key | |
| 3. Enter it in the sidebar of the app | |
| ## How It Works | |
| ### The RAG Pipeline | |
| 1. **π Document Processing**: | |
| - Scrapes official Scikit-learn documentation | |
| - Splits into 1000-character chunks with 150-character overlap | |
| - Creates semantic embeddings using `all-MiniLM-L6-v2` | |
| 2. **π Retrieval**: | |
| - User asks a question | |
| - Question is embedded using the same model | |
| - Top 3 most relevant chunks are retrieved from ChromaDB | |
| 3. **π Augmentation**: | |
| - Retrieved chunks are formatted as context | |
| - Detailed prompt is created with context and question | |
| 4. **π€ Generation**: | |
| - OpenAI GPT model generates answer based on context | |
| - Sources are displayed for verification | |
| ## Project Structure | |
| ``` | |
| βββ app.py # Main Streamlit application | |
| βββ scraper.py # Documentation scraper | |
| βββ chunker.py # Text chunking utility | |
| βββ build_vector_db.py # Vector database builder | |
| βββ requirements.txt # Python dependencies | |
| βββ scraped_content.json # Raw scraped content | |
| βββ chunks.json # Processed text chunks | |
| βββ chroma_db/ # Vector database | |
| βββ README.md # This file | |
| ``` | |
| ## Usage Examples | |
| ### Example Questions You Can Ask: | |
| - "How do I perform cross-validation in scikit-learn?" | |
| - "What is the difference between Ridge and Lasso regression?" | |
| - "How do I use GridSearchCV for parameter tuning?" | |
| - "What clustering algorithms are available in scikit-learn?" | |
| - "How do I preprocess data using StandardScaler?" | |
| - "What is feature selection and how do I use it?" | |
| ### Configuration Options: | |
| - **AI Model**: Choose between GPT-3.5-turbo, GPT-4, or GPT-4-turbo | |
| - **Context Chunks**: Adjust the number of relevant chunks (1-5) | |
| - **Chat History**: View and clear previous conversations | |
| ## Technical Details | |
| ### Vector Database | |
| - **Database**: ChromaDB with SQLite backend | |
| - **Embeddings**: 384-dimensional vectors from `all-MiniLM-L6-v2` | |
| - **Total Documents**: 1,249 chunks | |
| - **Database Size**: ~15 MB | |
| ### Performance | |
| - **Processing Speed**: ~56 docs/second during build | |
| - **Query Time**: <2 seconds for most questions | |
| - **Model Device**: Optimized for Apple Silicon (MPS) | |
| ## Requirements | |
| - Python 3.9+ | |
| - OpenAI API key | |
| - ~200 MB disk space for dependencies | |
| - ~15 MB for vector database | |
| ## Troubleshooting | |
| ### Common Issues: | |
| 1. **"OpenAI API key invalid"** | |
| - Make sure your API key is correct and has sufficient credits | |
| - Check that the key starts with "sk-" | |
| 2. **"ChromaDB collection not found"** | |
| - Run `python build_vector_db.py` to create the vector database | |
| - Make sure the `chroma_db` directory exists | |
| 3. **"Import errors"** | |
| - Run `pip install -r requirements.txt` to install all dependencies | |
| - Make sure you're using Python 3.9+ | |
| ### Getting Help: | |
| 1. Check the chat history for similar questions | |
| 2. Try rephrasing your question | |
| 3. Make sure your question is about Scikit-learn | |
| 4. Check the source links for additional context | |
| ## License | |
| This project is for educational and research purposes. The Scikit-learn documentation is under BSD license. | |
| ## Contributing | |
| Feel free to submit issues and enhancement requests! | |
| --- | |
| **Happy Learning with Scikit-learn! π** |