Spaces:
Sleeping
Sleeping
File size: 4,672 Bytes
fb2d52f ac20173 7ed4bfa ac20173 7ed4bfa ac20173 7ed4bfa 9222df3 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 | ---
title: Scikit-learn Documentation Q&A Bot
emoji: π€
colorFrom: blue
colorTo: green
sdk: streamlit
sdk_version: 1.28.0
app_file: app.py
pinned: false
license: mit
---
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
# Scikit-learn Documentation Q&A Bot π€
A Retrieval-Augmented Generation (RAG) chatbot that answers questions about Scikit-learn using the official documentation.
## Features
- **π Smart Retrieval**: Searches through 1,249+ documentation chunks using semantic similarity
- **π Context-Aware**: Provides relevant documentation context to the AI model
- **π€ AI-Powered**: Uses OpenAI's GPT models for accurate, helpful answers
- **π― Source Attribution**: Shows the exact documentation sources for each answer
- **π» User-Friendly**: Clean Streamlit web interface
- **β‘ Fast**: Efficient vector search with ChromaDB
## Quick Start
### 1. Install Dependencies
```bash
pip install -r requirements.txt
```
### 2. Build the Vector Database (First Time Only)
```bash
python scraper.py # Scrape Scikit-learn documentation
python chunker.py # Split text into chunks
python build_vector_db.py # Create vector embeddings
```
### 3. Run the Application
```bash
streamlit run app.py
```
### 4. Get Your OpenAI API Key
1. Go to [OpenAI API Keys](https://platform.openai.com/api-keys)
2. Create a new API key
3. Enter it in the sidebar of the app
## How It Works
### The RAG Pipeline
1. **π Document Processing**:
- Scrapes official Scikit-learn documentation
- Splits into 1000-character chunks with 150-character overlap
- Creates semantic embeddings using `all-MiniLM-L6-v2`
2. **π Retrieval**:
- User asks a question
- Question is embedded using the same model
- Top 3 most relevant chunks are retrieved from ChromaDB
3. **π Augmentation**:
- Retrieved chunks are formatted as context
- Detailed prompt is created with context and question
4. **π€ Generation**:
- OpenAI GPT model generates answer based on context
- Sources are displayed for verification
## Project Structure
```
βββ app.py # Main Streamlit application
βββ scraper.py # Documentation scraper
βββ chunker.py # Text chunking utility
βββ build_vector_db.py # Vector database builder
βββ requirements.txt # Python dependencies
βββ scraped_content.json # Raw scraped content
βββ chunks.json # Processed text chunks
βββ chroma_db/ # Vector database
βββ README.md # This file
```
## Usage Examples
### Example Questions You Can Ask:
- "How do I perform cross-validation in scikit-learn?"
- "What is the difference between Ridge and Lasso regression?"
- "How do I use GridSearchCV for parameter tuning?"
- "What clustering algorithms are available in scikit-learn?"
- "How do I preprocess data using StandardScaler?"
- "What is feature selection and how do I use it?"
### Configuration Options:
- **AI Model**: Choose between GPT-3.5-turbo, GPT-4, or GPT-4-turbo
- **Context Chunks**: Adjust the number of relevant chunks (1-5)
- **Chat History**: View and clear previous conversations
## Technical Details
### Vector Database
- **Database**: ChromaDB with SQLite backend
- **Embeddings**: 384-dimensional vectors from `all-MiniLM-L6-v2`
- **Total Documents**: 1,249 chunks
- **Database Size**: ~15 MB
### Performance
- **Processing Speed**: ~56 docs/second during build
- **Query Time**: <2 seconds for most questions
- **Model Device**: Optimized for Apple Silicon (MPS)
## Requirements
- Python 3.9+
- OpenAI API key
- ~200 MB disk space for dependencies
- ~15 MB for vector database
## Troubleshooting
### Common Issues:
1. **"OpenAI API key invalid"**
- Make sure your API key is correct and has sufficient credits
- Check that the key starts with "sk-"
2. **"ChromaDB collection not found"**
- Run `python build_vector_db.py` to create the vector database
- Make sure the `chroma_db` directory exists
3. **"Import errors"**
- Run `pip install -r requirements.txt` to install all dependencies
- Make sure you're using Python 3.9+
### Getting Help:
1. Check the chat history for similar questions
2. Try rephrasing your question
3. Make sure your question is about Scikit-learn
4. Check the source links for additional context
## License
This project is for educational and research purposes. The Scikit-learn documentation is under BSD license.
## Contributing
Feel free to submit issues and enhancement requests!
---
**Happy Learning with Scikit-learn! π** |