scikit-rag / README.md
fguryel's picture
Fix: Remove detection logic for HuggingFace Spaces compatibility
ac20173
---
title: Scikit-learn Documentation Q&A Bot
emoji: πŸ€–
colorFrom: blue
colorTo: green
sdk: streamlit
sdk_version: 1.28.0
app_file: app.py
pinned: false
license: mit
---
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
# Scikit-learn Documentation Q&A Bot πŸ€–
A Retrieval-Augmented Generation (RAG) chatbot that answers questions about Scikit-learn using the official documentation.
## Features
- **πŸ” Smart Retrieval**: Searches through 1,249+ documentation chunks using semantic similarity
- **πŸ“ Context-Aware**: Provides relevant documentation context to the AI model
- **πŸ€– AI-Powered**: Uses OpenAI's GPT models for accurate, helpful answers
- **🎯 Source Attribution**: Shows the exact documentation sources for each answer
- **πŸ’» User-Friendly**: Clean Streamlit web interface
- **⚑ Fast**: Efficient vector search with ChromaDB
## Quick Start
### 1. Install Dependencies
```bash
pip install -r requirements.txt
```
### 2. Build the Vector Database (First Time Only)
```bash
python scraper.py # Scrape Scikit-learn documentation
python chunker.py # Split text into chunks
python build_vector_db.py # Create vector embeddings
```
### 3. Run the Application
```bash
streamlit run app.py
```
### 4. Get Your OpenAI API Key
1. Go to [OpenAI API Keys](https://platform.openai.com/api-keys)
2. Create a new API key
3. Enter it in the sidebar of the app
## How It Works
### The RAG Pipeline
1. **πŸ“„ Document Processing**:
- Scrapes official Scikit-learn documentation
- Splits into 1000-character chunks with 150-character overlap
- Creates semantic embeddings using `all-MiniLM-L6-v2`
2. **πŸ” Retrieval**:
- User asks a question
- Question is embedded using the same model
- Top 3 most relevant chunks are retrieved from ChromaDB
3. **πŸ“ Augmentation**:
- Retrieved chunks are formatted as context
- Detailed prompt is created with context and question
4. **πŸ€– Generation**:
- OpenAI GPT model generates answer based on context
- Sources are displayed for verification
## Project Structure
```
β”œβ”€β”€ app.py # Main Streamlit application
β”œβ”€β”€ scraper.py # Documentation scraper
β”œβ”€β”€ chunker.py # Text chunking utility
β”œβ”€β”€ build_vector_db.py # Vector database builder
β”œβ”€β”€ requirements.txt # Python dependencies
β”œβ”€β”€ scraped_content.json # Raw scraped content
β”œβ”€β”€ chunks.json # Processed text chunks
β”œβ”€β”€ chroma_db/ # Vector database
└── README.md # This file
```
## Usage Examples
### Example Questions You Can Ask:
- "How do I perform cross-validation in scikit-learn?"
- "What is the difference between Ridge and Lasso regression?"
- "How do I use GridSearchCV for parameter tuning?"
- "What clustering algorithms are available in scikit-learn?"
- "How do I preprocess data using StandardScaler?"
- "What is feature selection and how do I use it?"
### Configuration Options:
- **AI Model**: Choose between GPT-3.5-turbo, GPT-4, or GPT-4-turbo
- **Context Chunks**: Adjust the number of relevant chunks (1-5)
- **Chat History**: View and clear previous conversations
## Technical Details
### Vector Database
- **Database**: ChromaDB with SQLite backend
- **Embeddings**: 384-dimensional vectors from `all-MiniLM-L6-v2`
- **Total Documents**: 1,249 chunks
- **Database Size**: ~15 MB
### Performance
- **Processing Speed**: ~56 docs/second during build
- **Query Time**: <2 seconds for most questions
- **Model Device**: Optimized for Apple Silicon (MPS)
## Requirements
- Python 3.9+
- OpenAI API key
- ~200 MB disk space for dependencies
- ~15 MB for vector database
## Troubleshooting
### Common Issues:
1. **"OpenAI API key invalid"**
- Make sure your API key is correct and has sufficient credits
- Check that the key starts with "sk-"
2. **"ChromaDB collection not found"**
- Run `python build_vector_db.py` to create the vector database
- Make sure the `chroma_db` directory exists
3. **"Import errors"**
- Run `pip install -r requirements.txt` to install all dependencies
- Make sure you're using Python 3.9+
### Getting Help:
1. Check the chat history for similar questions
2. Try rephrasing your question
3. Make sure your question is about Scikit-learn
4. Check the source links for additional context
## License
This project is for educational and research purposes. The Scikit-learn documentation is under BSD license.
## Contributing
Feel free to submit issues and enhancement requests!
---
**Happy Learning with Scikit-learn! πŸš€**