Spaces:
Sleeping
Sleeping
A newer version of the Streamlit SDK is available:
1.55.0
metadata
title: Scikit-learn Documentation Q&A Bot
emoji: π€
colorFrom: blue
colorTo: green
sdk: streamlit
sdk_version: 1.28.0
app_file: app.py
pinned: false
license: mit
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
Scikit-learn Documentation Q&A Bot π€
A Retrieval-Augmented Generation (RAG) chatbot that answers questions about Scikit-learn using the official documentation.
Features
- π Smart Retrieval: Searches through 1,249+ documentation chunks using semantic similarity
- π Context-Aware: Provides relevant documentation context to the AI model
- π€ AI-Powered: Uses OpenAI's GPT models for accurate, helpful answers
- π― Source Attribution: Shows the exact documentation sources for each answer
- π» User-Friendly: Clean Streamlit web interface
- β‘ Fast: Efficient vector search with ChromaDB
Quick Start
1. Install Dependencies
pip install -r requirements.txt
2. Build the Vector Database (First Time Only)
python scraper.py # Scrape Scikit-learn documentation
python chunker.py # Split text into chunks
python build_vector_db.py # Create vector embeddings
3. Run the Application
streamlit run app.py
4. Get Your OpenAI API Key
- Go to OpenAI API Keys
- Create a new API key
- Enter it in the sidebar of the app
How It Works
The RAG Pipeline
π Document Processing:
- Scrapes official Scikit-learn documentation
- Splits into 1000-character chunks with 150-character overlap
- Creates semantic embeddings using
all-MiniLM-L6-v2
π Retrieval:
- User asks a question
- Question is embedded using the same model
- Top 3 most relevant chunks are retrieved from ChromaDB
π Augmentation:
- Retrieved chunks are formatted as context
- Detailed prompt is created with context and question
π€ Generation:
- OpenAI GPT model generates answer based on context
- Sources are displayed for verification
Project Structure
βββ app.py # Main Streamlit application
βββ scraper.py # Documentation scraper
βββ chunker.py # Text chunking utility
βββ build_vector_db.py # Vector database builder
βββ requirements.txt # Python dependencies
βββ scraped_content.json # Raw scraped content
βββ chunks.json # Processed text chunks
βββ chroma_db/ # Vector database
βββ README.md # This file
Usage Examples
Example Questions You Can Ask:
- "How do I perform cross-validation in scikit-learn?"
- "What is the difference between Ridge and Lasso regression?"
- "How do I use GridSearchCV for parameter tuning?"
- "What clustering algorithms are available in scikit-learn?"
- "How do I preprocess data using StandardScaler?"
- "What is feature selection and how do I use it?"
Configuration Options:
- AI Model: Choose between GPT-3.5-turbo, GPT-4, or GPT-4-turbo
- Context Chunks: Adjust the number of relevant chunks (1-5)
- Chat History: View and clear previous conversations
Technical Details
Vector Database
- Database: ChromaDB with SQLite backend
- Embeddings: 384-dimensional vectors from
all-MiniLM-L6-v2 - Total Documents: 1,249 chunks
- Database Size: ~15 MB
Performance
- Processing Speed: ~56 docs/second during build
- Query Time: <2 seconds for most questions
- Model Device: Optimized for Apple Silicon (MPS)
Requirements
- Python 3.9+
- OpenAI API key
- ~200 MB disk space for dependencies
- ~15 MB for vector database
Troubleshooting
Common Issues:
"OpenAI API key invalid"
- Make sure your API key is correct and has sufficient credits
- Check that the key starts with "sk-"
"ChromaDB collection not found"
- Run
python build_vector_db.pyto create the vector database - Make sure the
chroma_dbdirectory exists
- Run
"Import errors"
- Run
pip install -r requirements.txtto install all dependencies - Make sure you're using Python 3.9+
- Run
Getting Help:
- Check the chat history for similar questions
- Try rephrasing your question
- Make sure your question is about Scikit-learn
- Check the source links for additional context
License
This project is for educational and research purposes. The Scikit-learn documentation is under BSD license.
Contributing
Feel free to submit issues and enhancement requests!
Happy Learning with Scikit-learn! π