Spaces:

fguryel
/

scikit-rag

Sleeping

App Files Files Community

scikit-rag / README.md

fguryel

Fix: Remove detection logic for HuggingFace Spaces compatibility

ac20173 5 months ago

preview code

raw

history blame contribute delete

4.67 kB

A newer version of the Streamlit SDK is available: 1.55.0

Upgrade

metadata

title: Scikit-learn Documentation Q&A Bot
emoji: 🤖
colorFrom: blue
colorTo: green
sdk: streamlit
sdk_version: 1.28.0
app_file: app.py
pinned: false
license: mit

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

Scikit-learn Documentation Q&A Bot 🤖

A Retrieval-Augmented Generation (RAG) chatbot that answers questions about Scikit-learn using the official documentation.

Features

🔍 Smart Retrieval: Searches through 1,249+ documentation chunks using semantic similarity
📝 Context-Aware: Provides relevant documentation context to the AI model
🤖 AI-Powered: Uses OpenAI's GPT models for accurate, helpful answers
🎯 Source Attribution: Shows the exact documentation sources for each answer
💻 User-Friendly: Clean Streamlit web interface
⚡ Fast: Efficient vector search with ChromaDB

Quick Start

1. Install Dependencies

pip install -r requirements.txt

2. Build the Vector Database (First Time Only)

python scraper.py      # Scrape Scikit-learn documentation
python chunker.py      # Split text into chunks
python build_vector_db.py  # Create vector embeddings

3. Run the Application

streamlit run app.py

4. Get Your OpenAI API Key

Go to OpenAI API Keys
Create a new API key
Enter it in the sidebar of the app

How It Works

The RAG Pipeline

📄 Document Processing:
- Scrapes official Scikit-learn documentation
- Splits into 1000-character chunks with 150-character overlap
- Creates semantic embeddings using all-MiniLM-L6-v2
🔍 Retrieval:
- User asks a question
- Question is embedded using the same model
- Top 3 most relevant chunks are retrieved from ChromaDB
📝 Augmentation:
- Retrieved chunks are formatted as context
- Detailed prompt is created with context and question
🤖 Generation:
- OpenAI GPT model generates answer based on context
- Sources are displayed for verification

Project Structure

├── app.py                    # Main Streamlit application
├── scraper.py               # Documentation scraper
├── chunker.py               # Text chunking utility
├── build_vector_db.py       # Vector database builder
├── requirements.txt         # Python dependencies
├── scraped_content.json     # Raw scraped content
├── chunks.json             # Processed text chunks
├── chroma_db/              # Vector database
└── README.md               # This file

Usage Examples

Example Questions You Can Ask:

"How do I perform cross-validation in scikit-learn?"
"What is the difference between Ridge and Lasso regression?"
"How do I use GridSearchCV for parameter tuning?"
"What clustering algorithms are available in scikit-learn?"
"How do I preprocess data using StandardScaler?"
"What is feature selection and how do I use it?"

Configuration Options:

AI Model: Choose between GPT-3.5-turbo, GPT-4, or GPT-4-turbo
Context Chunks: Adjust the number of relevant chunks (1-5)
Chat History: View and clear previous conversations

Technical Details

Vector Database

Database: ChromaDB with SQLite backend
Embeddings: 384-dimensional vectors from all-MiniLM-L6-v2
Total Documents: 1,249 chunks
Database Size: ~15 MB

Performance

Processing Speed: ~56 docs/second during build
Query Time: <2 seconds for most questions
Model Device: Optimized for Apple Silicon (MPS)

Requirements

Python 3.9+
OpenAI API key
~200 MB disk space for dependencies
~15 MB for vector database

Troubleshooting

Common Issues:

"OpenAI API key invalid"
- Make sure your API key is correct and has sufficient credits
- Check that the key starts with "sk-"
"ChromaDB collection not found"
- Run python build_vector_db.py to create the vector database
- Make sure the chroma_db directory exists
"Import errors"
- Run pip install -r requirements.txt to install all dependencies
- Make sure you're using Python 3.9+

Getting Help:

Check the chat history for similar questions
Try rephrasing your question
Make sure your question is about Scikit-learn
Check the source links for additional context

License

This project is for educational and research purposes. The Scikit-learn documentation is under BSD license.

Contributing

Feel free to submit issues and enhancement requests!

Happy Learning with Scikit-learn! 🚀