scikit-rag / README.md
fguryel's picture
Fix: Remove detection logic for HuggingFace Spaces compatibility
ac20173

A newer version of the Streamlit SDK is available: 1.55.0

Upgrade
metadata
title: Scikit-learn Documentation Q&A Bot
emoji: πŸ€–
colorFrom: blue
colorTo: green
sdk: streamlit
sdk_version: 1.28.0
app_file: app.py
pinned: false
license: mit

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

Scikit-learn Documentation Q&A Bot πŸ€–

A Retrieval-Augmented Generation (RAG) chatbot that answers questions about Scikit-learn using the official documentation.

Features

  • πŸ” Smart Retrieval: Searches through 1,249+ documentation chunks using semantic similarity
  • πŸ“ Context-Aware: Provides relevant documentation context to the AI model
  • πŸ€– AI-Powered: Uses OpenAI's GPT models for accurate, helpful answers
  • 🎯 Source Attribution: Shows the exact documentation sources for each answer
  • πŸ’» User-Friendly: Clean Streamlit web interface
  • ⚑ Fast: Efficient vector search with ChromaDB

Quick Start

1. Install Dependencies

pip install -r requirements.txt

2. Build the Vector Database (First Time Only)

python scraper.py      # Scrape Scikit-learn documentation
python chunker.py      # Split text into chunks
python build_vector_db.py  # Create vector embeddings

3. Run the Application

streamlit run app.py

4. Get Your OpenAI API Key

  1. Go to OpenAI API Keys
  2. Create a new API key
  3. Enter it in the sidebar of the app

How It Works

The RAG Pipeline

  1. πŸ“„ Document Processing:

    • Scrapes official Scikit-learn documentation
    • Splits into 1000-character chunks with 150-character overlap
    • Creates semantic embeddings using all-MiniLM-L6-v2
  2. πŸ” Retrieval:

    • User asks a question
    • Question is embedded using the same model
    • Top 3 most relevant chunks are retrieved from ChromaDB
  3. πŸ“ Augmentation:

    • Retrieved chunks are formatted as context
    • Detailed prompt is created with context and question
  4. πŸ€– Generation:

    • OpenAI GPT model generates answer based on context
    • Sources are displayed for verification

Project Structure

β”œβ”€β”€ app.py                    # Main Streamlit application
β”œβ”€β”€ scraper.py               # Documentation scraper
β”œβ”€β”€ chunker.py               # Text chunking utility
β”œβ”€β”€ build_vector_db.py       # Vector database builder
β”œβ”€β”€ requirements.txt         # Python dependencies
β”œβ”€β”€ scraped_content.json     # Raw scraped content
β”œβ”€β”€ chunks.json             # Processed text chunks
β”œβ”€β”€ chroma_db/              # Vector database
└── README.md               # This file

Usage Examples

Example Questions You Can Ask:

  • "How do I perform cross-validation in scikit-learn?"
  • "What is the difference between Ridge and Lasso regression?"
  • "How do I use GridSearchCV for parameter tuning?"
  • "What clustering algorithms are available in scikit-learn?"
  • "How do I preprocess data using StandardScaler?"
  • "What is feature selection and how do I use it?"

Configuration Options:

  • AI Model: Choose between GPT-3.5-turbo, GPT-4, or GPT-4-turbo
  • Context Chunks: Adjust the number of relevant chunks (1-5)
  • Chat History: View and clear previous conversations

Technical Details

Vector Database

  • Database: ChromaDB with SQLite backend
  • Embeddings: 384-dimensional vectors from all-MiniLM-L6-v2
  • Total Documents: 1,249 chunks
  • Database Size: ~15 MB

Performance

  • Processing Speed: ~56 docs/second during build
  • Query Time: <2 seconds for most questions
  • Model Device: Optimized for Apple Silicon (MPS)

Requirements

  • Python 3.9+
  • OpenAI API key
  • ~200 MB disk space for dependencies
  • ~15 MB for vector database

Troubleshooting

Common Issues:

  1. "OpenAI API key invalid"

    • Make sure your API key is correct and has sufficient credits
    • Check that the key starts with "sk-"
  2. "ChromaDB collection not found"

    • Run python build_vector_db.py to create the vector database
    • Make sure the chroma_db directory exists
  3. "Import errors"

    • Run pip install -r requirements.txt to install all dependencies
    • Make sure you're using Python 3.9+

Getting Help:

  1. Check the chat history for similar questions
  2. Try rephrasing your question
  3. Make sure your question is about Scikit-learn
  4. Check the source links for additional context

License

This project is for educational and research purposes. The Scikit-learn documentation is under BSD license.

Contributing

Feel free to submit issues and enhancement requests!


Happy Learning with Scikit-learn! πŸš€