Spaces:

fguryel
/

scikit-rag

Sleeping

File size: 4,672 Bytes

fb2d52f
ac20173
 
7ed4bfa
ac20173
 
 
7ed4bfa
 
ac20173
7ed4bfa
 
 
9222df3

---
title: Scikit-learn Documentation Q&A Bot
emoji: 🤖
colorFrom: blue
colorTo: green
sdk: streamlit
sdk_version: 1.28.0
app_file: app.py
pinned: false
license: mit
---

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
# Scikit-learn Documentation Q&A Bot 🤖

A Retrieval-Augmented Generation (RAG) chatbot that answers questions about Scikit-learn using the official documentation.

## Features

- **🔍 Smart Retrieval**: Searches through 1,249+ documentation chunks using semantic similarity
- **📝 Context-Aware**: Provides relevant documentation context to the AI model
- **🤖 AI-Powered**: Uses OpenAI's GPT models for accurate, helpful answers
- **🎯 Source Attribution**: Shows the exact documentation sources for each answer
- **💻 User-Friendly**: Clean Streamlit web interface
- **⚡ Fast**: Efficient vector search with ChromaDB

## Quick Start

### 1. Install Dependencies

```bash
pip install -r requirements.txt
```

### 2. Build the Vector Database (First Time Only)

```bash
python scraper.py      # Scrape Scikit-learn documentation
python chunker.py      # Split text into chunks
python build_vector_db.py  # Create vector embeddings
```

### 3. Run the Application

```bash
streamlit run app.py
```

### 4. Get Your OpenAI API Key

1. Go to [OpenAI API Keys](https://platform.openai.com/api-keys)
2. Create a new API key
3. Enter it in the sidebar of the app

## How It Works

### The RAG Pipeline

1. **📄 Document Processing**:
   - Scrapes official Scikit-learn documentation
   - Splits into 1000-character chunks with 150-character overlap
   - Creates semantic embeddings using `all-MiniLM-L6-v2`

2. **🔍 Retrieval**:
   - User asks a question
   - Question is embedded using the same model
   - Top 3 most relevant chunks are retrieved from ChromaDB

3. **📝 Augmentation**:
   - Retrieved chunks are formatted as context
   - Detailed prompt is created with context and question

4. **🤖 Generation**:
   - OpenAI GPT model generates answer based on context
   - Sources are displayed for verification

## Project Structure

```
├── app.py                    # Main Streamlit application
├── scraper.py               # Documentation scraper
├── chunker.py               # Text chunking utility
├── build_vector_db.py       # Vector database builder
├── requirements.txt         # Python dependencies
├── scraped_content.json     # Raw scraped content
├── chunks.json             # Processed text chunks
├── chroma_db/              # Vector database
└── README.md               # This file
```

## Usage Examples

### Example Questions You Can Ask:

- "How do I perform cross-validation in scikit-learn?"
- "What is the difference between Ridge and Lasso regression?"
- "How do I use GridSearchCV for parameter tuning?"
- "What clustering algorithms are available in scikit-learn?"
- "How do I preprocess data using StandardScaler?"
- "What is feature selection and how do I use it?"

### Configuration Options:

- **AI Model**: Choose between GPT-3.5-turbo, GPT-4, or GPT-4-turbo
- **Context Chunks**: Adjust the number of relevant chunks (1-5)
- **Chat History**: View and clear previous conversations

## Technical Details

### Vector Database
- **Database**: ChromaDB with SQLite backend
- **Embeddings**: 384-dimensional vectors from `all-MiniLM-L6-v2`
- **Total Documents**: 1,249 chunks
- **Database Size**: ~15 MB

### Performance
- **Processing Speed**: ~56 docs/second during build
- **Query Time**: <2 seconds for most questions
- **Model Device**: Optimized for Apple Silicon (MPS)

## Requirements

- Python 3.9+
- OpenAI API key
- ~200 MB disk space for dependencies
- ~15 MB for vector database

## Troubleshooting

### Common Issues:

1. **"OpenAI API key invalid"**
   - Make sure your API key is correct and has sufficient credits
   - Check that the key starts with "sk-"

2. **"ChromaDB collection not found"**
   - Run `python build_vector_db.py` to create the vector database
   - Make sure the `chroma_db` directory exists

3. **"Import errors"**
   - Run `pip install -r requirements.txt` to install all dependencies
   - Make sure you're using Python 3.9+

### Getting Help:

1. Check the chat history for similar questions
2. Try rephrasing your question
3. Make sure your question is about Scikit-learn
4. Check the source links for additional context

## License

This project is for educational and research purposes. The Scikit-learn documentation is under BSD license.

## Contributing

Feel free to submit issues and enhancement requests!

---

**Happy Learning with Scikit-learn! 🚀**