Spaces:

fguryel
/

scikit-rag

Sleeping

App Files Files Community

scikit-rag / README.md

fguryel

Fix: Remove detection logic for HuggingFace Spaces compatibility

ac20173 5 months ago

preview code

raw

history blame contribute delete

4.67 kB

	---
	title: Scikit-learn Documentation Q&A Bot
	emoji: 🤖
	colorFrom: blue
	colorTo: green
	sdk: streamlit
	sdk_version: 1.28.0
	app_file: app.py
	pinned: false
	license: mit
	---

	Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
	# Scikit-learn Documentation Q&A Bot 🤖

	A Retrieval-Augmented Generation (RAG) chatbot that answers questions about Scikit-learn using the official documentation.

	## Features

	- 🔍 Smart Retrieval: Searches through 1,249+ documentation chunks using semantic similarity
	- 📝 Context-Aware: Provides relevant documentation context to the AI model
	- 🤖 AI-Powered: Uses OpenAI's GPT models for accurate, helpful answers
	- 🎯 Source Attribution: Shows the exact documentation sources for each answer
	- 💻 User-Friendly: Clean Streamlit web interface
	- ⚡ Fast: Efficient vector search with ChromaDB

	## Quick Start

	### 1. Install Dependencies

	```bash
	pip install -r requirements.txt
	```

	### 2. Build the Vector Database (First Time Only)

	```bash
	python scraper.py # Scrape Scikit-learn documentation
	python chunker.py # Split text into chunks
	python build_vector_db.py # Create vector embeddings
	```

	### 3. Run the Application

	```bash
	streamlit run app.py
	```

	### 4. Get Your OpenAI API Key

	1. Go to [OpenAI API Keys](https://platform.openai.com/api-keys)
	2. Create a new API key
	3. Enter it in the sidebar of the app

	## How It Works

	### The RAG Pipeline

	1. 📄 Document Processing:
	- Scrapes official Scikit-learn documentation
	- Splits into 1000-character chunks with 150-character overlap
	- Creates semantic embeddings using `all-MiniLM-L6-v2`

	2. 🔍 Retrieval:
	- User asks a question
	- Question is embedded using the same model
	- Top 3 most relevant chunks are retrieved from ChromaDB

	3. 📝 Augmentation:
	- Retrieved chunks are formatted as context
	- Detailed prompt is created with context and question

	4. 🤖 Generation:
	- OpenAI GPT model generates answer based on context
	- Sources are displayed for verification

	## Project Structure

	```
	├── app.py # Main Streamlit application
	├── scraper.py # Documentation scraper
	├── chunker.py # Text chunking utility
	├── build_vector_db.py # Vector database builder
	├── requirements.txt # Python dependencies
	├── scraped_content.json # Raw scraped content
	├── chunks.json # Processed text chunks
	├── chroma_db/ # Vector database
	└── README.md # This file
	```

	## Usage Examples

	### Example Questions You Can Ask:

	- "How do I perform cross-validation in scikit-learn?"
	- "What is the difference between Ridge and Lasso regression?"
	- "How do I use GridSearchCV for parameter tuning?"
	- "What clustering algorithms are available in scikit-learn?"
	- "How do I preprocess data using StandardScaler?"
	- "What is feature selection and how do I use it?"

	### Configuration Options:

	- AI Model: Choose between GPT-3.5-turbo, GPT-4, or GPT-4-turbo
	- Context Chunks: Adjust the number of relevant chunks (1-5)
	- Chat History: View and clear previous conversations

	## Technical Details

	### Vector Database
	- Database: ChromaDB with SQLite backend
	- Embeddings: 384-dimensional vectors from `all-MiniLM-L6-v2`
	- Total Documents: 1,249 chunks
	- Database Size: ~15 MB

	### Performance
	- Processing Speed: ~56 docs/second during build
	- Query Time: <2 seconds for most questions
	- Model Device: Optimized for Apple Silicon (MPS)

	## Requirements

	- Python 3.9+
	- OpenAI API key
	- ~200 MB disk space for dependencies
	- ~15 MB for vector database

	## Troubleshooting

	### Common Issues:

	1. "OpenAI API key invalid"
	- Make sure your API key is correct and has sufficient credits
	- Check that the key starts with "sk-"

	2. "ChromaDB collection not found"
	- Run `python build_vector_db.py` to create the vector database
	- Make sure the `chroma_db` directory exists

	3. "Import errors"
	- Run `pip install -r requirements.txt` to install all dependencies
	- Make sure you're using Python 3.9+

	### Getting Help:

	1. Check the chat history for similar questions
	2. Try rephrasing your question
	3. Make sure your question is about Scikit-learn
	4. Check the source links for additional context

	## License

	This project is for educational and research purposes. The Scikit-learn documentation is under BSD license.

	## Contributing

	Feel free to submit issues and enhancement requests!

	---

	Happy Learning with Scikit-learn! 🚀