File size: 4,672 Bytes
fb2d52f
ac20173
 
7ed4bfa
ac20173
 
 
7ed4bfa
 
ac20173
7ed4bfa
 
 
9222df3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
---
title: Scikit-learn Documentation Q&A Bot
emoji: πŸ€–
colorFrom: blue
colorTo: green
sdk: streamlit
sdk_version: 1.28.0
app_file: app.py
pinned: false
license: mit
---

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
# Scikit-learn Documentation Q&A Bot πŸ€–

A Retrieval-Augmented Generation (RAG) chatbot that answers questions about Scikit-learn using the official documentation.

## Features

- **πŸ” Smart Retrieval**: Searches through 1,249+ documentation chunks using semantic similarity
- **πŸ“ Context-Aware**: Provides relevant documentation context to the AI model
- **πŸ€– AI-Powered**: Uses OpenAI's GPT models for accurate, helpful answers
- **🎯 Source Attribution**: Shows the exact documentation sources for each answer
- **πŸ’» User-Friendly**: Clean Streamlit web interface
- **⚑ Fast**: Efficient vector search with ChromaDB

## Quick Start

### 1. Install Dependencies

```bash
pip install -r requirements.txt
```

### 2. Build the Vector Database (First Time Only)

```bash
python scraper.py      # Scrape Scikit-learn documentation
python chunker.py      # Split text into chunks
python build_vector_db.py  # Create vector embeddings
```

### 3. Run the Application

```bash
streamlit run app.py
```

### 4. Get Your OpenAI API Key

1. Go to [OpenAI API Keys](https://platform.openai.com/api-keys)
2. Create a new API key
3. Enter it in the sidebar of the app

## How It Works

### The RAG Pipeline

1. **πŸ“„ Document Processing**:
   - Scrapes official Scikit-learn documentation
   - Splits into 1000-character chunks with 150-character overlap
   - Creates semantic embeddings using `all-MiniLM-L6-v2`

2. **πŸ” Retrieval**:
   - User asks a question
   - Question is embedded using the same model
   - Top 3 most relevant chunks are retrieved from ChromaDB

3. **πŸ“ Augmentation**:
   - Retrieved chunks are formatted as context
   - Detailed prompt is created with context and question

4. **πŸ€– Generation**:
   - OpenAI GPT model generates answer based on context
   - Sources are displayed for verification

## Project Structure

```
β”œβ”€β”€ app.py                    # Main Streamlit application
β”œβ”€β”€ scraper.py               # Documentation scraper
β”œβ”€β”€ chunker.py               # Text chunking utility
β”œβ”€β”€ build_vector_db.py       # Vector database builder
β”œβ”€β”€ requirements.txt         # Python dependencies
β”œβ”€β”€ scraped_content.json     # Raw scraped content
β”œβ”€β”€ chunks.json             # Processed text chunks
β”œβ”€β”€ chroma_db/              # Vector database
└── README.md               # This file
```

## Usage Examples

### Example Questions You Can Ask:

- "How do I perform cross-validation in scikit-learn?"
- "What is the difference between Ridge and Lasso regression?"
- "How do I use GridSearchCV for parameter tuning?"
- "What clustering algorithms are available in scikit-learn?"
- "How do I preprocess data using StandardScaler?"
- "What is feature selection and how do I use it?"

### Configuration Options:

- **AI Model**: Choose between GPT-3.5-turbo, GPT-4, or GPT-4-turbo
- **Context Chunks**: Adjust the number of relevant chunks (1-5)
- **Chat History**: View and clear previous conversations

## Technical Details

### Vector Database
- **Database**: ChromaDB with SQLite backend
- **Embeddings**: 384-dimensional vectors from `all-MiniLM-L6-v2`
- **Total Documents**: 1,249 chunks
- **Database Size**: ~15 MB

### Performance
- **Processing Speed**: ~56 docs/second during build
- **Query Time**: <2 seconds for most questions
- **Model Device**: Optimized for Apple Silicon (MPS)

## Requirements

- Python 3.9+
- OpenAI API key
- ~200 MB disk space for dependencies
- ~15 MB for vector database

## Troubleshooting

### Common Issues:

1. **"OpenAI API key invalid"**
   - Make sure your API key is correct and has sufficient credits
   - Check that the key starts with "sk-"

2. **"ChromaDB collection not found"**
   - Run `python build_vector_db.py` to create the vector database
   - Make sure the `chroma_db` directory exists

3. **"Import errors"**
   - Run `pip install -r requirements.txt` to install all dependencies
   - Make sure you're using Python 3.9+

### Getting Help:

1. Check the chat history for similar questions
2. Try rephrasing your question
3. Make sure your question is about Scikit-learn
4. Check the source links for additional context

## License

This project is for educational and research purposes. The Scikit-learn documentation is under BSD license.

## Contributing

Feel free to submit issues and enhancement requests!

---

**Happy Learning with Scikit-learn! πŸš€**