File size: 6,150 Bytes
dc2a1da
be65f0f
 
 
dc2a1da
 
be65f0f
dc2a1da
 
66fc578
1b7a642
66fc578
1b7a642
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
66fc578
1b7a642
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
66fc578
1b7a642
 
 
 
 
66fc578
1b7a642
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
66fc578
1b7a642
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
---
title: Ask the Guru
emoji: 🧘
colorFrom: yellow
colorTo: blue
sdk: docker
app_port: 7860
---

# RAG Q&A Assistant

A retrieval-augmented question-answering (RAG) system built on curated YouTube subtitle transcripts.

The project provides:
- A FastAPI backend (`/ask`) for question answering.
- A static frontend served by FastAPI.
- A data pipeline to download subtitles, preprocess text, embed transcripts, and retrieve relevant context.
- A CLI flow for local/offline querying.

## Table of Contents

- [Architecture](#architecture)
- [Project Structure](#project-structure)
- [Tech Stack](#tech-stack)
- [Prerequisites](#prerequisites)
- [Configuration](#configuration)
- [Quick Start](#quick-start)
- [Run with Docker](#run-with-docker)
- [API Reference](#api-reference)
- [Data Pipeline](#data-pipeline)
- [Deployment](#deployment)
- [Operational Notes](#operational-notes)
- [Troubleshooting](#troubleshooting)

## Architecture

1. User asks a question from the UI or directly through `POST /ask`.
2. Query is embedded using `all-MiniLM-L6-v2`.
3. Top-K transcript chunks are retrieved from the FAISS index.
4. Retrieved context is token-trimmed (`MAX_CONTEXT_TOKENS`).
5. Groq chat completion API generates the final answer using a domain-aligned system prompt.

Core runtime flow:
- `app.py` loads `data/file_paths.pkl` and `data/transcripts.pkl` at startup.
- `api/retrieve_context.py` handles vector retrieval.
- `api/generate_response.py` handles LLM generation.
- `frontend/index.html` is mounted and served from `/`.

## Project Structure

```text
.
β”œβ”€β”€ api/
β”‚   β”œβ”€β”€ embed_transcripts.py
β”‚   β”œβ”€β”€ generate_response.py
β”‚   └── retrieve_context.py
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ subtitles_vtt/
β”‚   β”œβ”€β”€ transcripts_txt/
β”‚   β”œβ”€β”€ file_paths.pkl
β”‚   β”œβ”€β”€ transcript_index.faiss
β”‚   └── transcripts.pkl
β”œβ”€β”€ frontend/
β”‚   β”œβ”€β”€ assets/images/
β”‚   └── index.html
β”œβ”€β”€ outputs/
β”‚   β”œβ”€β”€ generated_response.txt
β”‚   └── retrieved_transcripts.txt
β”œβ”€β”€ utils/
β”‚   β”œβ”€β”€ download_vtt.py
β”‚   β”œβ”€β”€ preprocess.py
β”‚   β”œβ”€β”€ token.py
β”‚   └── vtt_to_txt.py
β”œβ”€β”€ app.py
β”œβ”€β”€ config.py
β”œβ”€β”€ main.py
β”œβ”€β”€ Dockerfile
β”œβ”€β”€ pyproject.toml
β”œβ”€β”€ requirements.txt
└── uv.lock
```

## Tech Stack

- Python 3.11+ (project metadata), FastAPI, Uvicorn
- FAISS (`faiss-cpu`) for vector search
- Sentence Transformers (`all-MiniLM-L6-v2`) for embeddings
- Groq API for response generation (`llama-3.1-8b-instant`)
- Static HTML/CSS/JS frontend

## Prerequisites

- Python 3.11 or later
- `pip` or `uv`
- `yt-dlp` (required only when running subtitle download stage)
- A valid `GROQ_API_KEY`

## Configuration

Environment variables read by the app:

- `GROQ_API_KEY`: required for answer generation
- `GITHUB_TOKEN`: optional; present in config but not required for runtime flow
- `HF_API_TOKEN`: optional; present in config but not required for runtime flow

Important runtime paths are defined in `config.py`, including:
- `data/file_paths.pkl`
- `data/transcripts.pkl`
- `data/transcript_index.faiss`
- `outputs/generated_response.txt`

## Quick Start

### 1. Install dependencies

Using `uv`:

```bash
uv sync
```

Using `pip`:

```bash
python -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt
```

### 2. Set environment variable

```bash
export GROQ_API_KEY="your_groq_api_key"
```

### 3. Start API + frontend

```bash
uvicorn app:app --host 0.0.0.0 --port 7860 --reload
```

Open `http://localhost:7860`.

## Run with Docker

Build:

```bash
docker build -t rag-qa-assistant .
```

Run:

```bash
docker run --rm -p 7860:7860 -e GROQ_API_KEY="your_groq_api_key" rag-qa-assistant
```

## API Reference

### `POST /ask`

Request body:

```json
{
  "query": "How do I deal with fear?"
}
```

Success response (`200`):

```json
{
  "answer": "..."
}
```

Error responses:
- `400`: missing or empty `query`
- `404`: no relevant transcripts retrieved
- `500`: internal error

Example:

```bash
curl -X POST "http://localhost:7860/ask" \
  -H "Content-Type: application/json" \
  -d '{"query": "What is desire?"}'
```

## Data Pipeline

`main.py` includes stages for data preparation and querying.

Pipeline stages:
1. Download subtitles from configured channels (`utils/download_vtt.py`)
2. Convert `.vtt` to cleaned `.txt` (`utils/vtt_to_txt.py`, `utils/preprocess.py`)
3. Load and persist transcript corpus (`data/*.pkl`)
4. Create FAISS index (`api/embed_transcripts.py`)
5. Retrieve context + generate response

Current state of `main.py`:
- Download/preprocess/embed stages are present but commented out in `main()`.
- Default execution expects prebuilt artifacts in `data/`.

Run CLI query flow:

```bash
python main.py
```

## Deployment

This repository is configured for Hugging Face Spaces (Docker SDK):
- README front matter defines Space metadata.
- `.github/workflows/main.yml` syncs `main` branch to HF Space.
- `.github/workflows/space-keepalive.yml` pings the deployed Space every 12 hours.

## Operational Notes

- Data artifacts are currently committed to the repository (`data/*.pkl`, `.faiss`).
- CORS in `app.py` is permissive (`allow_origins=["*"]`) and suitable for dev/demo, not strict production hardening.
- `frontend/index.html` references `assets/images/hero-background.jpg`, but this file is not present in `frontend/assets/images/`.
- `api/embed_transcripts.py` currently treats `transcript_index` as a directory path (`mkdir`) though it is configured as a file path; this affects index regeneration workflows.

## Troubleshooting

- `Error: AI client not configured.`
  - Ensure `GROQ_API_KEY` is set in the shell/container before startup.

- `No relevant transcripts found` (`404` from `/ask`)
  - Check that `data/transcript_index.faiss`, `data/file_paths.pkl`, and `data/transcripts.pkl` exist and are compatible.

- API starts but UI looks incomplete
  - Verify static assets under `frontend/assets/images/`.

- Subtitle download stage fails
  - Install `yt-dlp` and verify network access and YouTube rate limits.