Spaces:

AnshulPrasad
/

transcript-rag-summarizer

Sleeping

App Files Files Community

transcript-rag-summarizer / README.md

Anshul Prasad

error handling.

be65f0f 25 days ago

preview code

raw

history blame contribute delete

6.15 kB

	---
	title: Ask the Guru
	emoji: 🧘
	colorFrom: yellow
	colorTo: blue
	sdk: docker
	app_port: 7860
	---

	# RAG Q&A Assistant

	A retrieval-augmented question-answering (RAG) system built on curated YouTube subtitle transcripts.

	The project provides:
	- A FastAPI backend (`/ask`) for question answering.
	- A static frontend served by FastAPI.
	- A data pipeline to download subtitles, preprocess text, embed transcripts, and retrieve relevant context.
	- A CLI flow for local/offline querying.

	## Table of Contents

	- [Architecture](#architecture)
	- [Project Structure](#project-structure)
	- [Tech Stack](#tech-stack)
	- [Prerequisites](#prerequisites)
	- [Configuration](#configuration)
	- [Quick Start](#quick-start)
	- [Run with Docker](#run-with-docker)
	- [API Reference](#api-reference)
	- [Data Pipeline](#data-pipeline)
	- [Deployment](#deployment)
	- [Operational Notes](#operational-notes)
	- [Troubleshooting](#troubleshooting)

	## Architecture

	1. User asks a question from the UI or directly through `POST /ask`.
	2. Query is embedded using `all-MiniLM-L6-v2`.
	3. Top-K transcript chunks are retrieved from the FAISS index.
	4. Retrieved context is token-trimmed (`MAX_CONTEXT_TOKENS`).
	5. Groq chat completion API generates the final answer using a domain-aligned system prompt.

	Core runtime flow:
	- `app.py` loads `data/file_paths.pkl` and `data/transcripts.pkl` at startup.
	- `api/retrieve_context.py` handles vector retrieval.
	- `api/generate_response.py` handles LLM generation.
	- `frontend/index.html` is mounted and served from `/`.

	## Project Structure

	```text
	.
	├── api/
	│ ├── embed_transcripts.py
	│ ├── generate_response.py
	│ └── retrieve_context.py
	├── data/
	│ ├── subtitles_vtt/
	│ ├── transcripts_txt/
	│ ├── file_paths.pkl
	│ ├── transcript_index.faiss
	│ └── transcripts.pkl
	├── frontend/
	│ ├── assets/images/
	│ └── index.html
	├── outputs/
	│ ├── generated_response.txt
	│ └── retrieved_transcripts.txt
	├── utils/
	│ ├── download_vtt.py
	│ ├── preprocess.py
	│ ├── token.py
	│ └── vtt_to_txt.py
	├── app.py
	├── config.py
	├── main.py
	├── Dockerfile
	├── pyproject.toml
	├── requirements.txt
	└── uv.lock
	```

	## Tech Stack

	- Python 3.11+ (project metadata), FastAPI, Uvicorn
	- FAISS (`faiss-cpu`) for vector search
	- Sentence Transformers (`all-MiniLM-L6-v2`) for embeddings
	- Groq API for response generation (`llama-3.1-8b-instant`)
	- Static HTML/CSS/JS frontend

	## Prerequisites

	- Python 3.11 or later
	- `pip` or `uv`
	- `yt-dlp` (required only when running subtitle download stage)
	- A valid `GROQ_API_KEY`

	## Configuration

	Environment variables read by the app:

	- `GROQ_API_KEY`: required for answer generation
	- `GITHUB_TOKEN`: optional; present in config but not required for runtime flow
	- `HF_API_TOKEN`: optional; present in config but not required for runtime flow

	Important runtime paths are defined in `config.py`, including:
	- `data/file_paths.pkl`
	- `data/transcripts.pkl`
	- `data/transcript_index.faiss`
	- `outputs/generated_response.txt`

	## Quick Start

	### 1. Install dependencies

	Using `uv`:

	```bash
	uv sync
	```

	Using `pip`:

	```bash
	python -m venv .venv
	source .venv/bin/activate
	pip install --upgrade pip
	pip install -r requirements.txt
	```

	### 2. Set environment variable

	```bash
	export GROQ_API_KEY="your_groq_api_key"
	```

	### 3. Start API + frontend

	```bash
	uvicorn app:app --host 0.0.0.0 --port 7860 --reload
	```

	Open `http://localhost:7860`.

	## Run with Docker

	Build:

	```bash
	docker build -t rag-qa-assistant .
	```

	Run:

	```bash
	docker run --rm -p 7860:7860 -e GROQ_API_KEY="your_groq_api_key" rag-qa-assistant
	```

	## API Reference

	### `POST /ask`

	Request body:

	```json
	{
	"query": "How do I deal with fear?"
	}
	```

	Success response (`200`):

	```json
	{
	"answer": "..."
	}
	```

	Error responses:
	- `400`: missing or empty `query`
	- `404`: no relevant transcripts retrieved
	- `500`: internal error

	Example:

	```bash
	curl -X POST "http://localhost:7860/ask" \
	-H "Content-Type: application/json" \
	-d '{"query": "What is desire?"}'
	```

	## Data Pipeline

	`main.py` includes stages for data preparation and querying.

	Pipeline stages:
	1. Download subtitles from configured channels (`utils/download_vtt.py`)
	2. Convert `.vtt` to cleaned `.txt` (`utils/vtt_to_txt.py`, `utils/preprocess.py`)
	3. Load and persist transcript corpus (`data/*.pkl`)
	4. Create FAISS index (`api/embed_transcripts.py`)
	5. Retrieve context + generate response

	Current state of `main.py`:
	- Download/preprocess/embed stages are present but commented out in `main()`.
	- Default execution expects prebuilt artifacts in `data/`.

	Run CLI query flow:

	```bash
	python main.py
	```

	## Deployment

	This repository is configured for Hugging Face Spaces (Docker SDK):
	- README front matter defines Space metadata.
	- `.github/workflows/main.yml` syncs `main` branch to HF Space.
	- `.github/workflows/space-keepalive.yml` pings the deployed Space every 12 hours.

	## Operational Notes

	- Data artifacts are currently committed to the repository (`data/*.pkl`, `.faiss`).
	- CORS in `app.py` is permissive (`allow_origins=["*"]`) and suitable for dev/demo, not strict production hardening.
	- `frontend/index.html` references `assets/images/hero-background.jpg`, but this file is not present in `frontend/assets/images/`.
	- `api/embed_transcripts.py` currently treats `transcript_index` as a directory path (`mkdir`) though it is configured as a file path; this affects index regeneration workflows.

	## Troubleshooting

	- `Error: AI client not configured.`
	- Ensure `GROQ_API_KEY` is set in the shell/container before startup.

	- `No relevant transcripts found` (`404` from `/ask`)
	- Check that `data/transcript_index.faiss`, `data/file_paths.pkl`, and `data/transcripts.pkl` exist and are compatible.

	- API starts but UI looks incomplete
	- Verify static assets under `frontend/assets/images/`.

	- Subtitle download stage fails
	- Install `yt-dlp` and verify network access and YouTube rate limits.