metadata
title: Ask the Guru
emoji: π§
colorFrom: yellow
colorTo: blue
sdk: docker
app_port: 7860
RAG Q&A Assistant
A retrieval-augmented question-answering (RAG) system built on curated YouTube subtitle transcripts.
The project provides:
- A FastAPI backend (
/ask) for question answering. - A static frontend served by FastAPI.
- A data pipeline to download subtitles, preprocess text, embed transcripts, and retrieve relevant context.
- A CLI flow for local/offline querying.
Table of Contents
- Architecture
- Project Structure
- Tech Stack
- Prerequisites
- Configuration
- Quick Start
- Run with Docker
- API Reference
- Data Pipeline
- Deployment
- Operational Notes
- Troubleshooting
Architecture
- User asks a question from the UI or directly through
POST /ask. - Query is embedded using
all-MiniLM-L6-v2. - Top-K transcript chunks are retrieved from the FAISS index.
- Retrieved context is token-trimmed (
MAX_CONTEXT_TOKENS). - Groq chat completion API generates the final answer using a domain-aligned system prompt.
Core runtime flow:
app.pyloadsdata/file_paths.pklanddata/transcripts.pklat startup.api/retrieve_context.pyhandles vector retrieval.api/generate_response.pyhandles LLM generation.frontend/index.htmlis mounted and served from/.
Project Structure
.
βββ api/
β βββ embed_transcripts.py
β βββ generate_response.py
β βββ retrieve_context.py
βββ data/
β βββ subtitles_vtt/
β βββ transcripts_txt/
β βββ file_paths.pkl
β βββ transcript_index.faiss
β βββ transcripts.pkl
βββ frontend/
β βββ assets/images/
β βββ index.html
βββ outputs/
β βββ generated_response.txt
β βββ retrieved_transcripts.txt
βββ utils/
β βββ download_vtt.py
β βββ preprocess.py
β βββ token.py
β βββ vtt_to_txt.py
βββ app.py
βββ config.py
βββ main.py
βββ Dockerfile
βββ pyproject.toml
βββ requirements.txt
βββ uv.lock
Tech Stack
- Python 3.11+ (project metadata), FastAPI, Uvicorn
- FAISS (
faiss-cpu) for vector search - Sentence Transformers (
all-MiniLM-L6-v2) for embeddings - Groq API for response generation (
llama-3.1-8b-instant) - Static HTML/CSS/JS frontend
Prerequisites
- Python 3.11 or later
piporuvyt-dlp(required only when running subtitle download stage)- A valid
GROQ_API_KEY
Configuration
Environment variables read by the app:
GROQ_API_KEY: required for answer generationGITHUB_TOKEN: optional; present in config but not required for runtime flowHF_API_TOKEN: optional; present in config but not required for runtime flow
Important runtime paths are defined in config.py, including:
data/file_paths.pkldata/transcripts.pkldata/transcript_index.faissoutputs/generated_response.txt
Quick Start
1. Install dependencies
Using uv:
uv sync
Using pip:
python -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt
2. Set environment variable
export GROQ_API_KEY="your_groq_api_key"
3. Start API + frontend
uvicorn app:app --host 0.0.0.0 --port 7860 --reload
Open http://localhost:7860.
Run with Docker
Build:
docker build -t rag-qa-assistant .
Run:
docker run --rm -p 7860:7860 -e GROQ_API_KEY="your_groq_api_key" rag-qa-assistant
API Reference
POST /ask
Request body:
{
"query": "How do I deal with fear?"
}
Success response (200):
{
"answer": "..."
}
Error responses:
400: missing or emptyquery404: no relevant transcripts retrieved500: internal error
Example:
curl -X POST "http://localhost:7860/ask" \
-H "Content-Type: application/json" \
-d '{"query": "What is desire?"}'
Data Pipeline
main.py includes stages for data preparation and querying.
Pipeline stages:
- Download subtitles from configured channels (
utils/download_vtt.py) - Convert
.vttto cleaned.txt(utils/vtt_to_txt.py,utils/preprocess.py) - Load and persist transcript corpus (
data/*.pkl) - Create FAISS index (
api/embed_transcripts.py) - Retrieve context + generate response
Current state of main.py:
- Download/preprocess/embed stages are present but commented out in
main(). - Default execution expects prebuilt artifacts in
data/.
Run CLI query flow:
python main.py
Deployment
This repository is configured for Hugging Face Spaces (Docker SDK):
- README front matter defines Space metadata.
.github/workflows/main.ymlsyncsmainbranch to HF Space..github/workflows/space-keepalive.ymlpings the deployed Space every 12 hours.
Operational Notes
- Data artifacts are currently committed to the repository (
data/*.pkl,.faiss). - CORS in
app.pyis permissive (allow_origins=["*"]) and suitable for dev/demo, not strict production hardening. frontend/index.htmlreferencesassets/images/hero-background.jpg, but this file is not present infrontend/assets/images/.api/embed_transcripts.pycurrently treatstranscript_indexas a directory path (mkdir) though it is configured as a file path; this affects index regeneration workflows.
Troubleshooting
Error: AI client not configured.- Ensure
GROQ_API_KEYis set in the shell/container before startup.
- Ensure
No relevant transcripts found(404from/ask)- Check that
data/transcript_index.faiss,data/file_paths.pkl, anddata/transcripts.pklexist and are compatible.
- Check that
API starts but UI looks incomplete
- Verify static assets under
frontend/assets/images/.
- Verify static assets under
Subtitle download stage fails
- Install
yt-dlpand verify network access and YouTube rate limits.
- Install