Anshul Prasad
error handling.
be65f0f
metadata
title: Ask the Guru
emoji: 🧘
colorFrom: yellow
colorTo: blue
sdk: docker
app_port: 7860

RAG Q&A Assistant

A retrieval-augmented question-answering (RAG) system built on curated YouTube subtitle transcripts.

The project provides:

  • A FastAPI backend (/ask) for question answering.
  • A static frontend served by FastAPI.
  • A data pipeline to download subtitles, preprocess text, embed transcripts, and retrieve relevant context.
  • A CLI flow for local/offline querying.

Table of Contents

Architecture

  1. User asks a question from the UI or directly through POST /ask.
  2. Query is embedded using all-MiniLM-L6-v2.
  3. Top-K transcript chunks are retrieved from the FAISS index.
  4. Retrieved context is token-trimmed (MAX_CONTEXT_TOKENS).
  5. Groq chat completion API generates the final answer using a domain-aligned system prompt.

Core runtime flow:

  • app.py loads data/file_paths.pkl and data/transcripts.pkl at startup.
  • api/retrieve_context.py handles vector retrieval.
  • api/generate_response.py handles LLM generation.
  • frontend/index.html is mounted and served from /.

Project Structure

.
β”œβ”€β”€ api/
β”‚   β”œβ”€β”€ embed_transcripts.py
β”‚   β”œβ”€β”€ generate_response.py
β”‚   └── retrieve_context.py
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ subtitles_vtt/
β”‚   β”œβ”€β”€ transcripts_txt/
β”‚   β”œβ”€β”€ file_paths.pkl
β”‚   β”œβ”€β”€ transcript_index.faiss
β”‚   └── transcripts.pkl
β”œβ”€β”€ frontend/
β”‚   β”œβ”€β”€ assets/images/
β”‚   └── index.html
β”œβ”€β”€ outputs/
β”‚   β”œβ”€β”€ generated_response.txt
β”‚   └── retrieved_transcripts.txt
β”œβ”€β”€ utils/
β”‚   β”œβ”€β”€ download_vtt.py
β”‚   β”œβ”€β”€ preprocess.py
β”‚   β”œβ”€β”€ token.py
β”‚   └── vtt_to_txt.py
β”œβ”€β”€ app.py
β”œβ”€β”€ config.py
β”œβ”€β”€ main.py
β”œβ”€β”€ Dockerfile
β”œβ”€β”€ pyproject.toml
β”œβ”€β”€ requirements.txt
└── uv.lock

Tech Stack

  • Python 3.11+ (project metadata), FastAPI, Uvicorn
  • FAISS (faiss-cpu) for vector search
  • Sentence Transformers (all-MiniLM-L6-v2) for embeddings
  • Groq API for response generation (llama-3.1-8b-instant)
  • Static HTML/CSS/JS frontend

Prerequisites

  • Python 3.11 or later
  • pip or uv
  • yt-dlp (required only when running subtitle download stage)
  • A valid GROQ_API_KEY

Configuration

Environment variables read by the app:

  • GROQ_API_KEY: required for answer generation
  • GITHUB_TOKEN: optional; present in config but not required for runtime flow
  • HF_API_TOKEN: optional; present in config but not required for runtime flow

Important runtime paths are defined in config.py, including:

  • data/file_paths.pkl
  • data/transcripts.pkl
  • data/transcript_index.faiss
  • outputs/generated_response.txt

Quick Start

1. Install dependencies

Using uv:

uv sync

Using pip:

python -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt

2. Set environment variable

export GROQ_API_KEY="your_groq_api_key"

3. Start API + frontend

uvicorn app:app --host 0.0.0.0 --port 7860 --reload

Open http://localhost:7860.

Run with Docker

Build:

docker build -t rag-qa-assistant .

Run:

docker run --rm -p 7860:7860 -e GROQ_API_KEY="your_groq_api_key" rag-qa-assistant

API Reference

POST /ask

Request body:

{
  "query": "How do I deal with fear?"
}

Success response (200):

{
  "answer": "..."
}

Error responses:

  • 400: missing or empty query
  • 404: no relevant transcripts retrieved
  • 500: internal error

Example:

curl -X POST "http://localhost:7860/ask" \
  -H "Content-Type: application/json" \
  -d '{"query": "What is desire?"}'

Data Pipeline

main.py includes stages for data preparation and querying.

Pipeline stages:

  1. Download subtitles from configured channels (utils/download_vtt.py)
  2. Convert .vtt to cleaned .txt (utils/vtt_to_txt.py, utils/preprocess.py)
  3. Load and persist transcript corpus (data/*.pkl)
  4. Create FAISS index (api/embed_transcripts.py)
  5. Retrieve context + generate response

Current state of main.py:

  • Download/preprocess/embed stages are present but commented out in main().
  • Default execution expects prebuilt artifacts in data/.

Run CLI query flow:

python main.py

Deployment

This repository is configured for Hugging Face Spaces (Docker SDK):

  • README front matter defines Space metadata.
  • .github/workflows/main.yml syncs main branch to HF Space.
  • .github/workflows/space-keepalive.yml pings the deployed Space every 12 hours.

Operational Notes

  • Data artifacts are currently committed to the repository (data/*.pkl, .faiss).
  • CORS in app.py is permissive (allow_origins=["*"]) and suitable for dev/demo, not strict production hardening.
  • frontend/index.html references assets/images/hero-background.jpg, but this file is not present in frontend/assets/images/.
  • api/embed_transcripts.py currently treats transcript_index as a directory path (mkdir) though it is configured as a file path; this affects index regeneration workflows.

Troubleshooting

  • Error: AI client not configured.

    • Ensure GROQ_API_KEY is set in the shell/container before startup.
  • No relevant transcripts found (404 from /ask)

    • Check that data/transcript_index.faiss, data/file_paths.pkl, and data/transcripts.pkl exist and are compatible.
  • API starts but UI looks incomplete

    • Verify static assets under frontend/assets/images/.
  • Subtitle download stage fails

    • Install yt-dlp and verify network access and YouTube rate limits.