NotebookLMClone / docs /STREAMLIT_ARCHITECTURE_SPEC.md
github-actions[bot]
Sync from GitHub e48aa5f27523b35a22c1a01acbb2b835cdc28984
aacd162

NotebookLM Clone Streamlit Architecture Spec

1. Scope

This spec defines the target MVP for a Streamlit-based NotebookLM clone with:

  • source ingestion (.pdf, .pptx, .txt, URL)
  • retrieval-augmented chat with citations
  • artifact generation (report, quiz, podcast transcript/audio)
  • strict per-user and per-notebook data isolation
  • CI/CD deployment from GitHub to Hugging Face Spaces

This document aligns your course requirements, the initial plan PDF, and the current repository implementation.

2. Current Baseline (Repo Audit)

Implemented now:

  • FastAPI backend with notebook/source/thread/chat endpoints in app.py
  • Streamlit frontend in frontend/app.py
  • ingestion pipeline in src/ingestion/ (extract, chunk, embed, Chroma upsert/query)
  • SQLite schema and CRUD in data/models.py and data/crud.py
  • artifact endpoints for report, quiz, and podcast in app.py
  • HF OAuth + session bridge integration in auth/oauth.py and auth/session.py
  • notebook create/rename/delete and notebook-scoped source/thread/artifact routes
  • URL ingestion safety controls (scheme allowlist, DNS/IP checks, redirect/body limits)
  • URL source auto-ingestion (processing -> ready/failed) and file upload ingestion
  • per-user authorization checks via require_current_user
  • Streamlit artifact panel with preview/download/playback controls
  • GitHub Actions workflow for deploy to Hugging Face Space

Remaining MVP gaps / hardening:

  • citation history display should persist clearly when reloading existing threads
  • operational docs/runbook should be updated with final artifact output formats and auth/deploy setup

3. Target Architecture (Streamlit + FastAPI)

3.1 Frontend (Streamlit)

  • frontend/app.py remains the primary UI.
  • Pages/sections:
    • Auth/session status
    • Notebook manager (create, rename, delete, switch)
    • Source ingestion (upload + URL)
    • Chat panel with citations
    • Artifact panel (generate/list/download/playback)
  • Session state stores selected notebook/thread and user identity from OAuth.

3.2 Backend (FastAPI)

  • Keep app.py routers for API boundaries:
    • /auth/*
    • /notebooks/*
    • /threads/*
    • /sources/*
    • /notebooks/{id}/artifacts/*
  • Service layer responsibilities:
    • ingestion orchestration (src/ingestion/service.py)
    • RAG retrieval + prompt construction (query_notebook_chunks, prompt templates)
    • artifact generation (src/artifacts/*)
  • Authorization rule: every notebook/thread/source/artifact operation must verify ownership against authenticated user.

3.3 Storage

Primary metadata store:

  • SQLite (users, notebooks, sources, chat_threads, messages, message_citations, artifacts)

Vector store:

  • ChromaDB collections with notebook scoping metadata (user_id, notebook_id, source_id, chunk refs)

File/object storage layout (MVP local/HF /data):

/data/users/<username>/notebooks/<notebook_uuid>/
  files_raw/
  files_extracted/
  chroma/
  chat/messages.jsonl
  artifacts/reports/
  artifacts/quizzes/
  artifacts/podcasts/

4. Identity, Auth, and Isolation Plan

4.1 Authentication

  • Integrate Hugging Face OAuth for user login.
  • Map provider identity (hf_sub or stable username) to internal users row.
  • Store session in secure cookie/server session.

4.2 Authorization

  • Replace free-form owner_user_id from UI with server-derived user ID from session.
  • Add shared helper (dependency/middleware) to resolve current_user.
  • Enforce ownership checks in every read/write endpoint.

4.3 Isolation invariants

  • DB queries always include ownership constraints.
  • Vector queries include user_id and notebook_id metadata filters.
  • File paths are derived from trusted IDs only (never direct user path input).

5. Functional Requirements and API Plan

5.1 Notebook lifecycle

Required:

  • create notebook
  • list notebooks for current user
  • rename notebook
  • delete notebook

Backend additions:

  • PATCH /notebooks/{notebook_id}
  • DELETE /notebooks/{notebook_id}

5.2 Source ingestion

Required:

  • upload .pdf/.pptx/.txt files
  • ingest URL sources with safe fetch rules
  • extract, chunk, embed, store, mark ready/failed

Backend additions:

  • URL validator + SSRF guardrail module (block private IP ranges, non-http(s), large responses)

5.3 RAG chat with citations

Required:

  • retrieve top-k notebook chunks
  • generate answer grounded in retrieved context
  • return citation metadata and persist messages + citations

Current state:

  • mostly implemented in POST /threads/{thread_id}/chat

Hardening needed:

  • stronger citation formatting in responses
  • conversation token budgeting and truncation policy

5.4 Artifact generation

Required outputs:

  • report (.md)
  • quiz (.md + answer key)
  • podcast transcript (.md) + audio (.mp3)

Current state:

  • all three artifact endpoints exist and are wired in Streamlit
  • report output is persisted as Markdown (.md)
  • quiz output is persisted as Markdown (.md) including answer key
  • podcast persists transcript Markdown (.md) and audio (.mp3)

Backend additions:

  • standard artifact serialization + saved output files under artifact subfolders

5.5 UI requirements

Required frontend features:

  • notebook manager with switching
  • source upload + URL ingest
  • chat with visible citations
  • artifact generate buttons (report/quiz/podcast)
  • artifact list with download links
  • podcast playback component
  • explicit error/retry states

6. CI/CD Requirements (GitHub -> HF Space)

  • Trigger on push to main.
  • Sync repository to HF Space via token auth.
  • Use GitHub Secrets:
    • HF_TOKEN
    • HF_SPACE_REPO (example: username/space-name)
    • optional: HF_SPACE_BRANCH (default main)
  • Optional pre-deploy check: run tests before sync.

7. Milestone Plan

Milestone 1: Auth + Isolation foundation

  • Implement HF OAuth and session plumbing.
  • Remove manual owner_user_id UI field.
  • Add authorization dependency and enforce route coverage.

Exit criteria:

  • no endpoint accepts cross-user notebook/thread access.

Milestone 2: Notebook + Ingestion completeness

  • add notebook rename/delete APIs and UI actions.
  • add SSRF-safe URL ingestion policy.
  • improve ingestion status feedback in UI.

Exit criteria:

  • complete notebook lifecycle and safe ingestion of all required source types.

Milestone 3: RAG + Artifacts

  • improve chat citation UX and persistence views.
  • add report artifact generation + storage.
  • finalize artifact browser/download/audio playback in Streamlit.

Exit criteria:

  • all three artifact types are generated, listed, and downloadable/playable.

Milestone 4: Deployment hardening

  • enable GitHub Actions HF deploy.
  • add smoke test steps and env validation.
  • document operational runbook.

Exit criteria:

  • push to main updates HF Space automatically.

8. Risk Controls

  • Cost control: cap tokens, default economical model, per-request limits.
  • Ephemeral storage: keep extracted text/chunks to rebuild vectors.
  • Prompt injection: treat source text as untrusted and constrain system prompts.
  • URL ingestion abuse: protocol allowlist, IP range blocklist, timeout/size caps.
  • Dependency risk: pin versions, scan vulnerabilities in CI periodically.

9. Build Order (Recommended Next 10 Tasks)

  1. implement current_user auth dependency
  2. wire HF OAuth callbacks
  3. replace UI owner_user_id with authenticated identity
  4. add notebook rename API + UI
  5. add notebook delete API + UI confirmation
  6. add report artifact generator + endpoint
  7. add artifact list/download/playback panel in Streamlit
  8. add URL safety validator module for ingestion
  9. add integration tests for cross-user isolation
  10. enforce CI deploy workflow and add README deployment setup