NotebookLMClone / docs /STREAMLIT_ARCHITECTURE_SPEC.md
github-actions[bot]
Sync from GitHub e48aa5f27523b35a22c1a01acbb2b835cdc28984
aacd162
# NotebookLM Clone Streamlit Architecture Spec
## 1. Scope
This spec defines the target MVP for a Streamlit-based NotebookLM clone with:
- source ingestion (`.pdf`, `.pptx`, `.txt`, URL)
- retrieval-augmented chat with citations
- artifact generation (report, quiz, podcast transcript/audio)
- strict per-user and per-notebook data isolation
- CI/CD deployment from GitHub to Hugging Face Spaces
This document aligns your course requirements, the initial plan PDF, and the current repository implementation.
## 2. Current Baseline (Repo Audit)
Implemented now:
- FastAPI backend with notebook/source/thread/chat endpoints in `app.py`
- Streamlit frontend in `frontend/app.py`
- ingestion pipeline in `src/ingestion/` (extract, chunk, embed, Chroma upsert/query)
- SQLite schema and CRUD in `data/models.py` and `data/crud.py`
- artifact endpoints for report, quiz, and podcast in `app.py`
- HF OAuth + session bridge integration in `auth/oauth.py` and `auth/session.py`
- notebook create/rename/delete and notebook-scoped source/thread/artifact routes
- URL ingestion safety controls (scheme allowlist, DNS/IP checks, redirect/body limits)
- URL source auto-ingestion (`processing -> ready/failed`) and file upload ingestion
- per-user authorization checks via `require_current_user`
- Streamlit artifact panel with preview/download/playback controls
- GitHub Actions workflow for deploy to Hugging Face Space
Remaining MVP gaps / hardening:
- citation history display should persist clearly when reloading existing threads
- operational docs/runbook should be updated with final artifact output formats and auth/deploy setup
## 3. Target Architecture (Streamlit + FastAPI)
### 3.1 Frontend (Streamlit)
- `frontend/app.py` remains the primary UI.
- Pages/sections:
- Auth/session status
- Notebook manager (create, rename, delete, switch)
- Source ingestion (upload + URL)
- Chat panel with citations
- Artifact panel (generate/list/download/playback)
- Session state stores selected notebook/thread and user identity from OAuth.
### 3.2 Backend (FastAPI)
- Keep `app.py` routers for API boundaries:
- `/auth/*`
- `/notebooks/*`
- `/threads/*`
- `/sources/*`
- `/notebooks/{id}/artifacts/*`
- Service layer responsibilities:
- ingestion orchestration (`src/ingestion/service.py`)
- RAG retrieval + prompt construction (`query_notebook_chunks`, prompt templates)
- artifact generation (`src/artifacts/*`)
- Authorization rule: every notebook/thread/source/artifact operation must verify ownership against authenticated user.
### 3.3 Storage
Primary metadata store:
- SQLite (`users`, `notebooks`, `sources`, `chat_threads`, `messages`, `message_citations`, `artifacts`)
Vector store:
- ChromaDB collections with notebook scoping metadata (`user_id`, `notebook_id`, `source_id`, chunk refs)
File/object storage layout (MVP local/HF `/data`):
```
/data/users/<username>/notebooks/<notebook_uuid>/
files_raw/
files_extracted/
chroma/
chat/messages.jsonl
artifacts/reports/
artifacts/quizzes/
artifacts/podcasts/
```
## 4. Identity, Auth, and Isolation Plan
### 4.1 Authentication
- Integrate Hugging Face OAuth for user login.
- Map provider identity (`hf_sub` or stable username) to internal `users` row.
- Store session in secure cookie/server session.
### 4.2 Authorization
- Replace free-form `owner_user_id` from UI with server-derived user ID from session.
- Add shared helper (dependency/middleware) to resolve `current_user`.
- Enforce ownership checks in every read/write endpoint.
### 4.3 Isolation invariants
- DB queries always include ownership constraints.
- Vector queries include `user_id` and `notebook_id` metadata filters.
- File paths are derived from trusted IDs only (never direct user path input).
## 5. Functional Requirements and API Plan
### 5.1 Notebook lifecycle
Required:
- create notebook
- list notebooks for current user
- rename notebook
- delete notebook
Backend additions:
- `PATCH /notebooks/{notebook_id}`
- `DELETE /notebooks/{notebook_id}`
### 5.2 Source ingestion
Required:
- upload `.pdf/.pptx/.txt` files
- ingest URL sources with safe fetch rules
- extract, chunk, embed, store, mark ready/failed
Backend additions:
- URL validator + SSRF guardrail module (block private IP ranges, non-http(s), large responses)
### 5.3 RAG chat with citations
Required:
- retrieve top-k notebook chunks
- generate answer grounded in retrieved context
- return citation metadata and persist messages + citations
Current state:
- mostly implemented in `POST /threads/{thread_id}/chat`
Hardening needed:
- stronger citation formatting in responses
- conversation token budgeting and truncation policy
### 5.4 Artifact generation
Required outputs:
- report (`.md`)
- quiz (`.md` + answer key)
- podcast transcript (`.md`) + audio (`.mp3`)
Current state:
- all three artifact endpoints exist and are wired in Streamlit
- report output is persisted as Markdown (`.md`)
- quiz output is persisted as Markdown (`.md`) including answer key
- podcast persists transcript Markdown (`.md`) and audio (`.mp3`)
Backend additions:
- standard artifact serialization + saved output files under artifact subfolders
### 5.5 UI requirements
Required frontend features:
- notebook manager with switching
- source upload + URL ingest
- chat with visible citations
- artifact generate buttons (report/quiz/podcast)
- artifact list with download links
- podcast playback component
- explicit error/retry states
## 6. CI/CD Requirements (GitHub -> HF Space)
- Trigger on push to `main`.
- Sync repository to HF Space via token auth.
- Use GitHub Secrets:
- `HF_TOKEN`
- `HF_SPACE_REPO` (example: `username/space-name`)
- optional: `HF_SPACE_BRANCH` (default `main`)
- Optional pre-deploy check: run tests before sync.
## 7. Milestone Plan
### Milestone 1: Auth + Isolation foundation
- Implement HF OAuth and session plumbing.
- Remove manual `owner_user_id` UI field.
- Add authorization dependency and enforce route coverage.
Exit criteria:
- no endpoint accepts cross-user notebook/thread access.
### Milestone 2: Notebook + Ingestion completeness
- add notebook rename/delete APIs and UI actions.
- add SSRF-safe URL ingestion policy.
- improve ingestion status feedback in UI.
Exit criteria:
- complete notebook lifecycle and safe ingestion of all required source types.
### Milestone 3: RAG + Artifacts
- improve chat citation UX and persistence views.
- add report artifact generation + storage.
- finalize artifact browser/download/audio playback in Streamlit.
Exit criteria:
- all three artifact types are generated, listed, and downloadable/playable.
### Milestone 4: Deployment hardening
- enable GitHub Actions HF deploy.
- add smoke test steps and env validation.
- document operational runbook.
Exit criteria:
- push to `main` updates HF Space automatically.
## 8. Risk Controls
- Cost control: cap tokens, default economical model, per-request limits.
- Ephemeral storage: keep extracted text/chunks to rebuild vectors.
- Prompt injection: treat source text as untrusted and constrain system prompts.
- URL ingestion abuse: protocol allowlist, IP range blocklist, timeout/size caps.
- Dependency risk: pin versions, scan vulnerabilities in CI periodically.
## 9. Build Order (Recommended Next 10 Tasks)
1. implement `current_user` auth dependency
2. wire HF OAuth callbacks
3. replace UI `owner_user_id` with authenticated identity
4. add notebook rename API + UI
5. add notebook delete API + UI confirmation
6. add report artifact generator + endpoint
7. add artifact list/download/playback panel in Streamlit
8. add URL safety validator module for ingestion
9. add integration tests for cross-user isolation
10. enforce CI deploy workflow and add README deployment setup