Spaces:

pbichpur
/

NotebookLMClone

Sleeping

File size: 7,770 Bytes

aacd162

# NotebookLM Clone Streamlit Architecture Spec

## 1. Scope
This spec defines the target MVP for a Streamlit-based NotebookLM clone with:
- source ingestion (`.pdf`, `.pptx`, `.txt`, URL)
- retrieval-augmented chat with citations
- artifact generation (report, quiz, podcast transcript/audio)
- strict per-user and per-notebook data isolation
- CI/CD deployment from GitHub to Hugging Face Spaces

This document aligns your course requirements, the initial plan PDF, and the current repository implementation.

## 2. Current Baseline (Repo Audit)
Implemented now:
- FastAPI backend with notebook/source/thread/chat endpoints in `app.py`
- Streamlit frontend in `frontend/app.py`
- ingestion pipeline in `src/ingestion/` (extract, chunk, embed, Chroma upsert/query)
- SQLite schema and CRUD in `data/models.py` and `data/crud.py`
- artifact endpoints for report, quiz, and podcast in `app.py`
- HF OAuth + session bridge integration in `auth/oauth.py` and `auth/session.py`
- notebook create/rename/delete and notebook-scoped source/thread/artifact routes
- URL ingestion safety controls (scheme allowlist, DNS/IP checks, redirect/body limits)
- URL source auto-ingestion (`processing -> ready/failed`) and file upload ingestion
- per-user authorization checks via `require_current_user`
- Streamlit artifact panel with preview/download/playback controls
- GitHub Actions workflow for deploy to Hugging Face Space

Remaining MVP gaps / hardening:
- citation history display should persist clearly when reloading existing threads
- operational docs/runbook should be updated with final artifact output formats and auth/deploy setup

## 3. Target Architecture (Streamlit + FastAPI)

### 3.1 Frontend (Streamlit)
- `frontend/app.py` remains the primary UI.
- Pages/sections:
  - Auth/session status
  - Notebook manager (create, rename, delete, switch)
  - Source ingestion (upload + URL)
  - Chat panel with citations
  - Artifact panel (generate/list/download/playback)
- Session state stores selected notebook/thread and user identity from OAuth.

### 3.2 Backend (FastAPI)
- Keep `app.py` routers for API boundaries:
  - `/auth/*`
  - `/notebooks/*`
  - `/threads/*`
  - `/sources/*`
  - `/notebooks/{id}/artifacts/*`
- Service layer responsibilities:
  - ingestion orchestration (`src/ingestion/service.py`)
  - RAG retrieval + prompt construction (`query_notebook_chunks`, prompt templates)
  - artifact generation (`src/artifacts/*`)
- Authorization rule: every notebook/thread/source/artifact operation must verify ownership against authenticated user.

### 3.3 Storage
Primary metadata store:
- SQLite (`users`, `notebooks`, `sources`, `chat_threads`, `messages`, `message_citations`, `artifacts`)

Vector store:
- ChromaDB collections with notebook scoping metadata (`user_id`, `notebook_id`, `source_id`, chunk refs)

File/object storage layout (MVP local/HF `/data`):
```
/data/users/<username>/notebooks/<notebook_uuid>/
  files_raw/
  files_extracted/
  chroma/
  chat/messages.jsonl
  artifacts/reports/
  artifacts/quizzes/
  artifacts/podcasts/
```

## 4. Identity, Auth, and Isolation Plan

### 4.1 Authentication
- Integrate Hugging Face OAuth for user login.
- Map provider identity (`hf_sub` or stable username) to internal `users` row.
- Store session in secure cookie/server session.

### 4.2 Authorization
- Replace free-form `owner_user_id` from UI with server-derived user ID from session.
- Add shared helper (dependency/middleware) to resolve `current_user`.
- Enforce ownership checks in every read/write endpoint.

### 4.3 Isolation invariants
- DB queries always include ownership constraints.
- Vector queries include `user_id` and `notebook_id` metadata filters.
- File paths are derived from trusted IDs only (never direct user path input).

## 5. Functional Requirements and API Plan

### 5.1 Notebook lifecycle
Required:
- create notebook
- list notebooks for current user
- rename notebook
- delete notebook

Backend additions:
- `PATCH /notebooks/{notebook_id}`
- `DELETE /notebooks/{notebook_id}`

### 5.2 Source ingestion
Required:
- upload `.pdf/.pptx/.txt` files
- ingest URL sources with safe fetch rules
- extract, chunk, embed, store, mark ready/failed

Backend additions:
- URL validator + SSRF guardrail module (block private IP ranges, non-http(s), large responses)

### 5.3 RAG chat with citations
Required:
- retrieve top-k notebook chunks
- generate answer grounded in retrieved context
- return citation metadata and persist messages + citations

Current state:
- mostly implemented in `POST /threads/{thread_id}/chat`

Hardening needed:
- stronger citation formatting in responses
- conversation token budgeting and truncation policy

### 5.4 Artifact generation
Required outputs:
- report (`.md`)
- quiz (`.md` + answer key)
- podcast transcript (`.md`) + audio (`.mp3`)

Current state:
- all three artifact endpoints exist and are wired in Streamlit
- report output is persisted as Markdown (`.md`)
- quiz output is persisted as Markdown (`.md`) including answer key
- podcast persists transcript Markdown (`.md`) and audio (`.mp3`)

Backend additions:
- standard artifact serialization + saved output files under artifact subfolders

### 5.5 UI requirements
Required frontend features:
- notebook manager with switching
- source upload + URL ingest
- chat with visible citations
- artifact generate buttons (report/quiz/podcast)
- artifact list with download links
- podcast playback component
- explicit error/retry states

## 6. CI/CD Requirements (GitHub -> HF Space)
- Trigger on push to `main`.
- Sync repository to HF Space via token auth.
- Use GitHub Secrets:
  - `HF_TOKEN`
  - `HF_SPACE_REPO` (example: `username/space-name`)
  - optional: `HF_SPACE_BRANCH` (default `main`)
- Optional pre-deploy check: run tests before sync.

## 7. Milestone Plan

### Milestone 1: Auth + Isolation foundation
- Implement HF OAuth and session plumbing.
- Remove manual `owner_user_id` UI field.
- Add authorization dependency and enforce route coverage.

Exit criteria:
- no endpoint accepts cross-user notebook/thread access.

### Milestone 2: Notebook + Ingestion completeness
- add notebook rename/delete APIs and UI actions.
- add SSRF-safe URL ingestion policy.
- improve ingestion status feedback in UI.

Exit criteria:
- complete notebook lifecycle and safe ingestion of all required source types.

### Milestone 3: RAG + Artifacts
- improve chat citation UX and persistence views.
- add report artifact generation + storage.
- finalize artifact browser/download/audio playback in Streamlit.

Exit criteria:
- all three artifact types are generated, listed, and downloadable/playable.

### Milestone 4: Deployment hardening
- enable GitHub Actions HF deploy.
- add smoke test steps and env validation.
- document operational runbook.

Exit criteria:
- push to `main` updates HF Space automatically.

## 8. Risk Controls
- Cost control: cap tokens, default economical model, per-request limits.
- Ephemeral storage: keep extracted text/chunks to rebuild vectors.
- Prompt injection: treat source text as untrusted and constrain system prompts.
- URL ingestion abuse: protocol allowlist, IP range blocklist, timeout/size caps.
- Dependency risk: pin versions, scan vulnerabilities in CI periodically.

## 9. Build Order (Recommended Next 10 Tasks)
1. implement `current_user` auth dependency
2. wire HF OAuth callbacks
3. replace UI `owner_user_id` with authenticated identity
4. add notebook rename API + UI
5. add notebook delete API + UI confirmation
6. add report artifact generator + endpoint
7. add artifact list/download/playback panel in Streamlit
8. add URL safety validator module for ingestion
9. add integration tests for cross-user isolation
10. enforce CI deploy workflow and add README deployment setup