Spaces:
Sleeping
Sleeping
File size: 7,770 Bytes
aacd162 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 | # NotebookLM Clone Streamlit Architecture Spec
## 1. Scope
This spec defines the target MVP for a Streamlit-based NotebookLM clone with:
- source ingestion (`.pdf`, `.pptx`, `.txt`, URL)
- retrieval-augmented chat with citations
- artifact generation (report, quiz, podcast transcript/audio)
- strict per-user and per-notebook data isolation
- CI/CD deployment from GitHub to Hugging Face Spaces
This document aligns your course requirements, the initial plan PDF, and the current repository implementation.
## 2. Current Baseline (Repo Audit)
Implemented now:
- FastAPI backend with notebook/source/thread/chat endpoints in `app.py`
- Streamlit frontend in `frontend/app.py`
- ingestion pipeline in `src/ingestion/` (extract, chunk, embed, Chroma upsert/query)
- SQLite schema and CRUD in `data/models.py` and `data/crud.py`
- artifact endpoints for report, quiz, and podcast in `app.py`
- HF OAuth + session bridge integration in `auth/oauth.py` and `auth/session.py`
- notebook create/rename/delete and notebook-scoped source/thread/artifact routes
- URL ingestion safety controls (scheme allowlist, DNS/IP checks, redirect/body limits)
- URL source auto-ingestion (`processing -> ready/failed`) and file upload ingestion
- per-user authorization checks via `require_current_user`
- Streamlit artifact panel with preview/download/playback controls
- GitHub Actions workflow for deploy to Hugging Face Space
Remaining MVP gaps / hardening:
- citation history display should persist clearly when reloading existing threads
- operational docs/runbook should be updated with final artifact output formats and auth/deploy setup
## 3. Target Architecture (Streamlit + FastAPI)
### 3.1 Frontend (Streamlit)
- `frontend/app.py` remains the primary UI.
- Pages/sections:
- Auth/session status
- Notebook manager (create, rename, delete, switch)
- Source ingestion (upload + URL)
- Chat panel with citations
- Artifact panel (generate/list/download/playback)
- Session state stores selected notebook/thread and user identity from OAuth.
### 3.2 Backend (FastAPI)
- Keep `app.py` routers for API boundaries:
- `/auth/*`
- `/notebooks/*`
- `/threads/*`
- `/sources/*`
- `/notebooks/{id}/artifacts/*`
- Service layer responsibilities:
- ingestion orchestration (`src/ingestion/service.py`)
- RAG retrieval + prompt construction (`query_notebook_chunks`, prompt templates)
- artifact generation (`src/artifacts/*`)
- Authorization rule: every notebook/thread/source/artifact operation must verify ownership against authenticated user.
### 3.3 Storage
Primary metadata store:
- SQLite (`users`, `notebooks`, `sources`, `chat_threads`, `messages`, `message_citations`, `artifacts`)
Vector store:
- ChromaDB collections with notebook scoping metadata (`user_id`, `notebook_id`, `source_id`, chunk refs)
File/object storage layout (MVP local/HF `/data`):
```
/data/users/<username>/notebooks/<notebook_uuid>/
files_raw/
files_extracted/
chroma/
chat/messages.jsonl
artifacts/reports/
artifacts/quizzes/
artifacts/podcasts/
```
## 4. Identity, Auth, and Isolation Plan
### 4.1 Authentication
- Integrate Hugging Face OAuth for user login.
- Map provider identity (`hf_sub` or stable username) to internal `users` row.
- Store session in secure cookie/server session.
### 4.2 Authorization
- Replace free-form `owner_user_id` from UI with server-derived user ID from session.
- Add shared helper (dependency/middleware) to resolve `current_user`.
- Enforce ownership checks in every read/write endpoint.
### 4.3 Isolation invariants
- DB queries always include ownership constraints.
- Vector queries include `user_id` and `notebook_id` metadata filters.
- File paths are derived from trusted IDs only (never direct user path input).
## 5. Functional Requirements and API Plan
### 5.1 Notebook lifecycle
Required:
- create notebook
- list notebooks for current user
- rename notebook
- delete notebook
Backend additions:
- `PATCH /notebooks/{notebook_id}`
- `DELETE /notebooks/{notebook_id}`
### 5.2 Source ingestion
Required:
- upload `.pdf/.pptx/.txt` files
- ingest URL sources with safe fetch rules
- extract, chunk, embed, store, mark ready/failed
Backend additions:
- URL validator + SSRF guardrail module (block private IP ranges, non-http(s), large responses)
### 5.3 RAG chat with citations
Required:
- retrieve top-k notebook chunks
- generate answer grounded in retrieved context
- return citation metadata and persist messages + citations
Current state:
- mostly implemented in `POST /threads/{thread_id}/chat`
Hardening needed:
- stronger citation formatting in responses
- conversation token budgeting and truncation policy
### 5.4 Artifact generation
Required outputs:
- report (`.md`)
- quiz (`.md` + answer key)
- podcast transcript (`.md`) + audio (`.mp3`)
Current state:
- all three artifact endpoints exist and are wired in Streamlit
- report output is persisted as Markdown (`.md`)
- quiz output is persisted as Markdown (`.md`) including answer key
- podcast persists transcript Markdown (`.md`) and audio (`.mp3`)
Backend additions:
- standard artifact serialization + saved output files under artifact subfolders
### 5.5 UI requirements
Required frontend features:
- notebook manager with switching
- source upload + URL ingest
- chat with visible citations
- artifact generate buttons (report/quiz/podcast)
- artifact list with download links
- podcast playback component
- explicit error/retry states
## 6. CI/CD Requirements (GitHub -> HF Space)
- Trigger on push to `main`.
- Sync repository to HF Space via token auth.
- Use GitHub Secrets:
- `HF_TOKEN`
- `HF_SPACE_REPO` (example: `username/space-name`)
- optional: `HF_SPACE_BRANCH` (default `main`)
- Optional pre-deploy check: run tests before sync.
## 7. Milestone Plan
### Milestone 1: Auth + Isolation foundation
- Implement HF OAuth and session plumbing.
- Remove manual `owner_user_id` UI field.
- Add authorization dependency and enforce route coverage.
Exit criteria:
- no endpoint accepts cross-user notebook/thread access.
### Milestone 2: Notebook + Ingestion completeness
- add notebook rename/delete APIs and UI actions.
- add SSRF-safe URL ingestion policy.
- improve ingestion status feedback in UI.
Exit criteria:
- complete notebook lifecycle and safe ingestion of all required source types.
### Milestone 3: RAG + Artifacts
- improve chat citation UX and persistence views.
- add report artifact generation + storage.
- finalize artifact browser/download/audio playback in Streamlit.
Exit criteria:
- all three artifact types are generated, listed, and downloadable/playable.
### Milestone 4: Deployment hardening
- enable GitHub Actions HF deploy.
- add smoke test steps and env validation.
- document operational runbook.
Exit criteria:
- push to `main` updates HF Space automatically.
## 8. Risk Controls
- Cost control: cap tokens, default economical model, per-request limits.
- Ephemeral storage: keep extracted text/chunks to rebuild vectors.
- Prompt injection: treat source text as untrusted and constrain system prompts.
- URL ingestion abuse: protocol allowlist, IP range blocklist, timeout/size caps.
- Dependency risk: pin versions, scan vulnerabilities in CI periodically.
## 9. Build Order (Recommended Next 10 Tasks)
1. implement `current_user` auth dependency
2. wire HF OAuth callbacks
3. replace UI `owner_user_id` with authenticated identity
4. add notebook rename API + UI
5. add notebook delete API + UI confirmation
6. add report artifact generator + endpoint
7. add artifact list/download/playback panel in Streamlit
8. add URL safety validator module for ingestion
9. add integration tests for cross-user isolation
10. enforce CI deploy workflow and add README deployment setup
|