File size: 7,770 Bytes
aacd162
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
# NotebookLM Clone Streamlit Architecture Spec

## 1. Scope
This spec defines the target MVP for a Streamlit-based NotebookLM clone with:
- source ingestion (`.pdf`, `.pptx`, `.txt`, URL)
- retrieval-augmented chat with citations
- artifact generation (report, quiz, podcast transcript/audio)
- strict per-user and per-notebook data isolation
- CI/CD deployment from GitHub to Hugging Face Spaces

This document aligns your course requirements, the initial plan PDF, and the current repository implementation.

## 2. Current Baseline (Repo Audit)
Implemented now:
- FastAPI backend with notebook/source/thread/chat endpoints in `app.py`
- Streamlit frontend in `frontend/app.py`
- ingestion pipeline in `src/ingestion/` (extract, chunk, embed, Chroma upsert/query)
- SQLite schema and CRUD in `data/models.py` and `data/crud.py`
- artifact endpoints for report, quiz, and podcast in `app.py`
- HF OAuth + session bridge integration in `auth/oauth.py` and `auth/session.py`
- notebook create/rename/delete and notebook-scoped source/thread/artifact routes
- URL ingestion safety controls (scheme allowlist, DNS/IP checks, redirect/body limits)
- URL source auto-ingestion (`processing -> ready/failed`) and file upload ingestion
- per-user authorization checks via `require_current_user`
- Streamlit artifact panel with preview/download/playback controls
- GitHub Actions workflow for deploy to Hugging Face Space

Remaining MVP gaps / hardening:
- citation history display should persist clearly when reloading existing threads
- operational docs/runbook should be updated with final artifact output formats and auth/deploy setup

## 3. Target Architecture (Streamlit + FastAPI)

### 3.1 Frontend (Streamlit)
- `frontend/app.py` remains the primary UI.
- Pages/sections:
  - Auth/session status
  - Notebook manager (create, rename, delete, switch)
  - Source ingestion (upload + URL)
  - Chat panel with citations
  - Artifact panel (generate/list/download/playback)
- Session state stores selected notebook/thread and user identity from OAuth.

### 3.2 Backend (FastAPI)
- Keep `app.py` routers for API boundaries:
  - `/auth/*`
  - `/notebooks/*`
  - `/threads/*`
  - `/sources/*`
  - `/notebooks/{id}/artifacts/*`
- Service layer responsibilities:
  - ingestion orchestration (`src/ingestion/service.py`)
  - RAG retrieval + prompt construction (`query_notebook_chunks`, prompt templates)
  - artifact generation (`src/artifacts/*`)
- Authorization rule: every notebook/thread/source/artifact operation must verify ownership against authenticated user.

### 3.3 Storage
Primary metadata store:
- SQLite (`users`, `notebooks`, `sources`, `chat_threads`, `messages`, `message_citations`, `artifacts`)

Vector store:
- ChromaDB collections with notebook scoping metadata (`user_id`, `notebook_id`, `source_id`, chunk refs)

File/object storage layout (MVP local/HF `/data`):
```
/data/users/<username>/notebooks/<notebook_uuid>/
  files_raw/
  files_extracted/
  chroma/
  chat/messages.jsonl
  artifacts/reports/
  artifacts/quizzes/
  artifacts/podcasts/
```

## 4. Identity, Auth, and Isolation Plan

### 4.1 Authentication
- Integrate Hugging Face OAuth for user login.
- Map provider identity (`hf_sub` or stable username) to internal `users` row.
- Store session in secure cookie/server session.

### 4.2 Authorization
- Replace free-form `owner_user_id` from UI with server-derived user ID from session.
- Add shared helper (dependency/middleware) to resolve `current_user`.
- Enforce ownership checks in every read/write endpoint.

### 4.3 Isolation invariants
- DB queries always include ownership constraints.
- Vector queries include `user_id` and `notebook_id` metadata filters.
- File paths are derived from trusted IDs only (never direct user path input).

## 5. Functional Requirements and API Plan

### 5.1 Notebook lifecycle
Required:
- create notebook
- list notebooks for current user
- rename notebook
- delete notebook

Backend additions:
- `PATCH /notebooks/{notebook_id}`
- `DELETE /notebooks/{notebook_id}`

### 5.2 Source ingestion
Required:
- upload `.pdf/.pptx/.txt` files
- ingest URL sources with safe fetch rules
- extract, chunk, embed, store, mark ready/failed

Backend additions:
- URL validator + SSRF guardrail module (block private IP ranges, non-http(s), large responses)

### 5.3 RAG chat with citations
Required:
- retrieve top-k notebook chunks
- generate answer grounded in retrieved context
- return citation metadata and persist messages + citations

Current state:
- mostly implemented in `POST /threads/{thread_id}/chat`

Hardening needed:
- stronger citation formatting in responses
- conversation token budgeting and truncation policy

### 5.4 Artifact generation
Required outputs:
- report (`.md`)
- quiz (`.md` + answer key)
- podcast transcript (`.md`) + audio (`.mp3`)

Current state:
- all three artifact endpoints exist and are wired in Streamlit
- report output is persisted as Markdown (`.md`)
- quiz output is persisted as Markdown (`.md`) including answer key
- podcast persists transcript Markdown (`.md`) and audio (`.mp3`)

Backend additions:
- standard artifact serialization + saved output files under artifact subfolders

### 5.5 UI requirements
Required frontend features:
- notebook manager with switching
- source upload + URL ingest
- chat with visible citations
- artifact generate buttons (report/quiz/podcast)
- artifact list with download links
- podcast playback component
- explicit error/retry states

## 6. CI/CD Requirements (GitHub -> HF Space)
- Trigger on push to `main`.
- Sync repository to HF Space via token auth.
- Use GitHub Secrets:
  - `HF_TOKEN`
  - `HF_SPACE_REPO` (example: `username/space-name`)
  - optional: `HF_SPACE_BRANCH` (default `main`)
- Optional pre-deploy check: run tests before sync.

## 7. Milestone Plan

### Milestone 1: Auth + Isolation foundation
- Implement HF OAuth and session plumbing.
- Remove manual `owner_user_id` UI field.
- Add authorization dependency and enforce route coverage.

Exit criteria:
- no endpoint accepts cross-user notebook/thread access.

### Milestone 2: Notebook + Ingestion completeness
- add notebook rename/delete APIs and UI actions.
- add SSRF-safe URL ingestion policy.
- improve ingestion status feedback in UI.

Exit criteria:
- complete notebook lifecycle and safe ingestion of all required source types.

### Milestone 3: RAG + Artifacts
- improve chat citation UX and persistence views.
- add report artifact generation + storage.
- finalize artifact browser/download/audio playback in Streamlit.

Exit criteria:
- all three artifact types are generated, listed, and downloadable/playable.

### Milestone 4: Deployment hardening
- enable GitHub Actions HF deploy.
- add smoke test steps and env validation.
- document operational runbook.

Exit criteria:
- push to `main` updates HF Space automatically.

## 8. Risk Controls
- Cost control: cap tokens, default economical model, per-request limits.
- Ephemeral storage: keep extracted text/chunks to rebuild vectors.
- Prompt injection: treat source text as untrusted and constrain system prompts.
- URL ingestion abuse: protocol allowlist, IP range blocklist, timeout/size caps.
- Dependency risk: pin versions, scan vulnerabilities in CI periodically.

## 9. Build Order (Recommended Next 10 Tasks)
1. implement `current_user` auth dependency
2. wire HF OAuth callbacks
3. replace UI `owner_user_id` with authenticated identity
4. add notebook rename API + UI
5. add notebook delete API + UI confirmation
6. add report artifact generator + endpoint
7. add artifact list/download/playback panel in Streamlit
8. add URL safety validator module for ingestion
9. add integration tests for cross-user isolation
10. enforce CI deploy workflow and add README deployment setup