File size: 6,272 Bytes
69068b7 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 | # NotebookLM Clone - Handoff Document
## Stack
- **Auth:** Hugging Face OAuth (`gr.LoginButton`, `user_id` = HF username)
- **Metadata:** Supabase (notebooks, messages, artifacts)
- **Files:** Supabase Storage bucket `notebooklm`
- **Vectors:** Supabase pgvector (chunks table)
## Setup
### 1. Supabase
- Run `db/schema.sql` in SQL Editor
- Create Storage bucket: **Storage** β **New bucket** β name `notebooklm`, set public/private as needed
- Add RLS policies for the bucket if using private access
### 2. HF Space
- Add `hf_oauth: true` in README (already done)
- Add `SUPABASE_URL`, `SUPABASE_KEY` (service role) as Space secrets
- Optional: `SUPABASE_BUCKET` (default: notebooklm)
### 3. Local
- `HF_TOKEN` env var or `huggingface-cli login` (required for OAuth mock)
- `.env` with `SUPABASE_URL`, `SUPABASE_KEY`
- `pip install gradio[oauth]` (or `itsdangerous`) for LoginButton
## Storage (Supabase Storage)
```python
from backend.storage import get_sources_path, save_file, load_file
# Ingestion: save uploaded PDF
prefix = get_sources_path(user_id, notebook_id) # "user_id/notebook_id/sources"
path = f"{prefix}/document.pdf"
save_file(path, file_bytes)
# Load
data = load_file(path)
```
Paths: `{user_id}/{notebook_id}/sources|embeddings|chats|artifacts}/{filename}`
## Notebook API
- `create_notebook(user_id, name)`
- `list_notebooks(user_id)`
- `rename_notebook(user_id, notebook_id, new_name)`
- `delete_notebook(user_id, notebook_id)`
## Chat (Supabase messages table)
- `save_message(notebook_id, role, content)`
- `load_chat(notebook_id)`
## Embeddings (pgvector)
Table `chunks`: id, notebook_id, source_id, content, embedding vector(1536), metadata, created_at.
Ingestion team: embed chunks, insert into `chunks`, filter by `notebook_id` for retrieval.
---
## Handover: Ingestion & RAG Builders
### Where to Write Your Code
| Responsibility | File / Location | Purpose |
|----------------|-----------------|---------|
| **Ingestion** | `backend/ingestion_service.py` (create this) | Parse uploaded files, chunk text, compute embeddings, insert into `chunks` |
| **RAG** | `backend/rag_service.py` (create this) | Embed query β similarity search β build context β call LLM β return answer |
| **Storage** | `backend/storage.py` (existing) | Save/load files in Supabase Storage; do not modify |
| **Chat** | `backend/chat_service.py` (existing) | Save/load messages; RAG calls `save_message` and `load_chat` |
| **UI** | `app.py` | Add upload component + chat interface; wire to ingestion and RAG |
---
### Ingestion Builder
**Write your code in:** `backend/ingestion_service.py`
**Flow:**
1. Receive: `user_id`, `notebook_id`, uploaded file bytes, and filename.
2. Save raw file via storage:
```python
from backend.storage import get_sources_path, save_file
prefix = get_sources_path(user_id, notebook_id) # β "user_id/notebook_id/sources"
path = f"{prefix}/{filename}"
save_file(path, file_bytes)
```
3. Parse file (PDF, DOCX, TXT, etc.) and extract text.
4. Chunk text (e.g., 512β1024 tokens with overlap).
5. Compute embeddings (e.g., OpenAI `text-embedding-3-small` β 1536 dims, or compatible).
6. Insert rows into `chunks`:
```python
supabase.table("chunks").insert({
"notebook_id": notebook_id,
"source_id": path, # or your source identifier
"content": chunk_text,
"embedding": embedding_list, # list of 1536 floats
"metadata": {"page": 1, "chunk_idx": 0} # optional
}).execute()
```
**Integrate in app:**
- Add `gr.File` or `gr.Upload` in `app.py` for the selected notebook.
- On upload, call `ingest_file(user_id, notebook_id, file_bytes, filename)` from your new service.
**Existing helpers:** `backend/storage` (save_file, load_file, list_files, get_sources_path).
---
### RAG Builder
**Write your code in:** `backend/rag_service.py`
**Flow:**
1. Receive: `notebook_id`, user query.
2. Embed the query (same model/dims as ingestion, e.g. 1536).
3. Similarity search in `chunks`:
```python
# Supabase pgvector example (cosine similarity)
result = supabase.rpc(
"match_chunks",
{"query_embedding": embedding, "match_count": 5, "p_notebook_id": notebook_id}
).execute()
```
- You must add a Supabase function `match_chunks` that filters by `notebook_id` and runs vector similarity (or use raw SQL).
- Alternative: use `supabase.table("chunks").select("*").eq("notebook_id", notebook_id)` and do similarity in Python (less efficient).
4. Build context from top-k chunks.
5. Call LLM (Hugging Face Inference API, OpenAI, etc.) with context + history.
6. Persist messages via `chat_service`:
```python
from backend.chat_service import save_message, load_chat
save_message(notebook_id, "user", query)
save_message(notebook_id, "assistant", answer)
```
**Integrate in app:**
- Add a chat block in `app.py` (Chatbot component) tied to `selected_notebook_id`.
- On submit: call `rag_chat(notebook_id, query, chat_history)` β returns assistant reply; update history using `load_chat(notebook_id)` or append locally.
**Existing helpers:** `backend/chat_service` (save_message, load_chat), `backend/db` (supabase).
---
### Schema Reference (for both)
```sql
-- chunks table (db/schema.sql)
chunks (
id uuid,
notebook_id uuid,
source_id text,
content text,
embedding vector(1536),
metadata jsonb,
created_at timestamptz
)
```
**Required:** `embedding` must be 1536 dimensions (or update schema if using a different model).
---
### Suggested RPC for RAG (optional)
Add this in Supabase SQL Editor if you prefer server-side similarity:
```sql
create or replace function match_chunks(
query_embedding vector(1536),
match_count int,
p_notebook_id uuid
)
returns table (id uuid, content text, metadata jsonb, similarity float)
language plpgsql as $$
begin
return query
select c.id, c.content, c.metadata,
1 - (c.embedding <=> query_embedding) as similarity
from chunks c
where c.notebook_id = p_notebook_id
order by c.embedding <=> query_embedding
limit match_count;
end;
$$;
```
Ingestion writes to `chunks`; RAG reads via `match_chunks` or equivalent.
|