Spaces:

notebooklm-group2
/

NotebookLM-Clone

Running

App Files Files Community

NotebookLM-Clone / HANDOFF_NOTEBOOKS.md

rahulrb99

Add Supabase, HF OAuth, centered login, realtime fix, full NotebookLM setup

69068b7 about 1 month ago

preview code

raw

history blame contribute delete

6.27 kB

	# NotebookLM Clone - Handoff Document

	## Stack

	- Auth: Hugging Face OAuth (`gr.LoginButton`, `user_id` = HF username)
	- Metadata: Supabase (notebooks, messages, artifacts)
	- Files: Supabase Storage bucket `notebooklm`
	- Vectors: Supabase pgvector (chunks table)

	## Setup

	### 1. Supabase

	- Run `db/schema.sql` in SQL Editor
	- Create Storage bucket: Storage → New bucket → name `notebooklm`, set public/private as needed
	- Add RLS policies for the bucket if using private access

	### 2. HF Space

	- Add `hf_oauth: true` in README (already done)
	- Add `SUPABASE_URL`, `SUPABASE_KEY` (service role) as Space secrets
	- Optional: `SUPABASE_BUCKET` (default: notebooklm)

	### 3. Local

	- `HF_TOKEN` env var or `huggingface-cli login` (required for OAuth mock)
	- `.env` with `SUPABASE_URL`, `SUPABASE_KEY`
	- `pip install gradio[oauth]` (or `itsdangerous`) for LoginButton

	## Storage (Supabase Storage)

	```python
	from backend.storage import get_sources_path, save_file, load_file

	# Ingestion: save uploaded PDF
	prefix = get_sources_path(user_id, notebook_id) # "user_id/notebook_id/sources"
	path = f"{prefix}/document.pdf"
	save_file(path, file_bytes)

	# Load
	data = load_file(path)
	```

	Paths: `{user_id}/{notebook_id}/sources\|embeddings\|chats\|artifacts}/{filename}`

	## Notebook API

	- `create_notebook(user_id, name)`
	- `list_notebooks(user_id)`
	- `rename_notebook(user_id, notebook_id, new_name)`
	- `delete_notebook(user_id, notebook_id)`

	## Chat (Supabase messages table)

	- `save_message(notebook_id, role, content)`
	- `load_chat(notebook_id)`

	## Embeddings (pgvector)

	Table `chunks`: id, notebook_id, source_id, content, embedding vector(1536), metadata, created_at.

	Ingestion team: embed chunks, insert into `chunks`, filter by `notebook_id` for retrieval.

	---

	## Handover: Ingestion & RAG Builders

	### Where to Write Your Code

	\| Responsibility \| File / Location \| Purpose \|
	\|----------------\|-----------------\|---------\|
	\| Ingestion \| `backend/ingestion_service.py` (create this) \| Parse uploaded files, chunk text, compute embeddings, insert into `chunks` \|
	\| RAG \| `backend/rag_service.py` (create this) \| Embed query → similarity search → build context → call LLM → return answer \|
	\| Storage \| `backend/storage.py` (existing) \| Save/load files in Supabase Storage; do not modify \|
	\| Chat \| `backend/chat_service.py` (existing) \| Save/load messages; RAG calls `save_message` and `load_chat` \|
	\| UI \| `app.py` \| Add upload component + chat interface; wire to ingestion and RAG \|

	---

	### Ingestion Builder

	Write your code in: `backend/ingestion_service.py`

	Flow:
	1. Receive: `user_id`, `notebook_id`, uploaded file bytes, and filename.
	2. Save raw file via storage:
	```python
	from backend.storage import get_sources_path, save_file
	prefix = get_sources_path(user_id, notebook_id) # → "user_id/notebook_id/sources"
	path = f"{prefix}/{filename}"
	save_file(path, file_bytes)
	```
	3. Parse file (PDF, DOCX, TXT, etc.) and extract text.
	4. Chunk text (e.g., 512–1024 tokens with overlap).
	5. Compute embeddings (e.g., OpenAI `text-embedding-3-small` → 1536 dims, or compatible).
	6. Insert rows into `chunks`:
	```python
	supabase.table("chunks").insert({
	"notebook_id": notebook_id,
	"source_id": path, # or your source identifier
	"content": chunk_text,
	"embedding": embedding_list, # list of 1536 floats
	"metadata": {"page": 1, "chunk_idx": 0} # optional
	}).execute()
	```

	Integrate in app:
	- Add `gr.File` or `gr.Upload` in `app.py` for the selected notebook.
	- On upload, call `ingest_file(user_id, notebook_id, file_bytes, filename)` from your new service.

	Existing helpers: `backend/storage` (save_file, load_file, list_files, get_sources_path).

	---

	### RAG Builder

	Write your code in: `backend/rag_service.py`

	Flow:
	1. Receive: `notebook_id`, user query.
	2. Embed the query (same model/dims as ingestion, e.g. 1536).
	3. Similarity search in `chunks`:
	```python
	# Supabase pgvector example (cosine similarity)
	result = supabase.rpc(
	"match_chunks",
	{"query_embedding": embedding, "match_count": 5, "p_notebook_id": notebook_id}
	).execute()
	```
	- You must add a Supabase function `match_chunks` that filters by `notebook_id` and runs vector similarity (or use raw SQL).
	- Alternative: use `supabase.table("chunks").select("*").eq("notebook_id", notebook_id)` and do similarity in Python (less efficient).
	4. Build context from top-k chunks.
	5. Call LLM (Hugging Face Inference API, OpenAI, etc.) with context + history.
	6. Persist messages via `chat_service`:
	```python
	from backend.chat_service import save_message, load_chat
	save_message(notebook_id, "user", query)
	save_message(notebook_id, "assistant", answer)
	```

	Integrate in app:
	- Add a chat block in `app.py` (Chatbot component) tied to `selected_notebook_id`.
	- On submit: call `rag_chat(notebook_id, query, chat_history)` → returns assistant reply; update history using `load_chat(notebook_id)` or append locally.

	Existing helpers: `backend/chat_service` (save_message, load_chat), `backend/db` (supabase).

	---

	### Schema Reference (for both)

	```sql
	-- chunks table (db/schema.sql)
	chunks (
	id uuid,
	notebook_id uuid,
	source_id text,
	content text,
	embedding vector(1536),
	metadata jsonb,
	created_at timestamptz
	)
	```

	Required: `embedding` must be 1536 dimensions (or update schema if using a different model).

	---

	### Suggested RPC for RAG (optional)

	Add this in Supabase SQL Editor if you prefer server-side similarity:

	```sql
	create or replace function match_chunks(
	query_embedding vector(1536),
	match_count int,
	p_notebook_id uuid
	)
	returns table (id uuid, content text, metadata jsonb, similarity float)
	language plpgsql as $$
	begin
	return query
	select c.id, c.content, c.metadata,
	1 - (c.embedding <=> query_embedding) as similarity
	from chunks c
	where c.notebook_id = p_notebook_id
	order by c.embedding <=> query_embedding
	limit match_count;
	end;
	$$;
	```

	Ingestion writes to `chunks`; RAG reads via `match_chunks` or equivalent.