Spaces:

notebooklm-group2
/

NotebookLM-Clone

Running

File size: 6,272 Bytes

69068b7

# NotebookLM Clone - Handoff Document

## Stack

- **Auth:** Hugging Face OAuth (`gr.LoginButton`, `user_id` = HF username)
- **Metadata:** Supabase (notebooks, messages, artifacts)
- **Files:** Supabase Storage bucket `notebooklm`
- **Vectors:** Supabase pgvector (chunks table)

## Setup

### 1. Supabase

- Run `db/schema.sql` in SQL Editor
- Create Storage bucket: **Storage** → **New bucket** → name `notebooklm`, set public/private as needed
- Add RLS policies for the bucket if using private access

### 2. HF Space

- Add `hf_oauth: true` in README (already done)
- Add `SUPABASE_URL`, `SUPABASE_KEY` (service role) as Space secrets
- Optional: `SUPABASE_BUCKET` (default: notebooklm)

### 3. Local

- `HF_TOKEN` env var or `huggingface-cli login` (required for OAuth mock)
- `.env` with `SUPABASE_URL`, `SUPABASE_KEY`
- `pip install gradio[oauth]` (or `itsdangerous`) for LoginButton

## Storage (Supabase Storage)

```python
from backend.storage import get_sources_path, save_file, load_file

# Ingestion: save uploaded PDF
prefix = get_sources_path(user_id, notebook_id)  # "user_id/notebook_id/sources"
path = f"{prefix}/document.pdf"
save_file(path, file_bytes)

# Load
data = load_file(path)
```

Paths: `{user_id}/{notebook_id}/sources|embeddings|chats|artifacts}/{filename}`

## Notebook API

- `create_notebook(user_id, name)` 
- `list_notebooks(user_id)`
- `rename_notebook(user_id, notebook_id, new_name)`
- `delete_notebook(user_id, notebook_id)`

## Chat (Supabase messages table)

- `save_message(notebook_id, role, content)`
- `load_chat(notebook_id)`

## Embeddings (pgvector)

Table `chunks`: id, notebook_id, source_id, content, embedding vector(1536), metadata, created_at.

Ingestion team: embed chunks, insert into `chunks`, filter by `notebook_id` for retrieval.

---

## Handover: Ingestion & RAG Builders

### Where to Write Your Code

| Responsibility | File / Location | Purpose |
|----------------|-----------------|---------|
| **Ingestion**  | `backend/ingestion_service.py` (create this) | Parse uploaded files, chunk text, compute embeddings, insert into `chunks` |
| **RAG**        | `backend/rag_service.py` (create this)      | Embed query → similarity search → build context → call LLM → return answer |
| **Storage**    | `backend/storage.py` (existing)             | Save/load files in Supabase Storage; do not modify |
| **Chat**       | `backend/chat_service.py` (existing)        | Save/load messages; RAG calls `save_message` and `load_chat` |
| **UI**         | `app.py`                                   | Add upload component + chat interface; wire to ingestion and RAG |

---

### Ingestion Builder

**Write your code in:** `backend/ingestion_service.py`

**Flow:**
1. Receive: `user_id`, `notebook_id`, uploaded file bytes, and filename.
2. Save raw file via storage:
   ```python
   from backend.storage import get_sources_path, save_file
   prefix = get_sources_path(user_id, notebook_id)  # → "user_id/notebook_id/sources"
   path = f"{prefix}/{filename}"
   save_file(path, file_bytes)
   ```
3. Parse file (PDF, DOCX, TXT, etc.) and extract text.
4. Chunk text (e.g., 512–1024 tokens with overlap).
5. Compute embeddings (e.g., OpenAI `text-embedding-3-small` → 1536 dims, or compatible).
6. Insert rows into `chunks`:
   ```python
   supabase.table("chunks").insert({
       "notebook_id": notebook_id,
       "source_id": path,  # or your source identifier
       "content": chunk_text,
       "embedding": embedding_list,  # list of 1536 floats
       "metadata": {"page": 1, "chunk_idx": 0}  # optional
   }).execute()
   ```

**Integrate in app:**
- Add `gr.File` or `gr.Upload` in `app.py` for the selected notebook.
- On upload, call `ingest_file(user_id, notebook_id, file_bytes, filename)` from your new service.

**Existing helpers:** `backend/storage` (save_file, load_file, list_files, get_sources_path).

---

### RAG Builder

**Write your code in:** `backend/rag_service.py`

**Flow:**
1. Receive: `notebook_id`, user query.
2. Embed the query (same model/dims as ingestion, e.g. 1536).
3. Similarity search in `chunks`:
   ```python
   # Supabase pgvector example (cosine similarity)
   result = supabase.rpc(
       "match_chunks",
       {"query_embedding": embedding, "match_count": 5, "p_notebook_id": notebook_id}
   ).execute()
   ```
   - You must add a Supabase function `match_chunks` that filters by `notebook_id` and runs vector similarity (or use raw SQL).
   - Alternative: use `supabase.table("chunks").select("*").eq("notebook_id", notebook_id)` and do similarity in Python (less efficient).
4. Build context from top-k chunks.
5. Call LLM (Hugging Face Inference API, OpenAI, etc.) with context + history.
6. Persist messages via `chat_service`:
   ```python
   from backend.chat_service import save_message, load_chat
   save_message(notebook_id, "user", query)
   save_message(notebook_id, "assistant", answer)
   ```

**Integrate in app:**
- Add a chat block in `app.py` (Chatbot component) tied to `selected_notebook_id`.
- On submit: call `rag_chat(notebook_id, query, chat_history)` → returns assistant reply; update history using `load_chat(notebook_id)` or append locally.

**Existing helpers:** `backend/chat_service` (save_message, load_chat), `backend/db` (supabase).

---

### Schema Reference (for both)

```sql
-- chunks table (db/schema.sql)
chunks (
  id uuid,
  notebook_id uuid,
  source_id text,
  content text,
  embedding vector(1536),
  metadata jsonb,
  created_at timestamptz
)
```

**Required:** `embedding` must be 1536 dimensions (or update schema if using a different model).

---

### Suggested RPC for RAG (optional)

Add this in Supabase SQL Editor if you prefer server-side similarity:

```sql
create or replace function match_chunks(
  query_embedding vector(1536),
  match_count int,
  p_notebook_id uuid
)
returns table (id uuid, content text, metadata jsonb, similarity float)
language plpgsql as $$
begin
  return query
  select c.id, c.content, c.metadata,
         1 - (c.embedding <=> query_embedding) as similarity
  from chunks c
  where c.notebook_id = p_notebook_id
  order by c.embedding <=> query_embedding
  limit match_count;
end;
$$;
```

Ingestion writes to `chunks`; RAG reads via `match_chunks` or equivalent.