Clone_Lm / docs /ARCHITECTURE.md
skumar54's picture
NotebookLM clone: Gradio app, backend, Gemini artifacts
9c9ce67
# Architecture: NotebookLM Clone
## Data flow
```
User (HF OAuth / MOCK_USER)
β†’ username
β†’ /data/users/<username>/notebooks/
β†’ index.json (list of notebooks)
β†’ <notebook-uuid>/
β†’ files_raw/ (uploaded PDF/PPTX/TXT)
β†’ files_extracted/ (extracted text JSON per source)
β†’ sources.json (source registry: id, filename/url, type, enabled)
β†’ chroma/ (ChromaDB persistence)
β†’ chat/messages.jsonl (conversation history)
β†’ artifacts/
β†’ reports/
β†’ quizzes/
β†’ podcasts/ (transcript_*.md, podcast_*.mp3)
β†’ index.json (artifact metadata)
```
## Modules
| Module | Responsibility |
|--------|----------------|
| `backend/config.py` | Env vars, paths, constants. No global mutable state. |
| `backend/auth.py` | Derive username from `gr.Request` or `MOCK_USER`. |
| `backend/storage.py` | Path helpers for user/notebook dirs and files. |
| `backend/notebooks.py` | CRUD: list, create, rename, delete notebooks. |
| `backend/ingestion.py` | File/URL ingestion: extract text β†’ chunk β†’ embed β†’ Chroma upsert. Source enable/disable. |
| `backend/retriever.py` | Embedding function (HF API or sentence-transformers). Chroma collection. Retrieval: similarity or MMR. |
| `backend/rag.py` | Retrieve β†’ build prompt β†’ LLM (HF API or local) β†’ format answer with citations. Timing. Chat only. |
| `backend/gemini_client.py` | Gemini API client for artifact generation only (context-only; no API key logged). |
| `backend/artifacts.py` | Report, quiz, podcast transcript via Gemini; citations from chunk metadata; TTS for podcast .mp3; persist under artifacts/. |
| `backend/tts.py` | TTS: HF Inference API or gTTS fallback. |
| `backend/utils.py` | Logging, user_data_dir, read_json/write_json/read_jsonl/append_jsonl, normalize_text. |
| `app.py` | Gradio UI: notebooks, sources, chat, citations, artifacts. All handlers take `request` for username. |
## Storage tree (exact)
```
/data/
└── users/
└── <username>/
└── notebooks/
β”œβ”€β”€ index.json
└── <notebook-uuid>/
β”œβ”€β”€ files_raw/
β”œβ”€β”€ files_extracted/
β”œβ”€β”€ sources.json
β”œβ”€β”€ chroma/
β”œβ”€β”€ chat/
β”‚ └── messages.jsonl
└── artifacts/
β”œβ”€β”€ index.json
β”œβ”€β”€ reports/
β”œβ”€β”€ quizzes/
└── podcasts/
```
## Request flow
1. **Auth**: Every state-changing handler receives `gr.Request`; `get_username_from_request(request)` returns username (or `MOCK_USER` / `anonymous`). All paths are under `user_data_dir(username)`.
2. **Notebook**: User selects/creates/renames/deletes notebooks. Current `notebook_id` is kept in state and hidden textbox; all source/chat/artifact ops use `(username, notebook_id)`.
3. **Ingestion**: Upload or URL β†’ extract text (pypdf/python-pptx/readability) β†’ chunk (recursive split, overlap) β†’ embed (HF or local) β†’ upsert Chroma with metadata (source_id, source_name, page_or_slide, enabled). sources.json updated.
4. **RAG**: Query β†’ embed β†’ Chroma query (filter enabled sources) β†’ optional MMR β†’ build context string β†’ LLM with citation instructions β†’ append to messages.jsonl. Retrieval and generation times logged and shown in UI.
5. **Artifacts**: Report/quiz/podcast use retrieval again (artifact-specific query + extra instruction). **Gemini** generates Markdown (context-only; citations [1], [2] mapped from chunk metadata). Podcast transcript from Gemini; TTS (HF or gTTS) for .mp3. Entries appended to artifacts/index.json.
## Chroma
- One collection per notebook: name `chunks`.
- Documents stored with metadata: `source_id`, `source_name`, `source_type`, `page_or_slide`, `chunk_index`, `enabled`.
- Chunk IDs: `{source_id}::{chunk_index}`.
- Retrieval filters by `enabled` source_id; supports similarity-only or MMR.
## Configuration (env)
See README and `backend/config.py`: `GEMINI_API_KEY`, `GEMINI_MODEL` (artifacts); `HF_TOKEN`, `HF_LLM_MODEL`, `HF_EMBED_MODEL`, `HF_TTS_MODEL` (chat/embeddings); `CHUNK_SIZE`, `CHUNK_OVERLAP`, `TOP_K`, `MMR_LAMBDA`, `MOCK_USER`, `DATA_ROOT`.