File size: 4,463 Bytes
9c9ce67 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 | # Architecture: NotebookLM Clone
## Data flow
```
User (HF OAuth / MOCK_USER)
β username
β /data/users/<username>/notebooks/
β index.json (list of notebooks)
β <notebook-uuid>/
β files_raw/ (uploaded PDF/PPTX/TXT)
β files_extracted/ (extracted text JSON per source)
β sources.json (source registry: id, filename/url, type, enabled)
β chroma/ (ChromaDB persistence)
β chat/messages.jsonl (conversation history)
β artifacts/
β reports/
β quizzes/
β podcasts/ (transcript_*.md, podcast_*.mp3)
β index.json (artifact metadata)
```
## Modules
| Module | Responsibility |
|--------|----------------|
| `backend/config.py` | Env vars, paths, constants. No global mutable state. |
| `backend/auth.py` | Derive username from `gr.Request` or `MOCK_USER`. |
| `backend/storage.py` | Path helpers for user/notebook dirs and files. |
| `backend/notebooks.py` | CRUD: list, create, rename, delete notebooks. |
| `backend/ingestion.py` | File/URL ingestion: extract text β chunk β embed β Chroma upsert. Source enable/disable. |
| `backend/retriever.py` | Embedding function (HF API or sentence-transformers). Chroma collection. Retrieval: similarity or MMR. |
| `backend/rag.py` | Retrieve β build prompt β LLM (HF API or local) β format answer with citations. Timing. Chat only. |
| `backend/gemini_client.py` | Gemini API client for artifact generation only (context-only; no API key logged). |
| `backend/artifacts.py` | Report, quiz, podcast transcript via Gemini; citations from chunk metadata; TTS for podcast .mp3; persist under artifacts/. |
| `backend/tts.py` | TTS: HF Inference API or gTTS fallback. |
| `backend/utils.py` | Logging, user_data_dir, read_json/write_json/read_jsonl/append_jsonl, normalize_text. |
| `app.py` | Gradio UI: notebooks, sources, chat, citations, artifacts. All handlers take `request` for username. |
## Storage tree (exact)
```
/data/
βββ users/
βββ <username>/
βββ notebooks/
βββ index.json
βββ <notebook-uuid>/
βββ files_raw/
βββ files_extracted/
βββ sources.json
βββ chroma/
βββ chat/
β βββ messages.jsonl
βββ artifacts/
βββ index.json
βββ reports/
βββ quizzes/
βββ podcasts/
```
## Request flow
1. **Auth**: Every state-changing handler receives `gr.Request`; `get_username_from_request(request)` returns username (or `MOCK_USER` / `anonymous`). All paths are under `user_data_dir(username)`.
2. **Notebook**: User selects/creates/renames/deletes notebooks. Current `notebook_id` is kept in state and hidden textbox; all source/chat/artifact ops use `(username, notebook_id)`.
3. **Ingestion**: Upload or URL β extract text (pypdf/python-pptx/readability) β chunk (recursive split, overlap) β embed (HF or local) β upsert Chroma with metadata (source_id, source_name, page_or_slide, enabled). sources.json updated.
4. **RAG**: Query β embed β Chroma query (filter enabled sources) β optional MMR β build context string β LLM with citation instructions β append to messages.jsonl. Retrieval and generation times logged and shown in UI.
5. **Artifacts**: Report/quiz/podcast use retrieval again (artifact-specific query + extra instruction). **Gemini** generates Markdown (context-only; citations [1], [2] mapped from chunk metadata). Podcast transcript from Gemini; TTS (HF or gTTS) for .mp3. Entries appended to artifacts/index.json.
## Chroma
- One collection per notebook: name `chunks`.
- Documents stored with metadata: `source_id`, `source_name`, `source_type`, `page_or_slide`, `chunk_index`, `enabled`.
- Chunk IDs: `{source_id}::{chunk_index}`.
- Retrieval filters by `enabled` source_id; supports similarity-only or MMR.
## Configuration (env)
See README and `backend/config.py`: `GEMINI_API_KEY`, `GEMINI_MODEL` (artifacts); `HF_TOKEN`, `HF_LLM_MODEL`, `HF_EMBED_MODEL`, `HF_TTS_MODEL` (chat/embeddings); `CHUNK_SIZE`, `CHUNK_OVERLAP`, `TOP_K`, `MMR_LAMBDA`, `MOCK_USER`, `DATA_ROOT`.
|