A newer version of the Gradio SDK is available: 6.14.0
Architecture: NotebookLM Clone
Data flow
User (HF OAuth / MOCK_USER)
β username
β /data/users/<username>/notebooks/
β index.json (list of notebooks)
β <notebook-uuid>/
β files_raw/ (uploaded PDF/PPTX/TXT)
β files_extracted/ (extracted text JSON per source)
β sources.json (source registry: id, filename/url, type, enabled)
β chroma/ (ChromaDB persistence)
β chat/messages.jsonl (conversation history)
β artifacts/
β reports/
β quizzes/
β podcasts/ (transcript_*.md, podcast_*.mp3)
β index.json (artifact metadata)
Modules
| Module | Responsibility |
|---|---|
backend/config.py |
Env vars, paths, constants. No global mutable state. |
backend/auth.py |
Derive username from gr.Request or MOCK_USER. |
backend/storage.py |
Path helpers for user/notebook dirs and files. |
backend/notebooks.py |
CRUD: list, create, rename, delete notebooks. |
backend/ingestion.py |
File/URL ingestion: extract text β chunk β embed β Chroma upsert. Source enable/disable. |
backend/retriever.py |
Embedding function (HF API or sentence-transformers). Chroma collection. Retrieval: similarity or MMR. |
backend/rag.py |
Retrieve β build prompt β LLM (HF API or local) β format answer with citations. Timing. Chat only. |
backend/gemini_client.py |
Gemini API client for artifact generation only (context-only; no API key logged). |
backend/artifacts.py |
Report, quiz, podcast transcript via Gemini; citations from chunk metadata; TTS for podcast .mp3; persist under artifacts/. |
backend/tts.py |
TTS: HF Inference API or gTTS fallback. |
backend/utils.py |
Logging, user_data_dir, read_json/write_json/read_jsonl/append_jsonl, normalize_text. |
app.py |
Gradio UI: notebooks, sources, chat, citations, artifacts. All handlers take request for username. |
Storage tree (exact)
/data/
βββ users/
βββ <username>/
βββ notebooks/
βββ index.json
βββ <notebook-uuid>/
βββ files_raw/
βββ files_extracted/
βββ sources.json
βββ chroma/
βββ chat/
β βββ messages.jsonl
βββ artifacts/
βββ index.json
βββ reports/
βββ quizzes/
βββ podcasts/
Request flow
Auth: Every state-changing handler receives
gr.Request;get_username_from_request(request)returns username (orMOCK_USER/anonymous). All paths are underuser_data_dir(username).Notebook: User selects/creates/renames/deletes notebooks. Current
notebook_idis kept in state and hidden textbox; all source/chat/artifact ops use(username, notebook_id).Ingestion: Upload or URL β extract text (pypdf/python-pptx/readability) β chunk (recursive split, overlap) β embed (HF or local) β upsert Chroma with metadata (source_id, source_name, page_or_slide, enabled). sources.json updated.
RAG: Query β embed β Chroma query (filter enabled sources) β optional MMR β build context string β LLM with citation instructions β append to messages.jsonl. Retrieval and generation times logged and shown in UI.
Artifacts: Report/quiz/podcast use retrieval again (artifact-specific query + extra instruction). Gemini generates Markdown (context-only; citations [1], [2] mapped from chunk metadata). Podcast transcript from Gemini; TTS (HF or gTTS) for .mp3. Entries appended to artifacts/index.json.
Chroma
- One collection per notebook: name
chunks. - Documents stored with metadata:
source_id,source_name,source_type,page_or_slide,chunk_index,enabled. - Chunk IDs:
{source_id}::{chunk_index}. - Retrieval filters by
enabledsource_id; supports similarity-only or MMR.
Configuration (env)
See README and backend/config.py: GEMINI_API_KEY, GEMINI_MODEL (artifacts); HF_TOKEN, HF_LLM_MODEL, HF_EMBED_MODEL, HF_TTS_MODEL (chat/embeddings); CHUNK_SIZE, CHUNK_OVERLAP, TOP_K, MMR_LAMBDA, MOCK_USER, DATA_ROOT.