Spaces:

skumar54
/

Clone_Lm

Runtime error

App Files Files Community

Clone_Lm / docs /ARCHITECTURE.md

skumar54

NotebookLM clone: Gradio app, backend, Gemini artifacts

9c9ce67 3 months ago

preview code

raw

history blame contribute delete

4.46 kB

A newer version of the Gradio SDK is available: 6.14.0

Upgrade

Architecture: NotebookLM Clone

Data flow

User (HF OAuth / MOCK_USER)
    → username
    → /data/users/<username>/notebooks/
        → index.json (list of notebooks)
        → <notebook-uuid>/
            → files_raw/          (uploaded PDF/PPTX/TXT)
            → files_extracted/    (extracted text JSON per source)
            → sources.json        (source registry: id, filename/url, type, enabled)
            → chroma/             (ChromaDB persistence)
            → chat/messages.jsonl (conversation history)
            → artifacts/
                → reports/
                → quizzes/
                → podcasts/       (transcript_*.md, podcast_*.mp3)
                → index.json      (artifact metadata)

Modules

Module	Responsibility
`backend/config.py`	Env vars, paths, constants. No global mutable state.
`backend/auth.py`	Derive username from `gr.Request` or `MOCK_USER`.
`backend/storage.py`	Path helpers for user/notebook dirs and files.
`backend/notebooks.py`	CRUD: list, create, rename, delete notebooks.
`backend/ingestion.py`	File/URL ingestion: extract text → chunk → embed → Chroma upsert. Source enable/disable.
`backend/retriever.py`	Embedding function (HF API or sentence-transformers). Chroma collection. Retrieval: similarity or MMR.
`backend/rag.py`	Retrieve → build prompt → LLM (HF API or local) → format answer with citations. Timing. Chat only.
`backend/gemini_client.py`	Gemini API client for artifact generation only (context-only; no API key logged).
`backend/artifacts.py`	Report, quiz, podcast transcript via Gemini; citations from chunk metadata; TTS for podcast .mp3; persist under artifacts/.
`backend/tts.py`	TTS: HF Inference API or gTTS fallback.
`backend/utils.py`	Logging, user_data_dir, read_json/write_json/read_jsonl/append_jsonl, normalize_text.
`app.py`	Gradio UI: notebooks, sources, chat, citations, artifacts. All handlers take `request` for username.

Storage tree (exact)

/data/
  └── users/
      └── <username>/
          └── notebooks/
              ├── index.json
              └── <notebook-uuid>/
                  ├── files_raw/
                  ├── files_extracted/
                  ├── sources.json
                  ├── chroma/
                  ├── chat/
                  │   └── messages.jsonl
                  └── artifacts/
                      ├── index.json
                      ├── reports/
                      ├── quizzes/
                      └── podcasts/

Request flow

Auth: Every state-changing handler receives gr.Request; get_username_from_request(request) returns username (or MOCK_USER / anonymous). All paths are under user_data_dir(username).
Notebook: User selects/creates/renames/deletes notebooks. Current notebook_id is kept in state and hidden textbox; all source/chat/artifact ops use (username, notebook_id).
Ingestion: Upload or URL → extract text (pypdf/python-pptx/readability) → chunk (recursive split, overlap) → embed (HF or local) → upsert Chroma with metadata (source_id, source_name, page_or_slide, enabled). sources.json updated.
RAG: Query → embed → Chroma query (filter enabled sources) → optional MMR → build context string → LLM with citation instructions → append to messages.jsonl. Retrieval and generation times logged and shown in UI.
Artifacts: Report/quiz/podcast use retrieval again (artifact-specific query + extra instruction). Gemini generates Markdown (context-only; citations [1], [2] mapped from chunk metadata). Podcast transcript from Gemini; TTS (HF or gTTS) for .mp3. Entries appended to artifacts/index.json.

Chroma

One collection per notebook: name chunks.
Documents stored with metadata: source_id, source_name, source_type, page_or_slide, chunk_index, enabled.
Chunk IDs: {source_id}::{chunk_index}.
Retrieval filters by enabled source_id; supports similarity-only or MMR.

Configuration (env)

See README and backend/config.py: GEMINI_API_KEY, GEMINI_MODEL (artifacts); HF_TOKEN, HF_LLM_MODEL, HF_EMBED_MODEL, HF_TTS_MODEL (chat/embeddings); CHUNK_SIZE, CHUNK_OVERLAP, TOP_K, MMR_LAMBDA, MOCK_USER, DATA_ROOT.