NotebookLMClone / docs /INGESTION_SPEC.md
github-actions[bot]
Sync from GitHub e48aa5f27523b35a22c1a01acbb2b835cdc28984
aacd162

Ingestion Pipeline Specification

Summary

  • Purpose: build the ingestion pipeline that extracts text from PDFs, PPTX, TXT, and URLs; chunks text; computes embeddings; and stores embeddings + metadata in a vector database with per-user and per-notebook isolation for RAG with citations.

Goals & Scope

  • Support file and URL ingestion for per-user notebooks.
  • Produce chunked, embedded documents with provenance metadata (source, page/slide, offsets).
  • Provide an API/worker flow with status tracking and retries.

High-level Architecture

  • Ingestion API: Accept uploads/URLs, validate, create source record, enqueue ingestion job.
  • Worker: Extract -> Preprocess -> Chunk -> Embed -> Upsert to vector DB -> Persist metadata.
  • Storage: raw files in data/raw/{user_id}/{notebook_id}/, metadata in a lightweight DB (SQLite/JSON), embeddings in a vector DB (Chroma/FAISS).

Components

  • Upload Handler: validates file type/size and writes raw file.
  • Extractors: modular functions for each type (PDF, PPTX, TXT, URL).
  • Preprocessing: normalize whitespace, remove boilerplate, preserve page/slide indices and char offsets.
  • Chunker: token-aware sliding window with overlap; store chunk-level metadata.
  • Embedder: pluggable adapter (local sentence-transformers or API-based embeddings).
  • Vectorstore Adapter: pluggable (default Chroma). Upserts include chunk text + metadata.
  • Metadata Store: tracks notebooks, sources, ingestion status, timestamps, and source enable/disable flags.
  • Job Orchestrator: simple background queue with retries (Redis/RQ or asyncio-based worker).

Extractors (recommended implementations)

  • PDF: PyMuPDF (fitz) or pdfminer.six; extract per-page text and char offsets.
  • PPTX: python-pptx extract slide text + speaker notes.
  • TXT: read with encoding detection (chardet / charset-normalizer).
  • URL: requests + BeautifulSoup with readability extraction; sanitize HTML.

Chunking

  • Default: 500 tokens per chunk with 50 token overlap (configurable via CHUNK_TOKENS, CHUNK_OVERLAP).
  • Alternative simple default: 2000 characters with 400 char overlap for token-agnostic environments.
  • Store for each chunk: chunk_id, source_id, char_start, char_end, page, text_preview.

Embeddings

  • Adapter interface supports local models (sentence-transformers/all-MiniLM) or API-based embeddings (OpenAI/HF).
  • Configurable batch size (default 64) and rate limiting.

Vector DB

  • Default: Chroma for HF Spaces compatibility; can swap to FAISS or Weaviate later.
  • Use namespacing by user_id + notebook_id or include those fields in metadata for isolation.

Metadata Schema (suggested)

  • Notebook: { notebook_id, user_id, name, created_at, updated_at }
  • Source: { source_id, notebook_id, user_id, filename, url?, status, pages, size_bytes, created_at, error? }
  • Chunk: { chunk_id, source_id, notebook_id, user_id, char_start, char_end, page, text_preview }

API Contract (ingest endpoints)

  • POST /ingest/upload β€” multipart file, user_id, notebook_id β†’ returns { source_id, job_id }.
  • POST /ingest/url β€” body { url, user_id, notebook_id } β†’ returns { source_id, job_id }.
  • GET /ingest/status?job_id=... β†’ returns status and error if failed.
  • POST /ingest/enable_source β€” enable/disable source for retrieval.

Citation Strategy

  • Store source_name, file_path or url, and page/slide in each chunk's metadata.
  • Retrieval returns top-k chunks with metadata; present citations like [SourceName β€” page 3].

Job Orchestration

  • Small files: synchronous ingestion (fast path).
  • Large files: enqueue to background worker with retry/backoff and status updates.

Security & Operational Considerations

  • Sanitize HTML from URLs; validate file types and size limits (configurable, e.g., 100MB).
  • Rate-limit embedding calls to control cost.
  • Log ingestion events and basic metrics.

File Layout (suggested)

  • ingest/
    • api.py β€” upload + status endpoints
    • worker.py β€” ingestion worker and job runner
    • extractors.py β€” file/URL extraction functions
    • chunker.py β€” token-aware chunking utilities
    • embedder.py β€” embedding adapter
    • vectorstore.py β€” vector DB adapter
    • metadata.py β€” metadata store (SQLite/TinyDB)
  • data/raw/ and data/meta/

Config / Environment Variables (examples)

  • EMBEDDING_PROVIDER=local|openai|hf
  • CHUNK_TOKENS=500
  • CHUNK_OVERLAP=50
  • EMBED_BATCH_SIZE=64
  • VECTORSTORE=chroma|faiss|weaviate

Acceptance Criteria

  • API accepts PDF/PPTX/TXT/URL and returns source_id + job_id.
  • Worker extracts text with page/slide metadata and chunks according to config.
  • Embeddings stored in vector DB with metadata linking back to source and offsets.
  • Retrieval returns chunk text + metadata for citation.
  • Per-user and per-notebook isolation enforced.

Testing

  • Unit tests for extractors (use small sample files), chunker determinism, embedder adapter (mocked), and vectorstore upsert/retrieve (in-memory).

CI / Deployment

  • Add GitHub Actions to run tests and push to Hugging Face Space on main branch updates.

Next Steps (immediate)

  1. Scaffold ingest/ module files and a small runner script.
  2. Add unit tests and sample files for extractors and chunker.
  3. Implement Chroma adapter and local sentence-transformers embedding support.

If you'd like, I can now scaffold the ingest/ module files and tests. Also confirm preferred default embedding provider: sentence-transformers (local, free) or openai/hf (API-based).