NotebookLMClone / README.md
github-actions[bot]
Sync from GitHub 907b7448edad59db074f2417e42629ba5c3f1cc7
dde0c6d
metadata
title: NotebookLMClone
emoji: ๐Ÿ“š
colorFrom: blue
colorTo: indigo
sdk: docker
app_file: app.py
pinned: false

Ingestion Module โ€” NotebookLM Clone (MVP)

This repository contains the ingestion module for a NotebookLM-style project. The ingestion pipeline extracts text from multiple source types, chunks text intelligently, computes embeddings (with provider flexibility), and stores vectors in Chroma for later RAG use.

Features

  • Multi-format source extraction: TXT, PDF (with optional OCR via pytesseract), PPTX, and URLs
  • Token-aware intelligent chunking: Sentence-based splitting with configurable overlap and token limits
  • Flexible embedding providers: Switch between local (sentence-transformers), OpenAI, and HuggingFace APIs via env vars
  • Local-first by default: Runs fully offline with no API keys required
  • Structured storage: File-based raw/extracted organization + Chroma vectors with user/notebook isolation
  • CLI interface: Simple commands for upload, URL extraction, and end-to-end ingestion
  • Comprehensive testing: Unit tests + integration tests covering the full pipeline

Quick Start

1. Install Dependencies

# Create and activate virtual environment
python -m venv .venv
# Windows PowerShell:
. .venv\Scripts\Activate.ps1
# macOS/Linux:
# source .venv/bin/activate

# Install dependencies
pip install -r requirements.txt

2. Configure Embedding Provider (Optional)

Copy .env.example to .env and set your preferred provider:

cp .env.example .env
# Edit .env to choose provider: "local" (default), "openai", or "huggingface"
  • Local (default): Uses sentence-transformers (offline, no API key)
  • OpenAI: Set OPENAI_API_KEY (requires active OpenAI account)
  • HuggingFace: Set HF_API_TOKEN (requires HF account)

3. CLI Usage Examples

Upload and extract a text file:

python -m src.ingestion.cli upload --user alice --notebook nb1 --path tests/data/sample.txt

Extract from a URL:

python -m src.ingestion.cli url --user alice --notebook nb1 --url https://example.com/article

Ingest into Chroma (chunk, embed, store):

python -m src.ingestion.cli ingest --user alice --notebook nb1 --source-id <source-id>

Ingest with custom embedding provider:

python -m src.ingestion.cli ingest --user alice --notebook nb1 \
  --source-id <source-id> \
  --embedding-provider openai \
  --embedding-model text-embedding-3-large

4. Run Tests

pytest -v   # Verbose
pytest -q   # Quiet

Supported File Types

Format Extractor Notes
.txt extract_text_from_txt() UTF-8 or latin-1 encoding
.pdf extract_text_from_pdf() Optional OCR with --ocr flag (requires pytesseract)
.pptx extract_text_from_pptx() Extracts text from all slides
URL extract_text_from_url() Fetches & uses readability for main content

Architecture

Core Modules

  • src/ingestion/storage.py: LocalStorageAdapter for file organization (raw/extracted/chunks)
  • src/ingestion/extractors.py: Multi-format text extraction (TXT, PDF, PPTX, URL)
  • src/ingestion/chunker.py: Token-aware intelligent chunking with NLTK & transformers
  • src/ingestion/embeddings.py: Provider-switching embedding adapter (local/OpenAI/HF)
  • src/ingestion/vectorstore.py: ChromaDB wrapper with user/notebook isolation
  • src/ingestion/cli.py: Full-featured CLI for upload, URL, and ingest operations

Storage Layout

data/
  users/
    <user_id>/
      notebooks/
        <notebook_id>/
          files_raw/              # Original file uploads
          files_extracted/        # Extracted text
          chroma/                 # Persistent Chroma data

Configuration & Environment Variables

See .env.example for all options:

# Embedding configuration
EMBEDDING_PROVIDER=local              # [local|openai|huggingface]
EMBEDDING_MODEL=all-MiniLM-L6-v2     # Model identifier
OPENAI_API_KEY=sk-...                 # For OpenAI provider
HF_API_TOKEN=hf_...                   # For HuggingFace provider

# Storage configuration
STORAGE_BASE_DIR=./data               # Base directory for file storage
CHROMA_PERSIST_DIR=./chroma_data      # Chroma persistence (optional)

Optional Dependencies

For enhanced functionality, install optional packages:

# PDF with OCR (requires system tesseract installation)
pip install pytesseract pillow pdf2image

# LangChain integration (future)
pip install langchain

# Additional models
pip install openai tiktoken

Testing

# Run all tests
pytest -v

# Run specific test module
pytest tests/test_storage_and_chunker.py -v

# Run integration tests only
pytest tests/test_integration.py -v

# Check coverage
pytest --cov=src tests/

API Examples

Python API

from src.ingestion.extractors import extract_text_from_txt, extract_text_from_pdf
from src.ingestion.chunker import chunk_text
from src.ingestion.embeddings import EmbeddingAdapter
from src.ingestion.vectorstore import ChromaAdapter

# Extract text
result = extract_text_from_txt("path/to/file.txt")
text = result["text"]

# Chunk
chunks = chunk_text(text, chunk_size_tokens=500, overlap_tokens=50)

# Embed (with provider switching)
embedder = EmbeddingAdapter(provider="local", model_name="all-MiniLM-L6-v2")
embeddings = embedder.embed_texts([c["text"] for c in chunks])

# Store in Chroma
store = ChromaAdapter(persist_directory="./data/chroma")
store.upsert_chunks("alice", "notebook1", chunks, embeddings)

Notes

  • Default stack is local-first โ€” no API keys required. All processing happens offline using sentence-transformers.
  • PDF OCR: Requires system tesseract installation. See pytesseract docs for setup.
  • Chunking: Token counts approximate document length. Adjust chunk_size_tokens and overlap_tokens for your use case.
  • Embedding dimensions: all-MiniLM-L6-v2 produces 384-dim vectors. OpenAI text-embedding-3-small produces 1536-dim.
  • Chroma persistence: Uses DuckDB+Parquet backend when persist_directory is set. Ephemeral (in-memory) mode for testing.
  1. Install Python 3.10.19, create a virtual environment, and install dependencies:
# install Python 3.10.11 (use installer from python.org or your package manager)
python --version  # should report 3.10.11
python -m venv .venv
# macOS / Linux
source .venv/bin/activate
# Windows PowerShell
. .venv\Scripts\Activate.ps1
# then install dependencies
pip install -r requirements.txt
  1. CLI usage examples (from repo root):
  • Upload a text file (saves raw file and extracts text for .txt files):
python -m src.ingestion.cli upload --user alice --notebook nb1 --path tests/data/sample.txt
  • Ingest an extracted source into Chroma (run after upload/url):
# supply the source-id printed during upload or omit to let CLI create one
python -m src.ingestion.cli ingest --user alice --notebook nb1 --source-id <source_id>
  1. Run tests:
pytest -q

Files of interest

  • src/ingestion/storage.py: LocalStorageAdapter and storage layout.
  • src/ingestion/extractors.py: TXT and URL extractors.
  • src/ingestion/chunker.py: Token-aware chunker.
  • src/ingestion/embeddings.py: Local sentence-transformers adapter.
  • src/ingestion/vectorstore.py: Chroma adapter.
  • src/ingestion/cli.py: Simple CLI to exercise upload, url, and ingest flows.

Notes

  • Default stack is local-first (no API keys required). If you enable OpenAI/HF embedding providers or cloud storage, set OPENAI_API_KEY, HF_API_TOKEN, or cloud credentials as appropriate.
  • For large PDFs requiring OCR, install tesseract and the optional Python packages listed in requirements.txt comment section.

Hugging Face Docker Space (Full Stack)

This repo now includes:

  • Dockerfile
  • start_hf.sh (starts FastAPI on :8000 and Streamlit on ${PORT:-7860})
  • .dockerignore

Deploy Steps

  1. Create or switch your Space to Docker SDK.
  2. Push this repo to your Space (or use the GitHub Action sync workflow already in .github/workflows/deploy-hf-space.yml).
  3. In Space variables/secrets, set at minimum:
    • AUTH_MODE=dev (or hf_oauth)
    • APP_SESSION_SECRET=<strong-random-secret>
    • STORAGE_BASE_DIR=/data
    • OPENAI_API_KEY=<key> (if using OpenAI features)
  4. For HF OAuth mode, also set:
    • HF_OAUTH_CLIENT_ID
    • HF_OAUTH_CLIENT_SECRET
    • HF_OAUTH_REDIRECT_URI (Space URL registered in your HF Connected App, e.g. https://<space>.hf.space/)
    • AUTH_SUCCESS_REDIRECT_URL (your Space URL)

The container exposes Streamlit on port 7860 and points the frontend to the internal backend via BACKEND_URL=http://127.0.0.1:8000.