Spaces:
Sleeping
Sleeping
metadata
title: NotebookLMClone
emoji: ๐
colorFrom: blue
colorTo: indigo
sdk: docker
app_file: app.py
pinned: false
Ingestion Module โ NotebookLM Clone (MVP)
This repository contains the ingestion module for a NotebookLM-style project. The ingestion pipeline extracts text from multiple source types, chunks text intelligently, computes embeddings (with provider flexibility), and stores vectors in Chroma for later RAG use.
Features
- Multi-format source extraction: TXT, PDF (with optional OCR via pytesseract), PPTX, and URLs
- Token-aware intelligent chunking: Sentence-based splitting with configurable overlap and token limits
- Flexible embedding providers: Switch between local (sentence-transformers), OpenAI, and HuggingFace APIs via env vars
- Local-first by default: Runs fully offline with no API keys required
- Structured storage: File-based raw/extracted organization + Chroma vectors with user/notebook isolation
- CLI interface: Simple commands for upload, URL extraction, and end-to-end ingestion
- Comprehensive testing: Unit tests + integration tests covering the full pipeline
Quick Start
1. Install Dependencies
# Create and activate virtual environment
python -m venv .venv
# Windows PowerShell:
. .venv\Scripts\Activate.ps1
# macOS/Linux:
# source .venv/bin/activate
# Install dependencies
pip install -r requirements.txt
2. Configure Embedding Provider (Optional)
Copy .env.example to .env and set your preferred provider:
cp .env.example .env
# Edit .env to choose provider: "local" (default), "openai", or "huggingface"
- Local (default): Uses sentence-transformers (offline, no API key)
- OpenAI: Set
OPENAI_API_KEY(requires active OpenAI account) - HuggingFace: Set
HF_API_TOKEN(requires HF account)
3. CLI Usage Examples
Upload and extract a text file:
python -m src.ingestion.cli upload --user alice --notebook nb1 --path tests/data/sample.txt
Extract from a URL:
python -m src.ingestion.cli url --user alice --notebook nb1 --url https://example.com/article
Ingest into Chroma (chunk, embed, store):
python -m src.ingestion.cli ingest --user alice --notebook nb1 --source-id <source-id>
Ingest with custom embedding provider:
python -m src.ingestion.cli ingest --user alice --notebook nb1 \
--source-id <source-id> \
--embedding-provider openai \
--embedding-model text-embedding-3-large
4. Run Tests
pytest -v # Verbose
pytest -q # Quiet
Supported File Types
| Format | Extractor | Notes |
|---|---|---|
.txt |
extract_text_from_txt() |
UTF-8 or latin-1 encoding |
.pdf |
extract_text_from_pdf() |
Optional OCR with --ocr flag (requires pytesseract) |
.pptx |
extract_text_from_pptx() |
Extracts text from all slides |
| URL | extract_text_from_url() |
Fetches & uses readability for main content |
Architecture
Core Modules
src/ingestion/storage.py: LocalStorageAdapter for file organization (raw/extracted/chunks)src/ingestion/extractors.py: Multi-format text extraction (TXT, PDF, PPTX, URL)src/ingestion/chunker.py: Token-aware intelligent chunking with NLTK & transformerssrc/ingestion/embeddings.py: Provider-switching embedding adapter (local/OpenAI/HF)src/ingestion/vectorstore.py: ChromaDB wrapper with user/notebook isolationsrc/ingestion/cli.py: Full-featured CLI for upload, URL, and ingest operations
Storage Layout
data/
users/
<user_id>/
notebooks/
<notebook_id>/
files_raw/ # Original file uploads
files_extracted/ # Extracted text
chroma/ # Persistent Chroma data
Configuration & Environment Variables
See .env.example for all options:
# Embedding configuration
EMBEDDING_PROVIDER=local # [local|openai|huggingface]
EMBEDDING_MODEL=all-MiniLM-L6-v2 # Model identifier
OPENAI_API_KEY=sk-... # For OpenAI provider
HF_API_TOKEN=hf_... # For HuggingFace provider
# Storage configuration
STORAGE_BASE_DIR=./data # Base directory for file storage
CHROMA_PERSIST_DIR=./chroma_data # Chroma persistence (optional)
Optional Dependencies
For enhanced functionality, install optional packages:
# PDF with OCR (requires system tesseract installation)
pip install pytesseract pillow pdf2image
# LangChain integration (future)
pip install langchain
# Additional models
pip install openai tiktoken
Testing
# Run all tests
pytest -v
# Run specific test module
pytest tests/test_storage_and_chunker.py -v
# Run integration tests only
pytest tests/test_integration.py -v
# Check coverage
pytest --cov=src tests/
API Examples
Python API
from src.ingestion.extractors import extract_text_from_txt, extract_text_from_pdf
from src.ingestion.chunker import chunk_text
from src.ingestion.embeddings import EmbeddingAdapter
from src.ingestion.vectorstore import ChromaAdapter
# Extract text
result = extract_text_from_txt("path/to/file.txt")
text = result["text"]
# Chunk
chunks = chunk_text(text, chunk_size_tokens=500, overlap_tokens=50)
# Embed (with provider switching)
embedder = EmbeddingAdapter(provider="local", model_name="all-MiniLM-L6-v2")
embeddings = embedder.embed_texts([c["text"] for c in chunks])
# Store in Chroma
store = ChromaAdapter(persist_directory="./data/chroma")
store.upsert_chunks("alice", "notebook1", chunks, embeddings)
Notes
- Default stack is local-first โ no API keys required. All processing happens offline using sentence-transformers.
- PDF OCR: Requires system
tesseractinstallation. See pytesseract docs for setup. - Chunking: Token counts approximate document length. Adjust
chunk_size_tokensandoverlap_tokensfor your use case. - Embedding dimensions: all-MiniLM-L6-v2 produces 384-dim vectors. OpenAI text-embedding-3-small produces 1536-dim.
- Chroma persistence: Uses DuckDB+Parquet backend when
persist_directoryis set. Ephemeral (in-memory) mode for testing.
- Install Python 3.10.19, create a virtual environment, and install dependencies:
# install Python 3.10.11 (use installer from python.org or your package manager)
python --version # should report 3.10.11
python -m venv .venv
# macOS / Linux
source .venv/bin/activate
# Windows PowerShell
. .venv\Scripts\Activate.ps1
# then install dependencies
pip install -r requirements.txt
- CLI usage examples (from repo root):
- Upload a text file (saves raw file and extracts text for .txt files):
python -m src.ingestion.cli upload --user alice --notebook nb1 --path tests/data/sample.txt
- Ingest an extracted source into Chroma (run after upload/url):
# supply the source-id printed during upload or omit to let CLI create one
python -m src.ingestion.cli ingest --user alice --notebook nb1 --source-id <source_id>
- Run tests:
pytest -q
Files of interest
src/ingestion/storage.py: LocalStorageAdapter and storage layout.src/ingestion/extractors.py: TXT and URL extractors.src/ingestion/chunker.py: Token-aware chunker.src/ingestion/embeddings.py: Local sentence-transformers adapter.src/ingestion/vectorstore.py: Chroma adapter.src/ingestion/cli.py: Simple CLI to exercise upload, url, and ingest flows.
Notes
- Default stack is local-first (no API keys required). If you enable OpenAI/HF embedding providers or cloud storage, set
OPENAI_API_KEY,HF_API_TOKEN, or cloud credentials as appropriate. - For large PDFs requiring OCR, install
tesseractand the optional Python packages listed inrequirements.txtcomment section.
Hugging Face Docker Space (Full Stack)
This repo now includes:
Dockerfilestart_hf.sh(starts FastAPI on:8000and Streamlit on${PORT:-7860}).dockerignore
Deploy Steps
- Create or switch your Space to Docker SDK.
- Push this repo to your Space (or use the GitHub Action sync workflow already in
.github/workflows/deploy-hf-space.yml). - In Space variables/secrets, set at minimum:
AUTH_MODE=dev(orhf_oauth)APP_SESSION_SECRET=<strong-random-secret>STORAGE_BASE_DIR=/dataOPENAI_API_KEY=<key>(if using OpenAI features)
- For HF OAuth mode, also set:
HF_OAUTH_CLIENT_IDHF_OAUTH_CLIENT_SECRETHF_OAUTH_REDIRECT_URI(Space URL registered in your HF Connected App, e.g.https://<space>.hf.space/)AUTH_SUCCESS_REDIRECT_URL(your Space URL)
The container exposes Streamlit on port 7860 and points the frontend to the internal backend via BACKEND_URL=http://127.0.0.1:8000.