Spaces:

pbichpur
/

NotebookLMClone

Running

App Files Files Community

NotebookLMClone / README.md

github-actions[bot]

Sync from GitHub 907b7448edad59db074f2417e42629ba5c3f1cc7

dde0c6d 15 days ago

preview code

raw

history blame contribute delete

8.89 kB

	---
	title: NotebookLMClone
	emoji: 📚
	colorFrom: blue
	colorTo: indigo
	sdk: docker
	app_file: app.py
	pinned: false
	---

	# Ingestion Module — NotebookLM Clone (MVP)

	This repository contains the ingestion module for a NotebookLM-style project. The ingestion pipeline extracts text from multiple source types, chunks text intelligently, computes embeddings (with provider flexibility), and stores vectors in Chroma for later RAG use.

	## Features

	- Multi-format source extraction: TXT, PDF (with optional OCR via pytesseract), PPTX, and URLs
	- Token-aware intelligent chunking: Sentence-based splitting with configurable overlap and token limits
	- Flexible embedding providers: Switch between local (sentence-transformers), OpenAI, and HuggingFace APIs via env vars
	- Local-first by default: Runs fully offline with no API keys required
	- Structured storage: File-based raw/extracted organization + Chroma vectors with user/notebook isolation
	- CLI interface: Simple commands for upload, URL extraction, and end-to-end ingestion
	- Comprehensive testing: Unit tests + integration tests covering the full pipeline


	## Quick Start

	### 1. Install Dependencies

	```bash
	# Create and activate virtual environment
	python -m venv .venv
	# Windows PowerShell:
	. .venv\Scripts\Activate.ps1
	# macOS/Linux:
	# source .venv/bin/activate

	# Install dependencies
	pip install -r requirements.txt
	```

	### 2. Configure Embedding Provider (Optional)

	Copy `.env.example` to `.env` and set your preferred provider:

	```bash
	cp .env.example .env
	# Edit .env to choose provider: "local" (default), "openai", or "huggingface"
	```

	- Local (default): Uses sentence-transformers (offline, no API key)
	- OpenAI: Set `OPENAI_API_KEY` (requires active OpenAI account)
	- HuggingFace: Set `HF_API_TOKEN` (requires HF account)

	### 3. CLI Usage Examples

	#### Upload and extract a text file:
	```bash
	python -m src.ingestion.cli upload --user alice --notebook nb1 --path tests/data/sample.txt
	```

	#### Extract from a URL:
	```bash
	python -m src.ingestion.cli url --user alice --notebook nb1 --url https://example.com/article
	```

	#### Ingest into Chroma (chunk, embed, store):
	```bash
	python -m src.ingestion.cli ingest --user alice --notebook nb1 --source-id <source-id>
	```

	#### Ingest with custom embedding provider:
	```bash
	python -m src.ingestion.cli ingest --user alice --notebook nb1 \
	--source-id <source-id> \
	--embedding-provider openai \
	--embedding-model text-embedding-3-large
	```

	### 4. Run Tests

	```bash
	pytest -v # Verbose
	pytest -q # Quiet
	```


	## Supported File Types

	\| Format \| Extractor \| Notes \|
	\|--------\|-----------\|-------\|
	\| `.txt` \| `extract_text_from_txt()` \| UTF-8 or latin-1 encoding \|
	\| `.pdf` \| `extract_text_from_pdf()` \| Optional OCR with `--ocr` flag (requires pytesseract) \|
	\| `.pptx` \| `extract_text_from_pptx()` \| Extracts text from all slides \|
	\| URL \| `extract_text_from_url()` \| Fetches & uses readability for main content \|


	## Architecture

	### Core Modules

	- `src/ingestion/storage.py`: LocalStorageAdapter for file organization (raw/extracted/chunks)
	- `src/ingestion/extractors.py`: Multi-format text extraction (TXT, PDF, PPTX, URL)
	- `src/ingestion/chunker.py`: Token-aware intelligent chunking with NLTK & transformers
	- `src/ingestion/embeddings.py`: Provider-switching embedding adapter (local/OpenAI/HF)
	- `src/ingestion/vectorstore.py`: ChromaDB wrapper with user/notebook isolation
	- `src/ingestion/cli.py`: Full-featured CLI for upload, URL, and ingest operations

	### Storage Layout

	```
	data/
	users/
	<user_id>/
	notebooks/
	<notebook_id>/
	files_raw/ # Original file uploads
	files_extracted/ # Extracted text
	chroma/ # Persistent Chroma data
	```


	## Configuration & Environment Variables

	See `.env.example` for all options:

	```bash
	# Embedding configuration
	EMBEDDING_PROVIDER=local # [local\|openai\|huggingface]
	EMBEDDING_MODEL=all-MiniLM-L6-v2 # Model identifier
	OPENAI_API_KEY=sk-... # For OpenAI provider
	HF_API_TOKEN=hf_... # For HuggingFace provider

	# Storage configuration
	STORAGE_BASE_DIR=./data # Base directory for file storage
	CHROMA_PERSIST_DIR=./chroma_data # Chroma persistence (optional)
	```


	## Optional Dependencies

	For enhanced functionality, install optional packages:

	```bash
	# PDF with OCR (requires system tesseract installation)
	pip install pytesseract pillow pdf2image

	# LangChain integration (future)
	pip install langchain

	# Additional models
	pip install openai tiktoken
	```


	## Testing

	```bash
	# Run all tests
	pytest -v

	# Run specific test module
	pytest tests/test_storage_and_chunker.py -v

	# Run integration tests only
	pytest tests/test_integration.py -v

	# Check coverage
	pytest --cov=src tests/
	```


	## API Examples

	### Python API

	```python
	from src.ingestion.extractors import extract_text_from_txt, extract_text_from_pdf
	from src.ingestion.chunker import chunk_text
	from src.ingestion.embeddings import EmbeddingAdapter
	from src.ingestion.vectorstore import ChromaAdapter

	# Extract text
	result = extract_text_from_txt("path/to/file.txt")
	text = result["text"]

	# Chunk
	chunks = chunk_text(text, chunk_size_tokens=500, overlap_tokens=50)

	# Embed (with provider switching)
	embedder = EmbeddingAdapter(provider="local", model_name="all-MiniLM-L6-v2")
	embeddings = embedder.embed_texts([c["text"] for c in chunks])

	# Store in Chroma
	store = ChromaAdapter(persist_directory="./data/chroma")
	store.upsert_chunks("alice", "notebook1", chunks, embeddings)
	```


	## Notes

	- Default stack is local-first — no API keys required. All processing happens offline using sentence-transformers.
	- PDF OCR: Requires system `tesseract` installation. See [pytesseract docs](https://github.com/madmaze/pytesseract) for setup.
	- Chunking: Token counts approximate document length. Adjust `chunk_size_tokens` and `overlap_tokens` for your use case.
	- Embedding dimensions: all-MiniLM-L6-v2 produces 384-dim vectors. OpenAI text-embedding-3-small produces 1536-dim.
	- Chroma persistence: Uses DuckDB+Parquet backend when `persist_directory` is set. Ephemeral (in-memory) mode for testing.

	1. Install Python 3.10.19, create a virtual environment, and install dependencies:

	```bash
	# install Python 3.10.11 (use installer from python.org or your package manager)
	python --version # should report 3.10.11
	python -m venv .venv
	# macOS / Linux
	source .venv/bin/activate
	# Windows PowerShell
	. .venv\Scripts\Activate.ps1
	# then install dependencies
	pip install -r requirements.txt
	```

	2. CLI usage examples (from repo root):

	- Upload a text file (saves raw file and extracts text for .txt files):

	```bash
	python -m src.ingestion.cli upload --user alice --notebook nb1 --path tests/data/sample.txt
	```

	- Ingest an extracted source into Chroma (run after upload/url):

	```bash
	# supply the source-id printed during upload or omit to let CLI create one
	python -m src.ingestion.cli ingest --user alice --notebook nb1 --source-id <source_id>
	```

	3. Run tests:

	```bash
	pytest -q
	```

	Files of interest

	- `src/ingestion/storage.py`: LocalStorageAdapter and storage layout.
	- `src/ingestion/extractors.py`: TXT and URL extractors.
	- `src/ingestion/chunker.py`: Token-aware chunker.
	- `src/ingestion/embeddings.py`: Local sentence-transformers adapter.
	- `src/ingestion/vectorstore.py`: Chroma adapter.
	- `src/ingestion/cli.py`: Simple CLI to exercise upload, url, and ingest flows.

	Notes

	- Default stack is local-first (no API keys required). If you enable OpenAI/HF embedding providers or cloud storage, set `OPENAI_API_KEY`, `HF_API_TOKEN`, or cloud credentials as appropriate.
	- For large PDFs requiring OCR, install `tesseract` and the optional Python packages listed in `requirements.txt` comment section.

	## Hugging Face Docker Space (Full Stack)

	This repo now includes:
	- `Dockerfile`
	- `start_hf.sh` (starts FastAPI on `:8000` and Streamlit on `${PORT:-7860}`)
	- `.dockerignore`

	### Deploy Steps

	1. Create or switch your Space to Docker SDK.
	2. Push this repo to your Space (or use the GitHub Action sync workflow already in `.github/workflows/deploy-hf-space.yml`).
	3. In Space variables/secrets, set at minimum:
	- `AUTH_MODE=dev` (or `hf_oauth`)
	- `APP_SESSION_SECRET=<strong-random-secret>`
	- `STORAGE_BASE_DIR=/data`
	- `OPENAI_API_KEY=<key>` (if using OpenAI features)
	4. For HF OAuth mode, also set:
	- `HF_OAUTH_CLIENT_ID`
	- `HF_OAUTH_CLIENT_SECRET`
	- `HF_OAUTH_REDIRECT_URI` (Space URL registered in your HF Connected App, e.g. `https://<space>.hf.space/`)
	- `AUTH_SUCCESS_REDIRECT_URL` (your Space URL)

	The container exposes Streamlit on port `7860` and points the frontend to the internal backend via `BACKEND_URL=http://127.0.0.1:8000`.