Spaces:

pbichpur
/

NotebookLMClone

Sleeping

File size: 8,892 Bytes

---
title: NotebookLMClone
emoji: 📚
colorFrom: blue
colorTo: indigo
sdk: docker
app_file: app.py
pinned: false
---

# Ingestion Module — NotebookLM Clone (MVP)

This repository contains the ingestion module for a NotebookLM-style project. The ingestion pipeline extracts text from multiple source types, chunks text intelligently, computes embeddings (with provider flexibility), and stores vectors in Chroma for later RAG use.

## Features

- **Multi-format source extraction**: TXT, PDF (with optional OCR via pytesseract), PPTX, and URLs
- **Token-aware intelligent chunking**: Sentence-based splitting with configurable overlap and token limits
- **Flexible embedding providers**: Switch between local (sentence-transformers), OpenAI, and HuggingFace APIs via env vars
- **Local-first by default**: Runs fully offline with no API keys required
- **Structured storage**: File-based raw/extracted organization + Chroma vectors with user/notebook isolation
- **CLI interface**: Simple commands for upload, URL extraction, and end-to-end ingestion
- **Comprehensive testing**: Unit tests + integration tests covering the full pipeline


## Quick Start

### 1. Install Dependencies

```bash
# Create and activate virtual environment
python -m venv .venv
# Windows PowerShell:
. .venv\Scripts\Activate.ps1
# macOS/Linux:
# source .venv/bin/activate

# Install dependencies
pip install -r requirements.txt
```

### 2. Configure Embedding Provider (Optional)

Copy `.env.example` to `.env` and set your preferred provider:

```bash
cp .env.example .env
# Edit .env to choose provider: "local" (default), "openai", or "huggingface"
```

- **Local** (default): Uses sentence-transformers (offline, no API key)
- **OpenAI**: Set `OPENAI_API_KEY` (requires active OpenAI account)
- **HuggingFace**: Set `HF_API_TOKEN` (requires HF account)

### 3. CLI Usage Examples

#### Upload and extract a text file:
```bash
python -m src.ingestion.cli upload --user alice --notebook nb1 --path tests/data/sample.txt
```

#### Extract from a URL:
```bash
python -m src.ingestion.cli url --user alice --notebook nb1 --url https://example.com/article
```

#### Ingest into Chroma (chunk, embed, store):
```bash
python -m src.ingestion.cli ingest --user alice --notebook nb1 --source-id <source-id>
```

#### Ingest with custom embedding provider:
```bash
python -m src.ingestion.cli ingest --user alice --notebook nb1 \
  --source-id <source-id> \
  --embedding-provider openai \
  --embedding-model text-embedding-3-large
```

### 4. Run Tests

```bash
pytest -v   # Verbose
pytest -q   # Quiet
```


## Supported File Types

| Format | Extractor | Notes |
|--------|-----------|-------|
| `.txt` | `extract_text_from_txt()` | UTF-8 or latin-1 encoding |
| `.pdf` | `extract_text_from_pdf()` | Optional OCR with `--ocr` flag (requires pytesseract) |
| `.pptx` | `extract_text_from_pptx()` | Extracts text from all slides |
| URL | `extract_text_from_url()` | Fetches & uses readability for main content |


## Architecture

### Core Modules

- **`src/ingestion/storage.py`**: LocalStorageAdapter for file organization (raw/extracted/chunks)
- **`src/ingestion/extractors.py`**: Multi-format text extraction (TXT, PDF, PPTX, URL)
- **`src/ingestion/chunker.py`**: Token-aware intelligent chunking with NLTK & transformers
- **`src/ingestion/embeddings.py`**: Provider-switching embedding adapter (local/OpenAI/HF)
- **`src/ingestion/vectorstore.py`**: ChromaDB wrapper with user/notebook isolation
- **`src/ingestion/cli.py`**: Full-featured CLI for upload, URL, and ingest operations

### Storage Layout

```
data/
  users/
    <user_id>/
      notebooks/
        <notebook_id>/
          files_raw/              # Original file uploads
          files_extracted/        # Extracted text
          chroma/                 # Persistent Chroma data
```


## Configuration & Environment Variables

See `.env.example` for all options:

```bash
# Embedding configuration
EMBEDDING_PROVIDER=local              # [local|openai|huggingface]
EMBEDDING_MODEL=all-MiniLM-L6-v2     # Model identifier
OPENAI_API_KEY=sk-...                 # For OpenAI provider
HF_API_TOKEN=hf_...                   # For HuggingFace provider

# Storage configuration
STORAGE_BASE_DIR=./data               # Base directory for file storage
CHROMA_PERSIST_DIR=./chroma_data      # Chroma persistence (optional)
```


## Optional Dependencies

For enhanced functionality, install optional packages:

```bash
# PDF with OCR (requires system tesseract installation)
pip install pytesseract pillow pdf2image

# LangChain integration (future)
pip install langchain

# Additional models
pip install openai tiktoken
```


## Testing

```bash
# Run all tests
pytest -v

# Run specific test module
pytest tests/test_storage_and_chunker.py -v

# Run integration tests only
pytest tests/test_integration.py -v

# Check coverage
pytest --cov=src tests/
```


## API Examples

### Python API

```python
from src.ingestion.extractors import extract_text_from_txt, extract_text_from_pdf
from src.ingestion.chunker import chunk_text
from src.ingestion.embeddings import EmbeddingAdapter
from src.ingestion.vectorstore import ChromaAdapter

# Extract text
result = extract_text_from_txt("path/to/file.txt")
text = result["text"]

# Chunk
chunks = chunk_text(text, chunk_size_tokens=500, overlap_tokens=50)

# Embed (with provider switching)
embedder = EmbeddingAdapter(provider="local", model_name="all-MiniLM-L6-v2")
embeddings = embedder.embed_texts([c["text"] for c in chunks])

# Store in Chroma
store = ChromaAdapter(persist_directory="./data/chroma")
store.upsert_chunks("alice", "notebook1", chunks, embeddings)
```


## Notes

- **Default stack is local-first** — no API keys required. All processing happens offline using sentence-transformers.
- **PDF OCR**: Requires system `tesseract` installation. See [pytesseract docs](https://github.com/madmaze/pytesseract) for setup.
- **Chunking**: Token counts approximate document length. Adjust `chunk_size_tokens` and `overlap_tokens` for your use case.
- **Embedding dimensions**: all-MiniLM-L6-v2 produces 384-dim vectors. OpenAI text-embedding-3-small produces 1536-dim.
- **Chroma persistence**: Uses DuckDB+Parquet backend when `persist_directory` is set. Ephemeral (in-memory) mode for testing.

1. Install Python 3.10.19, create a virtual environment, and install dependencies:

```bash
# install Python 3.10.11 (use installer from python.org or your package manager)
python --version  # should report 3.10.11
python -m venv .venv
# macOS / Linux
source .venv/bin/activate
# Windows PowerShell
. .venv\Scripts\Activate.ps1
# then install dependencies
pip install -r requirements.txt
```

2. CLI usage examples (from repo root):

- Upload a text file (saves raw file and extracts text for .txt files):

```bash
python -m src.ingestion.cli upload --user alice --notebook nb1 --path tests/data/sample.txt
```

- Ingest an extracted source into Chroma (run after upload/url):

```bash
# supply the source-id printed during upload or omit to let CLI create one
python -m src.ingestion.cli ingest --user alice --notebook nb1 --source-id <source_id>
```

3. Run tests:

```bash
pytest -q
```

Files of interest

- `src/ingestion/storage.py`: LocalStorageAdapter and storage layout.
- `src/ingestion/extractors.py`: TXT and URL extractors.
- `src/ingestion/chunker.py`: Token-aware chunker.
- `src/ingestion/embeddings.py`: Local sentence-transformers adapter.
- `src/ingestion/vectorstore.py`: Chroma adapter.
- `src/ingestion/cli.py`: Simple CLI to exercise upload, url, and ingest flows.

Notes

- Default stack is local-first (no API keys required). If you enable OpenAI/HF embedding providers or cloud storage, set `OPENAI_API_KEY`, `HF_API_TOKEN`, or cloud credentials as appropriate.
- For large PDFs requiring OCR, install `tesseract` and the optional Python packages listed in `requirements.txt` comment section.

## Hugging Face Docker Space (Full Stack)

This repo now includes:
- `Dockerfile`
- `start_hf.sh` (starts FastAPI on `:8000` and Streamlit on `${PORT:-7860}`)
- `.dockerignore`

### Deploy Steps

1. Create or switch your Space to **Docker** SDK.
2. Push this repo to your Space (or use the GitHub Action sync workflow already in `.github/workflows/deploy-hf-space.yml`).
3. In Space variables/secrets, set at minimum:
   - `AUTH_MODE=dev` (or `hf_oauth`)
   - `APP_SESSION_SECRET=<strong-random-secret>`
   - `STORAGE_BASE_DIR=/data`
   - `OPENAI_API_KEY=<key>` (if using OpenAI features)
4. For HF OAuth mode, also set:
   - `HF_OAUTH_CLIENT_ID`
   - `HF_OAUTH_CLIENT_SECRET`
   - `HF_OAUTH_REDIRECT_URI` (Space URL registered in your HF Connected App, e.g. `https://<space>.hf.space/`)
   - `AUTH_SUCCESS_REDIRECT_URL` (your Space URL)

The container exposes Streamlit on port `7860` and points the frontend to the internal backend via `BACKEND_URL=http://127.0.0.1:8000`.