File size: 8,892 Bytes
b3cce15
 
 
 
 
 
 
 
 
 
aacd162
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3fe4567
aacd162
 
 
 
 
 
dde0c6d
aacd162
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
---
title: NotebookLMClone
emoji: ๐Ÿ“š
colorFrom: blue
colorTo: indigo
sdk: docker
app_file: app.py
pinned: false
---

# Ingestion Module โ€” NotebookLM Clone (MVP)

This repository contains the ingestion module for a NotebookLM-style project. The ingestion pipeline extracts text from multiple source types, chunks text intelligently, computes embeddings (with provider flexibility), and stores vectors in Chroma for later RAG use.

## Features

- **Multi-format source extraction**: TXT, PDF (with optional OCR via pytesseract), PPTX, and URLs
- **Token-aware intelligent chunking**: Sentence-based splitting with configurable overlap and token limits
- **Flexible embedding providers**: Switch between local (sentence-transformers), OpenAI, and HuggingFace APIs via env vars
- **Local-first by default**: Runs fully offline with no API keys required
- **Structured storage**: File-based raw/extracted organization + Chroma vectors with user/notebook isolation
- **CLI interface**: Simple commands for upload, URL extraction, and end-to-end ingestion
- **Comprehensive testing**: Unit tests + integration tests covering the full pipeline


## Quick Start

### 1. Install Dependencies

```bash
# Create and activate virtual environment
python -m venv .venv
# Windows PowerShell:
. .venv\Scripts\Activate.ps1
# macOS/Linux:
# source .venv/bin/activate

# Install dependencies
pip install -r requirements.txt
```

### 2. Configure Embedding Provider (Optional)

Copy `.env.example` to `.env` and set your preferred provider:

```bash
cp .env.example .env
# Edit .env to choose provider: "local" (default), "openai", or "huggingface"
```

- **Local** (default): Uses sentence-transformers (offline, no API key)
- **OpenAI**: Set `OPENAI_API_KEY` (requires active OpenAI account)
- **HuggingFace**: Set `HF_API_TOKEN` (requires HF account)

### 3. CLI Usage Examples

#### Upload and extract a text file:
```bash
python -m src.ingestion.cli upload --user alice --notebook nb1 --path tests/data/sample.txt
```

#### Extract from a URL:
```bash
python -m src.ingestion.cli url --user alice --notebook nb1 --url https://example.com/article
```

#### Ingest into Chroma (chunk, embed, store):
```bash
python -m src.ingestion.cli ingest --user alice --notebook nb1 --source-id <source-id>
```

#### Ingest with custom embedding provider:
```bash
python -m src.ingestion.cli ingest --user alice --notebook nb1 \
  --source-id <source-id> \
  --embedding-provider openai \
  --embedding-model text-embedding-3-large
```

### 4. Run Tests

```bash
pytest -v   # Verbose
pytest -q   # Quiet
```


## Supported File Types

| Format | Extractor | Notes |
|--------|-----------|-------|
| `.txt` | `extract_text_from_txt()` | UTF-8 or latin-1 encoding |
| `.pdf` | `extract_text_from_pdf()` | Optional OCR with `--ocr` flag (requires pytesseract) |
| `.pptx` | `extract_text_from_pptx()` | Extracts text from all slides |
| URL | `extract_text_from_url()` | Fetches & uses readability for main content |


## Architecture

### Core Modules

- **`src/ingestion/storage.py`**: LocalStorageAdapter for file organization (raw/extracted/chunks)
- **`src/ingestion/extractors.py`**: Multi-format text extraction (TXT, PDF, PPTX, URL)
- **`src/ingestion/chunker.py`**: Token-aware intelligent chunking with NLTK & transformers
- **`src/ingestion/embeddings.py`**: Provider-switching embedding adapter (local/OpenAI/HF)
- **`src/ingestion/vectorstore.py`**: ChromaDB wrapper with user/notebook isolation
- **`src/ingestion/cli.py`**: Full-featured CLI for upload, URL, and ingest operations

### Storage Layout

```
data/
  users/
    <user_id>/
      notebooks/
        <notebook_id>/
          files_raw/              # Original file uploads
          files_extracted/        # Extracted text
          chroma/                 # Persistent Chroma data
```


## Configuration & Environment Variables

See `.env.example` for all options:

```bash
# Embedding configuration
EMBEDDING_PROVIDER=local              # [local|openai|huggingface]
EMBEDDING_MODEL=all-MiniLM-L6-v2     # Model identifier
OPENAI_API_KEY=sk-...                 # For OpenAI provider
HF_API_TOKEN=hf_...                   # For HuggingFace provider

# Storage configuration
STORAGE_BASE_DIR=./data               # Base directory for file storage
CHROMA_PERSIST_DIR=./chroma_data      # Chroma persistence (optional)
```


## Optional Dependencies

For enhanced functionality, install optional packages:

```bash
# PDF with OCR (requires system tesseract installation)
pip install pytesseract pillow pdf2image

# LangChain integration (future)
pip install langchain

# Additional models
pip install openai tiktoken
```


## Testing

```bash
# Run all tests
pytest -v

# Run specific test module
pytest tests/test_storage_and_chunker.py -v

# Run integration tests only
pytest tests/test_integration.py -v

# Check coverage
pytest --cov=src tests/
```


## API Examples

### Python API

```python
from src.ingestion.extractors import extract_text_from_txt, extract_text_from_pdf
from src.ingestion.chunker import chunk_text
from src.ingestion.embeddings import EmbeddingAdapter
from src.ingestion.vectorstore import ChromaAdapter

# Extract text
result = extract_text_from_txt("path/to/file.txt")
text = result["text"]

# Chunk
chunks = chunk_text(text, chunk_size_tokens=500, overlap_tokens=50)

# Embed (with provider switching)
embedder = EmbeddingAdapter(provider="local", model_name="all-MiniLM-L6-v2")
embeddings = embedder.embed_texts([c["text"] for c in chunks])

# Store in Chroma
store = ChromaAdapter(persist_directory="./data/chroma")
store.upsert_chunks("alice", "notebook1", chunks, embeddings)
```


## Notes

- **Default stack is local-first** โ€” no API keys required. All processing happens offline using sentence-transformers.
- **PDF OCR**: Requires system `tesseract` installation. See [pytesseract docs](https://github.com/madmaze/pytesseract) for setup.
- **Chunking**: Token counts approximate document length. Adjust `chunk_size_tokens` and `overlap_tokens` for your use case.
- **Embedding dimensions**: all-MiniLM-L6-v2 produces 384-dim vectors. OpenAI text-embedding-3-small produces 1536-dim.
- **Chroma persistence**: Uses DuckDB+Parquet backend when `persist_directory` is set. Ephemeral (in-memory) mode for testing.

1. Install Python 3.10.19, create a virtual environment, and install dependencies:

```bash
# install Python 3.10.11 (use installer from python.org or your package manager)
python --version  # should report 3.10.11
python -m venv .venv
# macOS / Linux
source .venv/bin/activate
# Windows PowerShell
. .venv\Scripts\Activate.ps1
# then install dependencies
pip install -r requirements.txt
```

2. CLI usage examples (from repo root):

- Upload a text file (saves raw file and extracts text for .txt files):

```bash
python -m src.ingestion.cli upload --user alice --notebook nb1 --path tests/data/sample.txt
```

- Ingest an extracted source into Chroma (run after upload/url):

```bash
# supply the source-id printed during upload or omit to let CLI create one
python -m src.ingestion.cli ingest --user alice --notebook nb1 --source-id <source_id>
```

3. Run tests:

```bash
pytest -q
```

Files of interest

- `src/ingestion/storage.py`: LocalStorageAdapter and storage layout.
- `src/ingestion/extractors.py`: TXT and URL extractors.
- `src/ingestion/chunker.py`: Token-aware chunker.
- `src/ingestion/embeddings.py`: Local sentence-transformers adapter.
- `src/ingestion/vectorstore.py`: Chroma adapter.
- `src/ingestion/cli.py`: Simple CLI to exercise upload, url, and ingest flows.

Notes

- Default stack is local-first (no API keys required). If you enable OpenAI/HF embedding providers or cloud storage, set `OPENAI_API_KEY`, `HF_API_TOKEN`, or cloud credentials as appropriate.
- For large PDFs requiring OCR, install `tesseract` and the optional Python packages listed in `requirements.txt` comment section.

## Hugging Face Docker Space (Full Stack)

This repo now includes:
- `Dockerfile`
- `start_hf.sh` (starts FastAPI on `:8000` and Streamlit on `${PORT:-7860}`)
- `.dockerignore`

### Deploy Steps

1. Create or switch your Space to **Docker** SDK.
2. Push this repo to your Space (or use the GitHub Action sync workflow already in `.github/workflows/deploy-hf-space.yml`).
3. In Space variables/secrets, set at minimum:
   - `AUTH_MODE=dev` (or `hf_oauth`)
   - `APP_SESSION_SECRET=<strong-random-secret>`
   - `STORAGE_BASE_DIR=/data`
   - `OPENAI_API_KEY=<key>` (if using OpenAI features)
4. For HF OAuth mode, also set:
   - `HF_OAUTH_CLIENT_ID`
   - `HF_OAUTH_CLIENT_SECRET`
   - `HF_OAUTH_REDIRECT_URI` (Space URL registered in your HF Connected App, e.g. `https://<space>.hf.space/`)
   - `AUTH_SUCCESS_REDIRECT_URL` (your Space URL)

The container exposes Streamlit on port `7860` and points the frontend to the internal backend via `BACKEND_URL=http://127.0.0.1:8000`.