github-actions[bot] commited on
Commit
ca91c6e
Β·
1 Parent(s): 68a9a37

Sync from GitHub f6751e9420b77f679eade7ba6322118a6ae93b9f

Browse files
Files changed (2) hide show
  1. DhakshiVSWorkspace.code-workspace +0 -7
  2. INTEGRATION.md +0 -419
DhakshiVSWorkspace.code-workspace DELETED
@@ -1,7 +0,0 @@
1
- {
2
- "folders": [
3
- {
4
- "path": "."
5
- }
6
- ]
7
- }
 
 
 
 
 
 
 
 
INTEGRATION.md DELETED
@@ -1,419 +0,0 @@
1
- # Integration Guide: Ingestion Module
2
-
3
- ## Overview
4
-
5
- The ingestion module (`src/ingestion/`) implements the **complete source ingestion pipeline** for the NotebookLM project. It handles:
6
-
7
- - **Source extraction**: PDF, PPTX, TXT files, and web URLs
8
- - **Text chunking**: Token-aware sentence-based chunking
9
- - **Embedding**: Local (offline) or cloud-based (OpenAI, HuggingFace) embeddings
10
- - **Vector storage**: Persistent Chroma database per user/notebook
11
- - **CLI interface**: For testing and direct integration
12
-
13
- ## Architecture
14
-
15
- ```
16
- β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”
17
- β”‚ Upload │────▢│ Extract │────▢│ Chunk │────▢│ Embed β”‚
18
- β”‚ File/URL β”‚ β”‚ Text β”‚ β”‚ Text β”‚ β”‚ Text β”‚
19
- β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜
20
- β”‚
21
- β–Ό
22
- β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
23
- β”‚ Chroma β”‚
24
- β”‚ VectorDB β”‚
25
- β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
26
- ```
27
-
28
- **Module Files:**
29
- - `extractors.py`: Text extraction for multiple formats
30
- - `chunker.py`: Token-aware chunking with NLTK
31
- - `embeddings.py`: Embedding provider abstraction (local/OpenAI/HuggingFace)
32
- - `vectorstore.py`: Chroma wrapper with per-user/notebook isolation
33
- - `storage.py`: File system storage adapter
34
- - `cli.py`: CLI interface for testing
35
-
36
- ## Storage Structure
37
-
38
- Data is stored in a per-user, per-notebook structure:
39
-
40
- ```
41
- data/
42
- └── users/
43
- └── {username}/
44
- └── notebooks/
45
- └── {notebook-uuid}/
46
- β”œβ”€β”€ files_raw/ ← Original uploaded files
47
- β”‚ └── {source-id}/
48
- β”‚ └── {filename}
49
- β”œβ”€β”€ files_extracted/ ← Plain text extracted from sources
50
- β”‚ └── {source-id}/
51
- β”‚ └── content.txt
52
- β”œβ”€β”€ chroma/ ← Vector database
53
- β”‚ β”œβ”€β”€ chroma.sqlite3
54
- β”‚ └── {collection-uuid}/
55
- β”‚ β”œβ”€β”€ data_level0.bin
56
- β”‚ β”œβ”€β”€ header.bin
57
- β”‚ β”œβ”€β”€ length.bin
58
- β”‚ └── link_lists.bin
59
- β”œβ”€β”€ chat/ ← (For RAG chat module)
60
- β”‚ └── messages.jsonl
61
- └── artifacts/ ← (For artifact generation module)
62
- β”œβ”€β”€ reports/
63
- β”œβ”€β”€ quizzes/
64
- └── podcasts/
65
- ```
66
-
67
- ## Core APIs
68
-
69
- ### 1. Ingest Sources (CLI)
70
-
71
- **Option A: Upload file with auto-ingest (recommended for UI)**
72
- ```bash
73
- python -m src.ingestion.cli upload \
74
- --user alice \
75
- --notebook notebook-123 \
76
- --path /path/to/document.pdf \
77
- --auto-ingest
78
- ```
79
-
80
- **Option B: Upload + extract, then ingest separately**
81
- ```bash
82
- # Step 1: Upload and extract
83
- python -m src.ingestion.cli upload \
84
- --user alice \
85
- --notebook notebook-123 \
86
- --path /path/to/document.pdf
87
-
88
- # Step 2: Ingest extracted content
89
- python -m src.ingestion.cli ingest \
90
- --user alice \
91
- --notebook notebook-123 \
92
- --source-id <source-id-from-step-1>
93
- ```
94
-
95
- **Option C: Ingest from URL with auto-ingest**
96
- ```bash
97
- python -m src.ingestion.cli url \
98
- --user alice \
99
- --notebook notebook-123 \
100
- --url https://example.com/article \
101
- --auto-ingest
102
- ```
103
-
104
- ### 2. Query Vector Database (RAG)
105
-
106
- **Import and use ChromaAdapter directly:**
107
-
108
- ```python
109
- from src.ingestion.vectorstore import ChromaAdapter
110
- from pathlib import Path
111
-
112
- # Initialize adapter
113
- user_id = "alice"
114
- notebook_id = "notebook-123"
115
- chroma_dir = f"data/users/{user_id}/notebooks/{notebook_id}/chroma"
116
-
117
- store = ChromaAdapter(persist_directory=chroma_dir)
118
-
119
- # Query for similar chunks
120
- query_text = "What is machine learning?"
121
- top_k = 5
122
- results = store.query(user_id, notebook_id, query_text, top_k=top_k)
123
-
124
- # Process results
125
- for chunk_id, distance, chunk_data in results:
126
- source_id = chunk_data["metadata"]["source_id"]
127
- text = chunk_data["document"]
128
- print(f"Source: {source_id}")
129
- print(f"Distance: {distance}")
130
- print(f"Text: {text}")
131
- print("---")
132
- ```
133
-
134
- **ChromaAdapter.query() returns:**
135
- ```python
136
- List[Tuple[str, float, Dict[str, Any]]]
137
- # (chunk_id, distance_score, {
138
- # "document": str, # Actual text chunk
139
- # "metadata": {
140
- # "source_id": str, # Which source this came from
141
- # "page": int | None, # Page number (PDFs)
142
- # "text_preview": str, # First 100 chars
143
- # "char_start": int, # Position in original
144
- # "char_end": int # Position in original
145
- # }
146
- # })
147
- ```
148
-
149
- ### 3. Pythonic Integration (Programmatic)
150
-
151
- If you need to embed without using the CLI:
152
-
153
- ```python
154
- from src.ingestion.storage import LocalStorageAdapter
155
- from src.ingestion.extractors import extract_text_from_pdf
156
- from src.ingestion.chunker import chunk_text
157
- from src.ingestion.embeddings import EmbeddingAdapter
158
- from src.ingestion.vectorstore import ChromaAdapter
159
- import uuid
160
-
161
- user_id = "alice"
162
- notebook_id = "nb-123"
163
- file_path = "path/to/document.pdf"
164
-
165
- # Step 1: Extract
166
- adapter = LocalStorageAdapter()
167
- source_id = str(uuid.uuid4())
168
- result = extract_text_from_pdf(file_path, use_ocr=False)
169
- text = result["text"]
170
-
171
- # Step 2: Save raw and extracted
172
- adapter.save_raw_file(user_id, notebook_id, source_id, file_path)
173
- adapter.save_extracted_text(user_id, notebook_id, source_id, "content", text)
174
-
175
- # Step 3: Chunk
176
- chunks = chunk_text(text, model_name="sentence-transformers/all-MiniLM-L6-v2")
177
- for c in chunks:
178
- c["source_id"] = source_id
179
- c["page"] = None
180
-
181
- # Step 4: Embed (using provider)
182
- embedder = EmbeddingAdapter(
183
- model_name="all-MiniLM-L6-v2",
184
- provider="local" # or "openai", "huggingface"
185
- )
186
- texts = [c["text"] for c in chunks]
187
- embeddings = embedder.embed_texts(texts, batch_size=32)
188
-
189
- # Step 5: Store in Chroma
190
- nb = adapter.ensure_notebook(user_id, notebook_id)
191
- chroma_dir = str((nb / "chroma").resolve())
192
- store = ChromaAdapter(persist_directory=chroma_dir)
193
- store.upsert_chunks(user_id, notebook_id, chunks, embeddings)
194
-
195
- print(f"βœ“ Ingested {len(chunks)} chunks")
196
- ```
197
-
198
- ## Configuration
199
-
200
- ### Environment Variables
201
-
202
- Create a `.env` file in the project root (see `.env.example`):
203
-
204
- ```bash
205
- # Embedding provider: "local" (default), "openai", or "huggingface"
206
- EMBEDDING_PROVIDER=local
207
-
208
- # OpenAI (if provider=openai)
209
- OPENAI_API_KEY=sk-...
210
-
211
- # HuggingFace (if provider=huggingface)
212
- HF_API_TOKEN=hf_...
213
-
214
- # Model names (optional, defaults shown)
215
- EMBEDDING_MODEL=all-MiniLM-L6-v2
216
- CHUNK_MODEL=sentence-transformers/all-MiniLM-L6-v2
217
-
218
- # Storage
219
- DATA_DIR=data/
220
- ```
221
-
222
- ### Embedding Providers
223
-
224
- **Local (Recommended for MVP)**
225
- ```bash
226
- EMBEDDING_PROVIDER=local
227
- # Downloads: sentence-transformers/all-MiniLM-L6-v2 (~91MB on first run)
228
- # No API keys required, works offline
229
- ```
230
-
231
- **OpenAI**
232
- ```bash
233
- EMBEDDING_PROVIDER=openai
234
- OPENAI_API_KEY=sk-...
235
- EMBEDDING_MODEL=text-embedding-3-small # or text-embedding-3-large
236
- ```
237
-
238
- **HuggingFace**
239
- ```bash
240
- EMBEDDING_PROVIDER=huggingface
241
- HF_API_TOKEN=hf_...
242
- EMBEDDING_MODEL=sentence-transformers/all-MiniLM-L6-v2
243
- ```
244
-
245
- ## For RAG Chat Module
246
-
247
- Once sources are ingested, the RAG chat module can:
248
-
249
- 1. **Accept user query** from frontend
250
- 2. **Retrieve context** using `ChromaAdapter.query(user, notebook, query_text, top_k=5)`
251
- 3. **Build prompt** with retrieved chunks + user query
252
- 4. **Call LLM** (GPT-4, Claude, etc.)
253
- 5. **Return answer** with source citations (using source_id + text_preview)
254
-
255
- ### Example RAG Flow
256
-
257
- ```python
258
- def rag_chat(user_id: str, notebook_id: str, query: str):
259
- """Retrieve-augmented chat with citations."""
260
-
261
- # 1. Get chroma dir for this user/notebook
262
- chroma_dir = f"data/users/{user_id}/notebooks/{notebook_id}/chroma"
263
- store = ChromaAdapter(persist_directory=chroma_dir)
264
-
265
- # 2. Retrieve top-5 relevant chunks
266
- results = store.query(user_id, notebook_id, query, top_k=5)
267
-
268
- # 3. Format context with citations
269
- context = ""
270
- citations = []
271
- for i, (chunk_id, distance, chunk_data) in enumerate(results, 1):
272
- text = chunk_data["document"]
273
- source_id = chunk_data["metadata"]["source_id"]
274
- context += f"[{i}] {text}\n\n"
275
- citations.append({
276
- "id": i,
277
- "source_id": source_id,
278
- "preview": chunk_data["metadata"]["text_preview"]
279
- })
280
-
281
- # 4. Build prompt (with your system prompt)
282
- prompt = f"""
283
- Context from uploaded sources:
284
- {context}
285
-
286
- User question: {query}
287
-
288
- Answer based on the context above. If not in context, say so.
289
- """
290
-
291
- # 5. Call LLM (e.g., OpenAI, Claude)
292
- response = call_llm(prompt) # Your LLM integration
293
-
294
- # 6. Return with citations
295
- return {
296
- "answer": response,
297
- "citations": citations
298
- }
299
- ```
300
-
301
- ## Testing
302
-
303
- All modules are unit and integration tested:
304
-
305
- ```bash
306
- # Run all tests
307
- pytest tests/ -v
308
-
309
- # Run specific test file
310
- pytest tests/test_integration.py -v
311
-
312
- # Run specific test
313
- pytest tests/test_integration.py::test_txt_upload_extract_ingest -v
314
- ```
315
-
316
- Current test coverage:
317
- - βœ… Text extraction (TXT, PDF, PPTX, URL)
318
- - βœ… Chunking with NLTK
319
- - βœ… Embedding (local + provider switching)
320
- - βœ… Chroma isolation by user/notebook
321
- - βœ… End-to-end ingestion pipeline
322
-
323
- ## Troubleshooting
324
-
325
- ### "No text extracted"
326
- - **PDF**: Scanned images without OCR β†’ add `--ocr` flag (requires pytesseract)
327
- - **URL**: Try with User-Agent header (handled automatically) or check network access
328
-
329
- ### "Chroma collection not found"
330
- - Check: data folder exists and has correct user/notebook structure
331
- - Try: Re-ingest sources or check chroma_dir path
332
-
333
- ### "Embedding provider error"
334
- - **Missing package**: `pip install openai` or `pip install huggingface_hub`
335
- - **Missing API key**: Check `.env` file for OPENAI_API_KEY or HF_API_TOKEN
336
-
337
- ### "NLTK punkt tokenizer not found"
338
- - Automatically downloaded on first run
339
- - If fails: `python -c "import nltk; nltk.download('punkt_tab')"`
340
-
341
- ## Development Notes
342
-
343
- ### Extending Extractors
344
-
345
- Add new format support in `src/ingestion/extractors.py`:
346
-
347
- ```python
348
- def extract_text_from_docx(file_path: Path) -> Dict[str, str]:
349
- """Extract text from DOCX files."""
350
- from docx import Document
351
- doc = Document(file_path)
352
- text = "\n".join(p.text for p in doc.paragraphs)
353
- return {"text": text}
354
-
355
- # Register in cli.py _EXTRACTORS
356
- _EXTRACTORS = {
357
- ".txt": extract_text_from_txt,
358
- ".pdf": extract_text_from_pdf,
359
- ".pptx": extract_text_from_pptx,
360
- ".docx": extract_text_from_docx, # NEW
361
- }
362
- ```
363
-
364
- ### Custom Embedding Models
365
-
366
- Pass any HuggingFace sentence-transformer model name:
367
-
368
- ```bash
369
- python -m src.ingestion.cli ingest \
370
- --user alice \
371
- --notebook nb1 \
372
- --source-id abc123 \
373
- --embedding-model all-mpnet-base-v2 # Different model
374
- ```
375
-
376
- ### Batch Processing
377
-
378
- For bulk ingestion:
379
-
380
- ```bash
381
- # Process multiple files
382
- for file in documents/*.pdf; do
383
- python -m src.ingestion.cli upload \
384
- --user alice \
385
- --notebook batch-$(date +%s) \
386
- --path "$file" \
387
- --auto-ingest
388
- done
389
- ```
390
-
391
- ## Next Steps for Teammates
392
-
393
- ### Frontend Team
394
- - Import CLI commands into Gradio callback: `subprocess.run(["python", "-m", "src.ingestion.cli", ...])`
395
- - Display upload progress using progress bar output
396
- - Parse source_ids from CLI output for metadata storage
397
-
398
- ### RAG Chat Team
399
- - Use `ChromaAdapter.query()` to retrieve context
400
- - Implement prompt engineering with citations
401
- - Integrate with LLM (OpenAI, Claude, etc.)
402
-
403
- ### Artifact Generation Team
404
- - Query chunks for context using `ChromaAdapter`
405
- - Generate reports/quizzes using retrieved sources
406
- - Save outputs to `artifacts/{reports,quizzes,podcasts}/`
407
-
408
- ### Deployment Team
409
- - Ensure `data/` directory is persistent (not ephemeral)
410
- - Set `EMBEDDING_PROVIDER=local` for HF Spaces (no API costs)
411
- - Pre-download models: `python -c "from sentence_transformers import SentenceTransformer; SentenceTransformer('all-MiniLM-L6-v2')"`
412
-
413
- ## Contact & Debugging
414
-
415
- If integration issues arise:
416
- 1. Check `README.md` for dependency installation
417
- 2. Run `pytest tests/ -v` to verify module health
418
- 3. Check `.env` file and required API keys
419
- 4. Review storage folder structure: `ls -R data/users/`