Spaces:
Sleeping
Sleeping
File size: 5,576 Bytes
aacd162 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 | # Ingestion Pipeline Specification
## Summary
- Purpose: build the ingestion pipeline that extracts text from PDFs, PPTX, TXT, and URLs; chunks text; computes embeddings; and stores embeddings + metadata in a vector database with per-user and per-notebook isolation for RAG with citations.
## Goals & Scope
- Support file and URL ingestion for per-user notebooks.
- Produce chunked, embedded documents with provenance metadata (source, page/slide, offsets).
- Provide an API/worker flow with status tracking and retries.
## High-level Architecture
- Ingestion API: Accept uploads/URLs, validate, create `source` record, enqueue ingestion job.
- Worker: Extract -> Preprocess -> Chunk -> Embed -> Upsert to vector DB -> Persist metadata.
- Storage: raw files in `data/raw/{user_id}/{notebook_id}/`, metadata in a lightweight DB (SQLite/JSON), embeddings in a vector DB (Chroma/FAISS).
## Components
- Upload Handler: validates file type/size and writes raw file.
- Extractors: modular functions for each type (PDF, PPTX, TXT, URL).
- Preprocessing: normalize whitespace, remove boilerplate, preserve page/slide indices and char offsets.
- Chunker: token-aware sliding window with overlap; store chunk-level metadata.
- Embedder: pluggable adapter (local `sentence-transformers` or API-based embeddings).
- Vectorstore Adapter: pluggable (default Chroma). Upserts include chunk text + metadata.
- Metadata Store: tracks notebooks, sources, ingestion status, timestamps, and source enable/disable flags.
- Job Orchestrator: simple background queue with retries (Redis/RQ or asyncio-based worker).
## Extractors (recommended implementations)
- PDF: `PyMuPDF` (fitz) or `pdfminer.six`; extract per-page text and char offsets.
- PPTX: `python-pptx` extract slide text + speaker notes.
- TXT: read with encoding detection (`chardet` / `charset-normalizer`).
- URL: `requests` + `BeautifulSoup` with readability extraction; sanitize HTML.
## Chunking
- Default: 500 tokens per chunk with 50 token overlap (configurable via `CHUNK_TOKENS`, `CHUNK_OVERLAP`).
- Alternative simple default: 2000 characters with 400 char overlap for token-agnostic environments.
- Store for each chunk: `chunk_id`, `source_id`, `char_start`, `char_end`, `page`, `text_preview`.
## Embeddings
- Adapter interface supports local models (`sentence-transformers/all-MiniLM`) or API-based embeddings (OpenAI/HF).
- Configurable batch size (default 64) and rate limiting.
## Vector DB
- Default: Chroma for HF Spaces compatibility; can swap to FAISS or Weaviate later.
- Use namespacing by `user_id` + `notebook_id` or include those fields in metadata for isolation.
## Metadata Schema (suggested)
- Notebook: `{ notebook_id, user_id, name, created_at, updated_at }`
- Source: `{ source_id, notebook_id, user_id, filename, url?, status, pages, size_bytes, created_at, error? }`
- Chunk: `{ chunk_id, source_id, notebook_id, user_id, char_start, char_end, page, text_preview }`
## API Contract (ingest endpoints)
- `POST /ingest/upload` β multipart file, `user_id`, `notebook_id` β returns `{ source_id, job_id }`.
- `POST /ingest/url` β body `{ url, user_id, notebook_id }` β returns `{ source_id, job_id }`.
- `GET /ingest/status?job_id=...` β returns status and error if failed.
- `POST /ingest/enable_source` β enable/disable source for retrieval.
## Citation Strategy
- Store `source_name`, `file_path` or `url`, and `page/slide` in each chunk's metadata.
- Retrieval returns top-k chunks with metadata; present citations like `[SourceName β page 3]`.
## Job Orchestration
- Small files: synchronous ingestion (fast path).
- Large files: enqueue to background worker with retry/backoff and status updates.
## Security & Operational Considerations
- Sanitize HTML from URLs; validate file types and size limits (configurable, e.g., 100MB).
- Rate-limit embedding calls to control cost.
- Log ingestion events and basic metrics.
## File Layout (suggested)
- `ingest/`
- `api.py` β upload + status endpoints
- `worker.py` β ingestion worker and job runner
- `extractors.py` β file/URL extraction functions
- `chunker.py` β token-aware chunking utilities
- `embedder.py` β embedding adapter
- `vectorstore.py` β vector DB adapter
- `metadata.py` β metadata store (SQLite/TinyDB)
- `data/raw/` and `data/meta/`
## Config / Environment Variables (examples)
- `EMBEDDING_PROVIDER`=local|openai|hf
- `CHUNK_TOKENS`=500
- `CHUNK_OVERLAP`=50
- `EMBED_BATCH_SIZE`=64
- `VECTORSTORE`=chroma|faiss|weaviate
## Acceptance Criteria
- API accepts PDF/PPTX/TXT/URL and returns `source_id` + `job_id`.
- Worker extracts text with page/slide metadata and chunks according to config.
- Embeddings stored in vector DB with metadata linking back to source and offsets.
- Retrieval returns chunk text + metadata for citation.
- Per-user and per-notebook isolation enforced.
## Testing
- Unit tests for extractors (use small sample files), chunker determinism, embedder adapter (mocked), and vectorstore upsert/retrieve (in-memory).
## CI / Deployment
- Add GitHub Actions to run tests and push to Hugging Face Space on main branch updates.
## Next Steps (immediate)
1. Scaffold `ingest/` module files and a small runner script.
2. Add unit tests and sample files for extractors and chunker.
3. Implement Chroma adapter and local `sentence-transformers` embedding support.
---
If you'd like, I can now scaffold the `ingest/` module files and tests. Also confirm preferred default embedding provider: `sentence-transformers` (local, free) or `openai/hf` (API-based).
|