Spaces:
Sleeping
Sleeping
Ingestion Pipeline Specification
Summary
- Purpose: build the ingestion pipeline that extracts text from PDFs, PPTX, TXT, and URLs; chunks text; computes embeddings; and stores embeddings + metadata in a vector database with per-user and per-notebook isolation for RAG with citations.
Goals & Scope
- Support file and URL ingestion for per-user notebooks.
- Produce chunked, embedded documents with provenance metadata (source, page/slide, offsets).
- Provide an API/worker flow with status tracking and retries.
High-level Architecture
- Ingestion API: Accept uploads/URLs, validate, create
sourcerecord, enqueue ingestion job. - Worker: Extract -> Preprocess -> Chunk -> Embed -> Upsert to vector DB -> Persist metadata.
- Storage: raw files in
data/raw/{user_id}/{notebook_id}/, metadata in a lightweight DB (SQLite/JSON), embeddings in a vector DB (Chroma/FAISS).
Components
- Upload Handler: validates file type/size and writes raw file.
- Extractors: modular functions for each type (PDF, PPTX, TXT, URL).
- Preprocessing: normalize whitespace, remove boilerplate, preserve page/slide indices and char offsets.
- Chunker: token-aware sliding window with overlap; store chunk-level metadata.
- Embedder: pluggable adapter (local
sentence-transformersor API-based embeddings). - Vectorstore Adapter: pluggable (default Chroma). Upserts include chunk text + metadata.
- Metadata Store: tracks notebooks, sources, ingestion status, timestamps, and source enable/disable flags.
- Job Orchestrator: simple background queue with retries (Redis/RQ or asyncio-based worker).
Extractors (recommended implementations)
- PDF:
PyMuPDF(fitz) orpdfminer.six; extract per-page text and char offsets. - PPTX:
python-pptxextract slide text + speaker notes. - TXT: read with encoding detection (
chardet/charset-normalizer). - URL:
requests+BeautifulSoupwith readability extraction; sanitize HTML.
Chunking
- Default: 500 tokens per chunk with 50 token overlap (configurable via
CHUNK_TOKENS,CHUNK_OVERLAP). - Alternative simple default: 2000 characters with 400 char overlap for token-agnostic environments.
- Store for each chunk:
chunk_id,source_id,char_start,char_end,page,text_preview.
Embeddings
- Adapter interface supports local models (
sentence-transformers/all-MiniLM) or API-based embeddings (OpenAI/HF). - Configurable batch size (default 64) and rate limiting.
Vector DB
- Default: Chroma for HF Spaces compatibility; can swap to FAISS or Weaviate later.
- Use namespacing by
user_id+notebook_idor include those fields in metadata for isolation.
Metadata Schema (suggested)
- Notebook:
{ notebook_id, user_id, name, created_at, updated_at } - Source:
{ source_id, notebook_id, user_id, filename, url?, status, pages, size_bytes, created_at, error? } - Chunk:
{ chunk_id, source_id, notebook_id, user_id, char_start, char_end, page, text_preview }
API Contract (ingest endpoints)
POST /ingest/uploadβ multipart file,user_id,notebook_idβ returns{ source_id, job_id }.POST /ingest/urlβ body{ url, user_id, notebook_id }β returns{ source_id, job_id }.GET /ingest/status?job_id=...β returns status and error if failed.POST /ingest/enable_sourceβ enable/disable source for retrieval.
Citation Strategy
- Store
source_name,file_pathorurl, andpage/slidein each chunk's metadata. - Retrieval returns top-k chunks with metadata; present citations like
[SourceName β page 3].
Job Orchestration
- Small files: synchronous ingestion (fast path).
- Large files: enqueue to background worker with retry/backoff and status updates.
Security & Operational Considerations
- Sanitize HTML from URLs; validate file types and size limits (configurable, e.g., 100MB).
- Rate-limit embedding calls to control cost.
- Log ingestion events and basic metrics.
File Layout (suggested)
ingest/api.pyβ upload + status endpointsworker.pyβ ingestion worker and job runnerextractors.pyβ file/URL extraction functionschunker.pyβ token-aware chunking utilitiesembedder.pyβ embedding adaptervectorstore.pyβ vector DB adaptermetadata.pyβ metadata store (SQLite/TinyDB)
data/raw/anddata/meta/
Config / Environment Variables (examples)
EMBEDDING_PROVIDER=local|openai|hfCHUNK_TOKENS=500CHUNK_OVERLAP=50EMBED_BATCH_SIZE=64VECTORSTORE=chroma|faiss|weaviate
Acceptance Criteria
- API accepts PDF/PPTX/TXT/URL and returns
source_id+job_id. - Worker extracts text with page/slide metadata and chunks according to config.
- Embeddings stored in vector DB with metadata linking back to source and offsets.
- Retrieval returns chunk text + metadata for citation.
- Per-user and per-notebook isolation enforced.
Testing
- Unit tests for extractors (use small sample files), chunker determinism, embedder adapter (mocked), and vectorstore upsert/retrieve (in-memory).
CI / Deployment
- Add GitHub Actions to run tests and push to Hugging Face Space on main branch updates.
Next Steps (immediate)
- Scaffold
ingest/module files and a small runner script. - Add unit tests and sample files for extractors and chunker.
- Implement Chroma adapter and local
sentence-transformersembedding support.
If you'd like, I can now scaffold the ingest/ module files and tests. Also confirm preferred default embedding provider: sentence-transformers (local, free) or openai/hf (API-based).