vn6295337's picture
Initial commit: RAG Document Assistant with Zero-Storage Privacy
f866820

RAG PoC — Ingestion

This folder contains Day-3 ingestion pipeline components.

Files:

  • load_docs.py : Markdown loader -> returns cleaned text + metadata
  • chunker.py : Deterministic whitespace chunker (approx tokens->chars)
  • test_ingestion.py : End-to-end loader -> chunker smoke test
  • embeddings.py : Offline deterministic pseudo-embedding stub (provider="local")
  • save_embeddings.py : Persist chunk embeddings to data/embeddings.jsonl
  • search_local.py : Local cosine-similarity retrieval against embeddings.jsonl
  • data/embeddings.jsonl : Generated embeddings (JSONL)

Quick run (from RAG-document-assistant/ingestion, with aienv active):

  1. Activate venv: source ~/aienv/bin/activate

  2. Load & summarize docs: python3 load_docs.py /full/path/to/your/markdown_folder

  3. End-to-end ingestion test: python3 test_ingestion.py /full/path/to/your/markdown_folder

  4. Generate & save embeddings: python3 save_embeddings.py /full/path/to/your/markdown_folder local 64

  5. Search locally: python3 search_local.py data/embeddings.jsonl "your query" 3 64

Notes:

  • Replace /full/path/to/your/markdown_folder with your real path (e.g. /home/vn6295337/RAG-document-assistant/sample_docs).
  • This pipeline uses a local pseudo-embedding for offline testing. Replace provider branches when ready to use real APIs.