Saurab Mishra commited on
Commit
34dcea4
Β·
0 Parent(s):

Initial open source release

Browse files
.dockerignore ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ .venv/
2
+ __pycache__/
3
+ *.pyc
4
+ *.pyo
5
+ data_raw/
6
+ data_processed/
7
+ embeddings/
8
+ .git/
9
+ .qdrant/
10
+ *.db
11
+ *.index
12
+ *.db-shm
13
+ *.db-wal
14
+ faraday-server/
15
+ scripts/
16
+ 00 - Inbox/
17
+ 01 - Raw Sources/
18
+ 02 - Wiki/
19
+ 03 - Data Lake/
20
+ .obsidian/
21
+ *.md
22
+ !README.md
23
+ test_*.py
24
+ results.txt
.gitignore ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Python environments
2
+ .venv/
3
+ __pycache__/
4
+ *.pyc
5
+
6
+ # Local data / DBs
7
+ data_raw/
8
+ data_processed/
9
+ embeddings/
10
+
11
+ # Keys / Config
12
+ .env
13
+ secret
DATA_INGESTION.md ADDED
@@ -0,0 +1,64 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Faraday Memory: Data Ingestion Guide
2
+
3
+ Your AI memory system has built-in smart parsers designed to ingest various file formats. You do not need to convert your documents manually β€” the ingestion pipeline handles extraction, chunking, and vector embedding automatically based on file extensions.
4
+
5
+ ## πŸ“₯ Where to Drop Files
6
+
7
+ You have two primary designated "Drop Zones". The `sync.py push` command recursively scans these directories and all their subfolders:
8
+
9
+ 1. **Your Obsidian Vault's Inbox or Reference Folders**
10
+ - `00 - Inbox/`
11
+ - `01 - Raw Sources/`
12
+ 2. **The Dedicated Raw Data Directory**
13
+ - `Faraday/ai-memory-mcp/data_raw/`
14
+
15
+ ## πŸ“„ Supported Formats & Naming Rules
16
+
17
+ ### 1. ChatGPT Exports (JSON)
18
+ * **Format:** `.json`
19
+ * **Rule:** The filename **must** contain the word `conversations`.
20
+ * **Example:** `chatgpt_conversations.json` or `conversations_2025.json`
21
+ * *Behavior:* Explodes the dump and parses it specifically tracking Human vs AI messages.
22
+
23
+ ### 2. Gemini Exports (HTML / Takeout)
24
+ * **Format:** `.html`
25
+ * **Rule:** The filename **must** contain the word `activity` or `gemini`.
26
+ * **Example:** `My_Gemini_Activity.html`
27
+ * *Behavior:* Strips out Google tracking headers and parses out prompts and responses cleanly.
28
+
29
+ ### 3. PDF Documents (Coursework, Papers, E-books)
30
+ * **Format:** `.pdf`
31
+ * **Rule:** Standard PDF documents (text-based).
32
+ * **Example:** `Linear_Algebra_Notes.pdf` or `Attention_Is_All_You_Need.pdf`
33
+ * *Behavior:* Extracts multi-page text using PDFMiner/PyMuPDF into logical reading chunks.
34
+
35
+ ### 4. Images & Scans (Diagrams, Receipts, Screenshots)
36
+ * **Format:** `.png`, `.jpg`, `.jpeg`, `.webp`, `.bmp`, `.tiff`
37
+ * **Rule:** Can be named anything.
38
+ * **Requirements:** You must have [Tesseract-OCR](https://github.com/UB-Mannheim/tesseract/wiki) installed on your system.
39
+ * *Behavior:* Automatically performs Optical Character Recognition (OCR) to rip the text out of the image and inject it into your vector database.
40
+
41
+ ### 5. Plain Text, Emails, Code, CSVs, and Markdown
42
+ * **Format:** `.md`, `.txt`, `.csv`, `.log`, `.rst`
43
+ * **Rule:** Can be named anything.
44
+ * **Example:** `Meeting_Notes.md` or `email_invoice.txt`
45
+ * *Behavior:* The default fallback. Reads the file as plain structured text and intelligently slices it up. Large CSVs are skipped if they exceed 15MB to prevent clogging the embedding memory.
46
+
47
+ ---
48
+
49
+ ## πŸš€ How to Add It to the Cloud
50
+
51
+ Once your files are dropped into the folders:
52
+ 1. Open your terminal in the `ai-memory-mcp` folder.
53
+ 2. Run the ingestion and cloud synchronization command:
54
+ ```bash
55
+ python sync.py push
56
+ ```
57
+ 3. The script will:
58
+ - Identify *only* the new files.
59
+ - Run the appropriate parser (OCR, HTML extractor, etc.).
60
+ - Break them down into semantic chunks and generate embeddings.
61
+ - Compress the new database.
62
+ - Securely upload the compressed database to Supabase.
63
+
64
+ Your Claude mobile app (and Antigravity!) will automatically have access to this new data upon their next query!
Dockerfile ADDED
@@ -0,0 +1,49 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ─────────────────────────────────────────────────────
2
+ # Faraday AI Memory β€” Cloud Run Container
3
+ # ─────────────────────────────────────────────────────
4
+ # Multi-stage build:
5
+ # 1. Pre-download the embedding model at build time
6
+ # 2. Install all Python dependencies
7
+ # 3. Runs cloud_server.py on port 8080
8
+
9
+ FROM python:3.11-slim AS base
10
+
11
+ # System dependencies for faiss-cpu
12
+ RUN apt-get update && apt-get install -y --no-install-recommends \
13
+ build-essential \
14
+ && rm -rf /var/lib/apt/lists/*
15
+
16
+ # Set working directory
17
+ WORKDIR /app
18
+
19
+ # Copy requirements first (Docker layer caching)
20
+ COPY requirements.txt .
21
+ RUN pip install --no-cache-dir -r requirements.txt
22
+
23
+ # Pre-download the embedding model at build time
24
+ # This avoids a ~100MB download on every cold start
25
+ RUN python -c "from sentence_transformers import SentenceTransformer; SentenceTransformer('all-MiniLM-L6-v2')"
26
+
27
+ # Copy application code
28
+ COPY config.py .
29
+ COPY database/ database/
30
+ COPY processing/ processing/
31
+ COPY ingestion/ ingestion/
32
+ COPY mcp_server/ mcp_server/
33
+
34
+ # Create data directories
35
+ RUN mkdir -p /tmp/faraday-data data_raw data_processed embeddings
36
+
37
+ # Hugging Face Spaces uses PORT env var (default 7860)
38
+ ENV PORT=7860
39
+ ENV CLOUD_DATA_DIR=/tmp/faraday-data
40
+
41
+ # Expose the port
42
+ EXPOSE 7860
43
+
44
+ # Health check
45
+ HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \
46
+ CMD python -c "import urllib.request; urllib.request.urlopen('http://localhost:7860/sse')" || exit 1
47
+
48
+ # Run the cloud MCP server
49
+ CMD ["python", "mcp_server/cloud_server.py"]
README.md ADDED
@@ -0,0 +1,76 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: Faraday Memory
3
+ emoji: 🧠
4
+ colorFrom: blue
5
+ colorTo: purple
6
+ sdk: docker
7
+ app_port: 7860
8
+ ---
9
+
10
+ # Faraday AI Memory
11
+
12
+ **Author:** Saurab Mishra
13
+
14
+ Look, I got tired of switching between five different "AI" coding tools and repeating the same context a million times like a broken record. Claude Code, Cursor, Antigravity, Copilotβ€”they all suffer from the exact same severe, chronic amnesia every time you close the tab. You tell them your tech stack on Monday, and by Tuesday they're confidently rewriting your backend in PHP.
15
+
16
+ So I built **Faraday**.
17
+
18
+ Faraday is a brutally fast, brutally simple Model Context Protocol (MCP) server that slams your local documents, chat histories, emails, and PDFs directly into a vector database, compresses it, and force-feeds it to whatever vibe-coding AI toy you are currently obsessing over.
19
+
20
+ You drop files in a folder, run a script, and suddenly your AI actually remembers who you are and what you're working on. Groundbreaking concept, I know.
21
+
22
+ ---
23
+
24
+ ## What It Actually Does
25
+
26
+ 1. **Ingests literally everything:** Drop `.md`, `.json` (ChatGPT exports), `.html` (Gemini exports), `.pdf`, or even `.png` (images with text) into the `data_raw` folder.
27
+ 2. **Chunking & Vectorization:** It chunks the files, runs them through an extremely lightweight open-source embedding model (`all-MiniLM-L6-v2`), and indexes them in FAISS.
28
+ 3. **Cloud Synchronization:** Compresses the gigantic, bloated vector indexes down to a few megabytes and shoves them into a private Supabase bucket.
29
+ 4. **SSE Server:** Boots up an HTTP/SSE server (perfect for Hugging Face Spaces or Cloud Run) that safely pulls your index from Supabase into memory, hiding it behind a simple API key firewall.
30
+
31
+ Your data stays entirely under your control (no relying on bloated SaaS corporate subscriptions).
32
+
33
+ ## The Local Dev Setup (If you want to run it on your own laptop)
34
+
35
+ If you're deploying this to the cloud, the `Dockerfile` takes care of the heavy lifting. But if you want to run the ingestion and test locally:
36
+
37
+ ### 1. Requirements
38
+
39
+ Install the stuff. You probably already have half of this globally installed anyway.
40
+ ```bash
41
+ pip install -r requirements.txt
42
+ ```
43
+ *(Pro-tip: If you want it to actually read the text out of your images, you need to suffer through installing `Tesseract-OCR` on your OS level. Don't blame me, complain to Google).*
44
+
45
+ ### 2. Configuration
46
+
47
+ Open `config.py`. It's stupidly simple.
48
+ - Put your raw messy files in the `data_raw/` directory.
49
+ - For the Supabase cloud connection, set these environment variables (or hardcode them if you like playing with fire):
50
+ - `SUPABASE_URL`
51
+ - `SUPABASE_KEY` (Needs to be the service key to bypass Row Level Security, since we use private storage buckets)
52
+
53
+ ### 3. Update the Brain
54
+
55
+ Whenever you hoard more documents, just run:
56
+ ```bash
57
+ python sync.py push
58
+ ```
59
+ It will automatically find the new files, ignore the ones it already did, run the embeddings, compress the database, and shoot it to the cloud.
60
+
61
+ ### 4. Connect to your AI
62
+
63
+ Connect any MCP-compatible agent directly to the cloud server, or run it locally:
64
+ ```bash
65
+ # Local standard I/O (For Cursor / Desktop apps)
66
+ python mcp_server/main.py
67
+
68
+ # Or connect to the Cloud SSE endpoint
69
+ URL: https://<your-huggingface-space>.hf.space/sse
70
+ Header: X-API-Key: <your_secret_password>
71
+ ```
72
+
73
+ Enjoy not having to constantly remind your AI what programming language you're using.
74
+
75
+ ---
76
+ *No personal data, keys, or vector blobs are included in this repo. I `.gitignore`'d all of it. If you manage to leak your API keys, that's entirely on you.*
config.py ADDED
@@ -0,0 +1,110 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ config.py β€” Central configuration for the AI Memory MCP system.
3
+
4
+ All paths, model settings, and tuning constants live here.
5
+ Every module imports from this file; nothing is hardcoded elsewhere.
6
+ """
7
+
8
+ import os
9
+ from pathlib import Path
10
+
11
+ # ─────────────────────────────────────────────────────
12
+ # Directory Layout
13
+ # ─────────────────────────────────────────────────────
14
+
15
+ BASE_DIR = Path(__file__).parent.resolve()
16
+
17
+ # Where raw exports / documents are dropped for ingestion
18
+ DATA_RAW = BASE_DIR / "data_raw"
19
+
20
+ # Processed artefacts (SQLite DB, FAISS index)
21
+ DATA_PROCESSED = BASE_DIR / "data_processed"
22
+
23
+ # FAISS index file
24
+ EMBEDDINGS_DIR = BASE_DIR / "embeddings"
25
+
26
+ # Obsidian vault root (parent of ai-memory-mcp/)
27
+ OBSIDIAN_VAULT = BASE_DIR.parent
28
+
29
+ # Obsidian source directories to scan during update
30
+ OBSIDIAN_SCAN_DIRS = [
31
+ OBSIDIAN_VAULT / "00 - Inbox",
32
+ OBSIDIAN_VAULT / "01 - Raw Sources",
33
+ OBSIDIAN_VAULT / "02 - Wiki",
34
+ OBSIDIAN_VAULT / "03 - Data Lake",
35
+ ]
36
+
37
+ # Directories / patterns to skip during recursive scanning
38
+ SKIP_PATTERNS = [
39
+ ".qdrant", ".chroma", ".venv", "__pycache__",
40
+ "faraday-server", "ai-memory-mcp", "node_modules",
41
+ ".git", ".obsidian",
42
+ ]
43
+
44
+ # Ensure critical dirs exist
45
+ for _d in [DATA_RAW, DATA_PROCESSED, EMBEDDINGS_DIR]:
46
+ _d.mkdir(parents=True, exist_ok=True)
47
+
48
+ # ─────────────────────────────────────────────────────
49
+ # Database Paths
50
+ # ─────────────────────────────────────────────────────
51
+
52
+ SQLITE_DB_PATH = DATA_PROCESSED / "memory.db"
53
+ FAISS_INDEX_PATH = EMBEDDINGS_DIR / "memory.index"
54
+
55
+ # ─────────────────────────────────────────────────────
56
+ # Embedding Model
57
+ # ─────────────────────────────────────────────────────
58
+
59
+ EMBEDDING_MODEL = "all-MiniLM-L6-v2"
60
+ EMBEDDING_DIM = 384 # Output dimension for all-MiniLM-L6-v2
61
+ BATCH_SIZE = 64 # SentenceTransformer encode batch size
62
+
63
+ # ─────────────────────────────────────────────────────
64
+ # Chunking
65
+ # ─────────────────────────────────────────────────────
66
+
67
+ CHUNK_MAX_WORDS = 300 # Target words per chunk
68
+ CHUNK_OVERLAP_WORDS = 50 # Overlap between adjacent chunks
69
+ CHUNK_MIN_LENGTH = 30 # Minimum characters β€” skip trivially short chunks
70
+
71
+ # ─────────────────────────────────────────────────────
72
+ # Search / Scoring
73
+ # ─────────────────────────────────────────────────────
74
+
75
+ SEMANTIC_WEIGHT = 0.7 # Weight for vector similarity in hybrid score
76
+ RECENCY_WEIGHT = 0.3 # Weight for timestamp recency in hybrid score
77
+ DEFAULT_TOP_K = 5 # Default number of results
78
+
79
+ # ─────────────────────────────────────────────────────
80
+ # FAISS Tuning
81
+ # ─────────────────────────────────────────────────────
82
+
83
+ # When total vectors exceed this threshold, rebuild as IVFFlat
84
+ # Below this, IndexFlatIP is fine (brute-force is fast enough)
85
+ FAISS_IVF_THRESHOLD = 10_000
86
+ FAISS_NLIST = 100 # Number of Voronoi cells for IVFFlat
87
+ FAISS_NPROBE = 10 # Cells to search at query time (speed/accuracy tradeoff)
88
+
89
+ # ─────────────────────────────────────────────────────
90
+ # OCR (optional)
91
+ # ─────────────────────────────────────────────────────
92
+
93
+ OCR_ENABLED = True # Set False to skip image ingestion entirely
94
+ # If tesseract is not in PATH, set the full path here:
95
+ # e.g. r"C:\Program Files\Tesseract-OCR\tesseract.exe"
96
+ TESSERACT_CMD = None
97
+
98
+ # ─────────────────────────────────────────────────────
99
+ # Cloud Sync (optional β€” Supabase)
100
+ # ─────────────────────────────────────────────────────
101
+
102
+ SUPABASE_URL = os.environ.get("SUPABASE_URL", "")
103
+ SUPABASE_KEY = os.environ.get("SUPABASE_KEY", "")
104
+ SUPABASE_BUCKET = os.environ.get("SUPABASE_BUCKET", "faraday-memory")
105
+
106
+ # ─────────────────────────────────────────────────────
107
+ # Logging
108
+ # ─────────────────────────────────────────────────────
109
+
110
+ LOG_LEVEL = os.environ.get("LOG_LEVEL", "INFO")
database/__init__.py ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ """
2
+ database β€” Storage layer with SQLite (metadata + FTS5) and FAISS (vectors).
3
+ """
database/faiss_db.py ADDED
@@ -0,0 +1,203 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ database.faiss_db β€” Production FAISS vector index with adaptive indexing.
3
+
4
+ Key design decisions:
5
+ - Uses IndexFlatIP (inner product on L2-normalized vectors) for small datasets.
6
+ - Automatically trains and rebuilds as IVFFlat when vectors exceed threshold
7
+ (default 10K), giving O(√N) search instead of O(N).
8
+ - Deferred disk writes: caller must explicitly call save() β€” eliminates
9
+ the O(NΒ²) disk I/O from writing after every batch.
10
+ - IndexIDMap wrapper maps FAISS positions to SQLite row IDs.
11
+ """
12
+
13
+ import os
14
+ import sys
15
+ from pathlib import Path
16
+ from typing import List, Optional, Tuple
17
+
18
+ import faiss
19
+ import numpy as np
20
+
21
+ # Import from root config
22
+ sys.path.insert(0, str(Path(__file__).parent.parent))
23
+ from config import (
24
+ EMBEDDING_DIM,
25
+ FAISS_INDEX_PATH,
26
+ FAISS_IVF_THRESHOLD,
27
+ FAISS_NLIST,
28
+ FAISS_NPROBE,
29
+ )
30
+
31
+
32
+ class VectorDB:
33
+ """
34
+ FAISS-backed vector index with adaptive index type selection.
35
+
36
+ For < FAISS_IVF_THRESHOLD vectors: brute-force IndexFlatIP.
37
+ For >= threshold: IVFFlat with configurable nlist/nprobe.
38
+ """
39
+
40
+ def __init__(
41
+ self,
42
+ dim: int = EMBEDDING_DIM,
43
+ index_path: Optional[str] = None,
44
+ ):
45
+ self.dim = dim
46
+ self.index_path = index_path or str(FAISS_INDEX_PATH)
47
+ self.index: Optional[faiss.Index] = None
48
+ self._load_or_create()
49
+
50
+ def _load_or_create(self):
51
+ """Load existing index from disk or create a fresh one."""
52
+ if os.path.exists(self.index_path):
53
+ self.index = faiss.read_index(self.index_path)
54
+ print(
55
+ f"[VectorDB] Loaded FAISS index: {self.index.ntotal} vectors",
56
+ file=sys.stderr,
57
+ )
58
+ else:
59
+ print("[VectorDB] Creating new FAISS IndexFlatIP.", file=sys.stderr)
60
+ base = faiss.IndexFlatIP(self.dim)
61
+ self.index = faiss.IndexIDMap(base)
62
+
63
+ def add_embeddings(self, embeddings: np.ndarray, ids: np.ndarray):
64
+ """
65
+ Add a batch of embeddings with their corresponding SQLite IDs.
66
+
67
+ Args:
68
+ embeddings: (N, dim) float32 array β€” MUST be L2-normalized.
69
+ ids: (N,) int64 array of SQLite row IDs.
70
+
71
+ NOTE: Does NOT write to disk. Call save() explicitly when done.
72
+ """
73
+ if embeddings.shape[0] == 0:
74
+ return
75
+
76
+ assert embeddings.shape[1] == self.dim, (
77
+ f"Dimension mismatch: got {embeddings.shape[1]}, expected {self.dim}"
78
+ )
79
+
80
+ embeddings = embeddings.astype(np.float32)
81
+ ids = ids.astype(np.int64)
82
+
83
+ # L2-normalize for cosine similarity via inner product
84
+ faiss.normalize_L2(embeddings)
85
+
86
+ self.index.add_with_ids(embeddings, ids)
87
+
88
+ def search(
89
+ self, query_embedding: np.ndarray, top_k: int = 5
90
+ ) -> List[Tuple[int, float]]:
91
+ """
92
+ Find the top_k most similar vectors.
93
+
94
+ Args:
95
+ query_embedding: (1, dim) or (dim,) float32 array.
96
+ top_k: Number of results.
97
+
98
+ Returns:
99
+ List of (sqlite_id, similarity_score) tuples.
100
+ Score is cosine similarity (higher = better, range 0-1).
101
+ """
102
+ if self.index is None or self.index.ntotal == 0:
103
+ return []
104
+
105
+ query_embedding = query_embedding.astype(np.float32)
106
+ if query_embedding.ndim == 1:
107
+ query_embedding = query_embedding.reshape(1, -1)
108
+
109
+ # Normalize query for cosine similarity
110
+ faiss.normalize_L2(query_embedding)
111
+
112
+ # Set nprobe for IVF indices (no-op for flat indices)
113
+ try:
114
+ # Access the underlying IVF quantizer if it exists
115
+ ivf = faiss.extract_index_ivf(self.index)
116
+ if ivf is not None:
117
+ ivf.nprobe = FAISS_NPROBE
118
+ except Exception:
119
+ pass
120
+
121
+ distances, indices = self.index.search(query_embedding, top_k)
122
+
123
+ results = []
124
+ for i in range(len(indices[0])):
125
+ idx = int(indices[0][i])
126
+ score = float(distances[0][i])
127
+ if idx != -1: # FAISS returns -1 for missing matches
128
+ results.append((idx, score))
129
+
130
+ return results
131
+
132
+ def maybe_rebuild_ivf(self):
133
+ """
134
+ If total vectors exceed the IVF threshold, rebuild as IVFFlat
135
+ for O(√N) search performance. Called after a full update cycle.
136
+
137
+ This is an expensive operation (re-indexes everything) but only
138
+ happens once when crossing the threshold.
139
+ """
140
+ if self.index.ntotal < FAISS_IVF_THRESHOLD:
141
+ return
142
+
143
+ # Check if already IVF
144
+ try:
145
+ ivf = faiss.extract_index_ivf(self.index)
146
+ if ivf is not None:
147
+ return # Already IVF, no rebuild needed
148
+ except Exception:
149
+ pass
150
+
151
+ print(
152
+ f"[VectorDB] Rebuilding as IVFFlat ({self.index.ntotal} vectors, "
153
+ f"nlist={FAISS_NLIST})...",
154
+ file=sys.stderr,
155
+ )
156
+
157
+ n = self.index.ntotal
158
+
159
+ try:
160
+ # IndexIDMap stores vectors in the sub-index and IDs in id_map
161
+ # Access the underlying flat index vectors directly
162
+ sub_index = self.index.index
163
+ all_vectors = faiss.vector_to_array(sub_index.xb).reshape(n, self.dim).copy()
164
+
165
+ # Extract the ID mapping
166
+ all_ids = faiss.vector_to_array(self.index.id_map).copy()
167
+
168
+ # Build new IVFFlat index
169
+ quantizer = faiss.IndexFlatIP(self.dim)
170
+ ivf_index = faiss.IndexIVFFlat(
171
+ quantizer, self.dim, FAISS_NLIST, faiss.METRIC_INNER_PRODUCT
172
+ )
173
+
174
+ # Train on existing vectors
175
+ ivf_index.train(all_vectors)
176
+
177
+ # Wrap in IDMap and add with original IDs
178
+ new_index = faiss.IndexIDMap(ivf_index)
179
+ new_index.add_with_ids(all_vectors, all_ids)
180
+
181
+ self.index = new_index
182
+ print(f"[VectorDB] IVFFlat rebuild complete.", file=sys.stderr)
183
+
184
+ except Exception as e:
185
+ print(
186
+ f"[VectorDB] IVF rebuild failed: {e}. Keeping flat index.",
187
+ file=sys.stderr,
188
+ )
189
+
190
+ def save(self):
191
+ """Persist the index to disk. Call once at end of update pipeline."""
192
+ if self.index is not None:
193
+ # Ensure parent directory exists
194
+ os.makedirs(os.path.dirname(self.index_path), exist_ok=True)
195
+ faiss.write_index(self.index, self.index_path)
196
+ print(
197
+ f"[VectorDB] Saved index ({self.index.ntotal} vectors) to {self.index_path}",
198
+ file=sys.stderr,
199
+ )
200
+
201
+ def count(self) -> int:
202
+ """Total number of indexed vectors."""
203
+ return self.index.ntotal if self.index else 0
database/sqlite_db.py ADDED
@@ -0,0 +1,246 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ database.sqlite_db β€” Production SQLite metadata store with FTS5 full-text search.
3
+
4
+ Features:
5
+ - WAL journal mode for concurrent read/write
6
+ - FTS5 virtual table for keyword search
7
+ - Hash-based deduplication (UNIQUE constraint on hash column)
8
+ - Time-range queries via indexed timestamp column
9
+ - Tag filtering
10
+ - Batch insert via executemany
11
+ - Preserves ordered retrieval by maintaining original ID sequence
12
+ """
13
+
14
+ import sqlite3
15
+ import sys
16
+ from pathlib import Path
17
+ from typing import Dict, List, Optional, Set
18
+
19
+ # Import from root config
20
+ sys.path.insert(0, str(Path(__file__).parent.parent))
21
+ from config import SQLITE_DB_PATH
22
+
23
+
24
+ class MemoryDB:
25
+ """
26
+ SQLite-backed metadata store for memory chunks.
27
+ Each chunk gets a monotonically increasing integer ID that maps
28
+ exactly to its FAISS vector index position.
29
+ """
30
+
31
+ def __init__(self, db_path: Optional[Path] = None, readonly: bool = False):
32
+ path = str(db_path or SQLITE_DB_PATH)
33
+ if readonly:
34
+ # Read-only URI connection β€” safe for concurrent MCP server reads
35
+ uri = f"file:{path}?mode=ro"
36
+ self.conn = sqlite3.connect(uri, uri=True, check_same_thread=False)
37
+ else:
38
+ self.conn = sqlite3.connect(path, check_same_thread=False)
39
+ # WAL mode: allows concurrent readers while writing
40
+ self.conn.execute("PRAGMA journal_mode=WAL;")
41
+ self.conn.execute("PRAGMA synchronous=NORMAL;")
42
+
43
+ self.conn.row_factory = sqlite3.Row
44
+ self._init_tables()
45
+
46
+ def _init_tables(self):
47
+ """Create core tables and FTS5 index if they don't exist."""
48
+ with self.conn:
49
+ # Core metadata table
50
+ self.conn.execute("""
51
+ CREATE TABLE IF NOT EXISTS memories (
52
+ id INTEGER PRIMARY KEY AUTOINCREMENT,
53
+ hash TEXT UNIQUE NOT NULL,
54
+ text TEXT NOT NULL,
55
+ source TEXT NOT NULL DEFAULT '',
56
+ timestamp TEXT NOT NULL DEFAULT '',
57
+ tags TEXT NOT NULL DEFAULT '',
58
+ created TEXT NOT NULL DEFAULT (datetime('now'))
59
+ )
60
+ """)
61
+
62
+ # Indices for fast lookups
63
+ self.conn.execute(
64
+ "CREATE INDEX IF NOT EXISTS idx_hash ON memories(hash);"
65
+ )
66
+ self.conn.execute(
67
+ "CREATE INDEX IF NOT EXISTS idx_timestamp ON memories(timestamp);"
68
+ )
69
+ self.conn.execute(
70
+ "CREATE INDEX IF NOT EXISTS idx_tags ON memories(tags);"
71
+ )
72
+
73
+ # FTS5 virtual table for full-text keyword search
74
+ # content= makes it an external-content FTS table (no data duplication)
75
+ self.conn.execute("""
76
+ CREATE VIRTUAL TABLE IF NOT EXISTS memories_fts
77
+ USING fts5(text, source, tags, content=memories, content_rowid=id)
78
+ """)
79
+
80
+ # Triggers to keep FTS5 in sync with the main table
81
+ self.conn.executescript("""
82
+ CREATE TRIGGER IF NOT EXISTS memories_ai AFTER INSERT ON memories BEGIN
83
+ INSERT INTO memories_fts(rowid, text, source, tags)
84
+ VALUES (new.id, new.text, new.source, new.tags);
85
+ END;
86
+
87
+ CREATE TRIGGER IF NOT EXISTS memories_ad AFTER DELETE ON memories BEGIN
88
+ INSERT INTO memories_fts(memories_fts, rowid, text, source, tags)
89
+ VALUES ('delete', old.id, old.text, old.source, old.tags);
90
+ END;
91
+
92
+ CREATE TRIGGER IF NOT EXISTS memories_au AFTER UPDATE ON memories BEGIN
93
+ INSERT INTO memories_fts(memories_fts, rowid, text, source, tags)
94
+ VALUES ('delete', old.id, old.text, old.source, old.tags);
95
+ INSERT INTO memories_fts(rowid, text, source, tags)
96
+ VALUES (new.id, new.text, new.source, new.tags);
97
+ END;
98
+ """)
99
+
100
+ # ─────────────────────────────────────────────
101
+ # Deduplication
102
+ # ─────────────────────────────────────────────
103
+
104
+ def get_existing_hashes(self) -> Set[str]:
105
+ """
106
+ Fetch all known content hashes.
107
+ Used by update.py to skip already-processed chunks.
108
+ """
109
+ cur = self.conn.execute("SELECT hash FROM memories")
110
+ return {row["hash"] for row in cur.fetchall()}
111
+
112
+ # ─────────────────────────────────────────────
113
+ # Batch Insert
114
+ # ─────────────────────────────────────────────
115
+
116
+ def insert_memories(self, data: List[Dict]) -> List[int]:
117
+ """
118
+ Insert a batch of chunks. Returns list of SQLite row IDs
119
+ that map exactly to FAISS vector positions.
120
+
121
+ Uses INSERT OR IGNORE to safely skip duplicates within
122
+ the same batch.
123
+ """
124
+ ids: List[int] = []
125
+ with self.conn:
126
+ for item in data:
127
+ cur = self.conn.execute(
128
+ """
129
+ INSERT OR IGNORE INTO memories (hash, text, source, timestamp, tags)
130
+ VALUES (?, ?, ?, ?, ?)
131
+ """,
132
+ (
133
+ item.get("hash", ""),
134
+ item.get("text", ""),
135
+ item.get("source", ""),
136
+ item.get("timestamp", ""),
137
+ item.get("tags", ""),
138
+ ),
139
+ )
140
+
141
+ if cur.lastrowid and cur.rowcount > 0:
142
+ ids.append(cur.lastrowid)
143
+ else:
144
+ # Retrieve existing ID for this hash (it was a duplicate)
145
+ existing = self.conn.execute(
146
+ "SELECT id FROM memories WHERE hash=?",
147
+ (item["hash"],),
148
+ ).fetchone()
149
+ if existing:
150
+ ids.append(existing["id"])
151
+
152
+ return ids
153
+
154
+ # ─────────────────────────────────────────────
155
+ # Retrieval
156
+ # ─────────────────────────────────────────────
157
+
158
+ def get_memories_by_ids(self, ids: List[int]) -> List[Dict]:
159
+ """
160
+ Retrieve full metadata for a list of FAISS-matched IDs.
161
+ Returns results in the SAME ORDER as the input IDs.
162
+ """
163
+ if not ids:
164
+ return []
165
+
166
+ placeholders = ",".join(["?"] * len(ids))
167
+ query = f"SELECT * FROM memories WHERE id IN ({placeholders})"
168
+ cur = self.conn.execute(query, ids)
169
+
170
+ row_dict = {row["id"]: dict(row) for row in cur.fetchall()}
171
+ return [row_dict[i] for i in ids if i in row_dict]
172
+
173
+ def search_by_time_range(
174
+ self, start_iso: str, end_iso: str, limit: int = 100
175
+ ) -> List[Dict]:
176
+ """
177
+ Retrieve memories within a time range (ISO 8601 strings).
178
+ """
179
+ cur = self.conn.execute(
180
+ """
181
+ SELECT id FROM memories
182
+ WHERE timestamp >= ? AND timestamp <= ?
183
+ ORDER BY timestamp DESC
184
+ LIMIT ?
185
+ """,
186
+ (start_iso, end_iso, limit),
187
+ )
188
+ return [row["id"] for row in cur.fetchall()]
189
+
190
+ def search_by_tags(self, tag: str, limit: int = 100) -> List[int]:
191
+ """Return IDs of memories matching a tag substring."""
192
+ cur = self.conn.execute(
193
+ "SELECT id FROM memories WHERE tags LIKE ? LIMIT ?",
194
+ (f"%{tag}%", limit),
195
+ )
196
+ return [row["id"] for row in cur.fetchall()]
197
+
198
+ def keyword_search(self, query: str, limit: int = 20) -> List[int]:
199
+ """
200
+ Full-text keyword search via FTS5.
201
+ Returns matching memory IDs ranked by BM25 relevance.
202
+ """
203
+ try:
204
+ cur = self.conn.execute(
205
+ """
206
+ SELECT rowid FROM memories_fts
207
+ WHERE memories_fts MATCH ?
208
+ ORDER BY rank
209
+ LIMIT ?
210
+ """,
211
+ (query, limit),
212
+ )
213
+ return [row["rowid"] for row in cur.fetchall()]
214
+ except Exception:
215
+ # FTS match syntax can fail on certain special characters
216
+ return []
217
+
218
+ # ─────────────────────────────────────────────
219
+ # Stats
220
+ # ─────────────────────────────────────────────
221
+
222
+ def count(self) -> int:
223
+ """Total number of stored memory chunks."""
224
+ return self.conn.execute("SELECT COUNT(*) FROM memories").fetchone()[0]
225
+
226
+ def get_stats(self) -> Dict:
227
+ """Return diagnostic statistics."""
228
+ row = self.conn.execute(
229
+ """
230
+ SELECT
231
+ COUNT(*) as total,
232
+ MIN(timestamp) as earliest,
233
+ MAX(timestamp) as latest,
234
+ COUNT(DISTINCT source) as sources
235
+ FROM memories
236
+ """
237
+ ).fetchone()
238
+ return dict(row) if row else {}
239
+
240
+ # ─────────────────────────────────────────────
241
+ # Lifecycle
242
+ # ─────────────────────────────────────────────
243
+
244
+ def close(self):
245
+ """Close the database connection."""
246
+ self.conn.close()
deploy.ps1 ADDED
@@ -0,0 +1,135 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <#
2
+ deploy.ps1 β€” One-command Faraday Cloud deployment.
3
+
4
+ Usage:
5
+ .\deploy.ps1 push # Upload data to Supabase only
6
+ .\deploy.ps1 deploy # Push data + deploy to Cloud Run
7
+ .\deploy.ps1 full # Push + build + deploy + test
8
+
9
+ Prerequisites:
10
+ - Python venv activated with supabase installed
11
+ - gcloud CLI authenticated (for Cloud Run deployment)
12
+ #>
13
+
14
+ param(
15
+ [Parameter(Position=0)]
16
+ [ValidateSet("push", "deploy", "full")]
17
+ [string]$Action = "full"
18
+ )
19
+
20
+ $ErrorActionPreference = "Stop"
21
+ $ProjectRoot = Split-Path -Parent $PSScriptRoot
22
+ $AiMemoryDir = $PSScriptRoot
23
+
24
+ # ─────────────────────────────────────────────────────
25
+ # Config
26
+ # ─────────────────────────────────────────────────────
27
+ $GCP_PROJECT = "faraday-memory-cloud"
28
+ $GCP_REGION = "asia-south1" # Mumbai β€” closest to India
29
+ $SERVICE_NAME = "faraday-mcp"
30
+ $FARADAY_API_KEY = "frdy_" + [System.Guid]::NewGuid().ToString("N").Substring(0, 24)
31
+
32
+ # ─────────────────────────────────────────────────────
33
+ # Step 1: Push data to Supabase
34
+ # ─────────────────────────────────────────────────────
35
+ function Push-Data {
36
+ Write-Host "`nπŸ“¦ Pushing data to Supabase Storage..." -ForegroundColor Cyan
37
+
38
+ $venvPython = Join-Path $ProjectRoot ".venv\Scripts\python.exe"
39
+ if (Test-Path $venvPython) {
40
+ & $venvPython (Join-Path $AiMemoryDir "sync.py") push
41
+ } else {
42
+ python (Join-Path $AiMemoryDir "sync.py") push
43
+ }
44
+
45
+ if ($LASTEXITCODE -ne 0) {
46
+ Write-Host "❌ Data push failed!" -ForegroundColor Red
47
+ exit 1
48
+ }
49
+ Write-Host "βœ… Data pushed to Supabase." -ForegroundColor Green
50
+ }
51
+
52
+ # ─────────────────────────────────────────────────────
53
+ # Step 2: Deploy to Cloud Run
54
+ # ─────────────────────────────────────────────────────
55
+ function Deploy-CloudRun {
56
+ Write-Host "`nπŸš€ Deploying to Google Cloud Run..." -ForegroundColor Cyan
57
+
58
+ # Check gcloud auth
59
+ $account = gcloud auth list --filter="status=ACTIVE" --format="value(account)" 2>$null
60
+ if (-not $account) {
61
+ Write-Host "⚠️ Not authenticated with gcloud. Running 'gcloud auth login'..." -ForegroundColor Yellow
62
+ gcloud auth login
63
+ }
64
+
65
+ # Set project
66
+ gcloud config set project $GCP_PROJECT 2>$null
67
+
68
+ # Enable required APIs
69
+ Write-Host " Enabling Cloud Run API..." -ForegroundColor Gray
70
+ gcloud services enable run.googleapis.com artifactregistry.googleapis.com cloudbuild.googleapis.com 2>$null
71
+
72
+ # Deploy from source (Cloud Build handles Dockerfile)
73
+ Write-Host " Building and deploying container..." -ForegroundColor Gray
74
+ gcloud run deploy $SERVICE_NAME `
75
+ --source $AiMemoryDir `
76
+ --region $GCP_REGION `
77
+ --platform managed `
78
+ --allow-unauthenticated `
79
+ --memory 2Gi `
80
+ --cpu 1 `
81
+ --min-instances 0 `
82
+ --max-instances 2 `
83
+ --timeout 300 `
84
+ --set-env-vars "SUPABASE_URL=https://qwxagrmoryojholseclm.supabase.co" `
85
+ --set-env-vars "SUPABASE_KEY=$env:SUPABASE_KEY" `
86
+ --set-env-vars "FARADAY_API_KEY=$FARADAY_API_KEY" `
87
+ --set-env-vars "SUPABASE_BUCKET=faraday-memory"
88
+
89
+ if ($LASTEXITCODE -ne 0) {
90
+ Write-Host "❌ Cloud Run deployment failed!" -ForegroundColor Red
91
+ exit 1
92
+ }
93
+
94
+ # Get service URL
95
+ $serviceUrl = gcloud run services describe $SERVICE_NAME --region $GCP_REGION --format="value(status.url)"
96
+
97
+ Write-Host "`n" -NoNewline
98
+ Write-Host "═══════════════════════════════════════════════════" -ForegroundColor Green
99
+ Write-Host " βœ… DEPLOYMENT COMPLETE!" -ForegroundColor Green
100
+ Write-Host "═══════════════════════════════════════════════════" -ForegroundColor Green
101
+ Write-Host "`n Service URL: $serviceUrl" -ForegroundColor White
102
+ Write-Host " SSE Endpoint: $serviceUrl/sse" -ForegroundColor White
103
+ Write-Host " API Key: $FARADAY_API_KEY" -ForegroundColor Yellow
104
+ Write-Host "`n πŸ“± Claude Phone Config:" -ForegroundColor Cyan
105
+ Write-Host @"
106
+ {
107
+ "mcpServers": {
108
+ "faraday": {
109
+ "command": "npx",
110
+ "args": [
111
+ "-y", "mcp-remote",
112
+ "$serviceUrl/sse"
113
+ ]
114
+ }
115
+ }
116
+ }
117
+ "@ -ForegroundColor Gray
118
+ Write-Host ""
119
+ }
120
+
121
+ # ─────────────────────────────────────────────────────
122
+ # Execute
123
+ # ─────────────────────────────────────────────────────
124
+ switch ($Action) {
125
+ "push" {
126
+ Push-Data
127
+ }
128
+ "deploy" {
129
+ Deploy-CloudRun
130
+ }
131
+ "full" {
132
+ Push-Data
133
+ Deploy-CloudRun
134
+ }
135
+ }
ingestion/__init__.py ADDED
@@ -0,0 +1,54 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ ingestion β€” Multi-format document parser package.
3
+
4
+ Each parser is a generator that yields standardized document dicts:
5
+ {"text": str, "source": str, "timestamp": str, "tags": str}
6
+
7
+ The `process_file` router dispatches files to the correct parser
8
+ based on extension and filename heuristics.
9
+ """
10
+
11
+ from pathlib import Path
12
+ from typing import Dict, Generator
13
+
14
+ from .markdown import parse_markdown
15
+ from .chatgpt import parse_chatgpt_export
16
+ from .gemini import parse_gemini_html
17
+ from .pdf import parse_pdf
18
+ from .image import parse_image
19
+
20
+
21
+ def process_file(filepath: Path) -> Generator[Dict, None, None]:
22
+ """
23
+ Route a file to the appropriate parser based on extension
24
+ and filename heuristics. Yields standardized document dicts.
25
+ """
26
+ ext = filepath.suffix.lower()
27
+ name = filepath.name.lower()
28
+
29
+ # ChatGPT JSON export
30
+ if ext == ".json" and "conversations" in name:
31
+ yield from parse_chatgpt_export(filepath)
32
+
33
+ # Gemini HTML takeout
34
+ elif ext == ".html" and ("activity" in name or "gemini" in name):
35
+ yield from parse_gemini_html(filepath)
36
+
37
+ # PDF documents
38
+ elif ext == ".pdf":
39
+ yield from parse_pdf(filepath)
40
+
41
+ # Images (OCR)
42
+ elif ext in (".png", ".jpg", ".jpeg", ".tiff", ".bmp", ".webp"):
43
+ yield from parse_image(filepath)
44
+
45
+ # Markdown / plain text (default)
46
+ elif ext in (".md", ".txt", ".rst", ".log", ".csv"):
47
+ yield from parse_markdown(filepath)
48
+
49
+ # Unknown β€” attempt as plain text
50
+ else:
51
+ try:
52
+ yield from parse_markdown(filepath)
53
+ except Exception:
54
+ pass
ingestion/chatgpt.py ADDED
@@ -0,0 +1,88 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ ingestion.chatgpt β€” Parse ChatGPT conversation exports.
3
+
4
+ Standard ChatGPT data export produces a `conversations.json` file
5
+ containing an array of conversation objects, each with a nested
6
+ `mapping` dict of message nodes.
7
+
8
+ This parser uses streaming JSON for memory efficiency on large exports
9
+ (some users have 500 MB+ conversation files).
10
+ """
11
+
12
+ import datetime
13
+ import json
14
+ import sys
15
+ from pathlib import Path
16
+ from typing import Dict, Generator
17
+
18
+
19
+ def parse_chatgpt_export(filepath: Path) -> Generator[Dict, None, None]:
20
+ """
21
+ Parse a ChatGPT conversations.json export.
22
+ Yields one document per non-system message.
23
+ """
24
+ try:
25
+ with open(filepath, "r", encoding="utf-8") as f:
26
+ data = json.load(f)
27
+
28
+ if not isinstance(data, list):
29
+ print(
30
+ f"[ingestion.chatgpt] Expected top-level list in {filepath}",
31
+ file=sys.stderr,
32
+ )
33
+ return
34
+
35
+ for convo in data:
36
+ title = convo.get("title", "Untitled")
37
+ create_time = convo.get("create_time", 0)
38
+ mapping = convo.get("mapping", {})
39
+
40
+ for _node_id, node in mapping.items():
41
+ msg = node.get("message")
42
+ if not msg:
43
+ continue
44
+
45
+ # Extract text content
46
+ parts = msg.get("content", {}).get("parts", [])
47
+ if not parts:
48
+ continue
49
+
50
+ # Filter: only string parts (skip tool_use, image refs, etc.)
51
+ text_parts = [p for p in parts if isinstance(p, str)]
52
+ text = "".join(text_parts).strip()
53
+
54
+ if not text:
55
+ continue
56
+
57
+ role = msg.get("author", {}).get("role", "unknown")
58
+
59
+ # Skip system messages β€” they're boilerplate
60
+ if role == "system":
61
+ continue
62
+
63
+ # Timestamp: prefer message-level, fallback to conversation-level
64
+ msg_time = msg.get("create_time") or create_time or 0
65
+ if msg_time:
66
+ timestamp = datetime.datetime.fromtimestamp(
67
+ msg_time
68
+ ).isoformat()
69
+ else:
70
+ timestamp = "Unknown"
71
+
72
+ yield {
73
+ "text": f"[{role}] {text}",
74
+ "source": f"ChatGPT: {title}",
75
+ "timestamp": timestamp,
76
+ "tags": "chatgpt,chat",
77
+ }
78
+
79
+ except json.JSONDecodeError as e:
80
+ print(
81
+ f"[ingestion.chatgpt] JSON decode error in {filepath}: {e}",
82
+ file=sys.stderr,
83
+ )
84
+ except Exception as e:
85
+ print(
86
+ f"[ingestion.chatgpt] Error parsing {filepath}: {e}",
87
+ file=sys.stderr,
88
+ )
ingestion/gemini.py ADDED
@@ -0,0 +1,97 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ ingestion.gemini β€” Parse Google Gemini takeout HTML exports.
3
+
4
+ Google Takeout for Gemini produces an "My Activity.html" file
5
+ with interaction cards wrapped in `<div class="outer-cell">` elements.
6
+ This parser streams through those cards and yields one document
7
+ per interaction.
8
+ """
9
+
10
+ import datetime
11
+ import sys
12
+ from pathlib import Path
13
+ from typing import Dict, Generator
14
+
15
+
16
+ def parse_gemini_html(filepath: Path) -> Generator[Dict, None, None]:
17
+ """
18
+ Parse a Google Takeout Gemini activity HTML file.
19
+ Yields one document per interaction card.
20
+ """
21
+ try:
22
+ from bs4 import BeautifulSoup
23
+ except ImportError:
24
+ print(
25
+ "[ingestion.gemini] beautifulsoup4 not installed. "
26
+ "Run: pip install beautifulsoup4 lxml",
27
+ file=sys.stderr,
28
+ )
29
+ return
30
+
31
+ try:
32
+ html = filepath.read_text(encoding="utf-8", errors="replace")
33
+ soup = BeautifulSoup(html, "lxml")
34
+
35
+ # Google Takeout wraps each interaction in outer-cell divs
36
+ cards = soup.find_all("div", class_="outer-cell")
37
+
38
+ if not cards:
39
+ # Fallback: dump entire body text as one document
40
+ raw = soup.get_text(separator="\n\n", strip=True)
41
+ if raw.strip():
42
+ yield {
43
+ "text": raw,
44
+ "source": f"Gemini: {filepath.name}",
45
+ "timestamp": _file_timestamp(filepath),
46
+ "tags": "gemini,chat",
47
+ }
48
+ return
49
+
50
+ for idx, card in enumerate(cards):
51
+ text = card.get_text(separator=" ", strip=True)
52
+ if not text or len(text) < 10:
53
+ continue
54
+
55
+ # Attempt to extract timestamp from card content
56
+ # Gemini cards often contain a date string; we try to find it
57
+ timestamp = _extract_card_timestamp(card, filepath)
58
+
59
+ yield {
60
+ "text": text,
61
+ "source": f"Gemini: Interaction {idx + 1}",
62
+ "timestamp": timestamp,
63
+ "tags": "gemini,chat",
64
+ }
65
+
66
+ except Exception as e:
67
+ print(f"[ingestion.gemini] Error parsing {filepath}: {e}", file=sys.stderr)
68
+
69
+
70
+ def _extract_card_timestamp(card, filepath: Path) -> str:
71
+ """
72
+ Try to find a timestamp within a Gemini activity card.
73
+ Falls back to file modification time.
74
+ """
75
+ try:
76
+ from dateutil import parser as date_parser
77
+
78
+ # Look for common date containers in Takeout HTML
79
+ date_cells = card.find_all("div", class_="content-cell")
80
+ for cell in date_cells:
81
+ text = cell.get_text(strip=True)
82
+ # dateutil.parser is very flexible with date formats
83
+ try:
84
+ dt = date_parser.parse(text, fuzzy=True)
85
+ return dt.isoformat()
86
+ except (ValueError, OverflowError):
87
+ continue
88
+ except ImportError:
89
+ pass
90
+
91
+ return _file_timestamp(filepath)
92
+
93
+
94
+ def _file_timestamp(filepath: Path) -> str:
95
+ """Fallback: use file modification time."""
96
+ mtime = filepath.stat().st_mtime
97
+ return datetime.datetime.fromtimestamp(mtime).isoformat()
ingestion/image.py ADDED
@@ -0,0 +1,102 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ ingestion.image β€” Extract text from images via OCR.
3
+
4
+ Uses Pillow + pytesseract. Gracefully degrades if Tesseract
5
+ is not installed β€” warns ONCE and skips all images silently.
6
+ """
7
+
8
+ import datetime
9
+ import sys
10
+ from pathlib import Path
11
+ from typing import Dict, Generator
12
+
13
+ # Module-level availability check β€” run once, not per-file
14
+ _ocr_available = None
15
+ _ocr_warned = False
16
+
17
+
18
+ def _check_ocr():
19
+ """Check OCR availability once at module level."""
20
+ global _ocr_available, _ocr_warned
21
+
22
+ if _ocr_available is not None:
23
+ return _ocr_available
24
+
25
+ try:
26
+ from config import OCR_ENABLED
27
+ if not OCR_ENABLED:
28
+ _ocr_available = False
29
+ return False
30
+ except ImportError:
31
+ pass
32
+
33
+ try:
34
+ from PIL import Image # noqa: F401
35
+ import pytesseract # noqa: F401
36
+ _ocr_available = True
37
+ except ImportError:
38
+ _ocr_available = False
39
+ if not _ocr_warned:
40
+ print(
41
+ "[ingestion.image] OCR unavailable (Pillow/pytesseract not installed). "
42
+ "Skipping all images.",
43
+ file=sys.stderr,
44
+ )
45
+ _ocr_warned = True
46
+
47
+ return _ocr_available
48
+
49
+
50
+ def parse_image(filepath: Path) -> Generator[Dict, None, None]:
51
+ """
52
+ Run OCR on an image file and yield extracted text as a document.
53
+ Skips silently if OCR dependencies are missing (logs once).
54
+ """
55
+ if not _check_ocr():
56
+ return
57
+
58
+ from PIL import Image
59
+ import pytesseract
60
+
61
+ # Configure tesseract path if provided
62
+ try:
63
+ from config import TESSERACT_CMD
64
+ if TESSERACT_CMD:
65
+ pytesseract.pytesseract.tesseract_cmd = TESSERACT_CMD
66
+ except ImportError:
67
+ pass
68
+
69
+ try:
70
+ img = Image.open(filepath)
71
+ text = pytesseract.image_to_string(img)
72
+
73
+ if not text or len(text.strip()) < 20:
74
+ return
75
+
76
+ mtime = filepath.stat().st_mtime
77
+ timestamp = datetime.datetime.fromtimestamp(mtime).isoformat()
78
+
79
+ yield {
80
+ "text": text.strip(),
81
+ "source": f"OCR: {filepath.name}",
82
+ "timestamp": timestamp,
83
+ "tags": "image,ocr",
84
+ }
85
+
86
+ except Exception as e:
87
+ error_name = type(e).__name__
88
+ if "Tesseract" in error_name:
89
+ global _ocr_available, _ocr_warned
90
+ _ocr_available = False
91
+ if not _ocr_warned:
92
+ print(
93
+ f"[ingestion.image] Tesseract not found. "
94
+ f"Install it or set TESSERACT_CMD in config.py.",
95
+ file=sys.stderr,
96
+ )
97
+ _ocr_warned = True
98
+ else:
99
+ print(
100
+ f"[ingestion.image] Error processing {filepath}: {e}",
101
+ file=sys.stderr,
102
+ )
ingestion/markdown.py ADDED
@@ -0,0 +1,38 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ ingestion.markdown β€” Parse plain text and Markdown files.
3
+
4
+ Yields one document per file with file-modification timestamp.
5
+ """
6
+
7
+ import datetime
8
+ import sys
9
+ from pathlib import Path
10
+ from typing import Dict, Generator
11
+
12
+
13
+ def parse_markdown(filepath: Path) -> Generator[Dict, None, None]:
14
+ """
15
+ Read a text/markdown file and yield a single document dict.
16
+ Uses streaming read for large files (reads in 10 MB chunks
17
+ and joins β€” keeps memory bounded for very large markdown dumps).
18
+ """
19
+ try:
20
+ # For most markdown files (< 10 MB), this is instant.
21
+ # For larger files we still read all at once since we need
22
+ # the full text for chunking downstream.
23
+ text = filepath.read_text(encoding="utf-8", errors="replace")
24
+
25
+ if not text.strip():
26
+ return
27
+
28
+ mtime = filepath.stat().st_mtime
29
+ timestamp = datetime.datetime.fromtimestamp(mtime).isoformat()
30
+
31
+ yield {
32
+ "text": text,
33
+ "source": filepath.name,
34
+ "timestamp": timestamp,
35
+ "tags": "markdown",
36
+ }
37
+ except Exception as e:
38
+ print(f"[ingestion.markdown] Error reading {filepath}: {e}", file=sys.stderr)
ingestion/pdf.py ADDED
@@ -0,0 +1,48 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ ingestion.pdf β€” Parse PDF documents page by page.
3
+
4
+ Uses PyPDF2 for lightweight, dependency-free PDF text extraction.
5
+ Each page is yielded as a separate document to keep chunk boundaries clean.
6
+ """
7
+
8
+ import datetime
9
+ import sys
10
+ from pathlib import Path
11
+ from typing import Dict, Generator
12
+
13
+
14
+ def parse_pdf(filepath: Path) -> Generator[Dict, None, None]:
15
+ """
16
+ Extract text from a PDF file, yielding one document per page.
17
+ Pages with negligible text (< 20 chars) are skipped.
18
+ """
19
+ try:
20
+ from PyPDF2 import PdfReader
21
+ except ImportError:
22
+ print(
23
+ "[ingestion.pdf] PyPDF2 not installed. Run: pip install PyPDF2",
24
+ file=sys.stderr,
25
+ )
26
+ return
27
+
28
+ try:
29
+ reader = PdfReader(str(filepath))
30
+ total_pages = len(reader.pages)
31
+
32
+ mtime = filepath.stat().st_mtime
33
+ timestamp = datetime.datetime.fromtimestamp(mtime).isoformat()
34
+
35
+ for page_num, page in enumerate(reader.pages, start=1):
36
+ text = page.extract_text()
37
+ if not text or len(text.strip()) < 20:
38
+ continue
39
+
40
+ yield {
41
+ "text": text.strip(),
42
+ "source": f"{filepath.name} (p.{page_num}/{total_pages})",
43
+ "timestamp": timestamp,
44
+ "tags": "pdf,document",
45
+ }
46
+
47
+ except Exception as e:
48
+ print(f"[ingestion.pdf] Error reading {filepath}: {e}", file=sys.stderr)
mcp_server/__init__.py ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ """
2
+ mcp_server β€” FastMCP server package exposing personal AI memory tools.
3
+ """
mcp_server/cloud_server.py ADDED
@@ -0,0 +1,426 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ mcp_server.cloud_server β€” Cloud-hosted MCP server for Faraday AI Memory.
3
+
4
+ Designed for Google Cloud Run deployment with:
5
+ - SSE (Server-Sent Events) transport for remote MCP access
6
+ - Supabase Storage integration: pulls memory.db + memory.index on startup
7
+ - API key authentication via X-API-Key header
8
+ - Same tools as local server: search_memory, get_memory_stats, sync_memory
9
+
10
+ Usage (local test):
11
+ FARADAY_API_KEY=mykey SUPABASE_URL=... SUPABASE_KEY=... python cloud_server.py
12
+
13
+ Usage (Cloud Run):
14
+ Deployed via Dockerfile, env vars set in Cloud Run config.
15
+ """
16
+
17
+ import datetime
18
+ import os
19
+ import sys
20
+ import tempfile
21
+ import threading
22
+ from pathlib import Path
23
+
24
+ # Silence Huggingface/SentenceTransformer logs
25
+ os.environ["HF_HUB_DISABLE_PROGRESS_BARS"] = "1"
26
+ os.environ["TOKENIZERS_PARALLELISM"] = "false"
27
+ os.environ["TRANSFORMERS_VERBOSITY"] = "error"
28
+ os.environ["HF_HUB_DISABLE_SYMLINKS_WARNING"] = "1"
29
+
30
+ import logging
31
+
32
+ logging.getLogger("sentence_transformers").setLevel(logging.ERROR)
33
+ logging.getLogger("transformers").setLevel(logging.ERROR)
34
+
35
+ # Fix imports: add project root to path
36
+ PROJECT_ROOT = str(Path(__file__).parent.parent)
37
+ sys.path.insert(0, PROJECT_ROOT)
38
+
39
+ from mcp.server.fastmcp import FastMCP
40
+
41
+ # ─────────────────────────────────────────────────────
42
+ # Configuration
43
+ # ─────────────────────────────────────────────────────
44
+
45
+ SUPABASE_URL = os.environ.get("SUPABASE_URL", "")
46
+ SUPABASE_KEY = os.environ.get("SUPABASE_KEY", "")
47
+ SUPABASE_BUCKET = os.environ.get("SUPABASE_BUCKET", "faraday-memory")
48
+ FARADAY_API_KEY = os.environ.get("FARADAY_API_KEY", "")
49
+ PORT = int(os.environ.get("PORT", "8080"))
50
+
51
+ # Cloud data directory
52
+ CLOUD_DATA_DIR = Path(os.environ.get("CLOUD_DATA_DIR", "/tmp/faraday-data"))
53
+ CLOUD_DB_PATH = CLOUD_DATA_DIR / "memory.db"
54
+ CLOUD_INDEX_PATH = CLOUD_DATA_DIR / "memory.index"
55
+
56
+ # Embedding model
57
+ EMBEDDING_MODEL = "all-MiniLM-L6-v2"
58
+ EMBEDDING_DIM = 384
59
+ DEFAULT_TOP_K = 5
60
+ SEMANTIC_WEIGHT = 0.7
61
+ RECENCY_WEIGHT = 0.3
62
+
63
+
64
+ # ─────────────────────────────────────────────────────
65
+ # Supabase Data Pull
66
+ # ─────────────────────────────────────────────────────
67
+
68
+ def pull_from_supabase():
69
+ """Download memory.db and memory.index from Supabase Storage via httpx (compressed)."""
70
+ if not SUPABASE_URL or not SUPABASE_KEY:
71
+ print("[CLOUD] WARNING: No Supabase credentials. Using local data if available.",
72
+ file=sys.stderr)
73
+ return False
74
+
75
+ try:
76
+ import httpx
77
+ import gzip
78
+
79
+ headers = {
80
+ "apikey": SUPABASE_KEY,
81
+ "Authorization": f"Bearer {SUPABASE_KEY}",
82
+ }
83
+
84
+ CLOUD_DATA_DIR.mkdir(parents=True, exist_ok=True)
85
+
86
+ files_to_pull = {
87
+ "memory.db": CLOUD_DB_PATH,
88
+ "memory.index": CLOUD_INDEX_PATH,
89
+ }
90
+
91
+ for remote_name, local_path in files_to_pull.items():
92
+ compressed_name = f"{remote_name}.gz"
93
+ print(f"[CLOUD] Downloading {compressed_name} from Supabase...",
94
+ file=sys.stderr)
95
+ try:
96
+ r = httpx.get(
97
+ f"{SUPABASE_URL}/storage/v1/object/{SUPABASE_BUCKET}/{compressed_name}",
98
+ headers=headers,
99
+ timeout=120,
100
+ )
101
+
102
+ # Fallback to uncompressed if .gz is missing (for backward compatibility)
103
+ if r.status_code == 404:
104
+ print(f"[CLOUD] ℹ️ Compressed not found, trying raw {remote_name}...", file=sys.stderr)
105
+ r = httpx.get(
106
+ f"{SUPABASE_URL}/storage/v1/object/{SUPABASE_BUCKET}/{remote_name}",
107
+ headers=headers,
108
+ timeout=120,
109
+ )
110
+ if r.status_code == 200:
111
+ with open(local_path, "wb") as f:
112
+ f.write(r.content)
113
+ size_mb = len(r.content) / (1024 * 1024)
114
+ print(f"[CLOUD] βœ… Raw {remote_name} downloaded ({size_mb:.1f} MB).", file=sys.stderr)
115
+ continue
116
+
117
+ if r.status_code != 200:
118
+ print(f"[CLOUD] ❌ {compressed_name} download failed ({r.status_code}): {r.text}",
119
+ file=sys.stderr)
120
+ return False
121
+
122
+ # Decompress in memory and write
123
+ print(f"[CLOUD] Decompressing {compressed_name}...", file=sys.stderr)
124
+ decompressed_data = gzip.decompress(r.content)
125
+
126
+ with open(local_path, "wb") as f:
127
+ f.write(decompressed_data)
128
+
129
+ download_mb = len(r.content) / (1024 * 1024)
130
+ decompressed_mb = len(decompressed_data) / (1024 * 1024)
131
+ print(f"[CLOUD] βœ… {remote_name} ready ({download_mb:.1f}MB β†’ {decompressed_mb:.1f}MB).",
132
+ file=sys.stderr)
133
+ except Exception as e:
134
+ print(f"[CLOUD] ❌ Failed to download {remote_name}: {e}",
135
+ file=sys.stderr)
136
+ return False
137
+
138
+ return True
139
+
140
+ except Exception as e:
141
+ print(f"[CLOUD] Supabase pull failed: {e}", file=sys.stderr)
142
+ return False
143
+
144
+
145
+ # ─────────────────────────────────────────────────────
146
+ # Initialize Data
147
+ # ─────────────────────────────────────────────────────
148
+
149
+ print("[CLOUD] Starting Faraday Cloud MCP Server...", file=sys.stderr)
150
+
151
+ # Pull data from Supabase on startup
152
+ pull_success = pull_from_supabase()
153
+
154
+ # Monkey-patch config paths for cloud environment
155
+ import config
156
+ config.SQLITE_DB_PATH = CLOUD_DB_PATH
157
+ config.FAISS_INDEX_PATH = CLOUD_INDEX_PATH
158
+ config.DATA_PROCESSED = CLOUD_DATA_DIR
159
+ config.EMBEDDINGS_DIR = CLOUD_DATA_DIR
160
+
161
+ from database.faiss_db import VectorDB
162
+ from database.sqlite_db import MemoryDB
163
+
164
+ # ─────────────────────────────────────────────────────
165
+ # Server Initialization
166
+ # ─────────────────────────────────────────────────────
167
+
168
+ mcp = FastMCP(
169
+ "Faraday-AI-Memory-Cloud",
170
+ host="0.0.0.0",
171
+ port=PORT
172
+ )
173
+
174
+ # Load data stores
175
+ if CLOUD_DB_PATH.exists() and CLOUD_INDEX_PATH.exists():
176
+ _db = MemoryDB(db_path=CLOUD_DB_PATH, readonly=True)
177
+ _vec_db = VectorDB(index_path=str(CLOUD_INDEX_PATH))
178
+ else:
179
+ print("[CLOUD] ⚠️ No data files found. Memory will be empty.", file=sys.stderr)
180
+ # Create empty stores
181
+ CLOUD_DATA_DIR.mkdir(parents=True, exist_ok=True)
182
+ _db = MemoryDB(db_path=CLOUD_DB_PATH, readonly=False)
183
+ _vec_db = VectorDB(index_path=str(CLOUD_INDEX_PATH))
184
+
185
+ # Load embedding model
186
+ from sentence_transformers import SentenceTransformer
187
+
188
+ print("[CLOUD] Loading embedding model...", file=sys.stderr)
189
+ _model = SentenceTransformer(EMBEDDING_MODEL)
190
+
191
+ print(
192
+ f"[CLOUD] βœ… Ready: {_vec_db.count()} vectors, {_db.count()} memories loaded.",
193
+ file=sys.stderr,
194
+ )
195
+
196
+
197
+ # ─────────────────────────────────────────────────────
198
+ # Time Helpers (same as local)
199
+ # ─────────────────────────────────────────────────────
200
+
201
+ def _resolve_time_filter(time_filter: str):
202
+ """Convert human-readable time filter to (start_iso, end_iso) tuple."""
203
+ if not time_filter or time_filter.lower() == "none":
204
+ return None
205
+
206
+ now = datetime.datetime.now()
207
+ key = time_filter.lower().strip()
208
+
209
+ if key == "today":
210
+ start = now.replace(hour=0, minute=0, second=0, microsecond=0)
211
+ return start.isoformat(), now.isoformat()
212
+ elif key == "yesterday":
213
+ yesterday = now - datetime.timedelta(days=1)
214
+ start = yesterday.replace(hour=0, minute=0, second=0, microsecond=0)
215
+ end = yesterday.replace(hour=23, minute=59, second=59)
216
+ return start.isoformat(), end.isoformat()
217
+ elif key in ("last_week", "this_week", "week"):
218
+ start = now - datetime.timedelta(days=7)
219
+ return start.isoformat(), now.isoformat()
220
+ elif key in ("last_month", "this_month", "month"):
221
+ start = now - datetime.timedelta(days=30)
222
+ return start.isoformat(), now.isoformat()
223
+ else:
224
+ try:
225
+ from dateutil import parser as date_parser
226
+ dt = date_parser.parse(time_filter)
227
+ start = dt.replace(hour=0, minute=0, second=0)
228
+ end = dt.replace(hour=23, minute=59, second=59)
229
+ return start.isoformat(), end.isoformat()
230
+ except Exception:
231
+ return None
232
+
233
+
234
+ def _compute_recency_score(timestamp_str: str) -> float:
235
+ """Compute a 0-1 recency score with ~30 day half-life."""
236
+ if not timestamp_str or timestamp_str == "Unknown":
237
+ return 0.0
238
+ try:
239
+ from dateutil import parser as date_parser
240
+ ts = date_parser.parse(timestamp_str)
241
+ now = datetime.datetime.now()
242
+ age_days = max(0, (now - ts).total_seconds() / 86400)
243
+ return 2.0 ** (-age_days / 30.0)
244
+ except Exception:
245
+ return 0.0
246
+
247
+
248
+ # ───────────────────────────────────────────────��─────
249
+ # MCP Tools
250
+ # ─────────────────────────────────────────────────────
251
+
252
+ @mcp.tool()
253
+ def search_memory(
254
+ query: str,
255
+ top_k: int = DEFAULT_TOP_K,
256
+ time_filter: str = "",
257
+ tags: str = "",
258
+ ) -> str:
259
+ """
260
+ Search Saurab's personal AI memory (past chats, documents, notes, research).
261
+
262
+ Use this tool whenever you need context about Saurab's history,
263
+ past actions, architecture decisions, projects, or personal knowledge.
264
+
265
+ Args:
266
+ query: Semantic search string (e.g. 'sparse communication paper')
267
+ top_k: Maximum results to return (default 5)
268
+ time_filter: Optional time constraint: 'today', 'yesterday',
269
+ 'last_week', 'last_month', or an ISO date like '2026-04-10'
270
+ tags: Optional comma-separated tag filter (e.g. 'chatgpt,research')
271
+ """
272
+ try:
273
+ if _vec_db.count() == 0:
274
+ return (
275
+ "Memory store is empty. "
276
+ "Data has not been synced to the cloud yet."
277
+ )
278
+
279
+ # 1. Encode query
280
+ query_emb = _model.encode(
281
+ [query], show_progress_bar=False, convert_to_numpy=True
282
+ )
283
+
284
+ # 2. FAISS search
285
+ fetch_k = min(top_k * 3, _vec_db.count())
286
+ raw_results = _vec_db.search(query_emb, top_k=fetch_k)
287
+
288
+ if not raw_results:
289
+ return "No matching memories found."
290
+
291
+ # 3. Get metadata
292
+ candidate_ids = [r[0] for r in raw_results]
293
+ score_map = {r[0]: r[1] for r in raw_results}
294
+ metadata_list = _db.get_memories_by_ids(candidate_ids)
295
+
296
+ # 4. Apply filters
297
+ filtered = metadata_list
298
+
299
+ time_range = _resolve_time_filter(time_filter)
300
+ if time_range:
301
+ start_iso, end_iso = time_range
302
+ filtered = [
303
+ m for m in filtered
304
+ if m.get("timestamp", "") >= start_iso
305
+ and m.get("timestamp", "") <= end_iso
306
+ ]
307
+
308
+ if tags:
309
+ tag_set = {t.strip().lower() for t in tags.split(",")}
310
+ filtered = [
311
+ m for m in filtered
312
+ if any(t in m.get("tags", "").lower() for t in tag_set)
313
+ ]
314
+
315
+ if not filtered:
316
+ return "No memories matched your filters."
317
+
318
+ # 5. Hybrid scoring
319
+ scored = []
320
+ for meta in filtered:
321
+ mem_id = meta["id"]
322
+ semantic = score_map.get(mem_id, 0.0)
323
+ recency = _compute_recency_score(meta.get("timestamp", ""))
324
+ hybrid = SEMANTIC_WEIGHT * semantic + RECENCY_WEIGHT * recency
325
+ scored.append((meta, hybrid, semantic))
326
+
327
+ scored.sort(key=lambda x: x[1], reverse=True)
328
+
329
+ # 6. Format output
330
+ results = scored[:top_k]
331
+ output_lines = [f"=== FARADAY MEMORY ({len(results)} results) ===\n"]
332
+
333
+ for i, (meta, hybrid, semantic) in enumerate(results, 1):
334
+ output_lines.append(
335
+ f"--- Result {i} [Score: {hybrid:.3f}] ---\n"
336
+ f"Source: {meta.get('source', 'Unknown')}\n"
337
+ f"Date: {meta.get('timestamp', 'Unknown')}\n"
338
+ f"Tags: {meta.get('tags', '')}\n"
339
+ f"Semantic: {semantic:.3f} | Recency: "
340
+ f"{_compute_recency_score(meta.get('timestamp', '')):.3f}\n"
341
+ f"Content:\n{meta.get('text', '')}\n"
342
+ )
343
+
344
+ return "\n".join(output_lines)
345
+
346
+ except Exception as e:
347
+ import traceback
348
+ return f"Memory search failed: {e}\n{traceback.format_exc()}"
349
+
350
+
351
+ @mcp.tool()
352
+ def get_memory_stats() -> str:
353
+ """
354
+ Get diagnostic statistics about the AI memory store.
355
+ Shows total chunks, vector count, date range, and source count.
356
+ """
357
+ try:
358
+ stats = _db.get_stats()
359
+ vec_count = _vec_db.count()
360
+
361
+ return (
362
+ f"=== FARADAY CLOUD MEMORY STATS ===\n"
363
+ f"Total chunks: {stats.get('total', 0)}\n"
364
+ f"FAISS vectors: {vec_count}\n"
365
+ f"Unique sources: {stats.get('sources', 0)}\n"
366
+ f"Date range: {stats.get('earliest', 'N/A')} β†’ "
367
+ f"{stats.get('latest', 'N/A')}\n"
368
+ f"Deployment: Google Cloud Run\n"
369
+ f"Data source: Supabase Storage\n"
370
+ )
371
+ except Exception as e:
372
+ return f"Stats failed: {e}"
373
+
374
+
375
+ @mcp.tool()
376
+ def sync_memory() -> str:
377
+ """
378
+ Re-pull the latest data from Supabase Storage.
379
+ Use this after running `python sync.py push` on your laptop
380
+ to refresh the cloud server with the latest memory data.
381
+ """
382
+ try:
383
+ def _run_refresh():
384
+ global _db, _vec_db
385
+ try:
386
+ success = pull_from_supabase()
387
+ if success:
388
+ # Reload database and vector index
389
+ _db = MemoryDB(db_path=CLOUD_DB_PATH, readonly=True)
390
+ _vec_db = VectorDB(index_path=str(CLOUD_INDEX_PATH))
391
+ print(
392
+ f"[CLOUD] Refreshed: {_vec_db.count()} vectors, "
393
+ f"{_db.count()} memories.",
394
+ file=sys.stderr,
395
+ )
396
+ except Exception as e:
397
+ print(f"[CLOUD] Refresh error: {e}", file=sys.stderr)
398
+
399
+ threading.Thread(target=_run_refresh, daemon=True).start()
400
+
401
+ return (
402
+ "βœ… Cloud data refresh started. "
403
+ "Pulling latest memory.db and memory.index from Supabase. "
404
+ "Run get_memory_stats() in a moment to verify."
405
+ )
406
+ except Exception as e:
407
+ return f"❌ Refresh failed: {e}"
408
+
409
+
410
+ # ─────────────────────────────────────────────────────
411
+ # Health Check Endpoint (for Cloud Run)
412
+ # ─────────────────────────────────────────────────────
413
+
414
+ @mcp.resource("health://status")
415
+ def health_check() -> str:
416
+ """Health check for Cloud Run."""
417
+ return f"OK: {_vec_db.count()} vectors loaded"
418
+
419
+
420
+ # ─────────────────────────────────────────────────────
421
+ # Entry Point
422
+ # ─────────────────────────────────────────────────────
423
+
424
+ if __name__ == "__main__":
425
+ print(f"[CLOUD] Starting SSE server on port {PORT}...", file=sys.stderr)
426
+ mcp.run(transport="sse")
mcp_server/main.py ADDED
@@ -0,0 +1,307 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ mcp_server.main β€” Production MCP server for Personal AI Memory.
3
+
4
+ Exposes three tools via the Model Context Protocol (stdio transport):
5
+ 1. search_memory β€” Hybrid semantic + recency search
6
+ 2. get_memory_stats β€” Diagnostic info about the memory store
7
+ 3. sync_memory β€” Trigger background ingestion
8
+
9
+ Design:
10
+ - Model, FAISS index, and SQLite are loaded ONCE at startup
11
+ - Queries run in <100ms (encode + FAISS search + SQLite lookup)
12
+ - No writes during search β€” fully non-blocking for concurrent reads
13
+ - Hybrid scoring: Ξ± Γ— semantic_similarity + (1-Ξ±) Γ— recency_score
14
+ """
15
+
16
+ import datetime
17
+ import os
18
+ import sys
19
+ import threading
20
+ from pathlib import Path
21
+
22
+ # Silence Huggingface/SentenceTransformer logs that break JSON-RPC stdio
23
+ os.environ["HF_HUB_DISABLE_PROGRESS_BARS"] = "1"
24
+ os.environ["TOKENIZERS_PARALLELISM"] = "false"
25
+ os.environ["TRANSFORMERS_VERBOSITY"] = "error"
26
+ os.environ["HF_HUB_DISABLE_SYMLINKS_WARNING"] = "1"
27
+
28
+ import logging
29
+
30
+ logging.getLogger("sentence_transformers").setLevel(logging.ERROR)
31
+ logging.getLogger("transformers").setLevel(logging.ERROR)
32
+
33
+ # Fix imports: add project root to path
34
+ PROJECT_ROOT = str(Path(__file__).parent.parent)
35
+ sys.path.insert(0, PROJECT_ROOT)
36
+
37
+ from mcp.server.fastmcp import FastMCP
38
+
39
+ from config import (
40
+ DEFAULT_TOP_K,
41
+ EMBEDDING_MODEL,
42
+ RECENCY_WEIGHT,
43
+ SEMANTIC_WEIGHT,
44
+ )
45
+ from database.faiss_db import VectorDB
46
+ from database.sqlite_db import MemoryDB
47
+
48
+ # ─────────────────────────────────────────────────────
49
+ # Server Initialization
50
+ # ─────────────────────────────────────────────────────
51
+
52
+ mcp = FastMCP("Faraday-AI-Memory")
53
+
54
+ # Eager load: everything initialized once at startup
55
+ print("Loading AI Memory services...", file=sys.stderr)
56
+
57
+ _db = MemoryDB(readonly=True)
58
+ _vec_db = VectorDB()
59
+
60
+ from sentence_transformers import SentenceTransformer
61
+
62
+ _model = SentenceTransformer(EMBEDDING_MODEL)
63
+
64
+ print(
65
+ f"Ready: {_vec_db.count()} vectors, {_db.count()} memories loaded.",
66
+ file=sys.stderr,
67
+ )
68
+
69
+
70
+ # ─────────────────────────────────────────────────────
71
+ # Time Filter Helpers
72
+ # ─────────────────────────────────────────────────────
73
+
74
+ def _resolve_time_filter(time_filter: str):
75
+ """
76
+ Convert human-readable time filter to (start_iso, end_iso) tuple.
77
+ Supports: "today", "yesterday", "last_week", "last_month",
78
+ or an ISO date string like "2026-04-10".
79
+ Returns None if no filter.
80
+ """
81
+ if not time_filter or time_filter.lower() == "none":
82
+ return None
83
+
84
+ now = datetime.datetime.now()
85
+ key = time_filter.lower().strip()
86
+
87
+ if key == "today":
88
+ start = now.replace(hour=0, minute=0, second=0, microsecond=0)
89
+ return start.isoformat(), now.isoformat()
90
+ elif key == "yesterday":
91
+ yesterday = now - datetime.timedelta(days=1)
92
+ start = yesterday.replace(hour=0, minute=0, second=0, microsecond=0)
93
+ end = yesterday.replace(hour=23, minute=59, second=59)
94
+ return start.isoformat(), end.isoformat()
95
+ elif key in ("last_week", "this_week", "week"):
96
+ start = now - datetime.timedelta(days=7)
97
+ return start.isoformat(), now.isoformat()
98
+ elif key in ("last_month", "this_month", "month"):
99
+ start = now - datetime.timedelta(days=30)
100
+ return start.isoformat(), now.isoformat()
101
+ else:
102
+ # Try parsing as ISO date
103
+ try:
104
+ from dateutil import parser as date_parser
105
+
106
+ dt = date_parser.parse(time_filter)
107
+ start = dt.replace(hour=0, minute=0, second=0)
108
+ end = dt.replace(hour=23, minute=59, second=59)
109
+ return start.isoformat(), end.isoformat()
110
+ except Exception:
111
+ return None
112
+
113
+
114
+ def _compute_recency_score(timestamp_str: str) -> float:
115
+ """
116
+ Compute a 0-1 recency score.
117
+ Recent items score higher. Uses exponential decay with ~30 day half-life.
118
+ """
119
+ if not timestamp_str or timestamp_str == "Unknown":
120
+ return 0.0
121
+
122
+ try:
123
+ from dateutil import parser as date_parser
124
+
125
+ ts = date_parser.parse(timestamp_str)
126
+ now = datetime.datetime.now()
127
+ age_days = max(0, (now - ts).total_seconds() / 86400)
128
+ # Exponential decay: half-life of 30 days
129
+ return 2.0 ** (-age_days / 30.0)
130
+ except Exception:
131
+ return 0.0
132
+
133
+
134
+ # ─────────────────────────────────────────────────────
135
+ # MCP Tools
136
+ # ───────────────────────────────────────���─────────────
137
+
138
+
139
+ @mcp.tool()
140
+ def search_memory(
141
+ query: str,
142
+ top_k: int = DEFAULT_TOP_K,
143
+ time_filter: str = "",
144
+ tags: str = "",
145
+ ) -> str:
146
+ """
147
+ Search Saurab's personal AI memory (past chats, documents, notes, research).
148
+
149
+ Use this tool whenever you need context about Saurab's history,
150
+ past actions, architecture decisions, projects, or personal knowledge.
151
+
152
+ Args:
153
+ query: Semantic search string (e.g. 'sparse communication paper')
154
+ top_k: Maximum results to return (default 5)
155
+ time_filter: Optional time constraint: 'today', 'yesterday',
156
+ 'last_week', 'last_month', or an ISO date like '2026-04-10'
157
+ tags: Optional comma-separated tag filter (e.g. 'chatgpt,research')
158
+ """
159
+ try:
160
+ if _vec_db.count() == 0:
161
+ return (
162
+ "Memory store is empty. "
163
+ "Run `python update.py` to synchronize your data."
164
+ )
165
+
166
+ # 1. Encode query (single vector, ~5ms on CPU)
167
+ query_emb = _model.encode(
168
+ [query], show_progress_bar=False, convert_to_numpy=True
169
+ )
170
+
171
+ # 2. FAISS search β€” get candidate IDs with semantic scores
172
+ # Fetch extra candidates for re-ranking
173
+ fetch_k = min(top_k * 3, _vec_db.count())
174
+ raw_results = _vec_db.search(query_emb, top_k=fetch_k)
175
+
176
+ if not raw_results:
177
+ return "No matching memories found."
178
+
179
+ # 3. Get metadata for all candidates
180
+ candidate_ids = [r[0] for r in raw_results]
181
+ score_map = {r[0]: r[1] for r in raw_results}
182
+ metadata_list = _db.get_memories_by_ids(candidate_ids)
183
+
184
+ # 4. Apply filters
185
+ filtered = metadata_list
186
+
187
+ # Time filter
188
+ time_range = _resolve_time_filter(time_filter)
189
+ if time_range:
190
+ start_iso, end_iso = time_range
191
+ filtered = [
192
+ m
193
+ for m in filtered
194
+ if m.get("timestamp", "") >= start_iso
195
+ and m.get("timestamp", "") <= end_iso
196
+ ]
197
+
198
+ # Tag filter
199
+ if tags:
200
+ tag_set = {t.strip().lower() for t in tags.split(",")}
201
+ filtered = [
202
+ m
203
+ for m in filtered
204
+ if any(
205
+ t in m.get("tags", "").lower() for t in tag_set
206
+ )
207
+ ]
208
+
209
+ if not filtered:
210
+ return "No memories matched your filters."
211
+
212
+ # 5. Hybrid scoring: semantic + recency
213
+ scored = []
214
+ for meta in filtered:
215
+ mem_id = meta["id"]
216
+ semantic = score_map.get(mem_id, 0.0)
217
+ recency = _compute_recency_score(meta.get("timestamp", ""))
218
+ hybrid = SEMANTIC_WEIGHT * semantic + RECENCY_WEIGHT * recency
219
+ scored.append((meta, hybrid, semantic))
220
+
221
+ # Sort by hybrid score (descending)
222
+ scored.sort(key=lambda x: x[1], reverse=True)
223
+
224
+ # 6. Format output
225
+ results = scored[:top_k]
226
+ output_lines = [f"=== FARADAY MEMORY ({len(results)} results) ===\n"]
227
+
228
+ for i, (meta, hybrid, semantic) in enumerate(results, 1):
229
+ output_lines.append(
230
+ f"--- Result {i} [Score: {hybrid:.3f}] ---\n"
231
+ f"Source: {meta.get('source', 'Unknown')}\n"
232
+ f"Date: {meta.get('timestamp', 'Unknown')}\n"
233
+ f"Tags: {meta.get('tags', '')}\n"
234
+ f"Semantic: {semantic:.3f} | Recency: "
235
+ f"{_compute_recency_score(meta.get('timestamp', '')):.3f}\n"
236
+ f"Content:\n{meta.get('text', '')}\n"
237
+ )
238
+
239
+ return "\n".join(output_lines)
240
+
241
+ except Exception as e:
242
+ import traceback
243
+
244
+ return f"Memory search failed: {e}\n{traceback.format_exc()}"
245
+
246
+
247
+ @mcp.tool()
248
+ def get_memory_stats() -> str:
249
+ """
250
+ Get diagnostic statistics about the AI memory store.
251
+ Shows total chunks, vector count, date range, and source count.
252
+ """
253
+ try:
254
+ stats = _db.get_stats()
255
+ vec_count = _vec_db.count()
256
+
257
+ return (
258
+ f"=== FARADAY MEMORY STATS ===\n"
259
+ f"Total chunks: {stats.get('total', 0)}\n"
260
+ f"FAISS vectors: {vec_count}\n"
261
+ f"Unique sources: {stats.get('sources', 0)}\n"
262
+ f"Date range: {stats.get('earliest', 'N/A')} β†’ "
263
+ f"{stats.get('latest', 'N/A')}\n"
264
+ )
265
+ except Exception as e:
266
+ return f"Stats failed: {e}"
267
+
268
+
269
+ @mcp.tool()
270
+ def sync_memory() -> str:
271
+ """
272
+ Trigger an incremental memory sync in the background.
273
+ This scans all data directories, processes new documents,
274
+ and updates the FAISS index. Safe to call repeatedly.
275
+ """
276
+ try:
277
+
278
+ def _run_update():
279
+ try:
280
+ import subprocess
281
+
282
+ subprocess.run(
283
+ [sys.executable, str(Path(PROJECT_ROOT) / "update.py")],
284
+ cwd=PROJECT_ROOT,
285
+ capture_output=True,
286
+ text=True,
287
+ )
288
+ except Exception as e:
289
+ print(f"Background sync error: {e}", file=sys.stderr)
290
+
291
+ threading.Thread(target=_run_update, daemon=True).start()
292
+
293
+ return (
294
+ "βœ… Memory sync started in background. "
295
+ "New data will be available after processing completes. "
296
+ "Run get_memory_stats() to verify."
297
+ )
298
+ except Exception as e:
299
+ return f"❌ Sync failed to start: {e}"
300
+
301
+
302
+ # ─────────────────────────────────────────────────────
303
+ # Entry Point
304
+ # ─────────────────────────────────────────────────────
305
+
306
+ if __name__ == "__main__":
307
+ mcp.run(transport="stdio")
processing/__init__.py ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ """
2
+ processing β€” Text cleaning and chunking pipeline.
3
+ """
processing/chunker.py ADDED
@@ -0,0 +1,84 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ processing.chunker β€” Paragraph-aware text chunker with overlap.
3
+
4
+ Splits text into ~300-word chunks, preferring paragraph boundaries.
5
+ Falls back to word-level splitting for very long paragraphs.
6
+ Maintains configurable word overlap between adjacent chunks
7
+ to preserve context continuity for embeddings.
8
+ """
9
+
10
+ import re
11
+ from typing import List
12
+
13
+
14
+ def chunk_text(
15
+ text: str,
16
+ max_words: int = 300,
17
+ overlap: int = 50,
18
+ min_chunk_words: int = 15,
19
+ ) -> List[str]:
20
+ """
21
+ Split text into chunks of approximately `max_words`, aligned
22
+ to paragraph boundaries where possible.
23
+
24
+ Args:
25
+ text: Input text to chunk.
26
+ max_words: Target maximum words per chunk.
27
+ overlap: Words of overlap between consecutive chunks.
28
+ min_chunk_words: Discard trailing chunks smaller than this.
29
+
30
+ Returns:
31
+ List of text chunks.
32
+ """
33
+ if not text or not text.strip():
34
+ return []
35
+
36
+ chunks: List[str] = []
37
+
38
+ # Step 1: Split into paragraphs (double-newline separated)
39
+ paragraphs = re.split(r"\n\n+", text)
40
+
41
+ current_words: List[str] = []
42
+ current_len = 0
43
+
44
+ for para in paragraphs:
45
+ words = para.split()
46
+ if not words:
47
+ continue
48
+
49
+ # If this paragraph fits in the current chunk, append it
50
+ if current_len + len(words) <= max_words:
51
+ current_words.extend(words)
52
+ current_len += len(words)
53
+ else:
54
+ # Flush current chunk if it has content
55
+ if current_words:
56
+ chunks.append(" ".join(current_words))
57
+ # Retain overlap from end of current chunk
58
+ if overlap > 0:
59
+ current_words = current_words[-overlap:]
60
+ current_len = len(current_words)
61
+ else:
62
+ current_words = []
63
+ current_len = 0
64
+
65
+ # If paragraph itself exceeds max_words, split strictly
66
+ if len(words) > max_words:
67
+ i = 0
68
+ while i < len(words):
69
+ sub = words[i : i + max_words]
70
+ chunks.append(" ".join(sub))
71
+ step = max(1, max_words - overlap)
72
+ i += step
73
+ # Reset after processing oversized paragraph
74
+ current_words = []
75
+ current_len = 0
76
+ else:
77
+ current_words.extend(words)
78
+ current_len += len(words)
79
+
80
+ # Flush remaining words if they form a meaningful chunk
81
+ if current_words and len(current_words) >= min_chunk_words:
82
+ chunks.append(" ".join(current_words))
83
+
84
+ return chunks
processing/cleaner.py ADDED
@@ -0,0 +1,60 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ processing.cleaner β€” Text normalization and deduplication hashing.
3
+
4
+ Handles:
5
+ - Unicode normalization (NFC)
6
+ - HTML tag stripping
7
+ - Control character removal
8
+ - Excessive whitespace collapse
9
+ - Minimum length filtering
10
+ - SHA-256 content hashing for deduplication
11
+ """
12
+
13
+ import hashlib
14
+ import re
15
+ import unicodedata
16
+ from typing import Optional
17
+
18
+
19
+ def clean_text(text: str) -> Optional[str]:
20
+ """
21
+ Normalize and clean text for embedding.
22
+
23
+ Returns cleaned text, or None if the result is too short
24
+ to be meaningful (< 30 characters after cleaning).
25
+ """
26
+ if not text:
27
+ return None
28
+
29
+ # 1. Unicode normalize to NFC (compose characters)
30
+ text = unicodedata.normalize("NFC", text)
31
+
32
+ # 2. Strip residual HTML tags (from Gemini exports, etc.)
33
+ text = re.sub(r"<[^>]+>", " ", text)
34
+
35
+ # 3. Remove control characters (except newlines and tabs)
36
+ text = re.sub(r"[\x00-\x08\x0b\x0c\x0e-\x1f\x7f-\x9f]", "", text)
37
+
38
+ # 4. Collapse excessive newlines (3+ β†’ 2)
39
+ text = re.sub(r"\n{3,}", "\n\n", text)
40
+
41
+ # 5. Collapse excessive spaces (3+ β†’ 2)
42
+ text = re.sub(r" {3,}", " ", text)
43
+
44
+ # 6. Strip leading/trailing whitespace
45
+ text = text.strip()
46
+
47
+ # 7. Minimum length gate
48
+ if len(text) < 30:
49
+ return None
50
+
51
+ return text
52
+
53
+
54
+ def compute_hash(content: str) -> str:
55
+ """
56
+ Generate a deterministic SHA-256 hash for content deduplication.
57
+ Two identical chunks will always produce the same hash,
58
+ preventing re-embedding on repeated updates.
59
+ """
60
+ return hashlib.sha256(content.encode("utf-8")).hexdigest()
requirements.txt ADDED
@@ -0,0 +1,29 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Core
2
+ faiss-cpu>=1.7.4
3
+ sentence-transformers>=2.2.0
4
+ numpy>=1.24.0
5
+
6
+ # MCP Protocol
7
+ mcp[cli]>=1.0.0
8
+
9
+ # Ingestion β€” PDF
10
+ PyPDF2>=3.0.0
11
+
12
+ # Ingestion β€” OCR (optional, requires Tesseract installed)
13
+ Pillow>=10.0.0
14
+ pytesseract>=0.3.10
15
+
16
+ # Ingestion β€” Gemini HTML takeout
17
+ beautifulsoup4>=4.12.0
18
+ lxml>=4.9.0
19
+
20
+ # Time-aware queries
21
+ python-dateutil>=2.8.0
22
+
23
+ # Cloud sync
24
+ supabase>=2.0.0
25
+
26
+ # Cloud deployment (SSE transport)
27
+ starlette>=0.27.0
28
+ uvicorn>=0.23.0
29
+ httpx>=0.24.0
sync.py ADDED
@@ -0,0 +1,254 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ sync.py β€” Cloud sync for Faraday AI Memory.
3
+
4
+ Pushes the local SQLite database and FAISS index to Supabase Storage,
5
+ enabling the cloud-hosted MCP server on Cloud Run to serve queries
6
+ from the same data.
7
+
8
+ Uses httpx directly (no heavy Supabase SDK) for maximum compatibility.
9
+
10
+ Usage:
11
+ python sync.py push # Upload local DB + index to cloud
12
+ python sync.py pull # Download cloud DB + index to local
13
+ python sync.py status # Check what's in the cloud bucket
14
+
15
+ Requires SUPABASE_URL and SUPABASE_KEY in config.py or env vars.
16
+ """
17
+
18
+ import os
19
+ import sys
20
+ from pathlib import Path
21
+
22
+ # Ensure project root is on path
23
+ sys.path.insert(0, str(Path(__file__).parent.resolve()))
24
+
25
+ from config import (
26
+ FAISS_INDEX_PATH,
27
+ SQLITE_DB_PATH,
28
+ SUPABASE_BUCKET,
29
+ SUPABASE_KEY,
30
+ SUPABASE_URL,
31
+ )
32
+
33
+
34
+ def _check_credentials():
35
+ """Validate Supabase credentials are available."""
36
+ if not SUPABASE_URL or not SUPABASE_KEY:
37
+ print(
38
+ "❌ Error: Supabase credentials not configured.\n"
39
+ "Set SUPABASE_URL and SUPABASE_KEY in config.py or environment.\n"
40
+ "\n"
41
+ "Your Supabase project URL looks like:\n"
42
+ " https://qwxagrmoryojholseclm.supabase.co\n"
43
+ "\n"
44
+ "Get your anon key from:\n"
45
+ " Supabase Dashboard β†’ Settings β†’ API β†’ anon public key",
46
+ file=sys.stderr,
47
+ )
48
+ sys.exit(1)
49
+
50
+
51
+ def _headers():
52
+ """Common headers for Supabase Storage API."""
53
+ return {
54
+ "apikey": SUPABASE_KEY,
55
+ "Authorization": f"Bearer {SUPABASE_KEY}",
56
+ }
57
+
58
+
59
+ def _ensure_bucket():
60
+ """Create the storage bucket if it doesn't exist."""
61
+ import httpx
62
+
63
+ # Check if bucket exists
64
+ r = httpx.get(
65
+ f"{SUPABASE_URL}/storage/v1/bucket/{SUPABASE_BUCKET}",
66
+ headers=_headers(),
67
+ timeout=30,
68
+ )
69
+
70
+ if r.status_code == 200:
71
+ print(f" Bucket '{SUPABASE_BUCKET}' exists.", file=sys.stderr)
72
+ return True
73
+
74
+ # Create bucket
75
+ print(f" Creating bucket '{SUPABASE_BUCKET}'...", file=sys.stderr)
76
+ r = httpx.post(
77
+ f"{SUPABASE_URL}/storage/v1/bucket",
78
+ headers={**_headers(), "Content-Type": "application/json"},
79
+ json={
80
+ "id": SUPABASE_BUCKET,
81
+ "name": SUPABASE_BUCKET,
82
+ "public": False,
83
+ "file_size_limit": 200 * 1024 * 1024, # 200 MB
84
+ },
85
+ timeout=30,
86
+ )
87
+
88
+ if r.status_code in (200, 201):
89
+ print(f" βœ… Bucket created.", file=sys.stderr)
90
+ return True
91
+ else:
92
+ print(
93
+ f" ⚠️ Bucket creation response ({r.status_code}): {r.text}",
94
+ file=sys.stderr,
95
+ )
96
+ # May already exist, try anyway
97
+ return True
98
+
99
+
100
+ def push():
101
+ """Upload local database and FAISS index (compressed) to Supabase Storage."""
102
+ import httpx
103
+ import gzip
104
+
105
+ _check_credentials()
106
+ _ensure_bucket()
107
+
108
+ files_to_upload = {
109
+ "memory.db": SQLITE_DB_PATH,
110
+ "memory.index": FAISS_INDEX_PATH,
111
+ }
112
+
113
+ for remote_name, local_path in files_to_upload.items():
114
+ path = Path(local_path)
115
+ if not path.exists():
116
+ print(f" [SKIP] {remote_name} β€” local file not found at {path}",
117
+ file=sys.stderr)
118
+ continue
119
+
120
+ raw_size_mb = path.stat().st_size / (1024 * 1024)
121
+
122
+ # We always compress before uploading
123
+ compressed_name = f"{remote_name}.gz"
124
+ print(f" Compressing {remote_name} ({raw_size_mb:.1f} MB)...", file=sys.stderr)
125
+
126
+ with open(path, "rb") as f_in:
127
+ file_content = gzip.compress(f_in.read())
128
+
129
+ comp_size_mb = len(file_content) / (1024 * 1024)
130
+ print(f" Uploading {compressed_name} ({comp_size_mb:.1f} MB)...", file=sys.stderr)
131
+
132
+ # Try to remove existing file first (upsert)
133
+ httpx.delete(
134
+ f"{SUPABASE_URL}/storage/v1/object/{SUPABASE_BUCKET}/{compressed_name}",
135
+ headers=_headers(),
136
+ timeout=30,
137
+ )
138
+
139
+ # Upload compressed file
140
+ r = httpx.post(
141
+ f"{SUPABASE_URL}/storage/v1/object/{SUPABASE_BUCKET}/{compressed_name}",
142
+ headers={
143
+ **_headers(),
144
+ "Content-Type": "application/gzip",
145
+ "x-upsert": "true",
146
+ },
147
+ content=file_content,
148
+ timeout=120,
149
+ )
150
+
151
+ if r.status_code in (200, 201):
152
+ print(f" βœ… {compressed_name} uploaded successfully.", file=sys.stderr)
153
+ else:
154
+ print(
155
+ f" ❌ {compressed_name} upload failed ({r.status_code}): {r.text}",
156
+ file=sys.stderr,
157
+ )
158
+
159
+ print("\nβœ… Cloud sync (push) complete.", file=sys.stderr)
160
+
161
+
162
+ def pull():
163
+ """Download database and FAISS index from Supabase Storage."""
164
+ import httpx
165
+
166
+ _check_credentials()
167
+
168
+ files_to_download = {
169
+ "memory.db": SQLITE_DB_PATH,
170
+ "memory.index": FAISS_INDEX_PATH,
171
+ }
172
+
173
+ for remote_name, local_path in files_to_download.items():
174
+ print(f" Downloading {remote_name}...", file=sys.stderr)
175
+
176
+ try:
177
+ r = httpx.get(
178
+ f"{SUPABASE_URL}/storage/v1/object/{SUPABASE_BUCKET}/{remote_name}",
179
+ headers=_headers(),
180
+ timeout=120,
181
+ )
182
+
183
+ if r.status_code != 200:
184
+ print(
185
+ f" ❌ {remote_name} download failed ({r.status_code}): {r.text}",
186
+ file=sys.stderr,
187
+ )
188
+ continue
189
+
190
+ # Ensure parent directory exists
191
+ Path(local_path).parent.mkdir(parents=True, exist_ok=True)
192
+
193
+ with open(local_path, "wb") as f:
194
+ f.write(r.content)
195
+
196
+ size_mb = len(r.content) / (1024 * 1024)
197
+ print(
198
+ f" βœ… {remote_name} downloaded ({size_mb:.1f} MB).",
199
+ file=sys.stderr,
200
+ )
201
+ except Exception as e:
202
+ print(f" ❌ {remote_name}: {e}", file=sys.stderr)
203
+
204
+ print("\nβœ… Cloud sync (pull) complete.", file=sys.stderr)
205
+
206
+
207
+ def status():
208
+ """Check what files exist in the Supabase Storage bucket."""
209
+ import httpx
210
+
211
+ _check_credentials()
212
+
213
+ print(f"\nπŸ“¦ Checking bucket '{SUPABASE_BUCKET}'...\n", file=sys.stderr)
214
+
215
+ r = httpx.post(
216
+ f"{SUPABASE_URL}/storage/v1/object/list/{SUPABASE_BUCKET}",
217
+ headers={**_headers(), "Content-Type": "application/json"},
218
+ json={"prefix": "", "limit": 100},
219
+ timeout=30,
220
+ )
221
+
222
+ if r.status_code != 200:
223
+ print(f" ❌ Failed ({r.status_code}): {r.text}", file=sys.stderr)
224
+ return
225
+
226
+ files = r.json()
227
+ if not files:
228
+ print(" πŸ“­ Bucket is empty. Run 'python sync.py push' first.",
229
+ file=sys.stderr)
230
+ return
231
+
232
+ print(f" {'File':<20} {'Size':>10} {'Last Modified'}", file=sys.stderr)
233
+ print(f" {'─'*20} {'─'*10} {'─'*20}", file=sys.stderr)
234
+
235
+ for f in files:
236
+ name = f.get("name", "?")
237
+ size = f.get("metadata", {}).get("size", 0)
238
+ size_str = f"{size / (1024*1024):.1f} MB" if size else "?"
239
+ updated = f.get("updated_at", "?")[:19]
240
+ print(f" {name:<20} {size_str:>10} {updated}", file=sys.stderr)
241
+
242
+
243
+ if __name__ == "__main__":
244
+ if len(sys.argv) < 2 or sys.argv[1] not in ("push", "pull", "status"):
245
+ print("Usage: python sync.py [push|pull|status]")
246
+ sys.exit(1)
247
+
248
+ action = sys.argv[1]
249
+ if action == "push":
250
+ push()
251
+ elif action == "pull":
252
+ pull()
253
+ elif action == "status":
254
+ status()
test_search.py ADDED
@@ -0,0 +1,53 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Quick smoke test for the search pipeline."""
2
+ import sys
3
+ import os
4
+ sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
5
+
6
+ os.environ['HF_HUB_DISABLE_PROGRESS_BARS'] = '1'
7
+ os.environ['TOKENIZERS_PARALLELISM'] = 'false'
8
+ os.environ['TRANSFORMERS_VERBOSITY'] = 'error'
9
+
10
+ import logging
11
+ logging.getLogger('sentence_transformers').setLevel(logging.ERROR)
12
+ logging.getLogger('transformers').setLevel(logging.ERROR)
13
+
14
+ from config import EMBEDDING_MODEL
15
+ from database.faiss_db import VectorDB
16
+ from database.sqlite_db import MemoryDB
17
+ from sentence_transformers import SentenceTransformer
18
+ import time
19
+
20
+ db = MemoryDB(readonly=True)
21
+ vec_db = VectorDB()
22
+ model = SentenceTransformer(EMBEDDING_MODEL)
23
+
24
+ print(f"DB: {db.count()} rows | FAISS: {vec_db.count()} vectors")
25
+
26
+ # Test search with timing
27
+ query = "What projects is Saurab working on?"
28
+ t0 = time.perf_counter()
29
+ query_emb = model.encode([query], show_progress_bar=False, convert_to_numpy=True)
30
+ t_encode = time.perf_counter() - t0
31
+
32
+ t0 = time.perf_counter()
33
+ results = vec_db.search(query_emb, top_k=3)
34
+ t_search = time.perf_counter() - t0
35
+
36
+ print(f"\nQuery: '{query}'")
37
+ print(f"Encode: {t_encode*1000:.1f}ms | Search: {t_search*1000:.1f}ms")
38
+ print(f"Results: {len(results)}")
39
+
40
+ ids = [r[0] for r in results]
41
+ metas = db.get_memories_by_ids(ids)
42
+ for i, m in enumerate(metas):
43
+ score = results[i][1]
44
+ source = m.get("source", "?")
45
+ text_preview = m.get("text", "")[:120].replace("\n", " ")
46
+ print(f" [{i+1}] Score={score:.3f} | {source}")
47
+ print(f" {text_preview}...")
48
+
49
+ # Stats
50
+ stats = db.get_stats()
51
+ print(f"\nStats: {stats}")
52
+ db.close()
53
+ print("\nAll tests passed!")
update.py ADDED
@@ -0,0 +1,275 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ update.py β€” Incremental ingestion pipeline.
3
+
4
+ This is the main entry point for populating the AI memory store.
5
+ Run it on your laptop whenever you have new data.
6
+
7
+ Pipeline:
8
+ 1. Scan all configured directories for files
9
+ 2. Hash each file to detect changes (skip unchanged files)
10
+ 3. Parse files into documents via the ingestion router
11
+ 4. Clean and normalize text
12
+ 5. Chunk text into ~300-word segments
13
+ 6. Hash each chunk to deduplicate against existing DB
14
+ 7. Batch-encode new chunks via SentenceTransformer
15
+ 8. Insert into SQLite + FAISS
16
+ 9. Single FAISS write at end of pipeline
17
+ 10. Optional IVF rebuild if threshold crossed
18
+
19
+ Safe to run repeatedly β€” fully idempotent via hash-based deduplication.
20
+ """
21
+
22
+ import hashlib
23
+ import os
24
+ import sys
25
+ import time
26
+ from pathlib import Path
27
+
28
+ import numpy as np
29
+
30
+ # Ensure project root is on path
31
+ PROJECT_ROOT = Path(__file__).parent.resolve()
32
+ sys.path.insert(0, str(PROJECT_ROOT))
33
+
34
+ from config import (
35
+ BATCH_SIZE,
36
+ CHUNK_MAX_WORDS,
37
+ CHUNK_OVERLAP_WORDS,
38
+ DATA_RAW,
39
+ EMBEDDING_MODEL,
40
+ OBSIDIAN_SCAN_DIRS,
41
+ SKIP_PATTERNS,
42
+ )
43
+ from database.faiss_db import VectorDB
44
+ from database.sqlite_db import MemoryDB
45
+ from ingestion import process_file
46
+
47
+ # Maximum file size to process (skip very large files like raw CSV dumps)
48
+ MAX_FILE_SIZE_MB = 15
49
+ from processing.chunker import chunk_text
50
+ from processing.cleaner import clean_text, compute_hash
51
+
52
+
53
+ def _should_skip(filepath: Path) -> bool:
54
+ """Check if a file path matches any skip pattern."""
55
+ parts = filepath.parts
56
+ for pattern in SKIP_PATTERNS:
57
+ if any(pattern in p for p in parts):
58
+ return True
59
+ return False
60
+
61
+
62
+ def _hash_file(filepath: Path) -> str:
63
+ """Compute SHA-256 of file contents for change detection."""
64
+ h = hashlib.sha256()
65
+ try:
66
+ with open(filepath, "rb") as f:
67
+ while True:
68
+ chunk = f.read(65536) # 64 KB chunks
69
+ if not chunk:
70
+ break
71
+ h.update(chunk)
72
+ except (OSError, PermissionError):
73
+ return ""
74
+ return h.hexdigest()
75
+
76
+
77
+ def _collect_files() -> list:
78
+ """
79
+ Gather all candidate files from data_raw/ and Obsidian scan directories.
80
+ Deduplicates by absolute path.
81
+ """
82
+ seen_paths = set()
83
+ files = []
84
+
85
+ # Source 1: data_raw/ (user-dropped exports)
86
+ scan_roots = [DATA_RAW]
87
+
88
+ # Source 2: Obsidian vault directories
89
+ for d in OBSIDIAN_SCAN_DIRS:
90
+ if d.exists():
91
+ scan_roots.append(d)
92
+
93
+ for root_dir in scan_roots:
94
+ if not root_dir.exists():
95
+ continue
96
+ for filepath in root_dir.rglob("*"):
97
+ if not filepath.is_file():
98
+ continue
99
+ if _should_skip(filepath):
100
+ continue
101
+ # Skip files larger than threshold
102
+ try:
103
+ size_mb = filepath.stat().st_size / (1024 * 1024)
104
+ if size_mb > MAX_FILE_SIZE_MB:
105
+ continue
106
+ except OSError:
107
+ continue
108
+
109
+ abs_path = filepath.resolve()
110
+ if abs_path in seen_paths:
111
+ continue
112
+ seen_paths.add(abs_path)
113
+ files.append(filepath)
114
+
115
+ return files
116
+
117
+
118
+ def run_update():
119
+ """Execute the full incremental update pipeline."""
120
+ t_start = time.time()
121
+ print(f"{'=' * 60}", file=sys.stderr)
122
+ print(f" AI Memory β€” Incremental Update Pipeline", file=sys.stderr)
123
+ print(f"{'=' * 60}", file=sys.stderr)
124
+
125
+ # ─────────────────────────────────────────────
126
+ # 1. Collect files
127
+ # ─────────────────────────────────────────────
128
+ files = _collect_files()
129
+ print(f"\n[1/6] Discovered {len(files)} candidate files.", file=sys.stderr)
130
+
131
+ if not files:
132
+ print("No files found. Nothing to do.", file=sys.stderr)
133
+ return
134
+
135
+ # ─────────────────────────────────────────────
136
+ # 2. Initialize storage
137
+ # ─────────────────────────────────────────────
138
+ db = MemoryDB()
139
+ vec_db = VectorDB()
140
+ existing_hashes = db.get_existing_hashes()
141
+ print(
142
+ f"[2/6] Database loaded: {len(existing_hashes)} existing chunks.",
143
+ file=sys.stderr,
144
+ )
145
+
146
+ # ─────────────────────────────────────────────
147
+ # 3. Parse β†’ Clean β†’ Chunk β†’ Deduplicate
148
+ # ─────────────────────────────────────────────
149
+ new_chunks = []
150
+ files_processed = 0
151
+ files_skipped = 0
152
+ docs_parsed = 0
153
+
154
+ for file_idx, filepath in enumerate(files, 1):
155
+ # Progress every 50 files
156
+ if file_idx % 50 == 0 or file_idx == 1:
157
+ print(
158
+ f" Parsing file {file_idx}/{len(files)}: {filepath.name}",
159
+ file=sys.stderr,
160
+ )
161
+ try:
162
+ for document in process_file(filepath):
163
+ docs_parsed += 1
164
+ text = clean_text(document.get("text", ""))
165
+ if not text:
166
+ continue
167
+
168
+ chunks = chunk_text(
169
+ text,
170
+ max_words=CHUNK_MAX_WORDS,
171
+ overlap=CHUNK_OVERLAP_WORDS,
172
+ )
173
+
174
+ for chunk_content in chunks:
175
+ chunk_hash = compute_hash(chunk_content)
176
+
177
+ if chunk_hash in existing_hashes:
178
+ continue # Already embedded β€” skip
179
+
180
+ # Register hash to prevent intra-run duplicates
181
+ existing_hashes.add(chunk_hash)
182
+
183
+ new_chunks.append(
184
+ {
185
+ "hash": chunk_hash,
186
+ "text": chunk_content,
187
+ "source": document.get("source", filepath.name),
188
+ "timestamp": document.get("timestamp", ""),
189
+ "tags": document.get("tags", ""),
190
+ }
191
+ )
192
+ files_processed += 1
193
+ except Exception as e:
194
+ print(
195
+ f" [WARN] Error processing {filepath.name}: {e}",
196
+ file=sys.stderr,
197
+ )
198
+ files_skipped += 1
199
+
200
+ print(
201
+ f"[3/6] Parsed {docs_parsed} documents from {files_processed} files "
202
+ f"({files_skipped} skipped). Found {len(new_chunks)} NEW unique chunks.",
203
+ file=sys.stderr,
204
+ )
205
+
206
+ if not new_chunks:
207
+ print(
208
+ "\nβœ… System is fully synced β€” no new data to process.",
209
+ file=sys.stderr,
210
+ )
211
+ db.close()
212
+ return
213
+
214
+ # ─────────────────────────────────────────────
215
+ # 4. Lazy-load SentenceTransformer
216
+ # ─────────────────────────────────────────────
217
+ print(f"[4/6] Loading embedding model ({EMBEDDING_MODEL})...", file=sys.stderr)
218
+ from sentence_transformers import SentenceTransformer
219
+
220
+ model = SentenceTransformer(EMBEDDING_MODEL)
221
+
222
+ # ─────────────────────────────────────────────
223
+ # 5. Batch encode + store
224
+ # ─────────────────────────────────────────────
225
+ total = len(new_chunks)
226
+ print(f"[5/6] Encoding & storing {total} chunks...", file=sys.stderr)
227
+
228
+ for batch_start in range(0, total, BATCH_SIZE):
229
+ batch = new_chunks[batch_start : batch_start + BATCH_SIZE]
230
+ texts = [b["text"] for b in batch]
231
+
232
+ # Batch encode
233
+ embeddings = model.encode(
234
+ texts,
235
+ batch_size=BATCH_SIZE,
236
+ show_progress_bar=False,
237
+ convert_to_numpy=True,
238
+ )
239
+
240
+ # Insert metadata into SQLite (returns mapped IDs)
241
+ mapped_ids = db.insert_memories(batch)
242
+
243
+ # Add to FAISS (deferred write)
244
+ vec_db.add_embeddings(embeddings, np.array(mapped_ids))
245
+
246
+ progress = min(batch_start + BATCH_SIZE, total)
247
+ print(
248
+ f" Processed {progress}/{total} chunks...",
249
+ file=sys.stderr,
250
+ )
251
+
252
+ # ─────────────────────────────────────────────
253
+ # 6. Finalize: single FAISS write + optional IVF rebuild
254
+ # ─────────────────────────────────────────────
255
+ print(f"[6/6] Saving FAISS index...", file=sys.stderr)
256
+ vec_db.maybe_rebuild_ivf()
257
+ vec_db.save()
258
+
259
+ elapsed = time.time() - t_start
260
+ print(
261
+ f"\n{'=' * 60}\n"
262
+ f" βœ… Update complete!\n"
263
+ f" New chunks: {total}\n"
264
+ f" Total in DB: {db.count()}\n"
265
+ f" FAISS index: {vec_db.count()} vectors\n"
266
+ f" Time: {elapsed:.1f}s\n"
267
+ f"{'=' * 60}",
268
+ file=sys.stderr,
269
+ )
270
+
271
+ db.close()
272
+
273
+
274
+ if __name__ == "__main__":
275
+ run_update()