Saurab Mishra commited on
Commit Β·
34dcea4
0
Parent(s):
Initial open source release
Browse files- .dockerignore +24 -0
- .gitignore +13 -0
- DATA_INGESTION.md +64 -0
- Dockerfile +49 -0
- README.md +76 -0
- config.py +110 -0
- database/__init__.py +3 -0
- database/faiss_db.py +203 -0
- database/sqlite_db.py +246 -0
- deploy.ps1 +135 -0
- ingestion/__init__.py +54 -0
- ingestion/chatgpt.py +88 -0
- ingestion/gemini.py +97 -0
- ingestion/image.py +102 -0
- ingestion/markdown.py +38 -0
- ingestion/pdf.py +48 -0
- mcp_server/__init__.py +3 -0
- mcp_server/cloud_server.py +426 -0
- mcp_server/main.py +307 -0
- processing/__init__.py +3 -0
- processing/chunker.py +84 -0
- processing/cleaner.py +60 -0
- requirements.txt +29 -0
- sync.py +254 -0
- test_search.py +53 -0
- update.py +275 -0
.dockerignore
ADDED
|
@@ -0,0 +1,24 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
.venv/
|
| 2 |
+
__pycache__/
|
| 3 |
+
*.pyc
|
| 4 |
+
*.pyo
|
| 5 |
+
data_raw/
|
| 6 |
+
data_processed/
|
| 7 |
+
embeddings/
|
| 8 |
+
.git/
|
| 9 |
+
.qdrant/
|
| 10 |
+
*.db
|
| 11 |
+
*.index
|
| 12 |
+
*.db-shm
|
| 13 |
+
*.db-wal
|
| 14 |
+
faraday-server/
|
| 15 |
+
scripts/
|
| 16 |
+
00 - Inbox/
|
| 17 |
+
01 - Raw Sources/
|
| 18 |
+
02 - Wiki/
|
| 19 |
+
03 - Data Lake/
|
| 20 |
+
.obsidian/
|
| 21 |
+
*.md
|
| 22 |
+
!README.md
|
| 23 |
+
test_*.py
|
| 24 |
+
results.txt
|
.gitignore
ADDED
|
@@ -0,0 +1,13 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Python environments
|
| 2 |
+
.venv/
|
| 3 |
+
__pycache__/
|
| 4 |
+
*.pyc
|
| 5 |
+
|
| 6 |
+
# Local data / DBs
|
| 7 |
+
data_raw/
|
| 8 |
+
data_processed/
|
| 9 |
+
embeddings/
|
| 10 |
+
|
| 11 |
+
# Keys / Config
|
| 12 |
+
.env
|
| 13 |
+
secret
|
DATA_INGESTION.md
ADDED
|
@@ -0,0 +1,64 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Faraday Memory: Data Ingestion Guide
|
| 2 |
+
|
| 3 |
+
Your AI memory system has built-in smart parsers designed to ingest various file formats. You do not need to convert your documents manually β the ingestion pipeline handles extraction, chunking, and vector embedding automatically based on file extensions.
|
| 4 |
+
|
| 5 |
+
## π₯ Where to Drop Files
|
| 6 |
+
|
| 7 |
+
You have two primary designated "Drop Zones". The `sync.py push` command recursively scans these directories and all their subfolders:
|
| 8 |
+
|
| 9 |
+
1. **Your Obsidian Vault's Inbox or Reference Folders**
|
| 10 |
+
- `00 - Inbox/`
|
| 11 |
+
- `01 - Raw Sources/`
|
| 12 |
+
2. **The Dedicated Raw Data Directory**
|
| 13 |
+
- `Faraday/ai-memory-mcp/data_raw/`
|
| 14 |
+
|
| 15 |
+
## π Supported Formats & Naming Rules
|
| 16 |
+
|
| 17 |
+
### 1. ChatGPT Exports (JSON)
|
| 18 |
+
* **Format:** `.json`
|
| 19 |
+
* **Rule:** The filename **must** contain the word `conversations`.
|
| 20 |
+
* **Example:** `chatgpt_conversations.json` or `conversations_2025.json`
|
| 21 |
+
* *Behavior:* Explodes the dump and parses it specifically tracking Human vs AI messages.
|
| 22 |
+
|
| 23 |
+
### 2. Gemini Exports (HTML / Takeout)
|
| 24 |
+
* **Format:** `.html`
|
| 25 |
+
* **Rule:** The filename **must** contain the word `activity` or `gemini`.
|
| 26 |
+
* **Example:** `My_Gemini_Activity.html`
|
| 27 |
+
* *Behavior:* Strips out Google tracking headers and parses out prompts and responses cleanly.
|
| 28 |
+
|
| 29 |
+
### 3. PDF Documents (Coursework, Papers, E-books)
|
| 30 |
+
* **Format:** `.pdf`
|
| 31 |
+
* **Rule:** Standard PDF documents (text-based).
|
| 32 |
+
* **Example:** `Linear_Algebra_Notes.pdf` or `Attention_Is_All_You_Need.pdf`
|
| 33 |
+
* *Behavior:* Extracts multi-page text using PDFMiner/PyMuPDF into logical reading chunks.
|
| 34 |
+
|
| 35 |
+
### 4. Images & Scans (Diagrams, Receipts, Screenshots)
|
| 36 |
+
* **Format:** `.png`, `.jpg`, `.jpeg`, `.webp`, `.bmp`, `.tiff`
|
| 37 |
+
* **Rule:** Can be named anything.
|
| 38 |
+
* **Requirements:** You must have [Tesseract-OCR](https://github.com/UB-Mannheim/tesseract/wiki) installed on your system.
|
| 39 |
+
* *Behavior:* Automatically performs Optical Character Recognition (OCR) to rip the text out of the image and inject it into your vector database.
|
| 40 |
+
|
| 41 |
+
### 5. Plain Text, Emails, Code, CSVs, and Markdown
|
| 42 |
+
* **Format:** `.md`, `.txt`, `.csv`, `.log`, `.rst`
|
| 43 |
+
* **Rule:** Can be named anything.
|
| 44 |
+
* **Example:** `Meeting_Notes.md` or `email_invoice.txt`
|
| 45 |
+
* *Behavior:* The default fallback. Reads the file as plain structured text and intelligently slices it up. Large CSVs are skipped if they exceed 15MB to prevent clogging the embedding memory.
|
| 46 |
+
|
| 47 |
+
---
|
| 48 |
+
|
| 49 |
+
## π How to Add It to the Cloud
|
| 50 |
+
|
| 51 |
+
Once your files are dropped into the folders:
|
| 52 |
+
1. Open your terminal in the `ai-memory-mcp` folder.
|
| 53 |
+
2. Run the ingestion and cloud synchronization command:
|
| 54 |
+
```bash
|
| 55 |
+
python sync.py push
|
| 56 |
+
```
|
| 57 |
+
3. The script will:
|
| 58 |
+
- Identify *only* the new files.
|
| 59 |
+
- Run the appropriate parser (OCR, HTML extractor, etc.).
|
| 60 |
+
- Break them down into semantic chunks and generate embeddings.
|
| 61 |
+
- Compress the new database.
|
| 62 |
+
- Securely upload the compressed database to Supabase.
|
| 63 |
+
|
| 64 |
+
Your Claude mobile app (and Antigravity!) will automatically have access to this new data upon their next query!
|
Dockerfile
ADDED
|
@@ -0,0 +1,49 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# βββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 2 |
+
# Faraday AI Memory β Cloud Run Container
|
| 3 |
+
# βββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 4 |
+
# Multi-stage build:
|
| 5 |
+
# 1. Pre-download the embedding model at build time
|
| 6 |
+
# 2. Install all Python dependencies
|
| 7 |
+
# 3. Runs cloud_server.py on port 8080
|
| 8 |
+
|
| 9 |
+
FROM python:3.11-slim AS base
|
| 10 |
+
|
| 11 |
+
# System dependencies for faiss-cpu
|
| 12 |
+
RUN apt-get update && apt-get install -y --no-install-recommends \
|
| 13 |
+
build-essential \
|
| 14 |
+
&& rm -rf /var/lib/apt/lists/*
|
| 15 |
+
|
| 16 |
+
# Set working directory
|
| 17 |
+
WORKDIR /app
|
| 18 |
+
|
| 19 |
+
# Copy requirements first (Docker layer caching)
|
| 20 |
+
COPY requirements.txt .
|
| 21 |
+
RUN pip install --no-cache-dir -r requirements.txt
|
| 22 |
+
|
| 23 |
+
# Pre-download the embedding model at build time
|
| 24 |
+
# This avoids a ~100MB download on every cold start
|
| 25 |
+
RUN python -c "from sentence_transformers import SentenceTransformer; SentenceTransformer('all-MiniLM-L6-v2')"
|
| 26 |
+
|
| 27 |
+
# Copy application code
|
| 28 |
+
COPY config.py .
|
| 29 |
+
COPY database/ database/
|
| 30 |
+
COPY processing/ processing/
|
| 31 |
+
COPY ingestion/ ingestion/
|
| 32 |
+
COPY mcp_server/ mcp_server/
|
| 33 |
+
|
| 34 |
+
# Create data directories
|
| 35 |
+
RUN mkdir -p /tmp/faraday-data data_raw data_processed embeddings
|
| 36 |
+
|
| 37 |
+
# Hugging Face Spaces uses PORT env var (default 7860)
|
| 38 |
+
ENV PORT=7860
|
| 39 |
+
ENV CLOUD_DATA_DIR=/tmp/faraday-data
|
| 40 |
+
|
| 41 |
+
# Expose the port
|
| 42 |
+
EXPOSE 7860
|
| 43 |
+
|
| 44 |
+
# Health check
|
| 45 |
+
HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \
|
| 46 |
+
CMD python -c "import urllib.request; urllib.request.urlopen('http://localhost:7860/sse')" || exit 1
|
| 47 |
+
|
| 48 |
+
# Run the cloud MCP server
|
| 49 |
+
CMD ["python", "mcp_server/cloud_server.py"]
|
README.md
ADDED
|
@@ -0,0 +1,76 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
title: Faraday Memory
|
| 3 |
+
emoji: π§
|
| 4 |
+
colorFrom: blue
|
| 5 |
+
colorTo: purple
|
| 6 |
+
sdk: docker
|
| 7 |
+
app_port: 7860
|
| 8 |
+
---
|
| 9 |
+
|
| 10 |
+
# Faraday AI Memory
|
| 11 |
+
|
| 12 |
+
**Author:** Saurab Mishra
|
| 13 |
+
|
| 14 |
+
Look, I got tired of switching between five different "AI" coding tools and repeating the same context a million times like a broken record. Claude Code, Cursor, Antigravity, Copilotβthey all suffer from the exact same severe, chronic amnesia every time you close the tab. You tell them your tech stack on Monday, and by Tuesday they're confidently rewriting your backend in PHP.
|
| 15 |
+
|
| 16 |
+
So I built **Faraday**.
|
| 17 |
+
|
| 18 |
+
Faraday is a brutally fast, brutally simple Model Context Protocol (MCP) server that slams your local documents, chat histories, emails, and PDFs directly into a vector database, compresses it, and force-feeds it to whatever vibe-coding AI toy you are currently obsessing over.
|
| 19 |
+
|
| 20 |
+
You drop files in a folder, run a script, and suddenly your AI actually remembers who you are and what you're working on. Groundbreaking concept, I know.
|
| 21 |
+
|
| 22 |
+
---
|
| 23 |
+
|
| 24 |
+
## What It Actually Does
|
| 25 |
+
|
| 26 |
+
1. **Ingests literally everything:** Drop `.md`, `.json` (ChatGPT exports), `.html` (Gemini exports), `.pdf`, or even `.png` (images with text) into the `data_raw` folder.
|
| 27 |
+
2. **Chunking & Vectorization:** It chunks the files, runs them through an extremely lightweight open-source embedding model (`all-MiniLM-L6-v2`), and indexes them in FAISS.
|
| 28 |
+
3. **Cloud Synchronization:** Compresses the gigantic, bloated vector indexes down to a few megabytes and shoves them into a private Supabase bucket.
|
| 29 |
+
4. **SSE Server:** Boots up an HTTP/SSE server (perfect for Hugging Face Spaces or Cloud Run) that safely pulls your index from Supabase into memory, hiding it behind a simple API key firewall.
|
| 30 |
+
|
| 31 |
+
Your data stays entirely under your control (no relying on bloated SaaS corporate subscriptions).
|
| 32 |
+
|
| 33 |
+
## The Local Dev Setup (If you want to run it on your own laptop)
|
| 34 |
+
|
| 35 |
+
If you're deploying this to the cloud, the `Dockerfile` takes care of the heavy lifting. But if you want to run the ingestion and test locally:
|
| 36 |
+
|
| 37 |
+
### 1. Requirements
|
| 38 |
+
|
| 39 |
+
Install the stuff. You probably already have half of this globally installed anyway.
|
| 40 |
+
```bash
|
| 41 |
+
pip install -r requirements.txt
|
| 42 |
+
```
|
| 43 |
+
*(Pro-tip: If you want it to actually read the text out of your images, you need to suffer through installing `Tesseract-OCR` on your OS level. Don't blame me, complain to Google).*
|
| 44 |
+
|
| 45 |
+
### 2. Configuration
|
| 46 |
+
|
| 47 |
+
Open `config.py`. It's stupidly simple.
|
| 48 |
+
- Put your raw messy files in the `data_raw/` directory.
|
| 49 |
+
- For the Supabase cloud connection, set these environment variables (or hardcode them if you like playing with fire):
|
| 50 |
+
- `SUPABASE_URL`
|
| 51 |
+
- `SUPABASE_KEY` (Needs to be the service key to bypass Row Level Security, since we use private storage buckets)
|
| 52 |
+
|
| 53 |
+
### 3. Update the Brain
|
| 54 |
+
|
| 55 |
+
Whenever you hoard more documents, just run:
|
| 56 |
+
```bash
|
| 57 |
+
python sync.py push
|
| 58 |
+
```
|
| 59 |
+
It will automatically find the new files, ignore the ones it already did, run the embeddings, compress the database, and shoot it to the cloud.
|
| 60 |
+
|
| 61 |
+
### 4. Connect to your AI
|
| 62 |
+
|
| 63 |
+
Connect any MCP-compatible agent directly to the cloud server, or run it locally:
|
| 64 |
+
```bash
|
| 65 |
+
# Local standard I/O (For Cursor / Desktop apps)
|
| 66 |
+
python mcp_server/main.py
|
| 67 |
+
|
| 68 |
+
# Or connect to the Cloud SSE endpoint
|
| 69 |
+
URL: https://<your-huggingface-space>.hf.space/sse
|
| 70 |
+
Header: X-API-Key: <your_secret_password>
|
| 71 |
+
```
|
| 72 |
+
|
| 73 |
+
Enjoy not having to constantly remind your AI what programming language you're using.
|
| 74 |
+
|
| 75 |
+
---
|
| 76 |
+
*No personal data, keys, or vector blobs are included in this repo. I `.gitignore`'d all of it. If you manage to leak your API keys, that's entirely on you.*
|
config.py
ADDED
|
@@ -0,0 +1,110 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
config.py β Central configuration for the AI Memory MCP system.
|
| 3 |
+
|
| 4 |
+
All paths, model settings, and tuning constants live here.
|
| 5 |
+
Every module imports from this file; nothing is hardcoded elsewhere.
|
| 6 |
+
"""
|
| 7 |
+
|
| 8 |
+
import os
|
| 9 |
+
from pathlib import Path
|
| 10 |
+
|
| 11 |
+
# βββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 12 |
+
# Directory Layout
|
| 13 |
+
# βββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 14 |
+
|
| 15 |
+
BASE_DIR = Path(__file__).parent.resolve()
|
| 16 |
+
|
| 17 |
+
# Where raw exports / documents are dropped for ingestion
|
| 18 |
+
DATA_RAW = BASE_DIR / "data_raw"
|
| 19 |
+
|
| 20 |
+
# Processed artefacts (SQLite DB, FAISS index)
|
| 21 |
+
DATA_PROCESSED = BASE_DIR / "data_processed"
|
| 22 |
+
|
| 23 |
+
# FAISS index file
|
| 24 |
+
EMBEDDINGS_DIR = BASE_DIR / "embeddings"
|
| 25 |
+
|
| 26 |
+
# Obsidian vault root (parent of ai-memory-mcp/)
|
| 27 |
+
OBSIDIAN_VAULT = BASE_DIR.parent
|
| 28 |
+
|
| 29 |
+
# Obsidian source directories to scan during update
|
| 30 |
+
OBSIDIAN_SCAN_DIRS = [
|
| 31 |
+
OBSIDIAN_VAULT / "00 - Inbox",
|
| 32 |
+
OBSIDIAN_VAULT / "01 - Raw Sources",
|
| 33 |
+
OBSIDIAN_VAULT / "02 - Wiki",
|
| 34 |
+
OBSIDIAN_VAULT / "03 - Data Lake",
|
| 35 |
+
]
|
| 36 |
+
|
| 37 |
+
# Directories / patterns to skip during recursive scanning
|
| 38 |
+
SKIP_PATTERNS = [
|
| 39 |
+
".qdrant", ".chroma", ".venv", "__pycache__",
|
| 40 |
+
"faraday-server", "ai-memory-mcp", "node_modules",
|
| 41 |
+
".git", ".obsidian",
|
| 42 |
+
]
|
| 43 |
+
|
| 44 |
+
# Ensure critical dirs exist
|
| 45 |
+
for _d in [DATA_RAW, DATA_PROCESSED, EMBEDDINGS_DIR]:
|
| 46 |
+
_d.mkdir(parents=True, exist_ok=True)
|
| 47 |
+
|
| 48 |
+
# βββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 49 |
+
# Database Paths
|
| 50 |
+
# βββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 51 |
+
|
| 52 |
+
SQLITE_DB_PATH = DATA_PROCESSED / "memory.db"
|
| 53 |
+
FAISS_INDEX_PATH = EMBEDDINGS_DIR / "memory.index"
|
| 54 |
+
|
| 55 |
+
# βββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 56 |
+
# Embedding Model
|
| 57 |
+
# βββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 58 |
+
|
| 59 |
+
EMBEDDING_MODEL = "all-MiniLM-L6-v2"
|
| 60 |
+
EMBEDDING_DIM = 384 # Output dimension for all-MiniLM-L6-v2
|
| 61 |
+
BATCH_SIZE = 64 # SentenceTransformer encode batch size
|
| 62 |
+
|
| 63 |
+
# βββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 64 |
+
# Chunking
|
| 65 |
+
# βββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 66 |
+
|
| 67 |
+
CHUNK_MAX_WORDS = 300 # Target words per chunk
|
| 68 |
+
CHUNK_OVERLAP_WORDS = 50 # Overlap between adjacent chunks
|
| 69 |
+
CHUNK_MIN_LENGTH = 30 # Minimum characters β skip trivially short chunks
|
| 70 |
+
|
| 71 |
+
# βββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 72 |
+
# Search / Scoring
|
| 73 |
+
# βββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 74 |
+
|
| 75 |
+
SEMANTIC_WEIGHT = 0.7 # Weight for vector similarity in hybrid score
|
| 76 |
+
RECENCY_WEIGHT = 0.3 # Weight for timestamp recency in hybrid score
|
| 77 |
+
DEFAULT_TOP_K = 5 # Default number of results
|
| 78 |
+
|
| 79 |
+
# βββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 80 |
+
# FAISS Tuning
|
| 81 |
+
# βββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 82 |
+
|
| 83 |
+
# When total vectors exceed this threshold, rebuild as IVFFlat
|
| 84 |
+
# Below this, IndexFlatIP is fine (brute-force is fast enough)
|
| 85 |
+
FAISS_IVF_THRESHOLD = 10_000
|
| 86 |
+
FAISS_NLIST = 100 # Number of Voronoi cells for IVFFlat
|
| 87 |
+
FAISS_NPROBE = 10 # Cells to search at query time (speed/accuracy tradeoff)
|
| 88 |
+
|
| 89 |
+
# βββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 90 |
+
# OCR (optional)
|
| 91 |
+
# βββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 92 |
+
|
| 93 |
+
OCR_ENABLED = True # Set False to skip image ingestion entirely
|
| 94 |
+
# If tesseract is not in PATH, set the full path here:
|
| 95 |
+
# e.g. r"C:\Program Files\Tesseract-OCR\tesseract.exe"
|
| 96 |
+
TESSERACT_CMD = None
|
| 97 |
+
|
| 98 |
+
# βββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 99 |
+
# Cloud Sync (optional β Supabase)
|
| 100 |
+
# βββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 101 |
+
|
| 102 |
+
SUPABASE_URL = os.environ.get("SUPABASE_URL", "")
|
| 103 |
+
SUPABASE_KEY = os.environ.get("SUPABASE_KEY", "")
|
| 104 |
+
SUPABASE_BUCKET = os.environ.get("SUPABASE_BUCKET", "faraday-memory")
|
| 105 |
+
|
| 106 |
+
# βββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 107 |
+
# Logging
|
| 108 |
+
# βββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 109 |
+
|
| 110 |
+
LOG_LEVEL = os.environ.get("LOG_LEVEL", "INFO")
|
database/__init__.py
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
database β Storage layer with SQLite (metadata + FTS5) and FAISS (vectors).
|
| 3 |
+
"""
|
database/faiss_db.py
ADDED
|
@@ -0,0 +1,203 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
database.faiss_db β Production FAISS vector index with adaptive indexing.
|
| 3 |
+
|
| 4 |
+
Key design decisions:
|
| 5 |
+
- Uses IndexFlatIP (inner product on L2-normalized vectors) for small datasets.
|
| 6 |
+
- Automatically trains and rebuilds as IVFFlat when vectors exceed threshold
|
| 7 |
+
(default 10K), giving O(βN) search instead of O(N).
|
| 8 |
+
- Deferred disk writes: caller must explicitly call save() β eliminates
|
| 9 |
+
the O(NΒ²) disk I/O from writing after every batch.
|
| 10 |
+
- IndexIDMap wrapper maps FAISS positions to SQLite row IDs.
|
| 11 |
+
"""
|
| 12 |
+
|
| 13 |
+
import os
|
| 14 |
+
import sys
|
| 15 |
+
from pathlib import Path
|
| 16 |
+
from typing import List, Optional, Tuple
|
| 17 |
+
|
| 18 |
+
import faiss
|
| 19 |
+
import numpy as np
|
| 20 |
+
|
| 21 |
+
# Import from root config
|
| 22 |
+
sys.path.insert(0, str(Path(__file__).parent.parent))
|
| 23 |
+
from config import (
|
| 24 |
+
EMBEDDING_DIM,
|
| 25 |
+
FAISS_INDEX_PATH,
|
| 26 |
+
FAISS_IVF_THRESHOLD,
|
| 27 |
+
FAISS_NLIST,
|
| 28 |
+
FAISS_NPROBE,
|
| 29 |
+
)
|
| 30 |
+
|
| 31 |
+
|
| 32 |
+
class VectorDB:
|
| 33 |
+
"""
|
| 34 |
+
FAISS-backed vector index with adaptive index type selection.
|
| 35 |
+
|
| 36 |
+
For < FAISS_IVF_THRESHOLD vectors: brute-force IndexFlatIP.
|
| 37 |
+
For >= threshold: IVFFlat with configurable nlist/nprobe.
|
| 38 |
+
"""
|
| 39 |
+
|
| 40 |
+
def __init__(
|
| 41 |
+
self,
|
| 42 |
+
dim: int = EMBEDDING_DIM,
|
| 43 |
+
index_path: Optional[str] = None,
|
| 44 |
+
):
|
| 45 |
+
self.dim = dim
|
| 46 |
+
self.index_path = index_path or str(FAISS_INDEX_PATH)
|
| 47 |
+
self.index: Optional[faiss.Index] = None
|
| 48 |
+
self._load_or_create()
|
| 49 |
+
|
| 50 |
+
def _load_or_create(self):
|
| 51 |
+
"""Load existing index from disk or create a fresh one."""
|
| 52 |
+
if os.path.exists(self.index_path):
|
| 53 |
+
self.index = faiss.read_index(self.index_path)
|
| 54 |
+
print(
|
| 55 |
+
f"[VectorDB] Loaded FAISS index: {self.index.ntotal} vectors",
|
| 56 |
+
file=sys.stderr,
|
| 57 |
+
)
|
| 58 |
+
else:
|
| 59 |
+
print("[VectorDB] Creating new FAISS IndexFlatIP.", file=sys.stderr)
|
| 60 |
+
base = faiss.IndexFlatIP(self.dim)
|
| 61 |
+
self.index = faiss.IndexIDMap(base)
|
| 62 |
+
|
| 63 |
+
def add_embeddings(self, embeddings: np.ndarray, ids: np.ndarray):
|
| 64 |
+
"""
|
| 65 |
+
Add a batch of embeddings with their corresponding SQLite IDs.
|
| 66 |
+
|
| 67 |
+
Args:
|
| 68 |
+
embeddings: (N, dim) float32 array β MUST be L2-normalized.
|
| 69 |
+
ids: (N,) int64 array of SQLite row IDs.
|
| 70 |
+
|
| 71 |
+
NOTE: Does NOT write to disk. Call save() explicitly when done.
|
| 72 |
+
"""
|
| 73 |
+
if embeddings.shape[0] == 0:
|
| 74 |
+
return
|
| 75 |
+
|
| 76 |
+
assert embeddings.shape[1] == self.dim, (
|
| 77 |
+
f"Dimension mismatch: got {embeddings.shape[1]}, expected {self.dim}"
|
| 78 |
+
)
|
| 79 |
+
|
| 80 |
+
embeddings = embeddings.astype(np.float32)
|
| 81 |
+
ids = ids.astype(np.int64)
|
| 82 |
+
|
| 83 |
+
# L2-normalize for cosine similarity via inner product
|
| 84 |
+
faiss.normalize_L2(embeddings)
|
| 85 |
+
|
| 86 |
+
self.index.add_with_ids(embeddings, ids)
|
| 87 |
+
|
| 88 |
+
def search(
|
| 89 |
+
self, query_embedding: np.ndarray, top_k: int = 5
|
| 90 |
+
) -> List[Tuple[int, float]]:
|
| 91 |
+
"""
|
| 92 |
+
Find the top_k most similar vectors.
|
| 93 |
+
|
| 94 |
+
Args:
|
| 95 |
+
query_embedding: (1, dim) or (dim,) float32 array.
|
| 96 |
+
top_k: Number of results.
|
| 97 |
+
|
| 98 |
+
Returns:
|
| 99 |
+
List of (sqlite_id, similarity_score) tuples.
|
| 100 |
+
Score is cosine similarity (higher = better, range 0-1).
|
| 101 |
+
"""
|
| 102 |
+
if self.index is None or self.index.ntotal == 0:
|
| 103 |
+
return []
|
| 104 |
+
|
| 105 |
+
query_embedding = query_embedding.astype(np.float32)
|
| 106 |
+
if query_embedding.ndim == 1:
|
| 107 |
+
query_embedding = query_embedding.reshape(1, -1)
|
| 108 |
+
|
| 109 |
+
# Normalize query for cosine similarity
|
| 110 |
+
faiss.normalize_L2(query_embedding)
|
| 111 |
+
|
| 112 |
+
# Set nprobe for IVF indices (no-op for flat indices)
|
| 113 |
+
try:
|
| 114 |
+
# Access the underlying IVF quantizer if it exists
|
| 115 |
+
ivf = faiss.extract_index_ivf(self.index)
|
| 116 |
+
if ivf is not None:
|
| 117 |
+
ivf.nprobe = FAISS_NPROBE
|
| 118 |
+
except Exception:
|
| 119 |
+
pass
|
| 120 |
+
|
| 121 |
+
distances, indices = self.index.search(query_embedding, top_k)
|
| 122 |
+
|
| 123 |
+
results = []
|
| 124 |
+
for i in range(len(indices[0])):
|
| 125 |
+
idx = int(indices[0][i])
|
| 126 |
+
score = float(distances[0][i])
|
| 127 |
+
if idx != -1: # FAISS returns -1 for missing matches
|
| 128 |
+
results.append((idx, score))
|
| 129 |
+
|
| 130 |
+
return results
|
| 131 |
+
|
| 132 |
+
def maybe_rebuild_ivf(self):
|
| 133 |
+
"""
|
| 134 |
+
If total vectors exceed the IVF threshold, rebuild as IVFFlat
|
| 135 |
+
for O(βN) search performance. Called after a full update cycle.
|
| 136 |
+
|
| 137 |
+
This is an expensive operation (re-indexes everything) but only
|
| 138 |
+
happens once when crossing the threshold.
|
| 139 |
+
"""
|
| 140 |
+
if self.index.ntotal < FAISS_IVF_THRESHOLD:
|
| 141 |
+
return
|
| 142 |
+
|
| 143 |
+
# Check if already IVF
|
| 144 |
+
try:
|
| 145 |
+
ivf = faiss.extract_index_ivf(self.index)
|
| 146 |
+
if ivf is not None:
|
| 147 |
+
return # Already IVF, no rebuild needed
|
| 148 |
+
except Exception:
|
| 149 |
+
pass
|
| 150 |
+
|
| 151 |
+
print(
|
| 152 |
+
f"[VectorDB] Rebuilding as IVFFlat ({self.index.ntotal} vectors, "
|
| 153 |
+
f"nlist={FAISS_NLIST})...",
|
| 154 |
+
file=sys.stderr,
|
| 155 |
+
)
|
| 156 |
+
|
| 157 |
+
n = self.index.ntotal
|
| 158 |
+
|
| 159 |
+
try:
|
| 160 |
+
# IndexIDMap stores vectors in the sub-index and IDs in id_map
|
| 161 |
+
# Access the underlying flat index vectors directly
|
| 162 |
+
sub_index = self.index.index
|
| 163 |
+
all_vectors = faiss.vector_to_array(sub_index.xb).reshape(n, self.dim).copy()
|
| 164 |
+
|
| 165 |
+
# Extract the ID mapping
|
| 166 |
+
all_ids = faiss.vector_to_array(self.index.id_map).copy()
|
| 167 |
+
|
| 168 |
+
# Build new IVFFlat index
|
| 169 |
+
quantizer = faiss.IndexFlatIP(self.dim)
|
| 170 |
+
ivf_index = faiss.IndexIVFFlat(
|
| 171 |
+
quantizer, self.dim, FAISS_NLIST, faiss.METRIC_INNER_PRODUCT
|
| 172 |
+
)
|
| 173 |
+
|
| 174 |
+
# Train on existing vectors
|
| 175 |
+
ivf_index.train(all_vectors)
|
| 176 |
+
|
| 177 |
+
# Wrap in IDMap and add with original IDs
|
| 178 |
+
new_index = faiss.IndexIDMap(ivf_index)
|
| 179 |
+
new_index.add_with_ids(all_vectors, all_ids)
|
| 180 |
+
|
| 181 |
+
self.index = new_index
|
| 182 |
+
print(f"[VectorDB] IVFFlat rebuild complete.", file=sys.stderr)
|
| 183 |
+
|
| 184 |
+
except Exception as e:
|
| 185 |
+
print(
|
| 186 |
+
f"[VectorDB] IVF rebuild failed: {e}. Keeping flat index.",
|
| 187 |
+
file=sys.stderr,
|
| 188 |
+
)
|
| 189 |
+
|
| 190 |
+
def save(self):
|
| 191 |
+
"""Persist the index to disk. Call once at end of update pipeline."""
|
| 192 |
+
if self.index is not None:
|
| 193 |
+
# Ensure parent directory exists
|
| 194 |
+
os.makedirs(os.path.dirname(self.index_path), exist_ok=True)
|
| 195 |
+
faiss.write_index(self.index, self.index_path)
|
| 196 |
+
print(
|
| 197 |
+
f"[VectorDB] Saved index ({self.index.ntotal} vectors) to {self.index_path}",
|
| 198 |
+
file=sys.stderr,
|
| 199 |
+
)
|
| 200 |
+
|
| 201 |
+
def count(self) -> int:
|
| 202 |
+
"""Total number of indexed vectors."""
|
| 203 |
+
return self.index.ntotal if self.index else 0
|
database/sqlite_db.py
ADDED
|
@@ -0,0 +1,246 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
database.sqlite_db β Production SQLite metadata store with FTS5 full-text search.
|
| 3 |
+
|
| 4 |
+
Features:
|
| 5 |
+
- WAL journal mode for concurrent read/write
|
| 6 |
+
- FTS5 virtual table for keyword search
|
| 7 |
+
- Hash-based deduplication (UNIQUE constraint on hash column)
|
| 8 |
+
- Time-range queries via indexed timestamp column
|
| 9 |
+
- Tag filtering
|
| 10 |
+
- Batch insert via executemany
|
| 11 |
+
- Preserves ordered retrieval by maintaining original ID sequence
|
| 12 |
+
"""
|
| 13 |
+
|
| 14 |
+
import sqlite3
|
| 15 |
+
import sys
|
| 16 |
+
from pathlib import Path
|
| 17 |
+
from typing import Dict, List, Optional, Set
|
| 18 |
+
|
| 19 |
+
# Import from root config
|
| 20 |
+
sys.path.insert(0, str(Path(__file__).parent.parent))
|
| 21 |
+
from config import SQLITE_DB_PATH
|
| 22 |
+
|
| 23 |
+
|
| 24 |
+
class MemoryDB:
|
| 25 |
+
"""
|
| 26 |
+
SQLite-backed metadata store for memory chunks.
|
| 27 |
+
Each chunk gets a monotonically increasing integer ID that maps
|
| 28 |
+
exactly to its FAISS vector index position.
|
| 29 |
+
"""
|
| 30 |
+
|
| 31 |
+
def __init__(self, db_path: Optional[Path] = None, readonly: bool = False):
|
| 32 |
+
path = str(db_path or SQLITE_DB_PATH)
|
| 33 |
+
if readonly:
|
| 34 |
+
# Read-only URI connection β safe for concurrent MCP server reads
|
| 35 |
+
uri = f"file:{path}?mode=ro"
|
| 36 |
+
self.conn = sqlite3.connect(uri, uri=True, check_same_thread=False)
|
| 37 |
+
else:
|
| 38 |
+
self.conn = sqlite3.connect(path, check_same_thread=False)
|
| 39 |
+
# WAL mode: allows concurrent readers while writing
|
| 40 |
+
self.conn.execute("PRAGMA journal_mode=WAL;")
|
| 41 |
+
self.conn.execute("PRAGMA synchronous=NORMAL;")
|
| 42 |
+
|
| 43 |
+
self.conn.row_factory = sqlite3.Row
|
| 44 |
+
self._init_tables()
|
| 45 |
+
|
| 46 |
+
def _init_tables(self):
|
| 47 |
+
"""Create core tables and FTS5 index if they don't exist."""
|
| 48 |
+
with self.conn:
|
| 49 |
+
# Core metadata table
|
| 50 |
+
self.conn.execute("""
|
| 51 |
+
CREATE TABLE IF NOT EXISTS memories (
|
| 52 |
+
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
| 53 |
+
hash TEXT UNIQUE NOT NULL,
|
| 54 |
+
text TEXT NOT NULL,
|
| 55 |
+
source TEXT NOT NULL DEFAULT '',
|
| 56 |
+
timestamp TEXT NOT NULL DEFAULT '',
|
| 57 |
+
tags TEXT NOT NULL DEFAULT '',
|
| 58 |
+
created TEXT NOT NULL DEFAULT (datetime('now'))
|
| 59 |
+
)
|
| 60 |
+
""")
|
| 61 |
+
|
| 62 |
+
# Indices for fast lookups
|
| 63 |
+
self.conn.execute(
|
| 64 |
+
"CREATE INDEX IF NOT EXISTS idx_hash ON memories(hash);"
|
| 65 |
+
)
|
| 66 |
+
self.conn.execute(
|
| 67 |
+
"CREATE INDEX IF NOT EXISTS idx_timestamp ON memories(timestamp);"
|
| 68 |
+
)
|
| 69 |
+
self.conn.execute(
|
| 70 |
+
"CREATE INDEX IF NOT EXISTS idx_tags ON memories(tags);"
|
| 71 |
+
)
|
| 72 |
+
|
| 73 |
+
# FTS5 virtual table for full-text keyword search
|
| 74 |
+
# content= makes it an external-content FTS table (no data duplication)
|
| 75 |
+
self.conn.execute("""
|
| 76 |
+
CREATE VIRTUAL TABLE IF NOT EXISTS memories_fts
|
| 77 |
+
USING fts5(text, source, tags, content=memories, content_rowid=id)
|
| 78 |
+
""")
|
| 79 |
+
|
| 80 |
+
# Triggers to keep FTS5 in sync with the main table
|
| 81 |
+
self.conn.executescript("""
|
| 82 |
+
CREATE TRIGGER IF NOT EXISTS memories_ai AFTER INSERT ON memories BEGIN
|
| 83 |
+
INSERT INTO memories_fts(rowid, text, source, tags)
|
| 84 |
+
VALUES (new.id, new.text, new.source, new.tags);
|
| 85 |
+
END;
|
| 86 |
+
|
| 87 |
+
CREATE TRIGGER IF NOT EXISTS memories_ad AFTER DELETE ON memories BEGIN
|
| 88 |
+
INSERT INTO memories_fts(memories_fts, rowid, text, source, tags)
|
| 89 |
+
VALUES ('delete', old.id, old.text, old.source, old.tags);
|
| 90 |
+
END;
|
| 91 |
+
|
| 92 |
+
CREATE TRIGGER IF NOT EXISTS memories_au AFTER UPDATE ON memories BEGIN
|
| 93 |
+
INSERT INTO memories_fts(memories_fts, rowid, text, source, tags)
|
| 94 |
+
VALUES ('delete', old.id, old.text, old.source, old.tags);
|
| 95 |
+
INSERT INTO memories_fts(rowid, text, source, tags)
|
| 96 |
+
VALUES (new.id, new.text, new.source, new.tags);
|
| 97 |
+
END;
|
| 98 |
+
""")
|
| 99 |
+
|
| 100 |
+
# βββββββββββββββββββββββββββββββββββββββββββββ
|
| 101 |
+
# Deduplication
|
| 102 |
+
# βββββββββββββββββββββββββββββββββββββββββββββ
|
| 103 |
+
|
| 104 |
+
def get_existing_hashes(self) -> Set[str]:
|
| 105 |
+
"""
|
| 106 |
+
Fetch all known content hashes.
|
| 107 |
+
Used by update.py to skip already-processed chunks.
|
| 108 |
+
"""
|
| 109 |
+
cur = self.conn.execute("SELECT hash FROM memories")
|
| 110 |
+
return {row["hash"] for row in cur.fetchall()}
|
| 111 |
+
|
| 112 |
+
# βββββββββββββββββββββββββββββββββββββββββββββ
|
| 113 |
+
# Batch Insert
|
| 114 |
+
# βββββββββββββββββββββββββββββββββββββββββββββ
|
| 115 |
+
|
| 116 |
+
def insert_memories(self, data: List[Dict]) -> List[int]:
|
| 117 |
+
"""
|
| 118 |
+
Insert a batch of chunks. Returns list of SQLite row IDs
|
| 119 |
+
that map exactly to FAISS vector positions.
|
| 120 |
+
|
| 121 |
+
Uses INSERT OR IGNORE to safely skip duplicates within
|
| 122 |
+
the same batch.
|
| 123 |
+
"""
|
| 124 |
+
ids: List[int] = []
|
| 125 |
+
with self.conn:
|
| 126 |
+
for item in data:
|
| 127 |
+
cur = self.conn.execute(
|
| 128 |
+
"""
|
| 129 |
+
INSERT OR IGNORE INTO memories (hash, text, source, timestamp, tags)
|
| 130 |
+
VALUES (?, ?, ?, ?, ?)
|
| 131 |
+
""",
|
| 132 |
+
(
|
| 133 |
+
item.get("hash", ""),
|
| 134 |
+
item.get("text", ""),
|
| 135 |
+
item.get("source", ""),
|
| 136 |
+
item.get("timestamp", ""),
|
| 137 |
+
item.get("tags", ""),
|
| 138 |
+
),
|
| 139 |
+
)
|
| 140 |
+
|
| 141 |
+
if cur.lastrowid and cur.rowcount > 0:
|
| 142 |
+
ids.append(cur.lastrowid)
|
| 143 |
+
else:
|
| 144 |
+
# Retrieve existing ID for this hash (it was a duplicate)
|
| 145 |
+
existing = self.conn.execute(
|
| 146 |
+
"SELECT id FROM memories WHERE hash=?",
|
| 147 |
+
(item["hash"],),
|
| 148 |
+
).fetchone()
|
| 149 |
+
if existing:
|
| 150 |
+
ids.append(existing["id"])
|
| 151 |
+
|
| 152 |
+
return ids
|
| 153 |
+
|
| 154 |
+
# βββββββββββββββββββββββββββββββββββββββββββββ
|
| 155 |
+
# Retrieval
|
| 156 |
+
# βββββββββββββββββββββββββββββββββββββββββββββ
|
| 157 |
+
|
| 158 |
+
def get_memories_by_ids(self, ids: List[int]) -> List[Dict]:
|
| 159 |
+
"""
|
| 160 |
+
Retrieve full metadata for a list of FAISS-matched IDs.
|
| 161 |
+
Returns results in the SAME ORDER as the input IDs.
|
| 162 |
+
"""
|
| 163 |
+
if not ids:
|
| 164 |
+
return []
|
| 165 |
+
|
| 166 |
+
placeholders = ",".join(["?"] * len(ids))
|
| 167 |
+
query = f"SELECT * FROM memories WHERE id IN ({placeholders})"
|
| 168 |
+
cur = self.conn.execute(query, ids)
|
| 169 |
+
|
| 170 |
+
row_dict = {row["id"]: dict(row) for row in cur.fetchall()}
|
| 171 |
+
return [row_dict[i] for i in ids if i in row_dict]
|
| 172 |
+
|
| 173 |
+
def search_by_time_range(
|
| 174 |
+
self, start_iso: str, end_iso: str, limit: int = 100
|
| 175 |
+
) -> List[Dict]:
|
| 176 |
+
"""
|
| 177 |
+
Retrieve memories within a time range (ISO 8601 strings).
|
| 178 |
+
"""
|
| 179 |
+
cur = self.conn.execute(
|
| 180 |
+
"""
|
| 181 |
+
SELECT id FROM memories
|
| 182 |
+
WHERE timestamp >= ? AND timestamp <= ?
|
| 183 |
+
ORDER BY timestamp DESC
|
| 184 |
+
LIMIT ?
|
| 185 |
+
""",
|
| 186 |
+
(start_iso, end_iso, limit),
|
| 187 |
+
)
|
| 188 |
+
return [row["id"] for row in cur.fetchall()]
|
| 189 |
+
|
| 190 |
+
def search_by_tags(self, tag: str, limit: int = 100) -> List[int]:
|
| 191 |
+
"""Return IDs of memories matching a tag substring."""
|
| 192 |
+
cur = self.conn.execute(
|
| 193 |
+
"SELECT id FROM memories WHERE tags LIKE ? LIMIT ?",
|
| 194 |
+
(f"%{tag}%", limit),
|
| 195 |
+
)
|
| 196 |
+
return [row["id"] for row in cur.fetchall()]
|
| 197 |
+
|
| 198 |
+
def keyword_search(self, query: str, limit: int = 20) -> List[int]:
|
| 199 |
+
"""
|
| 200 |
+
Full-text keyword search via FTS5.
|
| 201 |
+
Returns matching memory IDs ranked by BM25 relevance.
|
| 202 |
+
"""
|
| 203 |
+
try:
|
| 204 |
+
cur = self.conn.execute(
|
| 205 |
+
"""
|
| 206 |
+
SELECT rowid FROM memories_fts
|
| 207 |
+
WHERE memories_fts MATCH ?
|
| 208 |
+
ORDER BY rank
|
| 209 |
+
LIMIT ?
|
| 210 |
+
""",
|
| 211 |
+
(query, limit),
|
| 212 |
+
)
|
| 213 |
+
return [row["rowid"] for row in cur.fetchall()]
|
| 214 |
+
except Exception:
|
| 215 |
+
# FTS match syntax can fail on certain special characters
|
| 216 |
+
return []
|
| 217 |
+
|
| 218 |
+
# βββββββββββββββββββββββββββββββββββββββββββββ
|
| 219 |
+
# Stats
|
| 220 |
+
# βββββββββββββββββββββββββββββββββββββββββββββ
|
| 221 |
+
|
| 222 |
+
def count(self) -> int:
|
| 223 |
+
"""Total number of stored memory chunks."""
|
| 224 |
+
return self.conn.execute("SELECT COUNT(*) FROM memories").fetchone()[0]
|
| 225 |
+
|
| 226 |
+
def get_stats(self) -> Dict:
|
| 227 |
+
"""Return diagnostic statistics."""
|
| 228 |
+
row = self.conn.execute(
|
| 229 |
+
"""
|
| 230 |
+
SELECT
|
| 231 |
+
COUNT(*) as total,
|
| 232 |
+
MIN(timestamp) as earliest,
|
| 233 |
+
MAX(timestamp) as latest,
|
| 234 |
+
COUNT(DISTINCT source) as sources
|
| 235 |
+
FROM memories
|
| 236 |
+
"""
|
| 237 |
+
).fetchone()
|
| 238 |
+
return dict(row) if row else {}
|
| 239 |
+
|
| 240 |
+
# βββββββββββββββββββββββββββββββββββββββββββββ
|
| 241 |
+
# Lifecycle
|
| 242 |
+
# βββββββββββββββββββββββββββββββββββββββββββββ
|
| 243 |
+
|
| 244 |
+
def close(self):
|
| 245 |
+
"""Close the database connection."""
|
| 246 |
+
self.conn.close()
|
deploy.ps1
ADDED
|
@@ -0,0 +1,135 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
<#
|
| 2 |
+
deploy.ps1 β One-command Faraday Cloud deployment.
|
| 3 |
+
|
| 4 |
+
Usage:
|
| 5 |
+
.\deploy.ps1 push # Upload data to Supabase only
|
| 6 |
+
.\deploy.ps1 deploy # Push data + deploy to Cloud Run
|
| 7 |
+
.\deploy.ps1 full # Push + build + deploy + test
|
| 8 |
+
|
| 9 |
+
Prerequisites:
|
| 10 |
+
- Python venv activated with supabase installed
|
| 11 |
+
- gcloud CLI authenticated (for Cloud Run deployment)
|
| 12 |
+
#>
|
| 13 |
+
|
| 14 |
+
param(
|
| 15 |
+
[Parameter(Position=0)]
|
| 16 |
+
[ValidateSet("push", "deploy", "full")]
|
| 17 |
+
[string]$Action = "full"
|
| 18 |
+
)
|
| 19 |
+
|
| 20 |
+
$ErrorActionPreference = "Stop"
|
| 21 |
+
$ProjectRoot = Split-Path -Parent $PSScriptRoot
|
| 22 |
+
$AiMemoryDir = $PSScriptRoot
|
| 23 |
+
|
| 24 |
+
# βββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 25 |
+
# Config
|
| 26 |
+
# βββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 27 |
+
$GCP_PROJECT = "faraday-memory-cloud"
|
| 28 |
+
$GCP_REGION = "asia-south1" # Mumbai β closest to India
|
| 29 |
+
$SERVICE_NAME = "faraday-mcp"
|
| 30 |
+
$FARADAY_API_KEY = "frdy_" + [System.Guid]::NewGuid().ToString("N").Substring(0, 24)
|
| 31 |
+
|
| 32 |
+
# βββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 33 |
+
# Step 1: Push data to Supabase
|
| 34 |
+
# βββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 35 |
+
function Push-Data {
|
| 36 |
+
Write-Host "`nπ¦ Pushing data to Supabase Storage..." -ForegroundColor Cyan
|
| 37 |
+
|
| 38 |
+
$venvPython = Join-Path $ProjectRoot ".venv\Scripts\python.exe"
|
| 39 |
+
if (Test-Path $venvPython) {
|
| 40 |
+
& $venvPython (Join-Path $AiMemoryDir "sync.py") push
|
| 41 |
+
} else {
|
| 42 |
+
python (Join-Path $AiMemoryDir "sync.py") push
|
| 43 |
+
}
|
| 44 |
+
|
| 45 |
+
if ($LASTEXITCODE -ne 0) {
|
| 46 |
+
Write-Host "β Data push failed!" -ForegroundColor Red
|
| 47 |
+
exit 1
|
| 48 |
+
}
|
| 49 |
+
Write-Host "β
Data pushed to Supabase." -ForegroundColor Green
|
| 50 |
+
}
|
| 51 |
+
|
| 52 |
+
# βββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 53 |
+
# Step 2: Deploy to Cloud Run
|
| 54 |
+
# βββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 55 |
+
function Deploy-CloudRun {
|
| 56 |
+
Write-Host "`nπ Deploying to Google Cloud Run..." -ForegroundColor Cyan
|
| 57 |
+
|
| 58 |
+
# Check gcloud auth
|
| 59 |
+
$account = gcloud auth list --filter="status=ACTIVE" --format="value(account)" 2>$null
|
| 60 |
+
if (-not $account) {
|
| 61 |
+
Write-Host "β οΈ Not authenticated with gcloud. Running 'gcloud auth login'..." -ForegroundColor Yellow
|
| 62 |
+
gcloud auth login
|
| 63 |
+
}
|
| 64 |
+
|
| 65 |
+
# Set project
|
| 66 |
+
gcloud config set project $GCP_PROJECT 2>$null
|
| 67 |
+
|
| 68 |
+
# Enable required APIs
|
| 69 |
+
Write-Host " Enabling Cloud Run API..." -ForegroundColor Gray
|
| 70 |
+
gcloud services enable run.googleapis.com artifactregistry.googleapis.com cloudbuild.googleapis.com 2>$null
|
| 71 |
+
|
| 72 |
+
# Deploy from source (Cloud Build handles Dockerfile)
|
| 73 |
+
Write-Host " Building and deploying container..." -ForegroundColor Gray
|
| 74 |
+
gcloud run deploy $SERVICE_NAME `
|
| 75 |
+
--source $AiMemoryDir `
|
| 76 |
+
--region $GCP_REGION `
|
| 77 |
+
--platform managed `
|
| 78 |
+
--allow-unauthenticated `
|
| 79 |
+
--memory 2Gi `
|
| 80 |
+
--cpu 1 `
|
| 81 |
+
--min-instances 0 `
|
| 82 |
+
--max-instances 2 `
|
| 83 |
+
--timeout 300 `
|
| 84 |
+
--set-env-vars "SUPABASE_URL=https://qwxagrmoryojholseclm.supabase.co" `
|
| 85 |
+
--set-env-vars "SUPABASE_KEY=$env:SUPABASE_KEY" `
|
| 86 |
+
--set-env-vars "FARADAY_API_KEY=$FARADAY_API_KEY" `
|
| 87 |
+
--set-env-vars "SUPABASE_BUCKET=faraday-memory"
|
| 88 |
+
|
| 89 |
+
if ($LASTEXITCODE -ne 0) {
|
| 90 |
+
Write-Host "β Cloud Run deployment failed!" -ForegroundColor Red
|
| 91 |
+
exit 1
|
| 92 |
+
}
|
| 93 |
+
|
| 94 |
+
# Get service URL
|
| 95 |
+
$serviceUrl = gcloud run services describe $SERVICE_NAME --region $GCP_REGION --format="value(status.url)"
|
| 96 |
+
|
| 97 |
+
Write-Host "`n" -NoNewline
|
| 98 |
+
Write-Host "βββββββββββββββββββββββββββββββββββββββββββββββββββ" -ForegroundColor Green
|
| 99 |
+
Write-Host " β
DEPLOYMENT COMPLETE!" -ForegroundColor Green
|
| 100 |
+
Write-Host "βββββββββββββββββββββββββββββββββββββββββββββββββββ" -ForegroundColor Green
|
| 101 |
+
Write-Host "`n Service URL: $serviceUrl" -ForegroundColor White
|
| 102 |
+
Write-Host " SSE Endpoint: $serviceUrl/sse" -ForegroundColor White
|
| 103 |
+
Write-Host " API Key: $FARADAY_API_KEY" -ForegroundColor Yellow
|
| 104 |
+
Write-Host "`n π± Claude Phone Config:" -ForegroundColor Cyan
|
| 105 |
+
Write-Host @"
|
| 106 |
+
{
|
| 107 |
+
"mcpServers": {
|
| 108 |
+
"faraday": {
|
| 109 |
+
"command": "npx",
|
| 110 |
+
"args": [
|
| 111 |
+
"-y", "mcp-remote",
|
| 112 |
+
"$serviceUrl/sse"
|
| 113 |
+
]
|
| 114 |
+
}
|
| 115 |
+
}
|
| 116 |
+
}
|
| 117 |
+
"@ -ForegroundColor Gray
|
| 118 |
+
Write-Host ""
|
| 119 |
+
}
|
| 120 |
+
|
| 121 |
+
# βββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 122 |
+
# Execute
|
| 123 |
+
# βββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 124 |
+
switch ($Action) {
|
| 125 |
+
"push" {
|
| 126 |
+
Push-Data
|
| 127 |
+
}
|
| 128 |
+
"deploy" {
|
| 129 |
+
Deploy-CloudRun
|
| 130 |
+
}
|
| 131 |
+
"full" {
|
| 132 |
+
Push-Data
|
| 133 |
+
Deploy-CloudRun
|
| 134 |
+
}
|
| 135 |
+
}
|
ingestion/__init__.py
ADDED
|
@@ -0,0 +1,54 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
ingestion β Multi-format document parser package.
|
| 3 |
+
|
| 4 |
+
Each parser is a generator that yields standardized document dicts:
|
| 5 |
+
{"text": str, "source": str, "timestamp": str, "tags": str}
|
| 6 |
+
|
| 7 |
+
The `process_file` router dispatches files to the correct parser
|
| 8 |
+
based on extension and filename heuristics.
|
| 9 |
+
"""
|
| 10 |
+
|
| 11 |
+
from pathlib import Path
|
| 12 |
+
from typing import Dict, Generator
|
| 13 |
+
|
| 14 |
+
from .markdown import parse_markdown
|
| 15 |
+
from .chatgpt import parse_chatgpt_export
|
| 16 |
+
from .gemini import parse_gemini_html
|
| 17 |
+
from .pdf import parse_pdf
|
| 18 |
+
from .image import parse_image
|
| 19 |
+
|
| 20 |
+
|
| 21 |
+
def process_file(filepath: Path) -> Generator[Dict, None, None]:
|
| 22 |
+
"""
|
| 23 |
+
Route a file to the appropriate parser based on extension
|
| 24 |
+
and filename heuristics. Yields standardized document dicts.
|
| 25 |
+
"""
|
| 26 |
+
ext = filepath.suffix.lower()
|
| 27 |
+
name = filepath.name.lower()
|
| 28 |
+
|
| 29 |
+
# ChatGPT JSON export
|
| 30 |
+
if ext == ".json" and "conversations" in name:
|
| 31 |
+
yield from parse_chatgpt_export(filepath)
|
| 32 |
+
|
| 33 |
+
# Gemini HTML takeout
|
| 34 |
+
elif ext == ".html" and ("activity" in name or "gemini" in name):
|
| 35 |
+
yield from parse_gemini_html(filepath)
|
| 36 |
+
|
| 37 |
+
# PDF documents
|
| 38 |
+
elif ext == ".pdf":
|
| 39 |
+
yield from parse_pdf(filepath)
|
| 40 |
+
|
| 41 |
+
# Images (OCR)
|
| 42 |
+
elif ext in (".png", ".jpg", ".jpeg", ".tiff", ".bmp", ".webp"):
|
| 43 |
+
yield from parse_image(filepath)
|
| 44 |
+
|
| 45 |
+
# Markdown / plain text (default)
|
| 46 |
+
elif ext in (".md", ".txt", ".rst", ".log", ".csv"):
|
| 47 |
+
yield from parse_markdown(filepath)
|
| 48 |
+
|
| 49 |
+
# Unknown β attempt as plain text
|
| 50 |
+
else:
|
| 51 |
+
try:
|
| 52 |
+
yield from parse_markdown(filepath)
|
| 53 |
+
except Exception:
|
| 54 |
+
pass
|
ingestion/chatgpt.py
ADDED
|
@@ -0,0 +1,88 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
ingestion.chatgpt β Parse ChatGPT conversation exports.
|
| 3 |
+
|
| 4 |
+
Standard ChatGPT data export produces a `conversations.json` file
|
| 5 |
+
containing an array of conversation objects, each with a nested
|
| 6 |
+
`mapping` dict of message nodes.
|
| 7 |
+
|
| 8 |
+
This parser uses streaming JSON for memory efficiency on large exports
|
| 9 |
+
(some users have 500 MB+ conversation files).
|
| 10 |
+
"""
|
| 11 |
+
|
| 12 |
+
import datetime
|
| 13 |
+
import json
|
| 14 |
+
import sys
|
| 15 |
+
from pathlib import Path
|
| 16 |
+
from typing import Dict, Generator
|
| 17 |
+
|
| 18 |
+
|
| 19 |
+
def parse_chatgpt_export(filepath: Path) -> Generator[Dict, None, None]:
|
| 20 |
+
"""
|
| 21 |
+
Parse a ChatGPT conversations.json export.
|
| 22 |
+
Yields one document per non-system message.
|
| 23 |
+
"""
|
| 24 |
+
try:
|
| 25 |
+
with open(filepath, "r", encoding="utf-8") as f:
|
| 26 |
+
data = json.load(f)
|
| 27 |
+
|
| 28 |
+
if not isinstance(data, list):
|
| 29 |
+
print(
|
| 30 |
+
f"[ingestion.chatgpt] Expected top-level list in {filepath}",
|
| 31 |
+
file=sys.stderr,
|
| 32 |
+
)
|
| 33 |
+
return
|
| 34 |
+
|
| 35 |
+
for convo in data:
|
| 36 |
+
title = convo.get("title", "Untitled")
|
| 37 |
+
create_time = convo.get("create_time", 0)
|
| 38 |
+
mapping = convo.get("mapping", {})
|
| 39 |
+
|
| 40 |
+
for _node_id, node in mapping.items():
|
| 41 |
+
msg = node.get("message")
|
| 42 |
+
if not msg:
|
| 43 |
+
continue
|
| 44 |
+
|
| 45 |
+
# Extract text content
|
| 46 |
+
parts = msg.get("content", {}).get("parts", [])
|
| 47 |
+
if not parts:
|
| 48 |
+
continue
|
| 49 |
+
|
| 50 |
+
# Filter: only string parts (skip tool_use, image refs, etc.)
|
| 51 |
+
text_parts = [p for p in parts if isinstance(p, str)]
|
| 52 |
+
text = "".join(text_parts).strip()
|
| 53 |
+
|
| 54 |
+
if not text:
|
| 55 |
+
continue
|
| 56 |
+
|
| 57 |
+
role = msg.get("author", {}).get("role", "unknown")
|
| 58 |
+
|
| 59 |
+
# Skip system messages β they're boilerplate
|
| 60 |
+
if role == "system":
|
| 61 |
+
continue
|
| 62 |
+
|
| 63 |
+
# Timestamp: prefer message-level, fallback to conversation-level
|
| 64 |
+
msg_time = msg.get("create_time") or create_time or 0
|
| 65 |
+
if msg_time:
|
| 66 |
+
timestamp = datetime.datetime.fromtimestamp(
|
| 67 |
+
msg_time
|
| 68 |
+
).isoformat()
|
| 69 |
+
else:
|
| 70 |
+
timestamp = "Unknown"
|
| 71 |
+
|
| 72 |
+
yield {
|
| 73 |
+
"text": f"[{role}] {text}",
|
| 74 |
+
"source": f"ChatGPT: {title}",
|
| 75 |
+
"timestamp": timestamp,
|
| 76 |
+
"tags": "chatgpt,chat",
|
| 77 |
+
}
|
| 78 |
+
|
| 79 |
+
except json.JSONDecodeError as e:
|
| 80 |
+
print(
|
| 81 |
+
f"[ingestion.chatgpt] JSON decode error in {filepath}: {e}",
|
| 82 |
+
file=sys.stderr,
|
| 83 |
+
)
|
| 84 |
+
except Exception as e:
|
| 85 |
+
print(
|
| 86 |
+
f"[ingestion.chatgpt] Error parsing {filepath}: {e}",
|
| 87 |
+
file=sys.stderr,
|
| 88 |
+
)
|
ingestion/gemini.py
ADDED
|
@@ -0,0 +1,97 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
ingestion.gemini β Parse Google Gemini takeout HTML exports.
|
| 3 |
+
|
| 4 |
+
Google Takeout for Gemini produces an "My Activity.html" file
|
| 5 |
+
with interaction cards wrapped in `<div class="outer-cell">` elements.
|
| 6 |
+
This parser streams through those cards and yields one document
|
| 7 |
+
per interaction.
|
| 8 |
+
"""
|
| 9 |
+
|
| 10 |
+
import datetime
|
| 11 |
+
import sys
|
| 12 |
+
from pathlib import Path
|
| 13 |
+
from typing import Dict, Generator
|
| 14 |
+
|
| 15 |
+
|
| 16 |
+
def parse_gemini_html(filepath: Path) -> Generator[Dict, None, None]:
|
| 17 |
+
"""
|
| 18 |
+
Parse a Google Takeout Gemini activity HTML file.
|
| 19 |
+
Yields one document per interaction card.
|
| 20 |
+
"""
|
| 21 |
+
try:
|
| 22 |
+
from bs4 import BeautifulSoup
|
| 23 |
+
except ImportError:
|
| 24 |
+
print(
|
| 25 |
+
"[ingestion.gemini] beautifulsoup4 not installed. "
|
| 26 |
+
"Run: pip install beautifulsoup4 lxml",
|
| 27 |
+
file=sys.stderr,
|
| 28 |
+
)
|
| 29 |
+
return
|
| 30 |
+
|
| 31 |
+
try:
|
| 32 |
+
html = filepath.read_text(encoding="utf-8", errors="replace")
|
| 33 |
+
soup = BeautifulSoup(html, "lxml")
|
| 34 |
+
|
| 35 |
+
# Google Takeout wraps each interaction in outer-cell divs
|
| 36 |
+
cards = soup.find_all("div", class_="outer-cell")
|
| 37 |
+
|
| 38 |
+
if not cards:
|
| 39 |
+
# Fallback: dump entire body text as one document
|
| 40 |
+
raw = soup.get_text(separator="\n\n", strip=True)
|
| 41 |
+
if raw.strip():
|
| 42 |
+
yield {
|
| 43 |
+
"text": raw,
|
| 44 |
+
"source": f"Gemini: {filepath.name}",
|
| 45 |
+
"timestamp": _file_timestamp(filepath),
|
| 46 |
+
"tags": "gemini,chat",
|
| 47 |
+
}
|
| 48 |
+
return
|
| 49 |
+
|
| 50 |
+
for idx, card in enumerate(cards):
|
| 51 |
+
text = card.get_text(separator=" ", strip=True)
|
| 52 |
+
if not text or len(text) < 10:
|
| 53 |
+
continue
|
| 54 |
+
|
| 55 |
+
# Attempt to extract timestamp from card content
|
| 56 |
+
# Gemini cards often contain a date string; we try to find it
|
| 57 |
+
timestamp = _extract_card_timestamp(card, filepath)
|
| 58 |
+
|
| 59 |
+
yield {
|
| 60 |
+
"text": text,
|
| 61 |
+
"source": f"Gemini: Interaction {idx + 1}",
|
| 62 |
+
"timestamp": timestamp,
|
| 63 |
+
"tags": "gemini,chat",
|
| 64 |
+
}
|
| 65 |
+
|
| 66 |
+
except Exception as e:
|
| 67 |
+
print(f"[ingestion.gemini] Error parsing {filepath}: {e}", file=sys.stderr)
|
| 68 |
+
|
| 69 |
+
|
| 70 |
+
def _extract_card_timestamp(card, filepath: Path) -> str:
|
| 71 |
+
"""
|
| 72 |
+
Try to find a timestamp within a Gemini activity card.
|
| 73 |
+
Falls back to file modification time.
|
| 74 |
+
"""
|
| 75 |
+
try:
|
| 76 |
+
from dateutil import parser as date_parser
|
| 77 |
+
|
| 78 |
+
# Look for common date containers in Takeout HTML
|
| 79 |
+
date_cells = card.find_all("div", class_="content-cell")
|
| 80 |
+
for cell in date_cells:
|
| 81 |
+
text = cell.get_text(strip=True)
|
| 82 |
+
# dateutil.parser is very flexible with date formats
|
| 83 |
+
try:
|
| 84 |
+
dt = date_parser.parse(text, fuzzy=True)
|
| 85 |
+
return dt.isoformat()
|
| 86 |
+
except (ValueError, OverflowError):
|
| 87 |
+
continue
|
| 88 |
+
except ImportError:
|
| 89 |
+
pass
|
| 90 |
+
|
| 91 |
+
return _file_timestamp(filepath)
|
| 92 |
+
|
| 93 |
+
|
| 94 |
+
def _file_timestamp(filepath: Path) -> str:
|
| 95 |
+
"""Fallback: use file modification time."""
|
| 96 |
+
mtime = filepath.stat().st_mtime
|
| 97 |
+
return datetime.datetime.fromtimestamp(mtime).isoformat()
|
ingestion/image.py
ADDED
|
@@ -0,0 +1,102 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
ingestion.image β Extract text from images via OCR.
|
| 3 |
+
|
| 4 |
+
Uses Pillow + pytesseract. Gracefully degrades if Tesseract
|
| 5 |
+
is not installed β warns ONCE and skips all images silently.
|
| 6 |
+
"""
|
| 7 |
+
|
| 8 |
+
import datetime
|
| 9 |
+
import sys
|
| 10 |
+
from pathlib import Path
|
| 11 |
+
from typing import Dict, Generator
|
| 12 |
+
|
| 13 |
+
# Module-level availability check β run once, not per-file
|
| 14 |
+
_ocr_available = None
|
| 15 |
+
_ocr_warned = False
|
| 16 |
+
|
| 17 |
+
|
| 18 |
+
def _check_ocr():
|
| 19 |
+
"""Check OCR availability once at module level."""
|
| 20 |
+
global _ocr_available, _ocr_warned
|
| 21 |
+
|
| 22 |
+
if _ocr_available is not None:
|
| 23 |
+
return _ocr_available
|
| 24 |
+
|
| 25 |
+
try:
|
| 26 |
+
from config import OCR_ENABLED
|
| 27 |
+
if not OCR_ENABLED:
|
| 28 |
+
_ocr_available = False
|
| 29 |
+
return False
|
| 30 |
+
except ImportError:
|
| 31 |
+
pass
|
| 32 |
+
|
| 33 |
+
try:
|
| 34 |
+
from PIL import Image # noqa: F401
|
| 35 |
+
import pytesseract # noqa: F401
|
| 36 |
+
_ocr_available = True
|
| 37 |
+
except ImportError:
|
| 38 |
+
_ocr_available = False
|
| 39 |
+
if not _ocr_warned:
|
| 40 |
+
print(
|
| 41 |
+
"[ingestion.image] OCR unavailable (Pillow/pytesseract not installed). "
|
| 42 |
+
"Skipping all images.",
|
| 43 |
+
file=sys.stderr,
|
| 44 |
+
)
|
| 45 |
+
_ocr_warned = True
|
| 46 |
+
|
| 47 |
+
return _ocr_available
|
| 48 |
+
|
| 49 |
+
|
| 50 |
+
def parse_image(filepath: Path) -> Generator[Dict, None, None]:
|
| 51 |
+
"""
|
| 52 |
+
Run OCR on an image file and yield extracted text as a document.
|
| 53 |
+
Skips silently if OCR dependencies are missing (logs once).
|
| 54 |
+
"""
|
| 55 |
+
if not _check_ocr():
|
| 56 |
+
return
|
| 57 |
+
|
| 58 |
+
from PIL import Image
|
| 59 |
+
import pytesseract
|
| 60 |
+
|
| 61 |
+
# Configure tesseract path if provided
|
| 62 |
+
try:
|
| 63 |
+
from config import TESSERACT_CMD
|
| 64 |
+
if TESSERACT_CMD:
|
| 65 |
+
pytesseract.pytesseract.tesseract_cmd = TESSERACT_CMD
|
| 66 |
+
except ImportError:
|
| 67 |
+
pass
|
| 68 |
+
|
| 69 |
+
try:
|
| 70 |
+
img = Image.open(filepath)
|
| 71 |
+
text = pytesseract.image_to_string(img)
|
| 72 |
+
|
| 73 |
+
if not text or len(text.strip()) < 20:
|
| 74 |
+
return
|
| 75 |
+
|
| 76 |
+
mtime = filepath.stat().st_mtime
|
| 77 |
+
timestamp = datetime.datetime.fromtimestamp(mtime).isoformat()
|
| 78 |
+
|
| 79 |
+
yield {
|
| 80 |
+
"text": text.strip(),
|
| 81 |
+
"source": f"OCR: {filepath.name}",
|
| 82 |
+
"timestamp": timestamp,
|
| 83 |
+
"tags": "image,ocr",
|
| 84 |
+
}
|
| 85 |
+
|
| 86 |
+
except Exception as e:
|
| 87 |
+
error_name = type(e).__name__
|
| 88 |
+
if "Tesseract" in error_name:
|
| 89 |
+
global _ocr_available, _ocr_warned
|
| 90 |
+
_ocr_available = False
|
| 91 |
+
if not _ocr_warned:
|
| 92 |
+
print(
|
| 93 |
+
f"[ingestion.image] Tesseract not found. "
|
| 94 |
+
f"Install it or set TESSERACT_CMD in config.py.",
|
| 95 |
+
file=sys.stderr,
|
| 96 |
+
)
|
| 97 |
+
_ocr_warned = True
|
| 98 |
+
else:
|
| 99 |
+
print(
|
| 100 |
+
f"[ingestion.image] Error processing {filepath}: {e}",
|
| 101 |
+
file=sys.stderr,
|
| 102 |
+
)
|
ingestion/markdown.py
ADDED
|
@@ -0,0 +1,38 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
ingestion.markdown β Parse plain text and Markdown files.
|
| 3 |
+
|
| 4 |
+
Yields one document per file with file-modification timestamp.
|
| 5 |
+
"""
|
| 6 |
+
|
| 7 |
+
import datetime
|
| 8 |
+
import sys
|
| 9 |
+
from pathlib import Path
|
| 10 |
+
from typing import Dict, Generator
|
| 11 |
+
|
| 12 |
+
|
| 13 |
+
def parse_markdown(filepath: Path) -> Generator[Dict, None, None]:
|
| 14 |
+
"""
|
| 15 |
+
Read a text/markdown file and yield a single document dict.
|
| 16 |
+
Uses streaming read for large files (reads in 10 MB chunks
|
| 17 |
+
and joins β keeps memory bounded for very large markdown dumps).
|
| 18 |
+
"""
|
| 19 |
+
try:
|
| 20 |
+
# For most markdown files (< 10 MB), this is instant.
|
| 21 |
+
# For larger files we still read all at once since we need
|
| 22 |
+
# the full text for chunking downstream.
|
| 23 |
+
text = filepath.read_text(encoding="utf-8", errors="replace")
|
| 24 |
+
|
| 25 |
+
if not text.strip():
|
| 26 |
+
return
|
| 27 |
+
|
| 28 |
+
mtime = filepath.stat().st_mtime
|
| 29 |
+
timestamp = datetime.datetime.fromtimestamp(mtime).isoformat()
|
| 30 |
+
|
| 31 |
+
yield {
|
| 32 |
+
"text": text,
|
| 33 |
+
"source": filepath.name,
|
| 34 |
+
"timestamp": timestamp,
|
| 35 |
+
"tags": "markdown",
|
| 36 |
+
}
|
| 37 |
+
except Exception as e:
|
| 38 |
+
print(f"[ingestion.markdown] Error reading {filepath}: {e}", file=sys.stderr)
|
ingestion/pdf.py
ADDED
|
@@ -0,0 +1,48 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
ingestion.pdf β Parse PDF documents page by page.
|
| 3 |
+
|
| 4 |
+
Uses PyPDF2 for lightweight, dependency-free PDF text extraction.
|
| 5 |
+
Each page is yielded as a separate document to keep chunk boundaries clean.
|
| 6 |
+
"""
|
| 7 |
+
|
| 8 |
+
import datetime
|
| 9 |
+
import sys
|
| 10 |
+
from pathlib import Path
|
| 11 |
+
from typing import Dict, Generator
|
| 12 |
+
|
| 13 |
+
|
| 14 |
+
def parse_pdf(filepath: Path) -> Generator[Dict, None, None]:
|
| 15 |
+
"""
|
| 16 |
+
Extract text from a PDF file, yielding one document per page.
|
| 17 |
+
Pages with negligible text (< 20 chars) are skipped.
|
| 18 |
+
"""
|
| 19 |
+
try:
|
| 20 |
+
from PyPDF2 import PdfReader
|
| 21 |
+
except ImportError:
|
| 22 |
+
print(
|
| 23 |
+
"[ingestion.pdf] PyPDF2 not installed. Run: pip install PyPDF2",
|
| 24 |
+
file=sys.stderr,
|
| 25 |
+
)
|
| 26 |
+
return
|
| 27 |
+
|
| 28 |
+
try:
|
| 29 |
+
reader = PdfReader(str(filepath))
|
| 30 |
+
total_pages = len(reader.pages)
|
| 31 |
+
|
| 32 |
+
mtime = filepath.stat().st_mtime
|
| 33 |
+
timestamp = datetime.datetime.fromtimestamp(mtime).isoformat()
|
| 34 |
+
|
| 35 |
+
for page_num, page in enumerate(reader.pages, start=1):
|
| 36 |
+
text = page.extract_text()
|
| 37 |
+
if not text or len(text.strip()) < 20:
|
| 38 |
+
continue
|
| 39 |
+
|
| 40 |
+
yield {
|
| 41 |
+
"text": text.strip(),
|
| 42 |
+
"source": f"{filepath.name} (p.{page_num}/{total_pages})",
|
| 43 |
+
"timestamp": timestamp,
|
| 44 |
+
"tags": "pdf,document",
|
| 45 |
+
}
|
| 46 |
+
|
| 47 |
+
except Exception as e:
|
| 48 |
+
print(f"[ingestion.pdf] Error reading {filepath}: {e}", file=sys.stderr)
|
mcp_server/__init__.py
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
mcp_server β FastMCP server package exposing personal AI memory tools.
|
| 3 |
+
"""
|
mcp_server/cloud_server.py
ADDED
|
@@ -0,0 +1,426 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
mcp_server.cloud_server β Cloud-hosted MCP server for Faraday AI Memory.
|
| 3 |
+
|
| 4 |
+
Designed for Google Cloud Run deployment with:
|
| 5 |
+
- SSE (Server-Sent Events) transport for remote MCP access
|
| 6 |
+
- Supabase Storage integration: pulls memory.db + memory.index on startup
|
| 7 |
+
- API key authentication via X-API-Key header
|
| 8 |
+
- Same tools as local server: search_memory, get_memory_stats, sync_memory
|
| 9 |
+
|
| 10 |
+
Usage (local test):
|
| 11 |
+
FARADAY_API_KEY=mykey SUPABASE_URL=... SUPABASE_KEY=... python cloud_server.py
|
| 12 |
+
|
| 13 |
+
Usage (Cloud Run):
|
| 14 |
+
Deployed via Dockerfile, env vars set in Cloud Run config.
|
| 15 |
+
"""
|
| 16 |
+
|
| 17 |
+
import datetime
|
| 18 |
+
import os
|
| 19 |
+
import sys
|
| 20 |
+
import tempfile
|
| 21 |
+
import threading
|
| 22 |
+
from pathlib import Path
|
| 23 |
+
|
| 24 |
+
# Silence Huggingface/SentenceTransformer logs
|
| 25 |
+
os.environ["HF_HUB_DISABLE_PROGRESS_BARS"] = "1"
|
| 26 |
+
os.environ["TOKENIZERS_PARALLELISM"] = "false"
|
| 27 |
+
os.environ["TRANSFORMERS_VERBOSITY"] = "error"
|
| 28 |
+
os.environ["HF_HUB_DISABLE_SYMLINKS_WARNING"] = "1"
|
| 29 |
+
|
| 30 |
+
import logging
|
| 31 |
+
|
| 32 |
+
logging.getLogger("sentence_transformers").setLevel(logging.ERROR)
|
| 33 |
+
logging.getLogger("transformers").setLevel(logging.ERROR)
|
| 34 |
+
|
| 35 |
+
# Fix imports: add project root to path
|
| 36 |
+
PROJECT_ROOT = str(Path(__file__).parent.parent)
|
| 37 |
+
sys.path.insert(0, PROJECT_ROOT)
|
| 38 |
+
|
| 39 |
+
from mcp.server.fastmcp import FastMCP
|
| 40 |
+
|
| 41 |
+
# βββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 42 |
+
# Configuration
|
| 43 |
+
# βββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 44 |
+
|
| 45 |
+
SUPABASE_URL = os.environ.get("SUPABASE_URL", "")
|
| 46 |
+
SUPABASE_KEY = os.environ.get("SUPABASE_KEY", "")
|
| 47 |
+
SUPABASE_BUCKET = os.environ.get("SUPABASE_BUCKET", "faraday-memory")
|
| 48 |
+
FARADAY_API_KEY = os.environ.get("FARADAY_API_KEY", "")
|
| 49 |
+
PORT = int(os.environ.get("PORT", "8080"))
|
| 50 |
+
|
| 51 |
+
# Cloud data directory
|
| 52 |
+
CLOUD_DATA_DIR = Path(os.environ.get("CLOUD_DATA_DIR", "/tmp/faraday-data"))
|
| 53 |
+
CLOUD_DB_PATH = CLOUD_DATA_DIR / "memory.db"
|
| 54 |
+
CLOUD_INDEX_PATH = CLOUD_DATA_DIR / "memory.index"
|
| 55 |
+
|
| 56 |
+
# Embedding model
|
| 57 |
+
EMBEDDING_MODEL = "all-MiniLM-L6-v2"
|
| 58 |
+
EMBEDDING_DIM = 384
|
| 59 |
+
DEFAULT_TOP_K = 5
|
| 60 |
+
SEMANTIC_WEIGHT = 0.7
|
| 61 |
+
RECENCY_WEIGHT = 0.3
|
| 62 |
+
|
| 63 |
+
|
| 64 |
+
# βββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 65 |
+
# Supabase Data Pull
|
| 66 |
+
# βββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 67 |
+
|
| 68 |
+
def pull_from_supabase():
|
| 69 |
+
"""Download memory.db and memory.index from Supabase Storage via httpx (compressed)."""
|
| 70 |
+
if not SUPABASE_URL or not SUPABASE_KEY:
|
| 71 |
+
print("[CLOUD] WARNING: No Supabase credentials. Using local data if available.",
|
| 72 |
+
file=sys.stderr)
|
| 73 |
+
return False
|
| 74 |
+
|
| 75 |
+
try:
|
| 76 |
+
import httpx
|
| 77 |
+
import gzip
|
| 78 |
+
|
| 79 |
+
headers = {
|
| 80 |
+
"apikey": SUPABASE_KEY,
|
| 81 |
+
"Authorization": f"Bearer {SUPABASE_KEY}",
|
| 82 |
+
}
|
| 83 |
+
|
| 84 |
+
CLOUD_DATA_DIR.mkdir(parents=True, exist_ok=True)
|
| 85 |
+
|
| 86 |
+
files_to_pull = {
|
| 87 |
+
"memory.db": CLOUD_DB_PATH,
|
| 88 |
+
"memory.index": CLOUD_INDEX_PATH,
|
| 89 |
+
}
|
| 90 |
+
|
| 91 |
+
for remote_name, local_path in files_to_pull.items():
|
| 92 |
+
compressed_name = f"{remote_name}.gz"
|
| 93 |
+
print(f"[CLOUD] Downloading {compressed_name} from Supabase...",
|
| 94 |
+
file=sys.stderr)
|
| 95 |
+
try:
|
| 96 |
+
r = httpx.get(
|
| 97 |
+
f"{SUPABASE_URL}/storage/v1/object/{SUPABASE_BUCKET}/{compressed_name}",
|
| 98 |
+
headers=headers,
|
| 99 |
+
timeout=120,
|
| 100 |
+
)
|
| 101 |
+
|
| 102 |
+
# Fallback to uncompressed if .gz is missing (for backward compatibility)
|
| 103 |
+
if r.status_code == 404:
|
| 104 |
+
print(f"[CLOUD] βΉοΈ Compressed not found, trying raw {remote_name}...", file=sys.stderr)
|
| 105 |
+
r = httpx.get(
|
| 106 |
+
f"{SUPABASE_URL}/storage/v1/object/{SUPABASE_BUCKET}/{remote_name}",
|
| 107 |
+
headers=headers,
|
| 108 |
+
timeout=120,
|
| 109 |
+
)
|
| 110 |
+
if r.status_code == 200:
|
| 111 |
+
with open(local_path, "wb") as f:
|
| 112 |
+
f.write(r.content)
|
| 113 |
+
size_mb = len(r.content) / (1024 * 1024)
|
| 114 |
+
print(f"[CLOUD] β
Raw {remote_name} downloaded ({size_mb:.1f} MB).", file=sys.stderr)
|
| 115 |
+
continue
|
| 116 |
+
|
| 117 |
+
if r.status_code != 200:
|
| 118 |
+
print(f"[CLOUD] β {compressed_name} download failed ({r.status_code}): {r.text}",
|
| 119 |
+
file=sys.stderr)
|
| 120 |
+
return False
|
| 121 |
+
|
| 122 |
+
# Decompress in memory and write
|
| 123 |
+
print(f"[CLOUD] Decompressing {compressed_name}...", file=sys.stderr)
|
| 124 |
+
decompressed_data = gzip.decompress(r.content)
|
| 125 |
+
|
| 126 |
+
with open(local_path, "wb") as f:
|
| 127 |
+
f.write(decompressed_data)
|
| 128 |
+
|
| 129 |
+
download_mb = len(r.content) / (1024 * 1024)
|
| 130 |
+
decompressed_mb = len(decompressed_data) / (1024 * 1024)
|
| 131 |
+
print(f"[CLOUD] β
{remote_name} ready ({download_mb:.1f}MB β {decompressed_mb:.1f}MB).",
|
| 132 |
+
file=sys.stderr)
|
| 133 |
+
except Exception as e:
|
| 134 |
+
print(f"[CLOUD] β Failed to download {remote_name}: {e}",
|
| 135 |
+
file=sys.stderr)
|
| 136 |
+
return False
|
| 137 |
+
|
| 138 |
+
return True
|
| 139 |
+
|
| 140 |
+
except Exception as e:
|
| 141 |
+
print(f"[CLOUD] Supabase pull failed: {e}", file=sys.stderr)
|
| 142 |
+
return False
|
| 143 |
+
|
| 144 |
+
|
| 145 |
+
# βββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 146 |
+
# Initialize Data
|
| 147 |
+
# βββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 148 |
+
|
| 149 |
+
print("[CLOUD] Starting Faraday Cloud MCP Server...", file=sys.stderr)
|
| 150 |
+
|
| 151 |
+
# Pull data from Supabase on startup
|
| 152 |
+
pull_success = pull_from_supabase()
|
| 153 |
+
|
| 154 |
+
# Monkey-patch config paths for cloud environment
|
| 155 |
+
import config
|
| 156 |
+
config.SQLITE_DB_PATH = CLOUD_DB_PATH
|
| 157 |
+
config.FAISS_INDEX_PATH = CLOUD_INDEX_PATH
|
| 158 |
+
config.DATA_PROCESSED = CLOUD_DATA_DIR
|
| 159 |
+
config.EMBEDDINGS_DIR = CLOUD_DATA_DIR
|
| 160 |
+
|
| 161 |
+
from database.faiss_db import VectorDB
|
| 162 |
+
from database.sqlite_db import MemoryDB
|
| 163 |
+
|
| 164 |
+
# βββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 165 |
+
# Server Initialization
|
| 166 |
+
# βββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 167 |
+
|
| 168 |
+
mcp = FastMCP(
|
| 169 |
+
"Faraday-AI-Memory-Cloud",
|
| 170 |
+
host="0.0.0.0",
|
| 171 |
+
port=PORT
|
| 172 |
+
)
|
| 173 |
+
|
| 174 |
+
# Load data stores
|
| 175 |
+
if CLOUD_DB_PATH.exists() and CLOUD_INDEX_PATH.exists():
|
| 176 |
+
_db = MemoryDB(db_path=CLOUD_DB_PATH, readonly=True)
|
| 177 |
+
_vec_db = VectorDB(index_path=str(CLOUD_INDEX_PATH))
|
| 178 |
+
else:
|
| 179 |
+
print("[CLOUD] β οΈ No data files found. Memory will be empty.", file=sys.stderr)
|
| 180 |
+
# Create empty stores
|
| 181 |
+
CLOUD_DATA_DIR.mkdir(parents=True, exist_ok=True)
|
| 182 |
+
_db = MemoryDB(db_path=CLOUD_DB_PATH, readonly=False)
|
| 183 |
+
_vec_db = VectorDB(index_path=str(CLOUD_INDEX_PATH))
|
| 184 |
+
|
| 185 |
+
# Load embedding model
|
| 186 |
+
from sentence_transformers import SentenceTransformer
|
| 187 |
+
|
| 188 |
+
print("[CLOUD] Loading embedding model...", file=sys.stderr)
|
| 189 |
+
_model = SentenceTransformer(EMBEDDING_MODEL)
|
| 190 |
+
|
| 191 |
+
print(
|
| 192 |
+
f"[CLOUD] β
Ready: {_vec_db.count()} vectors, {_db.count()} memories loaded.",
|
| 193 |
+
file=sys.stderr,
|
| 194 |
+
)
|
| 195 |
+
|
| 196 |
+
|
| 197 |
+
# βββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 198 |
+
# Time Helpers (same as local)
|
| 199 |
+
# βββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 200 |
+
|
| 201 |
+
def _resolve_time_filter(time_filter: str):
|
| 202 |
+
"""Convert human-readable time filter to (start_iso, end_iso) tuple."""
|
| 203 |
+
if not time_filter or time_filter.lower() == "none":
|
| 204 |
+
return None
|
| 205 |
+
|
| 206 |
+
now = datetime.datetime.now()
|
| 207 |
+
key = time_filter.lower().strip()
|
| 208 |
+
|
| 209 |
+
if key == "today":
|
| 210 |
+
start = now.replace(hour=0, minute=0, second=0, microsecond=0)
|
| 211 |
+
return start.isoformat(), now.isoformat()
|
| 212 |
+
elif key == "yesterday":
|
| 213 |
+
yesterday = now - datetime.timedelta(days=1)
|
| 214 |
+
start = yesterday.replace(hour=0, minute=0, second=0, microsecond=0)
|
| 215 |
+
end = yesterday.replace(hour=23, minute=59, second=59)
|
| 216 |
+
return start.isoformat(), end.isoformat()
|
| 217 |
+
elif key in ("last_week", "this_week", "week"):
|
| 218 |
+
start = now - datetime.timedelta(days=7)
|
| 219 |
+
return start.isoformat(), now.isoformat()
|
| 220 |
+
elif key in ("last_month", "this_month", "month"):
|
| 221 |
+
start = now - datetime.timedelta(days=30)
|
| 222 |
+
return start.isoformat(), now.isoformat()
|
| 223 |
+
else:
|
| 224 |
+
try:
|
| 225 |
+
from dateutil import parser as date_parser
|
| 226 |
+
dt = date_parser.parse(time_filter)
|
| 227 |
+
start = dt.replace(hour=0, minute=0, second=0)
|
| 228 |
+
end = dt.replace(hour=23, minute=59, second=59)
|
| 229 |
+
return start.isoformat(), end.isoformat()
|
| 230 |
+
except Exception:
|
| 231 |
+
return None
|
| 232 |
+
|
| 233 |
+
|
| 234 |
+
def _compute_recency_score(timestamp_str: str) -> float:
|
| 235 |
+
"""Compute a 0-1 recency score with ~30 day half-life."""
|
| 236 |
+
if not timestamp_str or timestamp_str == "Unknown":
|
| 237 |
+
return 0.0
|
| 238 |
+
try:
|
| 239 |
+
from dateutil import parser as date_parser
|
| 240 |
+
ts = date_parser.parse(timestamp_str)
|
| 241 |
+
now = datetime.datetime.now()
|
| 242 |
+
age_days = max(0, (now - ts).total_seconds() / 86400)
|
| 243 |
+
return 2.0 ** (-age_days / 30.0)
|
| 244 |
+
except Exception:
|
| 245 |
+
return 0.0
|
| 246 |
+
|
| 247 |
+
|
| 248 |
+
# βββββββββββββββββββββββββββββββββββββββββββββββοΏ½οΏ½βββββ
|
| 249 |
+
# MCP Tools
|
| 250 |
+
# βββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 251 |
+
|
| 252 |
+
@mcp.tool()
|
| 253 |
+
def search_memory(
|
| 254 |
+
query: str,
|
| 255 |
+
top_k: int = DEFAULT_TOP_K,
|
| 256 |
+
time_filter: str = "",
|
| 257 |
+
tags: str = "",
|
| 258 |
+
) -> str:
|
| 259 |
+
"""
|
| 260 |
+
Search Saurab's personal AI memory (past chats, documents, notes, research).
|
| 261 |
+
|
| 262 |
+
Use this tool whenever you need context about Saurab's history,
|
| 263 |
+
past actions, architecture decisions, projects, or personal knowledge.
|
| 264 |
+
|
| 265 |
+
Args:
|
| 266 |
+
query: Semantic search string (e.g. 'sparse communication paper')
|
| 267 |
+
top_k: Maximum results to return (default 5)
|
| 268 |
+
time_filter: Optional time constraint: 'today', 'yesterday',
|
| 269 |
+
'last_week', 'last_month', or an ISO date like '2026-04-10'
|
| 270 |
+
tags: Optional comma-separated tag filter (e.g. 'chatgpt,research')
|
| 271 |
+
"""
|
| 272 |
+
try:
|
| 273 |
+
if _vec_db.count() == 0:
|
| 274 |
+
return (
|
| 275 |
+
"Memory store is empty. "
|
| 276 |
+
"Data has not been synced to the cloud yet."
|
| 277 |
+
)
|
| 278 |
+
|
| 279 |
+
# 1. Encode query
|
| 280 |
+
query_emb = _model.encode(
|
| 281 |
+
[query], show_progress_bar=False, convert_to_numpy=True
|
| 282 |
+
)
|
| 283 |
+
|
| 284 |
+
# 2. FAISS search
|
| 285 |
+
fetch_k = min(top_k * 3, _vec_db.count())
|
| 286 |
+
raw_results = _vec_db.search(query_emb, top_k=fetch_k)
|
| 287 |
+
|
| 288 |
+
if not raw_results:
|
| 289 |
+
return "No matching memories found."
|
| 290 |
+
|
| 291 |
+
# 3. Get metadata
|
| 292 |
+
candidate_ids = [r[0] for r in raw_results]
|
| 293 |
+
score_map = {r[0]: r[1] for r in raw_results}
|
| 294 |
+
metadata_list = _db.get_memories_by_ids(candidate_ids)
|
| 295 |
+
|
| 296 |
+
# 4. Apply filters
|
| 297 |
+
filtered = metadata_list
|
| 298 |
+
|
| 299 |
+
time_range = _resolve_time_filter(time_filter)
|
| 300 |
+
if time_range:
|
| 301 |
+
start_iso, end_iso = time_range
|
| 302 |
+
filtered = [
|
| 303 |
+
m for m in filtered
|
| 304 |
+
if m.get("timestamp", "") >= start_iso
|
| 305 |
+
and m.get("timestamp", "") <= end_iso
|
| 306 |
+
]
|
| 307 |
+
|
| 308 |
+
if tags:
|
| 309 |
+
tag_set = {t.strip().lower() for t in tags.split(",")}
|
| 310 |
+
filtered = [
|
| 311 |
+
m for m in filtered
|
| 312 |
+
if any(t in m.get("tags", "").lower() for t in tag_set)
|
| 313 |
+
]
|
| 314 |
+
|
| 315 |
+
if not filtered:
|
| 316 |
+
return "No memories matched your filters."
|
| 317 |
+
|
| 318 |
+
# 5. Hybrid scoring
|
| 319 |
+
scored = []
|
| 320 |
+
for meta in filtered:
|
| 321 |
+
mem_id = meta["id"]
|
| 322 |
+
semantic = score_map.get(mem_id, 0.0)
|
| 323 |
+
recency = _compute_recency_score(meta.get("timestamp", ""))
|
| 324 |
+
hybrid = SEMANTIC_WEIGHT * semantic + RECENCY_WEIGHT * recency
|
| 325 |
+
scored.append((meta, hybrid, semantic))
|
| 326 |
+
|
| 327 |
+
scored.sort(key=lambda x: x[1], reverse=True)
|
| 328 |
+
|
| 329 |
+
# 6. Format output
|
| 330 |
+
results = scored[:top_k]
|
| 331 |
+
output_lines = [f"=== FARADAY MEMORY ({len(results)} results) ===\n"]
|
| 332 |
+
|
| 333 |
+
for i, (meta, hybrid, semantic) in enumerate(results, 1):
|
| 334 |
+
output_lines.append(
|
| 335 |
+
f"--- Result {i} [Score: {hybrid:.3f}] ---\n"
|
| 336 |
+
f"Source: {meta.get('source', 'Unknown')}\n"
|
| 337 |
+
f"Date: {meta.get('timestamp', 'Unknown')}\n"
|
| 338 |
+
f"Tags: {meta.get('tags', '')}\n"
|
| 339 |
+
f"Semantic: {semantic:.3f} | Recency: "
|
| 340 |
+
f"{_compute_recency_score(meta.get('timestamp', '')):.3f}\n"
|
| 341 |
+
f"Content:\n{meta.get('text', '')}\n"
|
| 342 |
+
)
|
| 343 |
+
|
| 344 |
+
return "\n".join(output_lines)
|
| 345 |
+
|
| 346 |
+
except Exception as e:
|
| 347 |
+
import traceback
|
| 348 |
+
return f"Memory search failed: {e}\n{traceback.format_exc()}"
|
| 349 |
+
|
| 350 |
+
|
| 351 |
+
@mcp.tool()
|
| 352 |
+
def get_memory_stats() -> str:
|
| 353 |
+
"""
|
| 354 |
+
Get diagnostic statistics about the AI memory store.
|
| 355 |
+
Shows total chunks, vector count, date range, and source count.
|
| 356 |
+
"""
|
| 357 |
+
try:
|
| 358 |
+
stats = _db.get_stats()
|
| 359 |
+
vec_count = _vec_db.count()
|
| 360 |
+
|
| 361 |
+
return (
|
| 362 |
+
f"=== FARADAY CLOUD MEMORY STATS ===\n"
|
| 363 |
+
f"Total chunks: {stats.get('total', 0)}\n"
|
| 364 |
+
f"FAISS vectors: {vec_count}\n"
|
| 365 |
+
f"Unique sources: {stats.get('sources', 0)}\n"
|
| 366 |
+
f"Date range: {stats.get('earliest', 'N/A')} β "
|
| 367 |
+
f"{stats.get('latest', 'N/A')}\n"
|
| 368 |
+
f"Deployment: Google Cloud Run\n"
|
| 369 |
+
f"Data source: Supabase Storage\n"
|
| 370 |
+
)
|
| 371 |
+
except Exception as e:
|
| 372 |
+
return f"Stats failed: {e}"
|
| 373 |
+
|
| 374 |
+
|
| 375 |
+
@mcp.tool()
|
| 376 |
+
def sync_memory() -> str:
|
| 377 |
+
"""
|
| 378 |
+
Re-pull the latest data from Supabase Storage.
|
| 379 |
+
Use this after running `python sync.py push` on your laptop
|
| 380 |
+
to refresh the cloud server with the latest memory data.
|
| 381 |
+
"""
|
| 382 |
+
try:
|
| 383 |
+
def _run_refresh():
|
| 384 |
+
global _db, _vec_db
|
| 385 |
+
try:
|
| 386 |
+
success = pull_from_supabase()
|
| 387 |
+
if success:
|
| 388 |
+
# Reload database and vector index
|
| 389 |
+
_db = MemoryDB(db_path=CLOUD_DB_PATH, readonly=True)
|
| 390 |
+
_vec_db = VectorDB(index_path=str(CLOUD_INDEX_PATH))
|
| 391 |
+
print(
|
| 392 |
+
f"[CLOUD] Refreshed: {_vec_db.count()} vectors, "
|
| 393 |
+
f"{_db.count()} memories.",
|
| 394 |
+
file=sys.stderr,
|
| 395 |
+
)
|
| 396 |
+
except Exception as e:
|
| 397 |
+
print(f"[CLOUD] Refresh error: {e}", file=sys.stderr)
|
| 398 |
+
|
| 399 |
+
threading.Thread(target=_run_refresh, daemon=True).start()
|
| 400 |
+
|
| 401 |
+
return (
|
| 402 |
+
"β
Cloud data refresh started. "
|
| 403 |
+
"Pulling latest memory.db and memory.index from Supabase. "
|
| 404 |
+
"Run get_memory_stats() in a moment to verify."
|
| 405 |
+
)
|
| 406 |
+
except Exception as e:
|
| 407 |
+
return f"β Refresh failed: {e}"
|
| 408 |
+
|
| 409 |
+
|
| 410 |
+
# βββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 411 |
+
# Health Check Endpoint (for Cloud Run)
|
| 412 |
+
# βββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 413 |
+
|
| 414 |
+
@mcp.resource("health://status")
|
| 415 |
+
def health_check() -> str:
|
| 416 |
+
"""Health check for Cloud Run."""
|
| 417 |
+
return f"OK: {_vec_db.count()} vectors loaded"
|
| 418 |
+
|
| 419 |
+
|
| 420 |
+
# βββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 421 |
+
# Entry Point
|
| 422 |
+
# βββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 423 |
+
|
| 424 |
+
if __name__ == "__main__":
|
| 425 |
+
print(f"[CLOUD] Starting SSE server on port {PORT}...", file=sys.stderr)
|
| 426 |
+
mcp.run(transport="sse")
|
mcp_server/main.py
ADDED
|
@@ -0,0 +1,307 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
mcp_server.main β Production MCP server for Personal AI Memory.
|
| 3 |
+
|
| 4 |
+
Exposes three tools via the Model Context Protocol (stdio transport):
|
| 5 |
+
1. search_memory β Hybrid semantic + recency search
|
| 6 |
+
2. get_memory_stats β Diagnostic info about the memory store
|
| 7 |
+
3. sync_memory β Trigger background ingestion
|
| 8 |
+
|
| 9 |
+
Design:
|
| 10 |
+
- Model, FAISS index, and SQLite are loaded ONCE at startup
|
| 11 |
+
- Queries run in <100ms (encode + FAISS search + SQLite lookup)
|
| 12 |
+
- No writes during search β fully non-blocking for concurrent reads
|
| 13 |
+
- Hybrid scoring: Ξ± Γ semantic_similarity + (1-Ξ±) Γ recency_score
|
| 14 |
+
"""
|
| 15 |
+
|
| 16 |
+
import datetime
|
| 17 |
+
import os
|
| 18 |
+
import sys
|
| 19 |
+
import threading
|
| 20 |
+
from pathlib import Path
|
| 21 |
+
|
| 22 |
+
# Silence Huggingface/SentenceTransformer logs that break JSON-RPC stdio
|
| 23 |
+
os.environ["HF_HUB_DISABLE_PROGRESS_BARS"] = "1"
|
| 24 |
+
os.environ["TOKENIZERS_PARALLELISM"] = "false"
|
| 25 |
+
os.environ["TRANSFORMERS_VERBOSITY"] = "error"
|
| 26 |
+
os.environ["HF_HUB_DISABLE_SYMLINKS_WARNING"] = "1"
|
| 27 |
+
|
| 28 |
+
import logging
|
| 29 |
+
|
| 30 |
+
logging.getLogger("sentence_transformers").setLevel(logging.ERROR)
|
| 31 |
+
logging.getLogger("transformers").setLevel(logging.ERROR)
|
| 32 |
+
|
| 33 |
+
# Fix imports: add project root to path
|
| 34 |
+
PROJECT_ROOT = str(Path(__file__).parent.parent)
|
| 35 |
+
sys.path.insert(0, PROJECT_ROOT)
|
| 36 |
+
|
| 37 |
+
from mcp.server.fastmcp import FastMCP
|
| 38 |
+
|
| 39 |
+
from config import (
|
| 40 |
+
DEFAULT_TOP_K,
|
| 41 |
+
EMBEDDING_MODEL,
|
| 42 |
+
RECENCY_WEIGHT,
|
| 43 |
+
SEMANTIC_WEIGHT,
|
| 44 |
+
)
|
| 45 |
+
from database.faiss_db import VectorDB
|
| 46 |
+
from database.sqlite_db import MemoryDB
|
| 47 |
+
|
| 48 |
+
# βββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 49 |
+
# Server Initialization
|
| 50 |
+
# βββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 51 |
+
|
| 52 |
+
mcp = FastMCP("Faraday-AI-Memory")
|
| 53 |
+
|
| 54 |
+
# Eager load: everything initialized once at startup
|
| 55 |
+
print("Loading AI Memory services...", file=sys.stderr)
|
| 56 |
+
|
| 57 |
+
_db = MemoryDB(readonly=True)
|
| 58 |
+
_vec_db = VectorDB()
|
| 59 |
+
|
| 60 |
+
from sentence_transformers import SentenceTransformer
|
| 61 |
+
|
| 62 |
+
_model = SentenceTransformer(EMBEDDING_MODEL)
|
| 63 |
+
|
| 64 |
+
print(
|
| 65 |
+
f"Ready: {_vec_db.count()} vectors, {_db.count()} memories loaded.",
|
| 66 |
+
file=sys.stderr,
|
| 67 |
+
)
|
| 68 |
+
|
| 69 |
+
|
| 70 |
+
# βββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 71 |
+
# Time Filter Helpers
|
| 72 |
+
# βββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 73 |
+
|
| 74 |
+
def _resolve_time_filter(time_filter: str):
|
| 75 |
+
"""
|
| 76 |
+
Convert human-readable time filter to (start_iso, end_iso) tuple.
|
| 77 |
+
Supports: "today", "yesterday", "last_week", "last_month",
|
| 78 |
+
or an ISO date string like "2026-04-10".
|
| 79 |
+
Returns None if no filter.
|
| 80 |
+
"""
|
| 81 |
+
if not time_filter or time_filter.lower() == "none":
|
| 82 |
+
return None
|
| 83 |
+
|
| 84 |
+
now = datetime.datetime.now()
|
| 85 |
+
key = time_filter.lower().strip()
|
| 86 |
+
|
| 87 |
+
if key == "today":
|
| 88 |
+
start = now.replace(hour=0, minute=0, second=0, microsecond=0)
|
| 89 |
+
return start.isoformat(), now.isoformat()
|
| 90 |
+
elif key == "yesterday":
|
| 91 |
+
yesterday = now - datetime.timedelta(days=1)
|
| 92 |
+
start = yesterday.replace(hour=0, minute=0, second=0, microsecond=0)
|
| 93 |
+
end = yesterday.replace(hour=23, minute=59, second=59)
|
| 94 |
+
return start.isoformat(), end.isoformat()
|
| 95 |
+
elif key in ("last_week", "this_week", "week"):
|
| 96 |
+
start = now - datetime.timedelta(days=7)
|
| 97 |
+
return start.isoformat(), now.isoformat()
|
| 98 |
+
elif key in ("last_month", "this_month", "month"):
|
| 99 |
+
start = now - datetime.timedelta(days=30)
|
| 100 |
+
return start.isoformat(), now.isoformat()
|
| 101 |
+
else:
|
| 102 |
+
# Try parsing as ISO date
|
| 103 |
+
try:
|
| 104 |
+
from dateutil import parser as date_parser
|
| 105 |
+
|
| 106 |
+
dt = date_parser.parse(time_filter)
|
| 107 |
+
start = dt.replace(hour=0, minute=0, second=0)
|
| 108 |
+
end = dt.replace(hour=23, minute=59, second=59)
|
| 109 |
+
return start.isoformat(), end.isoformat()
|
| 110 |
+
except Exception:
|
| 111 |
+
return None
|
| 112 |
+
|
| 113 |
+
|
| 114 |
+
def _compute_recency_score(timestamp_str: str) -> float:
|
| 115 |
+
"""
|
| 116 |
+
Compute a 0-1 recency score.
|
| 117 |
+
Recent items score higher. Uses exponential decay with ~30 day half-life.
|
| 118 |
+
"""
|
| 119 |
+
if not timestamp_str or timestamp_str == "Unknown":
|
| 120 |
+
return 0.0
|
| 121 |
+
|
| 122 |
+
try:
|
| 123 |
+
from dateutil import parser as date_parser
|
| 124 |
+
|
| 125 |
+
ts = date_parser.parse(timestamp_str)
|
| 126 |
+
now = datetime.datetime.now()
|
| 127 |
+
age_days = max(0, (now - ts).total_seconds() / 86400)
|
| 128 |
+
# Exponential decay: half-life of 30 days
|
| 129 |
+
return 2.0 ** (-age_days / 30.0)
|
| 130 |
+
except Exception:
|
| 131 |
+
return 0.0
|
| 132 |
+
|
| 133 |
+
|
| 134 |
+
# βββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 135 |
+
# MCP Tools
|
| 136 |
+
# βββββββββββββββββββββββββββββββββββββββοΏ½οΏ½οΏ½βββββββββββββ
|
| 137 |
+
|
| 138 |
+
|
| 139 |
+
@mcp.tool()
|
| 140 |
+
def search_memory(
|
| 141 |
+
query: str,
|
| 142 |
+
top_k: int = DEFAULT_TOP_K,
|
| 143 |
+
time_filter: str = "",
|
| 144 |
+
tags: str = "",
|
| 145 |
+
) -> str:
|
| 146 |
+
"""
|
| 147 |
+
Search Saurab's personal AI memory (past chats, documents, notes, research).
|
| 148 |
+
|
| 149 |
+
Use this tool whenever you need context about Saurab's history,
|
| 150 |
+
past actions, architecture decisions, projects, or personal knowledge.
|
| 151 |
+
|
| 152 |
+
Args:
|
| 153 |
+
query: Semantic search string (e.g. 'sparse communication paper')
|
| 154 |
+
top_k: Maximum results to return (default 5)
|
| 155 |
+
time_filter: Optional time constraint: 'today', 'yesterday',
|
| 156 |
+
'last_week', 'last_month', or an ISO date like '2026-04-10'
|
| 157 |
+
tags: Optional comma-separated tag filter (e.g. 'chatgpt,research')
|
| 158 |
+
"""
|
| 159 |
+
try:
|
| 160 |
+
if _vec_db.count() == 0:
|
| 161 |
+
return (
|
| 162 |
+
"Memory store is empty. "
|
| 163 |
+
"Run `python update.py` to synchronize your data."
|
| 164 |
+
)
|
| 165 |
+
|
| 166 |
+
# 1. Encode query (single vector, ~5ms on CPU)
|
| 167 |
+
query_emb = _model.encode(
|
| 168 |
+
[query], show_progress_bar=False, convert_to_numpy=True
|
| 169 |
+
)
|
| 170 |
+
|
| 171 |
+
# 2. FAISS search β get candidate IDs with semantic scores
|
| 172 |
+
# Fetch extra candidates for re-ranking
|
| 173 |
+
fetch_k = min(top_k * 3, _vec_db.count())
|
| 174 |
+
raw_results = _vec_db.search(query_emb, top_k=fetch_k)
|
| 175 |
+
|
| 176 |
+
if not raw_results:
|
| 177 |
+
return "No matching memories found."
|
| 178 |
+
|
| 179 |
+
# 3. Get metadata for all candidates
|
| 180 |
+
candidate_ids = [r[0] for r in raw_results]
|
| 181 |
+
score_map = {r[0]: r[1] for r in raw_results}
|
| 182 |
+
metadata_list = _db.get_memories_by_ids(candidate_ids)
|
| 183 |
+
|
| 184 |
+
# 4. Apply filters
|
| 185 |
+
filtered = metadata_list
|
| 186 |
+
|
| 187 |
+
# Time filter
|
| 188 |
+
time_range = _resolve_time_filter(time_filter)
|
| 189 |
+
if time_range:
|
| 190 |
+
start_iso, end_iso = time_range
|
| 191 |
+
filtered = [
|
| 192 |
+
m
|
| 193 |
+
for m in filtered
|
| 194 |
+
if m.get("timestamp", "") >= start_iso
|
| 195 |
+
and m.get("timestamp", "") <= end_iso
|
| 196 |
+
]
|
| 197 |
+
|
| 198 |
+
# Tag filter
|
| 199 |
+
if tags:
|
| 200 |
+
tag_set = {t.strip().lower() for t in tags.split(",")}
|
| 201 |
+
filtered = [
|
| 202 |
+
m
|
| 203 |
+
for m in filtered
|
| 204 |
+
if any(
|
| 205 |
+
t in m.get("tags", "").lower() for t in tag_set
|
| 206 |
+
)
|
| 207 |
+
]
|
| 208 |
+
|
| 209 |
+
if not filtered:
|
| 210 |
+
return "No memories matched your filters."
|
| 211 |
+
|
| 212 |
+
# 5. Hybrid scoring: semantic + recency
|
| 213 |
+
scored = []
|
| 214 |
+
for meta in filtered:
|
| 215 |
+
mem_id = meta["id"]
|
| 216 |
+
semantic = score_map.get(mem_id, 0.0)
|
| 217 |
+
recency = _compute_recency_score(meta.get("timestamp", ""))
|
| 218 |
+
hybrid = SEMANTIC_WEIGHT * semantic + RECENCY_WEIGHT * recency
|
| 219 |
+
scored.append((meta, hybrid, semantic))
|
| 220 |
+
|
| 221 |
+
# Sort by hybrid score (descending)
|
| 222 |
+
scored.sort(key=lambda x: x[1], reverse=True)
|
| 223 |
+
|
| 224 |
+
# 6. Format output
|
| 225 |
+
results = scored[:top_k]
|
| 226 |
+
output_lines = [f"=== FARADAY MEMORY ({len(results)} results) ===\n"]
|
| 227 |
+
|
| 228 |
+
for i, (meta, hybrid, semantic) in enumerate(results, 1):
|
| 229 |
+
output_lines.append(
|
| 230 |
+
f"--- Result {i} [Score: {hybrid:.3f}] ---\n"
|
| 231 |
+
f"Source: {meta.get('source', 'Unknown')}\n"
|
| 232 |
+
f"Date: {meta.get('timestamp', 'Unknown')}\n"
|
| 233 |
+
f"Tags: {meta.get('tags', '')}\n"
|
| 234 |
+
f"Semantic: {semantic:.3f} | Recency: "
|
| 235 |
+
f"{_compute_recency_score(meta.get('timestamp', '')):.3f}\n"
|
| 236 |
+
f"Content:\n{meta.get('text', '')}\n"
|
| 237 |
+
)
|
| 238 |
+
|
| 239 |
+
return "\n".join(output_lines)
|
| 240 |
+
|
| 241 |
+
except Exception as e:
|
| 242 |
+
import traceback
|
| 243 |
+
|
| 244 |
+
return f"Memory search failed: {e}\n{traceback.format_exc()}"
|
| 245 |
+
|
| 246 |
+
|
| 247 |
+
@mcp.tool()
|
| 248 |
+
def get_memory_stats() -> str:
|
| 249 |
+
"""
|
| 250 |
+
Get diagnostic statistics about the AI memory store.
|
| 251 |
+
Shows total chunks, vector count, date range, and source count.
|
| 252 |
+
"""
|
| 253 |
+
try:
|
| 254 |
+
stats = _db.get_stats()
|
| 255 |
+
vec_count = _vec_db.count()
|
| 256 |
+
|
| 257 |
+
return (
|
| 258 |
+
f"=== FARADAY MEMORY STATS ===\n"
|
| 259 |
+
f"Total chunks: {stats.get('total', 0)}\n"
|
| 260 |
+
f"FAISS vectors: {vec_count}\n"
|
| 261 |
+
f"Unique sources: {stats.get('sources', 0)}\n"
|
| 262 |
+
f"Date range: {stats.get('earliest', 'N/A')} β "
|
| 263 |
+
f"{stats.get('latest', 'N/A')}\n"
|
| 264 |
+
)
|
| 265 |
+
except Exception as e:
|
| 266 |
+
return f"Stats failed: {e}"
|
| 267 |
+
|
| 268 |
+
|
| 269 |
+
@mcp.tool()
|
| 270 |
+
def sync_memory() -> str:
|
| 271 |
+
"""
|
| 272 |
+
Trigger an incremental memory sync in the background.
|
| 273 |
+
This scans all data directories, processes new documents,
|
| 274 |
+
and updates the FAISS index. Safe to call repeatedly.
|
| 275 |
+
"""
|
| 276 |
+
try:
|
| 277 |
+
|
| 278 |
+
def _run_update():
|
| 279 |
+
try:
|
| 280 |
+
import subprocess
|
| 281 |
+
|
| 282 |
+
subprocess.run(
|
| 283 |
+
[sys.executable, str(Path(PROJECT_ROOT) / "update.py")],
|
| 284 |
+
cwd=PROJECT_ROOT,
|
| 285 |
+
capture_output=True,
|
| 286 |
+
text=True,
|
| 287 |
+
)
|
| 288 |
+
except Exception as e:
|
| 289 |
+
print(f"Background sync error: {e}", file=sys.stderr)
|
| 290 |
+
|
| 291 |
+
threading.Thread(target=_run_update, daemon=True).start()
|
| 292 |
+
|
| 293 |
+
return (
|
| 294 |
+
"β
Memory sync started in background. "
|
| 295 |
+
"New data will be available after processing completes. "
|
| 296 |
+
"Run get_memory_stats() to verify."
|
| 297 |
+
)
|
| 298 |
+
except Exception as e:
|
| 299 |
+
return f"β Sync failed to start: {e}"
|
| 300 |
+
|
| 301 |
+
|
| 302 |
+
# βββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 303 |
+
# Entry Point
|
| 304 |
+
# βββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 305 |
+
|
| 306 |
+
if __name__ == "__main__":
|
| 307 |
+
mcp.run(transport="stdio")
|
processing/__init__.py
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
processing β Text cleaning and chunking pipeline.
|
| 3 |
+
"""
|
processing/chunker.py
ADDED
|
@@ -0,0 +1,84 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
processing.chunker β Paragraph-aware text chunker with overlap.
|
| 3 |
+
|
| 4 |
+
Splits text into ~300-word chunks, preferring paragraph boundaries.
|
| 5 |
+
Falls back to word-level splitting for very long paragraphs.
|
| 6 |
+
Maintains configurable word overlap between adjacent chunks
|
| 7 |
+
to preserve context continuity for embeddings.
|
| 8 |
+
"""
|
| 9 |
+
|
| 10 |
+
import re
|
| 11 |
+
from typing import List
|
| 12 |
+
|
| 13 |
+
|
| 14 |
+
def chunk_text(
|
| 15 |
+
text: str,
|
| 16 |
+
max_words: int = 300,
|
| 17 |
+
overlap: int = 50,
|
| 18 |
+
min_chunk_words: int = 15,
|
| 19 |
+
) -> List[str]:
|
| 20 |
+
"""
|
| 21 |
+
Split text into chunks of approximately `max_words`, aligned
|
| 22 |
+
to paragraph boundaries where possible.
|
| 23 |
+
|
| 24 |
+
Args:
|
| 25 |
+
text: Input text to chunk.
|
| 26 |
+
max_words: Target maximum words per chunk.
|
| 27 |
+
overlap: Words of overlap between consecutive chunks.
|
| 28 |
+
min_chunk_words: Discard trailing chunks smaller than this.
|
| 29 |
+
|
| 30 |
+
Returns:
|
| 31 |
+
List of text chunks.
|
| 32 |
+
"""
|
| 33 |
+
if not text or not text.strip():
|
| 34 |
+
return []
|
| 35 |
+
|
| 36 |
+
chunks: List[str] = []
|
| 37 |
+
|
| 38 |
+
# Step 1: Split into paragraphs (double-newline separated)
|
| 39 |
+
paragraphs = re.split(r"\n\n+", text)
|
| 40 |
+
|
| 41 |
+
current_words: List[str] = []
|
| 42 |
+
current_len = 0
|
| 43 |
+
|
| 44 |
+
for para in paragraphs:
|
| 45 |
+
words = para.split()
|
| 46 |
+
if not words:
|
| 47 |
+
continue
|
| 48 |
+
|
| 49 |
+
# If this paragraph fits in the current chunk, append it
|
| 50 |
+
if current_len + len(words) <= max_words:
|
| 51 |
+
current_words.extend(words)
|
| 52 |
+
current_len += len(words)
|
| 53 |
+
else:
|
| 54 |
+
# Flush current chunk if it has content
|
| 55 |
+
if current_words:
|
| 56 |
+
chunks.append(" ".join(current_words))
|
| 57 |
+
# Retain overlap from end of current chunk
|
| 58 |
+
if overlap > 0:
|
| 59 |
+
current_words = current_words[-overlap:]
|
| 60 |
+
current_len = len(current_words)
|
| 61 |
+
else:
|
| 62 |
+
current_words = []
|
| 63 |
+
current_len = 0
|
| 64 |
+
|
| 65 |
+
# If paragraph itself exceeds max_words, split strictly
|
| 66 |
+
if len(words) > max_words:
|
| 67 |
+
i = 0
|
| 68 |
+
while i < len(words):
|
| 69 |
+
sub = words[i : i + max_words]
|
| 70 |
+
chunks.append(" ".join(sub))
|
| 71 |
+
step = max(1, max_words - overlap)
|
| 72 |
+
i += step
|
| 73 |
+
# Reset after processing oversized paragraph
|
| 74 |
+
current_words = []
|
| 75 |
+
current_len = 0
|
| 76 |
+
else:
|
| 77 |
+
current_words.extend(words)
|
| 78 |
+
current_len += len(words)
|
| 79 |
+
|
| 80 |
+
# Flush remaining words if they form a meaningful chunk
|
| 81 |
+
if current_words and len(current_words) >= min_chunk_words:
|
| 82 |
+
chunks.append(" ".join(current_words))
|
| 83 |
+
|
| 84 |
+
return chunks
|
processing/cleaner.py
ADDED
|
@@ -0,0 +1,60 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
processing.cleaner β Text normalization and deduplication hashing.
|
| 3 |
+
|
| 4 |
+
Handles:
|
| 5 |
+
- Unicode normalization (NFC)
|
| 6 |
+
- HTML tag stripping
|
| 7 |
+
- Control character removal
|
| 8 |
+
- Excessive whitespace collapse
|
| 9 |
+
- Minimum length filtering
|
| 10 |
+
- SHA-256 content hashing for deduplication
|
| 11 |
+
"""
|
| 12 |
+
|
| 13 |
+
import hashlib
|
| 14 |
+
import re
|
| 15 |
+
import unicodedata
|
| 16 |
+
from typing import Optional
|
| 17 |
+
|
| 18 |
+
|
| 19 |
+
def clean_text(text: str) -> Optional[str]:
|
| 20 |
+
"""
|
| 21 |
+
Normalize and clean text for embedding.
|
| 22 |
+
|
| 23 |
+
Returns cleaned text, or None if the result is too short
|
| 24 |
+
to be meaningful (< 30 characters after cleaning).
|
| 25 |
+
"""
|
| 26 |
+
if not text:
|
| 27 |
+
return None
|
| 28 |
+
|
| 29 |
+
# 1. Unicode normalize to NFC (compose characters)
|
| 30 |
+
text = unicodedata.normalize("NFC", text)
|
| 31 |
+
|
| 32 |
+
# 2. Strip residual HTML tags (from Gemini exports, etc.)
|
| 33 |
+
text = re.sub(r"<[^>]+>", " ", text)
|
| 34 |
+
|
| 35 |
+
# 3. Remove control characters (except newlines and tabs)
|
| 36 |
+
text = re.sub(r"[\x00-\x08\x0b\x0c\x0e-\x1f\x7f-\x9f]", "", text)
|
| 37 |
+
|
| 38 |
+
# 4. Collapse excessive newlines (3+ β 2)
|
| 39 |
+
text = re.sub(r"\n{3,}", "\n\n", text)
|
| 40 |
+
|
| 41 |
+
# 5. Collapse excessive spaces (3+ β 2)
|
| 42 |
+
text = re.sub(r" {3,}", " ", text)
|
| 43 |
+
|
| 44 |
+
# 6. Strip leading/trailing whitespace
|
| 45 |
+
text = text.strip()
|
| 46 |
+
|
| 47 |
+
# 7. Minimum length gate
|
| 48 |
+
if len(text) < 30:
|
| 49 |
+
return None
|
| 50 |
+
|
| 51 |
+
return text
|
| 52 |
+
|
| 53 |
+
|
| 54 |
+
def compute_hash(content: str) -> str:
|
| 55 |
+
"""
|
| 56 |
+
Generate a deterministic SHA-256 hash for content deduplication.
|
| 57 |
+
Two identical chunks will always produce the same hash,
|
| 58 |
+
preventing re-embedding on repeated updates.
|
| 59 |
+
"""
|
| 60 |
+
return hashlib.sha256(content.encode("utf-8")).hexdigest()
|
requirements.txt
ADDED
|
@@ -0,0 +1,29 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Core
|
| 2 |
+
faiss-cpu>=1.7.4
|
| 3 |
+
sentence-transformers>=2.2.0
|
| 4 |
+
numpy>=1.24.0
|
| 5 |
+
|
| 6 |
+
# MCP Protocol
|
| 7 |
+
mcp[cli]>=1.0.0
|
| 8 |
+
|
| 9 |
+
# Ingestion β PDF
|
| 10 |
+
PyPDF2>=3.0.0
|
| 11 |
+
|
| 12 |
+
# Ingestion β OCR (optional, requires Tesseract installed)
|
| 13 |
+
Pillow>=10.0.0
|
| 14 |
+
pytesseract>=0.3.10
|
| 15 |
+
|
| 16 |
+
# Ingestion β Gemini HTML takeout
|
| 17 |
+
beautifulsoup4>=4.12.0
|
| 18 |
+
lxml>=4.9.0
|
| 19 |
+
|
| 20 |
+
# Time-aware queries
|
| 21 |
+
python-dateutil>=2.8.0
|
| 22 |
+
|
| 23 |
+
# Cloud sync
|
| 24 |
+
supabase>=2.0.0
|
| 25 |
+
|
| 26 |
+
# Cloud deployment (SSE transport)
|
| 27 |
+
starlette>=0.27.0
|
| 28 |
+
uvicorn>=0.23.0
|
| 29 |
+
httpx>=0.24.0
|
sync.py
ADDED
|
@@ -0,0 +1,254 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
sync.py β Cloud sync for Faraday AI Memory.
|
| 3 |
+
|
| 4 |
+
Pushes the local SQLite database and FAISS index to Supabase Storage,
|
| 5 |
+
enabling the cloud-hosted MCP server on Cloud Run to serve queries
|
| 6 |
+
from the same data.
|
| 7 |
+
|
| 8 |
+
Uses httpx directly (no heavy Supabase SDK) for maximum compatibility.
|
| 9 |
+
|
| 10 |
+
Usage:
|
| 11 |
+
python sync.py push # Upload local DB + index to cloud
|
| 12 |
+
python sync.py pull # Download cloud DB + index to local
|
| 13 |
+
python sync.py status # Check what's in the cloud bucket
|
| 14 |
+
|
| 15 |
+
Requires SUPABASE_URL and SUPABASE_KEY in config.py or env vars.
|
| 16 |
+
"""
|
| 17 |
+
|
| 18 |
+
import os
|
| 19 |
+
import sys
|
| 20 |
+
from pathlib import Path
|
| 21 |
+
|
| 22 |
+
# Ensure project root is on path
|
| 23 |
+
sys.path.insert(0, str(Path(__file__).parent.resolve()))
|
| 24 |
+
|
| 25 |
+
from config import (
|
| 26 |
+
FAISS_INDEX_PATH,
|
| 27 |
+
SQLITE_DB_PATH,
|
| 28 |
+
SUPABASE_BUCKET,
|
| 29 |
+
SUPABASE_KEY,
|
| 30 |
+
SUPABASE_URL,
|
| 31 |
+
)
|
| 32 |
+
|
| 33 |
+
|
| 34 |
+
def _check_credentials():
|
| 35 |
+
"""Validate Supabase credentials are available."""
|
| 36 |
+
if not SUPABASE_URL or not SUPABASE_KEY:
|
| 37 |
+
print(
|
| 38 |
+
"β Error: Supabase credentials not configured.\n"
|
| 39 |
+
"Set SUPABASE_URL and SUPABASE_KEY in config.py or environment.\n"
|
| 40 |
+
"\n"
|
| 41 |
+
"Your Supabase project URL looks like:\n"
|
| 42 |
+
" https://qwxagrmoryojholseclm.supabase.co\n"
|
| 43 |
+
"\n"
|
| 44 |
+
"Get your anon key from:\n"
|
| 45 |
+
" Supabase Dashboard β Settings β API β anon public key",
|
| 46 |
+
file=sys.stderr,
|
| 47 |
+
)
|
| 48 |
+
sys.exit(1)
|
| 49 |
+
|
| 50 |
+
|
| 51 |
+
def _headers():
|
| 52 |
+
"""Common headers for Supabase Storage API."""
|
| 53 |
+
return {
|
| 54 |
+
"apikey": SUPABASE_KEY,
|
| 55 |
+
"Authorization": f"Bearer {SUPABASE_KEY}",
|
| 56 |
+
}
|
| 57 |
+
|
| 58 |
+
|
| 59 |
+
def _ensure_bucket():
|
| 60 |
+
"""Create the storage bucket if it doesn't exist."""
|
| 61 |
+
import httpx
|
| 62 |
+
|
| 63 |
+
# Check if bucket exists
|
| 64 |
+
r = httpx.get(
|
| 65 |
+
f"{SUPABASE_URL}/storage/v1/bucket/{SUPABASE_BUCKET}",
|
| 66 |
+
headers=_headers(),
|
| 67 |
+
timeout=30,
|
| 68 |
+
)
|
| 69 |
+
|
| 70 |
+
if r.status_code == 200:
|
| 71 |
+
print(f" Bucket '{SUPABASE_BUCKET}' exists.", file=sys.stderr)
|
| 72 |
+
return True
|
| 73 |
+
|
| 74 |
+
# Create bucket
|
| 75 |
+
print(f" Creating bucket '{SUPABASE_BUCKET}'...", file=sys.stderr)
|
| 76 |
+
r = httpx.post(
|
| 77 |
+
f"{SUPABASE_URL}/storage/v1/bucket",
|
| 78 |
+
headers={**_headers(), "Content-Type": "application/json"},
|
| 79 |
+
json={
|
| 80 |
+
"id": SUPABASE_BUCKET,
|
| 81 |
+
"name": SUPABASE_BUCKET,
|
| 82 |
+
"public": False,
|
| 83 |
+
"file_size_limit": 200 * 1024 * 1024, # 200 MB
|
| 84 |
+
},
|
| 85 |
+
timeout=30,
|
| 86 |
+
)
|
| 87 |
+
|
| 88 |
+
if r.status_code in (200, 201):
|
| 89 |
+
print(f" β
Bucket created.", file=sys.stderr)
|
| 90 |
+
return True
|
| 91 |
+
else:
|
| 92 |
+
print(
|
| 93 |
+
f" β οΈ Bucket creation response ({r.status_code}): {r.text}",
|
| 94 |
+
file=sys.stderr,
|
| 95 |
+
)
|
| 96 |
+
# May already exist, try anyway
|
| 97 |
+
return True
|
| 98 |
+
|
| 99 |
+
|
| 100 |
+
def push():
|
| 101 |
+
"""Upload local database and FAISS index (compressed) to Supabase Storage."""
|
| 102 |
+
import httpx
|
| 103 |
+
import gzip
|
| 104 |
+
|
| 105 |
+
_check_credentials()
|
| 106 |
+
_ensure_bucket()
|
| 107 |
+
|
| 108 |
+
files_to_upload = {
|
| 109 |
+
"memory.db": SQLITE_DB_PATH,
|
| 110 |
+
"memory.index": FAISS_INDEX_PATH,
|
| 111 |
+
}
|
| 112 |
+
|
| 113 |
+
for remote_name, local_path in files_to_upload.items():
|
| 114 |
+
path = Path(local_path)
|
| 115 |
+
if not path.exists():
|
| 116 |
+
print(f" [SKIP] {remote_name} β local file not found at {path}",
|
| 117 |
+
file=sys.stderr)
|
| 118 |
+
continue
|
| 119 |
+
|
| 120 |
+
raw_size_mb = path.stat().st_size / (1024 * 1024)
|
| 121 |
+
|
| 122 |
+
# We always compress before uploading
|
| 123 |
+
compressed_name = f"{remote_name}.gz"
|
| 124 |
+
print(f" Compressing {remote_name} ({raw_size_mb:.1f} MB)...", file=sys.stderr)
|
| 125 |
+
|
| 126 |
+
with open(path, "rb") as f_in:
|
| 127 |
+
file_content = gzip.compress(f_in.read())
|
| 128 |
+
|
| 129 |
+
comp_size_mb = len(file_content) / (1024 * 1024)
|
| 130 |
+
print(f" Uploading {compressed_name} ({comp_size_mb:.1f} MB)...", file=sys.stderr)
|
| 131 |
+
|
| 132 |
+
# Try to remove existing file first (upsert)
|
| 133 |
+
httpx.delete(
|
| 134 |
+
f"{SUPABASE_URL}/storage/v1/object/{SUPABASE_BUCKET}/{compressed_name}",
|
| 135 |
+
headers=_headers(),
|
| 136 |
+
timeout=30,
|
| 137 |
+
)
|
| 138 |
+
|
| 139 |
+
# Upload compressed file
|
| 140 |
+
r = httpx.post(
|
| 141 |
+
f"{SUPABASE_URL}/storage/v1/object/{SUPABASE_BUCKET}/{compressed_name}",
|
| 142 |
+
headers={
|
| 143 |
+
**_headers(),
|
| 144 |
+
"Content-Type": "application/gzip",
|
| 145 |
+
"x-upsert": "true",
|
| 146 |
+
},
|
| 147 |
+
content=file_content,
|
| 148 |
+
timeout=120,
|
| 149 |
+
)
|
| 150 |
+
|
| 151 |
+
if r.status_code in (200, 201):
|
| 152 |
+
print(f" β
{compressed_name} uploaded successfully.", file=sys.stderr)
|
| 153 |
+
else:
|
| 154 |
+
print(
|
| 155 |
+
f" β {compressed_name} upload failed ({r.status_code}): {r.text}",
|
| 156 |
+
file=sys.stderr,
|
| 157 |
+
)
|
| 158 |
+
|
| 159 |
+
print("\nβ
Cloud sync (push) complete.", file=sys.stderr)
|
| 160 |
+
|
| 161 |
+
|
| 162 |
+
def pull():
|
| 163 |
+
"""Download database and FAISS index from Supabase Storage."""
|
| 164 |
+
import httpx
|
| 165 |
+
|
| 166 |
+
_check_credentials()
|
| 167 |
+
|
| 168 |
+
files_to_download = {
|
| 169 |
+
"memory.db": SQLITE_DB_PATH,
|
| 170 |
+
"memory.index": FAISS_INDEX_PATH,
|
| 171 |
+
}
|
| 172 |
+
|
| 173 |
+
for remote_name, local_path in files_to_download.items():
|
| 174 |
+
print(f" Downloading {remote_name}...", file=sys.stderr)
|
| 175 |
+
|
| 176 |
+
try:
|
| 177 |
+
r = httpx.get(
|
| 178 |
+
f"{SUPABASE_URL}/storage/v1/object/{SUPABASE_BUCKET}/{remote_name}",
|
| 179 |
+
headers=_headers(),
|
| 180 |
+
timeout=120,
|
| 181 |
+
)
|
| 182 |
+
|
| 183 |
+
if r.status_code != 200:
|
| 184 |
+
print(
|
| 185 |
+
f" β {remote_name} download failed ({r.status_code}): {r.text}",
|
| 186 |
+
file=sys.stderr,
|
| 187 |
+
)
|
| 188 |
+
continue
|
| 189 |
+
|
| 190 |
+
# Ensure parent directory exists
|
| 191 |
+
Path(local_path).parent.mkdir(parents=True, exist_ok=True)
|
| 192 |
+
|
| 193 |
+
with open(local_path, "wb") as f:
|
| 194 |
+
f.write(r.content)
|
| 195 |
+
|
| 196 |
+
size_mb = len(r.content) / (1024 * 1024)
|
| 197 |
+
print(
|
| 198 |
+
f" β
{remote_name} downloaded ({size_mb:.1f} MB).",
|
| 199 |
+
file=sys.stderr,
|
| 200 |
+
)
|
| 201 |
+
except Exception as e:
|
| 202 |
+
print(f" β {remote_name}: {e}", file=sys.stderr)
|
| 203 |
+
|
| 204 |
+
print("\nβ
Cloud sync (pull) complete.", file=sys.stderr)
|
| 205 |
+
|
| 206 |
+
|
| 207 |
+
def status():
|
| 208 |
+
"""Check what files exist in the Supabase Storage bucket."""
|
| 209 |
+
import httpx
|
| 210 |
+
|
| 211 |
+
_check_credentials()
|
| 212 |
+
|
| 213 |
+
print(f"\nπ¦ Checking bucket '{SUPABASE_BUCKET}'...\n", file=sys.stderr)
|
| 214 |
+
|
| 215 |
+
r = httpx.post(
|
| 216 |
+
f"{SUPABASE_URL}/storage/v1/object/list/{SUPABASE_BUCKET}",
|
| 217 |
+
headers={**_headers(), "Content-Type": "application/json"},
|
| 218 |
+
json={"prefix": "", "limit": 100},
|
| 219 |
+
timeout=30,
|
| 220 |
+
)
|
| 221 |
+
|
| 222 |
+
if r.status_code != 200:
|
| 223 |
+
print(f" β Failed ({r.status_code}): {r.text}", file=sys.stderr)
|
| 224 |
+
return
|
| 225 |
+
|
| 226 |
+
files = r.json()
|
| 227 |
+
if not files:
|
| 228 |
+
print(" π Bucket is empty. Run 'python sync.py push' first.",
|
| 229 |
+
file=sys.stderr)
|
| 230 |
+
return
|
| 231 |
+
|
| 232 |
+
print(f" {'File':<20} {'Size':>10} {'Last Modified'}", file=sys.stderr)
|
| 233 |
+
print(f" {'β'*20} {'β'*10} {'β'*20}", file=sys.stderr)
|
| 234 |
+
|
| 235 |
+
for f in files:
|
| 236 |
+
name = f.get("name", "?")
|
| 237 |
+
size = f.get("metadata", {}).get("size", 0)
|
| 238 |
+
size_str = f"{size / (1024*1024):.1f} MB" if size else "?"
|
| 239 |
+
updated = f.get("updated_at", "?")[:19]
|
| 240 |
+
print(f" {name:<20} {size_str:>10} {updated}", file=sys.stderr)
|
| 241 |
+
|
| 242 |
+
|
| 243 |
+
if __name__ == "__main__":
|
| 244 |
+
if len(sys.argv) < 2 or sys.argv[1] not in ("push", "pull", "status"):
|
| 245 |
+
print("Usage: python sync.py [push|pull|status]")
|
| 246 |
+
sys.exit(1)
|
| 247 |
+
|
| 248 |
+
action = sys.argv[1]
|
| 249 |
+
if action == "push":
|
| 250 |
+
push()
|
| 251 |
+
elif action == "pull":
|
| 252 |
+
pull()
|
| 253 |
+
elif action == "status":
|
| 254 |
+
status()
|
test_search.py
ADDED
|
@@ -0,0 +1,53 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Quick smoke test for the search pipeline."""
|
| 2 |
+
import sys
|
| 3 |
+
import os
|
| 4 |
+
sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
|
| 5 |
+
|
| 6 |
+
os.environ['HF_HUB_DISABLE_PROGRESS_BARS'] = '1'
|
| 7 |
+
os.environ['TOKENIZERS_PARALLELISM'] = 'false'
|
| 8 |
+
os.environ['TRANSFORMERS_VERBOSITY'] = 'error'
|
| 9 |
+
|
| 10 |
+
import logging
|
| 11 |
+
logging.getLogger('sentence_transformers').setLevel(logging.ERROR)
|
| 12 |
+
logging.getLogger('transformers').setLevel(logging.ERROR)
|
| 13 |
+
|
| 14 |
+
from config import EMBEDDING_MODEL
|
| 15 |
+
from database.faiss_db import VectorDB
|
| 16 |
+
from database.sqlite_db import MemoryDB
|
| 17 |
+
from sentence_transformers import SentenceTransformer
|
| 18 |
+
import time
|
| 19 |
+
|
| 20 |
+
db = MemoryDB(readonly=True)
|
| 21 |
+
vec_db = VectorDB()
|
| 22 |
+
model = SentenceTransformer(EMBEDDING_MODEL)
|
| 23 |
+
|
| 24 |
+
print(f"DB: {db.count()} rows | FAISS: {vec_db.count()} vectors")
|
| 25 |
+
|
| 26 |
+
# Test search with timing
|
| 27 |
+
query = "What projects is Saurab working on?"
|
| 28 |
+
t0 = time.perf_counter()
|
| 29 |
+
query_emb = model.encode([query], show_progress_bar=False, convert_to_numpy=True)
|
| 30 |
+
t_encode = time.perf_counter() - t0
|
| 31 |
+
|
| 32 |
+
t0 = time.perf_counter()
|
| 33 |
+
results = vec_db.search(query_emb, top_k=3)
|
| 34 |
+
t_search = time.perf_counter() - t0
|
| 35 |
+
|
| 36 |
+
print(f"\nQuery: '{query}'")
|
| 37 |
+
print(f"Encode: {t_encode*1000:.1f}ms | Search: {t_search*1000:.1f}ms")
|
| 38 |
+
print(f"Results: {len(results)}")
|
| 39 |
+
|
| 40 |
+
ids = [r[0] for r in results]
|
| 41 |
+
metas = db.get_memories_by_ids(ids)
|
| 42 |
+
for i, m in enumerate(metas):
|
| 43 |
+
score = results[i][1]
|
| 44 |
+
source = m.get("source", "?")
|
| 45 |
+
text_preview = m.get("text", "")[:120].replace("\n", " ")
|
| 46 |
+
print(f" [{i+1}] Score={score:.3f} | {source}")
|
| 47 |
+
print(f" {text_preview}...")
|
| 48 |
+
|
| 49 |
+
# Stats
|
| 50 |
+
stats = db.get_stats()
|
| 51 |
+
print(f"\nStats: {stats}")
|
| 52 |
+
db.close()
|
| 53 |
+
print("\nAll tests passed!")
|
update.py
ADDED
|
@@ -0,0 +1,275 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
update.py β Incremental ingestion pipeline.
|
| 3 |
+
|
| 4 |
+
This is the main entry point for populating the AI memory store.
|
| 5 |
+
Run it on your laptop whenever you have new data.
|
| 6 |
+
|
| 7 |
+
Pipeline:
|
| 8 |
+
1. Scan all configured directories for files
|
| 9 |
+
2. Hash each file to detect changes (skip unchanged files)
|
| 10 |
+
3. Parse files into documents via the ingestion router
|
| 11 |
+
4. Clean and normalize text
|
| 12 |
+
5. Chunk text into ~300-word segments
|
| 13 |
+
6. Hash each chunk to deduplicate against existing DB
|
| 14 |
+
7. Batch-encode new chunks via SentenceTransformer
|
| 15 |
+
8. Insert into SQLite + FAISS
|
| 16 |
+
9. Single FAISS write at end of pipeline
|
| 17 |
+
10. Optional IVF rebuild if threshold crossed
|
| 18 |
+
|
| 19 |
+
Safe to run repeatedly β fully idempotent via hash-based deduplication.
|
| 20 |
+
"""
|
| 21 |
+
|
| 22 |
+
import hashlib
|
| 23 |
+
import os
|
| 24 |
+
import sys
|
| 25 |
+
import time
|
| 26 |
+
from pathlib import Path
|
| 27 |
+
|
| 28 |
+
import numpy as np
|
| 29 |
+
|
| 30 |
+
# Ensure project root is on path
|
| 31 |
+
PROJECT_ROOT = Path(__file__).parent.resolve()
|
| 32 |
+
sys.path.insert(0, str(PROJECT_ROOT))
|
| 33 |
+
|
| 34 |
+
from config import (
|
| 35 |
+
BATCH_SIZE,
|
| 36 |
+
CHUNK_MAX_WORDS,
|
| 37 |
+
CHUNK_OVERLAP_WORDS,
|
| 38 |
+
DATA_RAW,
|
| 39 |
+
EMBEDDING_MODEL,
|
| 40 |
+
OBSIDIAN_SCAN_DIRS,
|
| 41 |
+
SKIP_PATTERNS,
|
| 42 |
+
)
|
| 43 |
+
from database.faiss_db import VectorDB
|
| 44 |
+
from database.sqlite_db import MemoryDB
|
| 45 |
+
from ingestion import process_file
|
| 46 |
+
|
| 47 |
+
# Maximum file size to process (skip very large files like raw CSV dumps)
|
| 48 |
+
MAX_FILE_SIZE_MB = 15
|
| 49 |
+
from processing.chunker import chunk_text
|
| 50 |
+
from processing.cleaner import clean_text, compute_hash
|
| 51 |
+
|
| 52 |
+
|
| 53 |
+
def _should_skip(filepath: Path) -> bool:
|
| 54 |
+
"""Check if a file path matches any skip pattern."""
|
| 55 |
+
parts = filepath.parts
|
| 56 |
+
for pattern in SKIP_PATTERNS:
|
| 57 |
+
if any(pattern in p for p in parts):
|
| 58 |
+
return True
|
| 59 |
+
return False
|
| 60 |
+
|
| 61 |
+
|
| 62 |
+
def _hash_file(filepath: Path) -> str:
|
| 63 |
+
"""Compute SHA-256 of file contents for change detection."""
|
| 64 |
+
h = hashlib.sha256()
|
| 65 |
+
try:
|
| 66 |
+
with open(filepath, "rb") as f:
|
| 67 |
+
while True:
|
| 68 |
+
chunk = f.read(65536) # 64 KB chunks
|
| 69 |
+
if not chunk:
|
| 70 |
+
break
|
| 71 |
+
h.update(chunk)
|
| 72 |
+
except (OSError, PermissionError):
|
| 73 |
+
return ""
|
| 74 |
+
return h.hexdigest()
|
| 75 |
+
|
| 76 |
+
|
| 77 |
+
def _collect_files() -> list:
|
| 78 |
+
"""
|
| 79 |
+
Gather all candidate files from data_raw/ and Obsidian scan directories.
|
| 80 |
+
Deduplicates by absolute path.
|
| 81 |
+
"""
|
| 82 |
+
seen_paths = set()
|
| 83 |
+
files = []
|
| 84 |
+
|
| 85 |
+
# Source 1: data_raw/ (user-dropped exports)
|
| 86 |
+
scan_roots = [DATA_RAW]
|
| 87 |
+
|
| 88 |
+
# Source 2: Obsidian vault directories
|
| 89 |
+
for d in OBSIDIAN_SCAN_DIRS:
|
| 90 |
+
if d.exists():
|
| 91 |
+
scan_roots.append(d)
|
| 92 |
+
|
| 93 |
+
for root_dir in scan_roots:
|
| 94 |
+
if not root_dir.exists():
|
| 95 |
+
continue
|
| 96 |
+
for filepath in root_dir.rglob("*"):
|
| 97 |
+
if not filepath.is_file():
|
| 98 |
+
continue
|
| 99 |
+
if _should_skip(filepath):
|
| 100 |
+
continue
|
| 101 |
+
# Skip files larger than threshold
|
| 102 |
+
try:
|
| 103 |
+
size_mb = filepath.stat().st_size / (1024 * 1024)
|
| 104 |
+
if size_mb > MAX_FILE_SIZE_MB:
|
| 105 |
+
continue
|
| 106 |
+
except OSError:
|
| 107 |
+
continue
|
| 108 |
+
|
| 109 |
+
abs_path = filepath.resolve()
|
| 110 |
+
if abs_path in seen_paths:
|
| 111 |
+
continue
|
| 112 |
+
seen_paths.add(abs_path)
|
| 113 |
+
files.append(filepath)
|
| 114 |
+
|
| 115 |
+
return files
|
| 116 |
+
|
| 117 |
+
|
| 118 |
+
def run_update():
|
| 119 |
+
"""Execute the full incremental update pipeline."""
|
| 120 |
+
t_start = time.time()
|
| 121 |
+
print(f"{'=' * 60}", file=sys.stderr)
|
| 122 |
+
print(f" AI Memory β Incremental Update Pipeline", file=sys.stderr)
|
| 123 |
+
print(f"{'=' * 60}", file=sys.stderr)
|
| 124 |
+
|
| 125 |
+
# βββββββββββββββββββββββββββββββββββββββββββββ
|
| 126 |
+
# 1. Collect files
|
| 127 |
+
# βββββββββββββββββββββββββββββββββββββββββββββ
|
| 128 |
+
files = _collect_files()
|
| 129 |
+
print(f"\n[1/6] Discovered {len(files)} candidate files.", file=sys.stderr)
|
| 130 |
+
|
| 131 |
+
if not files:
|
| 132 |
+
print("No files found. Nothing to do.", file=sys.stderr)
|
| 133 |
+
return
|
| 134 |
+
|
| 135 |
+
# βββββββββββββββββββββββββββββββββββββββββββββ
|
| 136 |
+
# 2. Initialize storage
|
| 137 |
+
# βββββββββββββββββββββββββββββββββββββββββββββ
|
| 138 |
+
db = MemoryDB()
|
| 139 |
+
vec_db = VectorDB()
|
| 140 |
+
existing_hashes = db.get_existing_hashes()
|
| 141 |
+
print(
|
| 142 |
+
f"[2/6] Database loaded: {len(existing_hashes)} existing chunks.",
|
| 143 |
+
file=sys.stderr,
|
| 144 |
+
)
|
| 145 |
+
|
| 146 |
+
# βββββββββββββββββββββββββββββββββββββββββββββ
|
| 147 |
+
# 3. Parse β Clean β Chunk β Deduplicate
|
| 148 |
+
# βββββββββββββββββββββββββββββββββββββββββββββ
|
| 149 |
+
new_chunks = []
|
| 150 |
+
files_processed = 0
|
| 151 |
+
files_skipped = 0
|
| 152 |
+
docs_parsed = 0
|
| 153 |
+
|
| 154 |
+
for file_idx, filepath in enumerate(files, 1):
|
| 155 |
+
# Progress every 50 files
|
| 156 |
+
if file_idx % 50 == 0 or file_idx == 1:
|
| 157 |
+
print(
|
| 158 |
+
f" Parsing file {file_idx}/{len(files)}: {filepath.name}",
|
| 159 |
+
file=sys.stderr,
|
| 160 |
+
)
|
| 161 |
+
try:
|
| 162 |
+
for document in process_file(filepath):
|
| 163 |
+
docs_parsed += 1
|
| 164 |
+
text = clean_text(document.get("text", ""))
|
| 165 |
+
if not text:
|
| 166 |
+
continue
|
| 167 |
+
|
| 168 |
+
chunks = chunk_text(
|
| 169 |
+
text,
|
| 170 |
+
max_words=CHUNK_MAX_WORDS,
|
| 171 |
+
overlap=CHUNK_OVERLAP_WORDS,
|
| 172 |
+
)
|
| 173 |
+
|
| 174 |
+
for chunk_content in chunks:
|
| 175 |
+
chunk_hash = compute_hash(chunk_content)
|
| 176 |
+
|
| 177 |
+
if chunk_hash in existing_hashes:
|
| 178 |
+
continue # Already embedded β skip
|
| 179 |
+
|
| 180 |
+
# Register hash to prevent intra-run duplicates
|
| 181 |
+
existing_hashes.add(chunk_hash)
|
| 182 |
+
|
| 183 |
+
new_chunks.append(
|
| 184 |
+
{
|
| 185 |
+
"hash": chunk_hash,
|
| 186 |
+
"text": chunk_content,
|
| 187 |
+
"source": document.get("source", filepath.name),
|
| 188 |
+
"timestamp": document.get("timestamp", ""),
|
| 189 |
+
"tags": document.get("tags", ""),
|
| 190 |
+
}
|
| 191 |
+
)
|
| 192 |
+
files_processed += 1
|
| 193 |
+
except Exception as e:
|
| 194 |
+
print(
|
| 195 |
+
f" [WARN] Error processing {filepath.name}: {e}",
|
| 196 |
+
file=sys.stderr,
|
| 197 |
+
)
|
| 198 |
+
files_skipped += 1
|
| 199 |
+
|
| 200 |
+
print(
|
| 201 |
+
f"[3/6] Parsed {docs_parsed} documents from {files_processed} files "
|
| 202 |
+
f"({files_skipped} skipped). Found {len(new_chunks)} NEW unique chunks.",
|
| 203 |
+
file=sys.stderr,
|
| 204 |
+
)
|
| 205 |
+
|
| 206 |
+
if not new_chunks:
|
| 207 |
+
print(
|
| 208 |
+
"\nβ
System is fully synced β no new data to process.",
|
| 209 |
+
file=sys.stderr,
|
| 210 |
+
)
|
| 211 |
+
db.close()
|
| 212 |
+
return
|
| 213 |
+
|
| 214 |
+
# βββββββββββββββββββββββββββββββββββββββββββββ
|
| 215 |
+
# 4. Lazy-load SentenceTransformer
|
| 216 |
+
# βββββββββββββββββββββββββββββββββββββββββββββ
|
| 217 |
+
print(f"[4/6] Loading embedding model ({EMBEDDING_MODEL})...", file=sys.stderr)
|
| 218 |
+
from sentence_transformers import SentenceTransformer
|
| 219 |
+
|
| 220 |
+
model = SentenceTransformer(EMBEDDING_MODEL)
|
| 221 |
+
|
| 222 |
+
# βββββββββββββββββββββββββββββββββββββββββββββ
|
| 223 |
+
# 5. Batch encode + store
|
| 224 |
+
# βββββββββββββββββββββββββββββββββββββββββββββ
|
| 225 |
+
total = len(new_chunks)
|
| 226 |
+
print(f"[5/6] Encoding & storing {total} chunks...", file=sys.stderr)
|
| 227 |
+
|
| 228 |
+
for batch_start in range(0, total, BATCH_SIZE):
|
| 229 |
+
batch = new_chunks[batch_start : batch_start + BATCH_SIZE]
|
| 230 |
+
texts = [b["text"] for b in batch]
|
| 231 |
+
|
| 232 |
+
# Batch encode
|
| 233 |
+
embeddings = model.encode(
|
| 234 |
+
texts,
|
| 235 |
+
batch_size=BATCH_SIZE,
|
| 236 |
+
show_progress_bar=False,
|
| 237 |
+
convert_to_numpy=True,
|
| 238 |
+
)
|
| 239 |
+
|
| 240 |
+
# Insert metadata into SQLite (returns mapped IDs)
|
| 241 |
+
mapped_ids = db.insert_memories(batch)
|
| 242 |
+
|
| 243 |
+
# Add to FAISS (deferred write)
|
| 244 |
+
vec_db.add_embeddings(embeddings, np.array(mapped_ids))
|
| 245 |
+
|
| 246 |
+
progress = min(batch_start + BATCH_SIZE, total)
|
| 247 |
+
print(
|
| 248 |
+
f" Processed {progress}/{total} chunks...",
|
| 249 |
+
file=sys.stderr,
|
| 250 |
+
)
|
| 251 |
+
|
| 252 |
+
# βββββββββββββββββββββββββββββββββββββββββββββ
|
| 253 |
+
# 6. Finalize: single FAISS write + optional IVF rebuild
|
| 254 |
+
# βββββββββββββββββββββββββββββββββββββββββββββ
|
| 255 |
+
print(f"[6/6] Saving FAISS index...", file=sys.stderr)
|
| 256 |
+
vec_db.maybe_rebuild_ivf()
|
| 257 |
+
vec_db.save()
|
| 258 |
+
|
| 259 |
+
elapsed = time.time() - t_start
|
| 260 |
+
print(
|
| 261 |
+
f"\n{'=' * 60}\n"
|
| 262 |
+
f" β
Update complete!\n"
|
| 263 |
+
f" New chunks: {total}\n"
|
| 264 |
+
f" Total in DB: {db.count()}\n"
|
| 265 |
+
f" FAISS index: {vec_db.count()} vectors\n"
|
| 266 |
+
f" Time: {elapsed:.1f}s\n"
|
| 267 |
+
f"{'=' * 60}",
|
| 268 |
+
file=sys.stderr,
|
| 269 |
+
)
|
| 270 |
+
|
| 271 |
+
db.close()
|
| 272 |
+
|
| 273 |
+
|
| 274 |
+
if __name__ == "__main__":
|
| 275 |
+
run_update()
|