Spaces:

MnemoSenseLab
/

MnemoSense

Sleeping

App Files Files Community

Vineetha00 commited on Nov 16, 2025

Commit

6df4ebe

verified ·

1 Parent(s): 031cae4

Upload 8 files

Browse files

Files changed (8) hide show

README.md +138 -12
app.py +61 -0
embedder.py +31 -0
llm.py +39 -0
rag_store.py +52 -0
requirements.txt +7 -0
stt.py +27 -0
summarize.py +30 -0

README.md CHANGED Viewed

@@ -1,12 +1,138 @@
----
-title: MnemoSense
-emoji: 🏃
-colorFrom: gray
-colorTo: red
-sdk: gradio
-sdk_version: 5.49.1
-app_file: app.py
-pinned: false
----
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+# MnemoSense: An Artificial Hippocampus for Dementia Patients
+“Helping people remember, stay safe, and live with dignity.”
+## Overview
+MnemoSense is a cognitive-assistive AI system designed to support individuals with dementia, Alzheimer’s, or memory loss. Inspired by the hippocampus — the brain’s memory center — MnemoSense acts as an external memory companion that continuously observes, understands, and remembers daily life.
+A wearable device captures short segments of video and audio, analyzes the surroundings, and transcribes only the meaningful content — not the raw footage. It then creates rich contextual summaries that include what happened, who was involved, and what was discussed.
+## When the user speaks to it, MnemoSense can:
+- *Recall what happened, who they interacted with, and what they talked about*
+- *Provide spoken reminders for medication, meals, and safety*
+- Offer situational awareness (where they are, what’s around them)
+- Respond verbally, acting like a kind, always-present companion
+By merging LLMs, speech processing, and situational AI, MnemoSense functions as an artificial hippocampus — helping memory-impaired users remain oriented, autonomous, and safe.
+## Core Idea
+**“Instead of recording your life, it remembers the meaning of it.”**
+Unlike surveillance-based systems that store raw footage, MnemoSense captures 2-minute multimodal (audio + video) windows, transcribes the dialogue, detects context and participants, and stores a semantic summary instead of the full data.
+Each memory entry contains:
+- Who was present (faces or voices recognized)
+- Where the user was (room, indoor/outdoor context)
+- What was discussed (topic-level conversational summary)
+- What actions occurred (activities, reminders, or events)
+This turns the device into a privacy-preserving personal historian — capable of telling users what they did, who they met, and what they talked about, anytime they ask.
+## Technical Architecture
+### System Flow
+**Continuous Multimodal Capture**
+- Captures short synchronized video + audio segments every 120 seconds via webcam or wearable sensors.
+- Performs lightweight situational awareness (scene type, people nearby, ambient conditions).
+**Transcription + Conversation Understanding**
+- Processes speech using OpenAI Whisper (STT).
+- Extracts key topics and conversational intent, summarizing what was said and by whom.
+- Merges conversation and scene information into a single context-rich summary.
+**Semantic Embedding + Vector Storage**
+- Converts summaries into embeddings using Sentence-Transformers.
+- Stores these in a FAISS vector database, forming a searchable “memory space.”
+- Raw video/audio is deleted — only meaning remains.
+**Query → Recall → Response Loop**
+- The user asks, “Who did I talk to today?” or “What did I discuss with my doctor?”
+- The query is embedded and compared against the vector database to retrieve the most relevant “memories.”
+- The top results are passed to GPT-4o-mini, which composes a natural, coherent answer.
+- The answer is spoken back using TTS, enabling full voice-in → voice-out recall.
+## Tech Stack
+- **Frontend / UI** — Flask + Vanilla JS (Voice recording & playback)
+- **Video / Audio Capture** — OpenCV · SoundDevice · ffmpeg-python
+- **Speech Recognition (STT)** — OpenAI Whisper
+- **Conversation Summarization** — MMR-based text selection + LLM-assisted dialogue abstraction
+- **Situational Awareness** — OpenCV (scene detection / face cues / motion context)
+- **Embeddings & Retrieval** — Sentence-Transformers · FAISS Vector DB
+- **LLM Reasoning** — OpenAI GPT-4o-mini
+- **Voice Output (TTS)** — macOS `say` / pyttsx3
+- **Backend Orchestration** — Python (continuous threaded ingestion + Flask UI)
+- **Data Handling** — YAML configs · JSONL transcripts · NumPy vector storage
+## Example Interactions
+### Memory Recall
+**User:** “Who did I talk to today?”
+**MnemoSense:** “You spoke with your friend Arjun in the afternoon about your doctor’s visit and evening plans.”
+### Situational Awareness
+**User:** “Where am I right now?”
+**MnemoSense:** “You’re in the living room near the window. The TV is on, and someone is talking to you from the kitchen.”
+### Smart Reminder
+**MnemoSense:** “It’s 8 PM — time for your evening medicine.”
+## Privacy by Design
+- No raw media stored — only text summaries and encrypted embeddings.
+- All processing runs locally on the device (edge-first).
+- User-controlled deletion and retention policies.
+## How to Run
+```bash
+# Clone repository
+git clone https://github.com/K-RAMYA05/MnemoSense.git
+cd MnemoSense-main
+# Create and activate virtual environment
+python -m venv .venv
+source .venv/bin/activate
+# Install dependencies
+pip install -r requirements.txt
+pip install faiss-cpu sentence-transformers opencv-python ffmpeg-python
+# Configure OpenAI
+export OPENAI_API_KEY=sk-...
+export OPENAI_MODEL=gpt-4o-mini
+# Start continuous memory ingestion
+python -m src.continuous_ingest
+# Launch interactive web interface
+python -m src.web_ui
+```
+## Future Work
+- Jetson-based upgrade: Migrating MnemoSense to an NVIDIA Jetson (e.g., Nano or Orin Nano) would unlock CUDA-accelerated execution for ASR, vision, and LLM components, enabling smoother real-time capture and recall.
+- TensorRT optimization: Converting Whisper-, CLIP/BLIP-, and encoder models into TensorRT engines would provide 2–4× faster inference and lower latency, making continuous multimodal processing feasible on-device.
+- NVIDIA Riva for speech: Replacing or complementing Whisper with NVIDIA Riva’s streaming ASR and TTS would give MnemoSense a production-grade, low-latency speech interface tuned for edge deployment.
+- NVIDIA NeMo for LLMs: Using NVIDIA NeMo to fine-tune compact LLMs on user-specific memory capsules would enable personalized, privacy-preserving summarization and retrieval logic.
+End result: By leveraging Jetson + CUDA, TensorRT, Riva, and NeMo, MnemoSense can evolve from a CPU-only prototype into a GPU-accelerated, fully on-device “external memory” assistant with richer multimodal understanding, lower latency, and better power efficiency.

app.py ADDED Viewed

	@@ -0,0 +1,61 @@

+import gradio as gr
+import time, os, uuid, datetime as dt
+from stt import transcribe_file
+from summarize import mmr_summarize
+from rag_store import add_text, search
+from llm import answer
+def ingest(audio_path: str, notes: str):
+    if not audio_path:
+        return "No audio provided.", ""
+    t0 = time.time()
+    text = transcribe_file(audio_path) or ""
+    if not text.strip():
+        return "Couldn't transcribe. Try speaking closer to the mic.", ""
+    summary = mmr_summarize(text, max_sentences=4)
+    meta = {
+        "id": str(uuid.uuid4()),
+        "ts": dt.datetime.utcnow().isoformat(),
+        "tags": [t.strip() for t in (notes or "").split(",") if t.strip()]
+    }
+    add_text(summary, meta)
+    dt_ms = int((time.time()-t0)*1000)
+    return f"Indexed summary in {dt_ms} ms (text-only).", summary
+def ask(q: str, audio_q: str):
+    query = (q or "").strip()
+    if (not query) and audio_q:
+        query = transcribe_file(audio_q)
+    if not query.strip():
+        return "", "", "Please provide a question (text or audio)."
+    hits = search(query, k=5)
+    ctxs = [h.get("text","") for h in hits]
+    ans = answer(query, ctxs)
+    refs = "\n\n".join([f"- {h.get('text','')[:160]}…" for h in hits])
+    return query, ans, refs if refs else "(no references yet)"
+with gr.Blocks(title="MnemoSense — Spaces Demo") as demo:
+    gr.Markdown("# MnemoSense — Text-only Memory (HF Spaces)")
+    gr.Markdown("**Privacy-first**: Only summaries are stored. Try the **Ingest** tab, then ask questions.")
+    with gr.Tab("Ingest"):
+        with gr.Row():
+            mic = gr.Audio(sources=["microphone","upload"], type="filepath", label="Record or Upload (<= 60s)")
+            notes = gr.Textbox(label="Optional tags (comma-separated)", placeholder="demo, meeting, idea")
+        btn_ingest = gr.Button("Transcribe → Summarize → Index")
+        status = gr.Textbox(label="Status", interactive=False)
+        summary = gr.Textbox(label="Summary stored", lines=4, interactive=False)
+        btn_ingest.click(ingest, inputs=[mic, notes], outputs=[status, summary])
+    with gr.Tab("Ask"):
+        with gr.Row():
+            q = gr.Textbox(label="Question", placeholder="What did we say about the mission?")
+            q_audio = gr.Audio(sources=["microphone","upload"], type="filepath", label="Or ask by voice")
+        btn_ask = gr.Button("Retrieve → Answer")
+        out_q = gr.Textbox(label="You asked", interactive=False)
+        out_ans = gr.Textbox(label="Answer", lines=6, interactive=False)
+        out_refs = gr.Textbox(label="References (summaries)", lines=6, interactive=False)
+        btn_ask.click(ask, inputs=[q, q_audio], outputs=[out_q, out_ans, out_refs])
+if __name__ == "__main__":
+    demo.launch()

embedder.py ADDED Viewed

	@@ -0,0 +1,31 @@

+from typing import List
+_model = None
+def _ensure_model():
+    global _model
+    if _model is not None:
+        return
+    try:
+        from sentence_transformers import SentenceTransformer
+        _model = SentenceTransformer("all-MiniLM-L6-v2")
+    except Exception:
+        # ultra-light fallback if sentence-transformers can't load
+        import numpy as np
+        class _HashEmb:
+            def encode(self, texts, normalize_embeddings=True):
+                out = []
+                for t in texts:
+                    h = abs(hash(t)) % (10**8)
+                    vec = np.array([(h >> i) & 1 for i in range(256)], dtype=float)
+                    if normalize_embeddings:
+                        n = np.linalg.norm(vec) + 1e-8
+                        vec = vec / n
+                    out.append(vec)
+                return out
+        _model = _HashEmb()
+def embed_texts(texts: List[str]):
+    if isinstance(texts, str):
+        texts = [texts]
+    _ensure_model()
+    return _model.encode(texts, normalize_embeddings=True)

llm.py ADDED Viewed

	@@ -0,0 +1,39 @@

+import os
+from typing import List
+from summarize import mmr_summarize
+def _join(ctxs: List[str], max_chars=4000) -> str:
+    out, used = [], 0
+    for c in (ctxs or []):
+        c = (c or "").strip()
+        if not c: continue
+        if used + len(c) > max_chars: break
+        out.append(c); used += len(c)
+    return "\n\n".join(out) if out else "(no context)"
+def local_answer(question: str, contexts: List[str]) -> str:
+    ctx = _join(contexts, 3000)
+    if not ctx or ctx == "(no context)":
+        return "I don't have enough information yet."
+    return mmr_summarize(ctx, max_sentences=4)
+def openai_answer(question: str, contexts: List[str]) -> str:
+    try:
+        from openai import OpenAI
+        client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
+        model = os.getenv("OPENAI_MODEL","gpt-4o-mini")
+        system = "You are MnemoSense. Answer using ONLY the provided context. If insufficient, say you don't know."
+        user = f"Context:\n{_join(contexts)}\n\nQuestion: {question}"
+        resp = client.chat.completions.create(
+            model=model,
+            messages=[{"role":"system","content":system},{"role":"user","content":user}],
+            temperature=0.2,
+        )
+        return resp.choices[0].message.content.strip()
+    except Exception:
+        return "(local) " + local_answer(question, contexts)
+def answer(question: str, contexts: List[str]) -> str:
+    if os.getenv("OPENAI_API_KEY"):
+        return openai_answer(question, contexts)
+    return local_answer(question, contexts)

rag_store.py ADDED Viewed

	@@ -0,0 +1,52 @@

+import os, json
+from typing import List, Dict, Any
+import numpy as np
+from embedder import embed_texts
+BASE = os.path.dirname(__file__)
+DB_DIR = os.path.join(BASE, "data")
+META = os.path.join(DB_DIR, "transcripts.jsonl")
+VEC = os.path.join(DB_DIR, "vec.npy")
+os.makedirs(DB_DIR, exist_ok=True)
+def _load_meta() -> List[Dict[str, Any]]:
+    if not os.path.exists(META): return []
+    with open(META, "r") as f:
+        return [json.loads(line) for line in f if line.strip()]
+def _append_meta(row: Dict[str, Any]):
+    with open(META, "a") as f:
+        f.write(json.dumps(row, ensure_ascii=False)+"\n")
+def _load_vecs() -> np.ndarray:
+    if not os.path.exists(VEC): return np.zeros((0,256), dtype=np.float32)
+    return np.load(VEC)
+def _save_vecs(X: np.ndarray):
+    np.save(VEC, X)
+def add_text(text: str, meta: Dict[str, Any]):
+    X = _load_vecs()
+    emb = embed_texts([text])[0]
+    emb = np.array(emb, dtype=np.float32)
+    emb = emb / (np.linalg.norm(emb)+1e-8)
+    X_new = emb[None, :] if X.size==0 else np.vstack([X, emb])
+    _save_vecs(X_new)
+    _append_meta(meta | {"text": text})
+def search(query: str, k: int = 5) -> List[Dict[str, Any]]:
+    rows = _load_meta()
+    if not rows: return []
+    X = _load_vecs()
+    qv = np.array(embed_texts([query])[0], dtype=np.float32)
+    qv = qv / (np.linalg.norm(qv)+1e-8)
+    sims = (X @ qv).tolist() if X.size else []
+    idx = np.argsort(sims)[::-1][:k] if sims else []
+    hits = []
+    for i in idx:
+        if i < len(rows):
+            r = dict(rows[i])
+            r["score"] = float(sims[i])
+            hits.append(r)
+    return hits

requirements.txt ADDED Viewed

	@@ -0,0 +1,7 @@

+gradio>=4.44.0
+openai>=1.40.0
+whisper
+ffmpeg-python
+numpy
+sentence-transformers
+scipy

stt.py ADDED Viewed

	@@ -0,0 +1,27 @@

+import os, subprocess
+def _to_wav16k(path: str) -> str:
+    if path.endswith(".wav"): return path
+    wav = path + ".wav"
+    cmd = ["ffmpeg", "-y", "-i", path, "-ac", "1", "-ar", "16000", wav]
+    subprocess.run(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE, check=False)
+    return wav if os.path.exists(wav) and os.path.getsize(wav) > 0 else path
+def transcribe_file(path: str, model_size="base") -> str:
+    # Prefer OpenAI Whisper API (fast on Spaces CPU)
+    if os.getenv("OPENAI_API_KEY"):
+        try:
+            from openai import OpenAI
+            client = OpenAI()
+            wav = _to_wav16k(path)
+            with open(wav, "rb") as f:
+                tr = client.audio.transcriptions.create(model="whisper-1", file=f)
+            return (tr.text or "").strip()
+        except Exception:
+            pass
+    # Fallback: local whisper (may be slow on CPU)
+    import whisper
+    wav = _to_wav16k(path)
+    model = whisper.load_model(model_size)
+    out = model.transcribe(wav, fp16=False)
+    return (out or {}).get("text","").strip()

summarize.py ADDED Viewed

	@@ -0,0 +1,30 @@

+import re, numpy as np
+from typing import List
+from embedder import embed_texts
+def split_sentences(text: str) -> List[str]:
+    sents = re.split(r'(?<=[\.\!\?])\s+', text.strip())
+    return [s.strip() for s in sents if s.strip()]
+def mmr_summarize(text: str, max_sentences: int = 4, diversity: float = 0.6) -> str:
+    sents = split_sentences(text)
+    if not sents: return text.strip()
+    if len(sents) <= max_sentences: return " ".join(sents)
+    embs = embed_texts(sents)
+    embs = np.array(embs)
+    centroid = embs.mean(axis=0)
+    centroid = centroid / (np.linalg.norm(centroid) + 1e-8)
+    selected = [int(np.argmax(embs @ centroid))]
+    while len(selected) < max_sentences:
+        best, idx = -1e9, None
+        for i in range(len(sents)):
+            if i in selected: continue
+            rel = float(embs[i] @ centroid)
+            red = max(float(embs[i] @ embs[j]) for j in selected) if selected else 0.0
+            score = diversity*rel - (1-diversity)*red
+            if score > best:
+                best, idx = score, i
+        if idx is None: break
+        selected.append(idx)
+    selected.sort()
+    return " ".join(sents[i] for i in selected)