Vineetha00 commited on
Commit
6df4ebe
·
verified ·
1 Parent(s): 031cae4

Upload 8 files

Browse files
Files changed (8) hide show
  1. README.md +138 -12
  2. app.py +61 -0
  3. embedder.py +31 -0
  4. llm.py +39 -0
  5. rag_store.py +52 -0
  6. requirements.txt +7 -0
  7. stt.py +27 -0
  8. summarize.py +30 -0
README.md CHANGED
@@ -1,12 +1,138 @@
1
- ---
2
- title: MnemoSense
3
- emoji: 🏃
4
- colorFrom: gray
5
- colorTo: red
6
- sdk: gradio
7
- sdk_version: 5.49.1
8
- app_file: app.py
9
- pinned: false
10
- ---
11
-
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # MnemoSense: An Artificial Hippocampus for Dementia Patients
2
+ “Helping people remember, stay safe, and live with dignity.”
3
+
4
+
5
+
6
+ ## Overview
7
+
8
+ MnemoSense is a cognitive-assistive AI system designed to support individuals with dementia, Alzheimer’s, or memory loss. Inspired by the hippocampus — the brain’s memory center — MnemoSense acts as an external memory companion that continuously observes, understands, and remembers daily life.
9
+
10
+ A wearable device captures short segments of video and audio, analyzes the surroundings, and transcribes only the meaningful content — not the raw footage. It then creates rich contextual summaries that include what happened, who was involved, and what was discussed.
11
+
12
+
13
+ ## When the user speaks to it, MnemoSense can:
14
+
15
+ - *Recall what happened, who they interacted with, and what they talked about*
16
+ - *Provide spoken reminders for medication, meals, and safety*
17
+ - Offer situational awareness (where they are, what’s around them)
18
+ - Respond verbally, acting like a kind, always-present companion
19
+
20
+ By merging LLMs, speech processing, and situational AI, MnemoSense functions as an artificial hippocampus — helping memory-impaired users remain oriented, autonomous, and safe.
21
+
22
+
23
+ ## Core Idea
24
+
25
+ **“Instead of recording your life, it remembers the meaning of it.”**
26
+
27
+ Unlike surveillance-based systems that store raw footage, MnemoSense captures 2-minute multimodal (audio + video) windows, transcribes the dialogue, detects context and participants, and stores a semantic summary instead of the full data.
28
+
29
+ Each memory entry contains:
30
+
31
+ - Who was present (faces or voices recognized)
32
+ - Where the user was (room, indoor/outdoor context)
33
+ - What was discussed (topic-level conversational summary)
34
+ - What actions occurred (activities, reminders, or events)
35
+
36
+ This turns the device into a privacy-preserving personal historian — capable of telling users what they did, who they met, and what they talked about, anytime they ask.
37
+
38
+
39
+ ## Technical Architecture
40
+
41
+ ### System Flow
42
+
43
+ **Continuous Multimodal Capture**
44
+ - Captures short synchronized video + audio segments every 120 seconds via webcam or wearable sensors.
45
+ - Performs lightweight situational awareness (scene type, people nearby, ambient conditions).
46
+
47
+ **Transcription + Conversation Understanding**
48
+ - Processes speech using OpenAI Whisper (STT).
49
+ - Extracts key topics and conversational intent, summarizing what was said and by whom.
50
+ - Merges conversation and scene information into a single context-rich summary.
51
+
52
+ **Semantic Embedding + Vector Storage**
53
+ - Converts summaries into embeddings using Sentence-Transformers.
54
+ - Stores these in a FAISS vector database, forming a searchable “memory space.”
55
+ - Raw video/audio is deleted — only meaning remains.
56
+
57
+ **Query → Recall → Response Loop**
58
+ - The user asks, “Who did I talk to today?” or “What did I discuss with my doctor?”
59
+ - The query is embedded and compared against the vector database to retrieve the most relevant “memories.”
60
+ - The top results are passed to GPT-4o-mini, which composes a natural, coherent answer.
61
+ - The answer is spoken back using TTS, enabling full voice-in → voice-out recall.
62
+
63
+
64
+ ## Tech Stack
65
+
66
+ - **Frontend / UI** — Flask + Vanilla JS (Voice recording & playback)
67
+ - **Video / Audio Capture** — OpenCV · SoundDevice · ffmpeg-python
68
+ - **Speech Recognition (STT)** — OpenAI Whisper
69
+ - **Conversation Summarization** — MMR-based text selection + LLM-assisted dialogue abstraction
70
+ - **Situational Awareness** — OpenCV (scene detection / face cues / motion context)
71
+ - **Embeddings & Retrieval** — Sentence-Transformers · FAISS Vector DB
72
+ - **LLM Reasoning** — OpenAI GPT-4o-mini
73
+ - **Voice Output (TTS)** — macOS `say` / pyttsx3
74
+ - **Backend Orchestration** — Python (continuous threaded ingestion + Flask UI)
75
+ - **Data Handling** — YAML configs · JSONL transcripts · NumPy vector storage
76
+
77
+
78
+
79
+ ## Example Interactions
80
+
81
+ ### Memory Recall
82
+ **User:** “Who did I talk to today?”
83
+ **MnemoSense:** “You spoke with your friend Arjun in the afternoon about your doctor’s visit and evening plans.”
84
+
85
+ ### Situational Awareness
86
+ **User:** “Where am I right now?”
87
+ **MnemoSense:** “You’re in the living room near the window. The TV is on, and someone is talking to you from the kitchen.”
88
+
89
+ ### Smart Reminder
90
+ **MnemoSense:** “It’s 8 PM — time for your evening medicine.”
91
+
92
+
93
+ ## Privacy by Design
94
+
95
+ - No raw media stored — only text summaries and encrypted embeddings.
96
+ - All processing runs locally on the device (edge-first).
97
+ - User-controlled deletion and retention policies.
98
+
99
+
100
+
101
+ ## How to Run
102
+
103
+ ```bash
104
+ # Clone repository
105
+ git clone https://github.com/K-RAMYA05/MnemoSense.git
106
+ cd MnemoSense-main
107
+
108
+ # Create and activate virtual environment
109
+ python -m venv .venv
110
+ source .venv/bin/activate
111
+
112
+ # Install dependencies
113
+ pip install -r requirements.txt
114
+ pip install faiss-cpu sentence-transformers opencv-python ffmpeg-python
115
+
116
+ # Configure OpenAI
117
+ export OPENAI_API_KEY=sk-...
118
+ export OPENAI_MODEL=gpt-4o-mini
119
+
120
+ # Start continuous memory ingestion
121
+ python -m src.continuous_ingest
122
+
123
+ # Launch interactive web interface
124
+ python -m src.web_ui
125
+ ```
126
+ ## Future Work
127
+
128
+ - Jetson-based upgrade: Migrating MnemoSense to an NVIDIA Jetson (e.g., Nano or Orin Nano) would unlock CUDA-accelerated execution for ASR, vision, and LLM components, enabling smoother real-time capture and recall.
129
+
130
+ - TensorRT optimization: Converting Whisper-, CLIP/BLIP-, and encoder models into TensorRT engines would provide 2–4× faster inference and lower latency, making continuous multimodal processing feasible on-device.
131
+
132
+ - NVIDIA Riva for speech: Replacing or complementing Whisper with NVIDIA Riva’s streaming ASR and TTS would give MnemoSense a production-grade, low-latency speech interface tuned for edge deployment.
133
+
134
+ - NVIDIA NeMo for LLMs: Using NVIDIA NeMo to fine-tune compact LLMs on user-specific memory capsules would enable personalized, privacy-preserving summarization and retrieval logic.
135
+
136
+ End result: By leveraging Jetson + CUDA, TensorRT, Riva, and NeMo, MnemoSense can evolve from a CPU-only prototype into a GPU-accelerated, fully on-device “external memory” assistant with richer multimodal understanding, lower latency, and better power efficiency.
137
+
138
+
app.py ADDED
@@ -0,0 +1,61 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import gradio as gr
2
+ import time, os, uuid, datetime as dt
3
+ from stt import transcribe_file
4
+ from summarize import mmr_summarize
5
+ from rag_store import add_text, search
6
+ from llm import answer
7
+
8
+ def ingest(audio_path: str, notes: str):
9
+ if not audio_path:
10
+ return "No audio provided.", ""
11
+ t0 = time.time()
12
+ text = transcribe_file(audio_path) or ""
13
+ if not text.strip():
14
+ return "Couldn't transcribe. Try speaking closer to the mic.", ""
15
+ summary = mmr_summarize(text, max_sentences=4)
16
+ meta = {
17
+ "id": str(uuid.uuid4()),
18
+ "ts": dt.datetime.utcnow().isoformat(),
19
+ "tags": [t.strip() for t in (notes or "").split(",") if t.strip()]
20
+ }
21
+ add_text(summary, meta)
22
+ dt_ms = int((time.time()-t0)*1000)
23
+ return f"Indexed summary in {dt_ms} ms (text-only).", summary
24
+
25
+ def ask(q: str, audio_q: str):
26
+ query = (q or "").strip()
27
+ if (not query) and audio_q:
28
+ query = transcribe_file(audio_q)
29
+ if not query.strip():
30
+ return "", "", "Please provide a question (text or audio)."
31
+ hits = search(query, k=5)
32
+ ctxs = [h.get("text","") for h in hits]
33
+ ans = answer(query, ctxs)
34
+ refs = "\n\n".join([f"- {h.get('text','')[:160]}…" for h in hits])
35
+ return query, ans, refs if refs else "(no references yet)"
36
+
37
+ with gr.Blocks(title="MnemoSense — Spaces Demo") as demo:
38
+ gr.Markdown("# MnemoSense — Text-only Memory (HF Spaces)")
39
+ gr.Markdown("**Privacy-first**: Only summaries are stored. Try the **Ingest** tab, then ask questions.")
40
+
41
+ with gr.Tab("Ingest"):
42
+ with gr.Row():
43
+ mic = gr.Audio(sources=["microphone","upload"], type="filepath", label="Record or Upload (<= 60s)")
44
+ notes = gr.Textbox(label="Optional tags (comma-separated)", placeholder="demo, meeting, idea")
45
+ btn_ingest = gr.Button("Transcribe → Summarize → Index")
46
+ status = gr.Textbox(label="Status", interactive=False)
47
+ summary = gr.Textbox(label="Summary stored", lines=4, interactive=False)
48
+ btn_ingest.click(ingest, inputs=[mic, notes], outputs=[status, summary])
49
+
50
+ with gr.Tab("Ask"):
51
+ with gr.Row():
52
+ q = gr.Textbox(label="Question", placeholder="What did we say about the mission?")
53
+ q_audio = gr.Audio(sources=["microphone","upload"], type="filepath", label="Or ask by voice")
54
+ btn_ask = gr.Button("Retrieve → Answer")
55
+ out_q = gr.Textbox(label="You asked", interactive=False)
56
+ out_ans = gr.Textbox(label="Answer", lines=6, interactive=False)
57
+ out_refs = gr.Textbox(label="References (summaries)", lines=6, interactive=False)
58
+ btn_ask.click(ask, inputs=[q, q_audio], outputs=[out_q, out_ans, out_refs])
59
+
60
+ if __name__ == "__main__":
61
+ demo.launch()
embedder.py ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from typing import List
2
+ _model = None
3
+
4
+ def _ensure_model():
5
+ global _model
6
+ if _model is not None:
7
+ return
8
+ try:
9
+ from sentence_transformers import SentenceTransformer
10
+ _model = SentenceTransformer("all-MiniLM-L6-v2")
11
+ except Exception:
12
+ # ultra-light fallback if sentence-transformers can't load
13
+ import numpy as np
14
+ class _HashEmb:
15
+ def encode(self, texts, normalize_embeddings=True):
16
+ out = []
17
+ for t in texts:
18
+ h = abs(hash(t)) % (10**8)
19
+ vec = np.array([(h >> i) & 1 for i in range(256)], dtype=float)
20
+ if normalize_embeddings:
21
+ n = np.linalg.norm(vec) + 1e-8
22
+ vec = vec / n
23
+ out.append(vec)
24
+ return out
25
+ _model = _HashEmb()
26
+
27
+ def embed_texts(texts: List[str]):
28
+ if isinstance(texts, str):
29
+ texts = [texts]
30
+ _ensure_model()
31
+ return _model.encode(texts, normalize_embeddings=True)
llm.py ADDED
@@ -0,0 +1,39 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ from typing import List
3
+ from summarize import mmr_summarize
4
+
5
+ def _join(ctxs: List[str], max_chars=4000) -> str:
6
+ out, used = [], 0
7
+ for c in (ctxs or []):
8
+ c = (c or "").strip()
9
+ if not c: continue
10
+ if used + len(c) > max_chars: break
11
+ out.append(c); used += len(c)
12
+ return "\n\n".join(out) if out else "(no context)"
13
+
14
+ def local_answer(question: str, contexts: List[str]) -> str:
15
+ ctx = _join(contexts, 3000)
16
+ if not ctx or ctx == "(no context)":
17
+ return "I don't have enough information yet."
18
+ return mmr_summarize(ctx, max_sentences=4)
19
+
20
+ def openai_answer(question: str, contexts: List[str]) -> str:
21
+ try:
22
+ from openai import OpenAI
23
+ client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
24
+ model = os.getenv("OPENAI_MODEL","gpt-4o-mini")
25
+ system = "You are MnemoSense. Answer using ONLY the provided context. If insufficient, say you don't know."
26
+ user = f"Context:\n{_join(contexts)}\n\nQuestion: {question}"
27
+ resp = client.chat.completions.create(
28
+ model=model,
29
+ messages=[{"role":"system","content":system},{"role":"user","content":user}],
30
+ temperature=0.2,
31
+ )
32
+ return resp.choices[0].message.content.strip()
33
+ except Exception:
34
+ return "(local) " + local_answer(question, contexts)
35
+
36
+ def answer(question: str, contexts: List[str]) -> str:
37
+ if os.getenv("OPENAI_API_KEY"):
38
+ return openai_answer(question, contexts)
39
+ return local_answer(question, contexts)
rag_store.py ADDED
@@ -0,0 +1,52 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os, json
2
+ from typing import List, Dict, Any
3
+ import numpy as np
4
+ from embedder import embed_texts
5
+
6
+ BASE = os.path.dirname(__file__)
7
+ DB_DIR = os.path.join(BASE, "data")
8
+ META = os.path.join(DB_DIR, "transcripts.jsonl")
9
+ VEC = os.path.join(DB_DIR, "vec.npy")
10
+
11
+ os.makedirs(DB_DIR, exist_ok=True)
12
+
13
+ def _load_meta() -> List[Dict[str, Any]]:
14
+ if not os.path.exists(META): return []
15
+ with open(META, "r") as f:
16
+ return [json.loads(line) for line in f if line.strip()]
17
+
18
+ def _append_meta(row: Dict[str, Any]):
19
+ with open(META, "a") as f:
20
+ f.write(json.dumps(row, ensure_ascii=False)+"\n")
21
+
22
+ def _load_vecs() -> np.ndarray:
23
+ if not os.path.exists(VEC): return np.zeros((0,256), dtype=np.float32)
24
+ return np.load(VEC)
25
+
26
+ def _save_vecs(X: np.ndarray):
27
+ np.save(VEC, X)
28
+
29
+ def add_text(text: str, meta: Dict[str, Any]):
30
+ X = _load_vecs()
31
+ emb = embed_texts([text])[0]
32
+ emb = np.array(emb, dtype=np.float32)
33
+ emb = emb / (np.linalg.norm(emb)+1e-8)
34
+ X_new = emb[None, :] if X.size==0 else np.vstack([X, emb])
35
+ _save_vecs(X_new)
36
+ _append_meta(meta | {"text": text})
37
+
38
+ def search(query: str, k: int = 5) -> List[Dict[str, Any]]:
39
+ rows = _load_meta()
40
+ if not rows: return []
41
+ X = _load_vecs()
42
+ qv = np.array(embed_texts([query])[0], dtype=np.float32)
43
+ qv = qv / (np.linalg.norm(qv)+1e-8)
44
+ sims = (X @ qv).tolist() if X.size else []
45
+ idx = np.argsort(sims)[::-1][:k] if sims else []
46
+ hits = []
47
+ for i in idx:
48
+ if i < len(rows):
49
+ r = dict(rows[i])
50
+ r["score"] = float(sims[i])
51
+ hits.append(r)
52
+ return hits
requirements.txt ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ gradio>=4.44.0
2
+ openai>=1.40.0
3
+ whisper
4
+ ffmpeg-python
5
+ numpy
6
+ sentence-transformers
7
+ scipy
stt.py ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os, subprocess
2
+
3
+ def _to_wav16k(path: str) -> str:
4
+ if path.endswith(".wav"): return path
5
+ wav = path + ".wav"
6
+ cmd = ["ffmpeg", "-y", "-i", path, "-ac", "1", "-ar", "16000", wav]
7
+ subprocess.run(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE, check=False)
8
+ return wav if os.path.exists(wav) and os.path.getsize(wav) > 0 else path
9
+
10
+ def transcribe_file(path: str, model_size="base") -> str:
11
+ # Prefer OpenAI Whisper API (fast on Spaces CPU)
12
+ if os.getenv("OPENAI_API_KEY"):
13
+ try:
14
+ from openai import OpenAI
15
+ client = OpenAI()
16
+ wav = _to_wav16k(path)
17
+ with open(wav, "rb") as f:
18
+ tr = client.audio.transcriptions.create(model="whisper-1", file=f)
19
+ return (tr.text or "").strip()
20
+ except Exception:
21
+ pass
22
+ # Fallback: local whisper (may be slow on CPU)
23
+ import whisper
24
+ wav = _to_wav16k(path)
25
+ model = whisper.load_model(model_size)
26
+ out = model.transcribe(wav, fp16=False)
27
+ return (out or {}).get("text","").strip()
summarize.py ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import re, numpy as np
2
+ from typing import List
3
+ from embedder import embed_texts
4
+
5
+ def split_sentences(text: str) -> List[str]:
6
+ sents = re.split(r'(?<=[\.\!\?])\s+', text.strip())
7
+ return [s.strip() for s in sents if s.strip()]
8
+
9
+ def mmr_summarize(text: str, max_sentences: int = 4, diversity: float = 0.6) -> str:
10
+ sents = split_sentences(text)
11
+ if not sents: return text.strip()
12
+ if len(sents) <= max_sentences: return " ".join(sents)
13
+ embs = embed_texts(sents)
14
+ embs = np.array(embs)
15
+ centroid = embs.mean(axis=0)
16
+ centroid = centroid / (np.linalg.norm(centroid) + 1e-8)
17
+ selected = [int(np.argmax(embs @ centroid))]
18
+ while len(selected) < max_sentences:
19
+ best, idx = -1e9, None
20
+ for i in range(len(sents)):
21
+ if i in selected: continue
22
+ rel = float(embs[i] @ centroid)
23
+ red = max(float(embs[i] @ embs[j]) for j in selected) if selected else 0.0
24
+ score = diversity*rel - (1-diversity)*red
25
+ if score > best:
26
+ best, idx = score, i
27
+ if idx is None: break
28
+ selected.append(idx)
29
+ selected.sort()
30
+ return " ".join(sents[i] for i in selected)