File size: 6,441 Bytes
d8ec0c1
 
 
 
 
 
 
 
 
 
6df4ebe
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
---
title: MnemoSense
emoji: 🧠
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: "4.44.0"
app_file: app.py
pinned: false
---
# MnemoSense: An Artificial Hippocampus for Dementia Patients
“Helping people remember, stay safe, and live with dignity.”



## Overview

MnemoSense is a cognitive-assistive AI system designed to support individuals with dementia, Alzheimer’s, or memory loss. Inspired by the hippocampus — the brain’s memory center — MnemoSense acts as an external memory companion that continuously observes, understands, and remembers daily life.

A wearable device captures short segments of video and audio, analyzes the surroundings, and transcribes only the meaningful content — not the raw footage. It then creates rich contextual summaries that include what happened, who was involved, and what was discussed.


## When the user speaks to it, MnemoSense can:

- *Recall what happened, who they interacted with, and what they talked about*
- *Provide spoken reminders for medication, meals, and safety*
- Offer situational awareness (where they are, what’s around them)
- Respond verbally, acting like a kind, always-present companion

By merging LLMs, speech processing, and situational AI, MnemoSense functions as an artificial hippocampus — helping memory-impaired users remain oriented, autonomous, and safe.


## Core Idea

**“Instead of recording your life, it remembers the meaning of it.”**

Unlike surveillance-based systems that store raw footage, MnemoSense captures 2-minute multimodal (audio + video) windows, transcribes the dialogue, detects context and participants, and stores a semantic summary instead of the full data.

Each memory entry contains:

- Who was present (faces or voices recognized)  
- Where the user was (room, indoor/outdoor context)  
- What was discussed (topic-level conversational summary)  
- What actions occurred (activities, reminders, or events)

This turns the device into a privacy-preserving personal historian — capable of telling users what they did, who they met, and what they talked about, anytime they ask.


## Technical Architecture

### System Flow

**Continuous Multimodal Capture**  
- Captures short synchronized video + audio segments every 120 seconds via webcam or wearable sensors.  
- Performs lightweight situational awareness (scene type, people nearby, ambient conditions).

**Transcription + Conversation Understanding**  
- Processes speech using OpenAI Whisper (STT).  
- Extracts key topics and conversational intent, summarizing what was said and by whom.  
- Merges conversation and scene information into a single context-rich summary.

**Semantic Embedding + Vector Storage**  
- Converts summaries into embeddings using Sentence-Transformers.  
- Stores these in a FAISS vector database, forming a searchable “memory space.”  
- Raw video/audio is deleted — only meaning remains.

**Query → Recall → Response Loop**  
- The user asks, “Who did I talk to today?” or “What did I discuss with my doctor?”  
- The query is embedded and compared against the vector database to retrieve the most relevant “memories.”  
- The top results are passed to GPT-4o-mini, which composes a natural, coherent answer.  
- The answer is spoken back using TTS, enabling full voice-in → voice-out recall.


## Tech Stack

- **Frontend / UI** — Flask + Vanilla JS (Voice recording & playback)  
- **Video / Audio Capture** — OpenCV · SoundDevice · ffmpeg-python  
- **Speech Recognition (STT)** — OpenAI Whisper  
- **Conversation Summarization** — MMR-based text selection + LLM-assisted dialogue abstraction  
- **Situational Awareness** — OpenCV (scene detection / face cues / motion context)  
- **Embeddings & Retrieval** — Sentence-Transformers · FAISS Vector DB  
- **LLM Reasoning** — OpenAI GPT-4o-mini  
- **Voice Output (TTS)** — macOS `say` / pyttsx3  
- **Backend Orchestration** — Python (continuous threaded ingestion + Flask UI)  
- **Data Handling** — YAML configs · JSONL transcripts · NumPy vector storage  



## Example Interactions

### Memory Recall
**User:** “Who did I talk to today?”  
**MnemoSense:** “You spoke with your friend Arjun in the afternoon about your doctor’s visit and evening plans.”

### Situational Awareness
**User:** “Where am I right now?”  
**MnemoSense:** “You’re in the living room near the window. The TV is on, and someone is talking to you from the kitchen.”

### Smart Reminder
**MnemoSense:** “It’s 8 PM — time for your evening medicine.”


## Privacy by Design

- No raw media stored — only text summaries and encrypted embeddings.  
- All processing runs locally on the device (edge-first).  
- User-controlled deletion and retention policies.



## How to Run

```bash
# Clone repository
git clone https://github.com/K-RAMYA05/MnemoSense.git
cd MnemoSense-main

# Create and activate virtual environment
python -m venv .venv
source .venv/bin/activate

# Install dependencies
pip install -r requirements.txt
pip install faiss-cpu sentence-transformers opencv-python ffmpeg-python

# Configure OpenAI
export OPENAI_API_KEY=sk-...
export OPENAI_MODEL=gpt-4o-mini

# Start continuous memory ingestion
python -m src.continuous_ingest

# Launch interactive web interface
python -m src.web_ui
```
## Future Work

- Jetson-based upgrade: Migrating MnemoSense to an NVIDIA Jetson (e.g., Nano or Orin Nano) would unlock CUDA-accelerated execution for ASR, vision, and LLM components, enabling smoother real-time capture and recall.

- TensorRT optimization: Converting Whisper-, CLIP/BLIP-, and encoder models into TensorRT engines would provide 2–4× faster inference and lower latency, making continuous multimodal processing feasible on-device.

- NVIDIA Riva for speech: Replacing or complementing Whisper with NVIDIA Riva’s streaming ASR and TTS would give MnemoSense a production-grade, low-latency speech interface tuned for edge deployment.

- NVIDIA NeMo for LLMs: Using NVIDIA NeMo to fine-tune compact LLMs on user-specific memory capsules would enable personalized, privacy-preserving summarization and retrieval logic.

End result: By leveraging Jetson + CUDA, TensorRT, Riva, and NeMo, MnemoSense can evolve from a CPU-only prototype into a GPU-accelerated, fully on-device “external memory” assistant with richer multimodal understanding, lower latency, and better power efficiency.