Spaces:

MnemoSenseLab
/

MnemoSense

Sleeping

App Files Files Community

MnemoSense / README.md

Vineetha00

Update README.md

d8ec0c1 verified 6 months ago

preview code

raw

history blame contribute delete

6.44 kB

A newer version of the Gradio SDK is available: 6.15.1

Upgrade

metadata

title: MnemoSense
emoji: 🧠
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 4.44.0
app_file: app.py
pinned: false

MnemoSense: An Artificial Hippocampus for Dementia Patients

“Helping people remember, stay safe, and live with dignity.”

Overview

MnemoSense is a cognitive-assistive AI system designed to support individuals with dementia, Alzheimer’s, or memory loss. Inspired by the hippocampus — the brain’s memory center — MnemoSense acts as an external memory companion that continuously observes, understands, and remembers daily life.

A wearable device captures short segments of video and audio, analyzes the surroundings, and transcribes only the meaningful content — not the raw footage. It then creates rich contextual summaries that include what happened, who was involved, and what was discussed.

When the user speaks to it, MnemoSense can:

Recall what happened, who they interacted with, and what they talked about
Provide spoken reminders for medication, meals, and safety
Offer situational awareness (where they are, what’s around them)
Respond verbally, acting like a kind, always-present companion

By merging LLMs, speech processing, and situational AI, MnemoSense functions as an artificial hippocampus — helping memory-impaired users remain oriented, autonomous, and safe.

Core Idea

“Instead of recording your life, it remembers the meaning of it.”

Unlike surveillance-based systems that store raw footage, MnemoSense captures 2-minute multimodal (audio + video) windows, transcribes the dialogue, detects context and participants, and stores a semantic summary instead of the full data.

Each memory entry contains:

Who was present (faces or voices recognized)
Where the user was (room, indoor/outdoor context)
What was discussed (topic-level conversational summary)
What actions occurred (activities, reminders, or events)

This turns the device into a privacy-preserving personal historian — capable of telling users what they did, who they met, and what they talked about, anytime they ask.

Technical Architecture

System Flow

Continuous Multimodal Capture

Captures short synchronized video + audio segments every 120 seconds via webcam or wearable sensors.
Performs lightweight situational awareness (scene type, people nearby, ambient conditions).

Transcription + Conversation Understanding

Processes speech using OpenAI Whisper (STT).
Extracts key topics and conversational intent, summarizing what was said and by whom.
Merges conversation and scene information into a single context-rich summary.

Semantic Embedding + Vector Storage

Converts summaries into embeddings using Sentence-Transformers.
Stores these in a FAISS vector database, forming a searchable “memory space.”
Raw video/audio is deleted — only meaning remains.

Query → Recall → Response Loop

The user asks, “Who did I talk to today?” or “What did I discuss with my doctor?”
The query is embedded and compared against the vector database to retrieve the most relevant “memories.”
The top results are passed to GPT-4o-mini, which composes a natural, coherent answer.
The answer is spoken back using TTS, enabling full voice-in → voice-out recall.

Tech Stack

Frontend / UI — Flask + Vanilla JS (Voice recording & playback)
Video / Audio Capture — OpenCV · SoundDevice · ffmpeg-python
Speech Recognition (STT) — OpenAI Whisper
Conversation Summarization — MMR-based text selection + LLM-assisted dialogue abstraction
Situational Awareness — OpenCV (scene detection / face cues / motion context)
Embeddings & Retrieval — Sentence-Transformers · FAISS Vector DB
LLM Reasoning — OpenAI GPT-4o-mini
Voice Output (TTS) — macOS say / pyttsx3
Backend Orchestration — Python (continuous threaded ingestion + Flask UI)
Data Handling — YAML configs · JSONL transcripts · NumPy vector storage

Example Interactions

Memory Recall

User: “Who did I talk to today?”
MnemoSense: “You spoke with your friend Arjun in the afternoon about your doctor’s visit and evening plans.”

Situational Awareness

User: “Where am I right now?”
MnemoSense: “You’re in the living room near the window. The TV is on, and someone is talking to you from the kitchen.”

Smart Reminder

MnemoSense: “It’s 8 PM — time for your evening medicine.”

Privacy by Design

No raw media stored — only text summaries and encrypted embeddings.
All processing runs locally on the device (edge-first).
User-controlled deletion and retention policies.

How to Run

# Clone repository
git clone https://github.com/K-RAMYA05/MnemoSense.git
cd MnemoSense-main

# Create and activate virtual environment
python -m venv .venv
source .venv/bin/activate

# Install dependencies
pip install -r requirements.txt
pip install faiss-cpu sentence-transformers opencv-python ffmpeg-python

# Configure OpenAI
export OPENAI_API_KEY=sk-...
export OPENAI_MODEL=gpt-4o-mini

# Start continuous memory ingestion
python -m src.continuous_ingest

# Launch interactive web interface
python -m src.web_ui

Future Work

Jetson-based upgrade: Migrating MnemoSense to an NVIDIA Jetson (e.g., Nano or Orin Nano) would unlock CUDA-accelerated execution for ASR, vision, and LLM components, enabling smoother real-time capture and recall.
TensorRT optimization: Converting Whisper-, CLIP/BLIP-, and encoder models into TensorRT engines would provide 2–4× faster inference and lower latency, making continuous multimodal processing feasible on-device.
NVIDIA Riva for speech: Replacing or complementing Whisper with NVIDIA Riva’s streaming ASR and TTS would give MnemoSense a production-grade, low-latency speech interface tuned for edge deployment.
NVIDIA NeMo for LLMs: Using NVIDIA NeMo to fine-tune compact LLMs on user-specific memory capsules would enable personalized, privacy-preserving summarization and retrieval logic.

End result: By leveraging Jetson + CUDA, TensorRT, Riva, and NeMo, MnemoSense can evolve from a CPU-only prototype into a GPU-accelerated, fully on-device “external memory” assistant with richer multimodal understanding, lower latency, and better power efficiency.