Spaces:
Sleeping
Sleeping
Upload 8 files
Browse files- README.md +138 -12
- app.py +61 -0
- embedder.py +31 -0
- llm.py +39 -0
- rag_store.py +52 -0
- requirements.txt +7 -0
- stt.py +27 -0
- summarize.py +30 -0
README.md
CHANGED
|
@@ -1,12 +1,138 @@
|
|
| 1 |
-
|
| 2 |
-
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
|
| 11 |
-
|
| 12 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# MnemoSense: An Artificial Hippocampus for Dementia Patients
|
| 2 |
+
“Helping people remember, stay safe, and live with dignity.”
|
| 3 |
+
|
| 4 |
+
|
| 5 |
+
|
| 6 |
+
## Overview
|
| 7 |
+
|
| 8 |
+
MnemoSense is a cognitive-assistive AI system designed to support individuals with dementia, Alzheimer’s, or memory loss. Inspired by the hippocampus — the brain’s memory center — MnemoSense acts as an external memory companion that continuously observes, understands, and remembers daily life.
|
| 9 |
+
|
| 10 |
+
A wearable device captures short segments of video and audio, analyzes the surroundings, and transcribes only the meaningful content — not the raw footage. It then creates rich contextual summaries that include what happened, who was involved, and what was discussed.
|
| 11 |
+
|
| 12 |
+
|
| 13 |
+
## When the user speaks to it, MnemoSense can:
|
| 14 |
+
|
| 15 |
+
- *Recall what happened, who they interacted with, and what they talked about*
|
| 16 |
+
- *Provide spoken reminders for medication, meals, and safety*
|
| 17 |
+
- Offer situational awareness (where they are, what’s around them)
|
| 18 |
+
- Respond verbally, acting like a kind, always-present companion
|
| 19 |
+
|
| 20 |
+
By merging LLMs, speech processing, and situational AI, MnemoSense functions as an artificial hippocampus — helping memory-impaired users remain oriented, autonomous, and safe.
|
| 21 |
+
|
| 22 |
+
|
| 23 |
+
## Core Idea
|
| 24 |
+
|
| 25 |
+
**“Instead of recording your life, it remembers the meaning of it.”**
|
| 26 |
+
|
| 27 |
+
Unlike surveillance-based systems that store raw footage, MnemoSense captures 2-minute multimodal (audio + video) windows, transcribes the dialogue, detects context and participants, and stores a semantic summary instead of the full data.
|
| 28 |
+
|
| 29 |
+
Each memory entry contains:
|
| 30 |
+
|
| 31 |
+
- Who was present (faces or voices recognized)
|
| 32 |
+
- Where the user was (room, indoor/outdoor context)
|
| 33 |
+
- What was discussed (topic-level conversational summary)
|
| 34 |
+
- What actions occurred (activities, reminders, or events)
|
| 35 |
+
|
| 36 |
+
This turns the device into a privacy-preserving personal historian — capable of telling users what they did, who they met, and what they talked about, anytime they ask.
|
| 37 |
+
|
| 38 |
+
|
| 39 |
+
## Technical Architecture
|
| 40 |
+
|
| 41 |
+
### System Flow
|
| 42 |
+
|
| 43 |
+
**Continuous Multimodal Capture**
|
| 44 |
+
- Captures short synchronized video + audio segments every 120 seconds via webcam or wearable sensors.
|
| 45 |
+
- Performs lightweight situational awareness (scene type, people nearby, ambient conditions).
|
| 46 |
+
|
| 47 |
+
**Transcription + Conversation Understanding**
|
| 48 |
+
- Processes speech using OpenAI Whisper (STT).
|
| 49 |
+
- Extracts key topics and conversational intent, summarizing what was said and by whom.
|
| 50 |
+
- Merges conversation and scene information into a single context-rich summary.
|
| 51 |
+
|
| 52 |
+
**Semantic Embedding + Vector Storage**
|
| 53 |
+
- Converts summaries into embeddings using Sentence-Transformers.
|
| 54 |
+
- Stores these in a FAISS vector database, forming a searchable “memory space.”
|
| 55 |
+
- Raw video/audio is deleted — only meaning remains.
|
| 56 |
+
|
| 57 |
+
**Query → Recall → Response Loop**
|
| 58 |
+
- The user asks, “Who did I talk to today?” or “What did I discuss with my doctor?”
|
| 59 |
+
- The query is embedded and compared against the vector database to retrieve the most relevant “memories.”
|
| 60 |
+
- The top results are passed to GPT-4o-mini, which composes a natural, coherent answer.
|
| 61 |
+
- The answer is spoken back using TTS, enabling full voice-in → voice-out recall.
|
| 62 |
+
|
| 63 |
+
|
| 64 |
+
## Tech Stack
|
| 65 |
+
|
| 66 |
+
- **Frontend / UI** — Flask + Vanilla JS (Voice recording & playback)
|
| 67 |
+
- **Video / Audio Capture** — OpenCV · SoundDevice · ffmpeg-python
|
| 68 |
+
- **Speech Recognition (STT)** — OpenAI Whisper
|
| 69 |
+
- **Conversation Summarization** — MMR-based text selection + LLM-assisted dialogue abstraction
|
| 70 |
+
- **Situational Awareness** — OpenCV (scene detection / face cues / motion context)
|
| 71 |
+
- **Embeddings & Retrieval** — Sentence-Transformers · FAISS Vector DB
|
| 72 |
+
- **LLM Reasoning** — OpenAI GPT-4o-mini
|
| 73 |
+
- **Voice Output (TTS)** — macOS `say` / pyttsx3
|
| 74 |
+
- **Backend Orchestration** — Python (continuous threaded ingestion + Flask UI)
|
| 75 |
+
- **Data Handling** — YAML configs · JSONL transcripts · NumPy vector storage
|
| 76 |
+
|
| 77 |
+
|
| 78 |
+
|
| 79 |
+
## Example Interactions
|
| 80 |
+
|
| 81 |
+
### Memory Recall
|
| 82 |
+
**User:** “Who did I talk to today?”
|
| 83 |
+
**MnemoSense:** “You spoke with your friend Arjun in the afternoon about your doctor’s visit and evening plans.”
|
| 84 |
+
|
| 85 |
+
### Situational Awareness
|
| 86 |
+
**User:** “Where am I right now?”
|
| 87 |
+
**MnemoSense:** “You’re in the living room near the window. The TV is on, and someone is talking to you from the kitchen.”
|
| 88 |
+
|
| 89 |
+
### Smart Reminder
|
| 90 |
+
**MnemoSense:** “It’s 8 PM — time for your evening medicine.”
|
| 91 |
+
|
| 92 |
+
|
| 93 |
+
## Privacy by Design
|
| 94 |
+
|
| 95 |
+
- No raw media stored — only text summaries and encrypted embeddings.
|
| 96 |
+
- All processing runs locally on the device (edge-first).
|
| 97 |
+
- User-controlled deletion and retention policies.
|
| 98 |
+
|
| 99 |
+
|
| 100 |
+
|
| 101 |
+
## How to Run
|
| 102 |
+
|
| 103 |
+
```bash
|
| 104 |
+
# Clone repository
|
| 105 |
+
git clone https://github.com/K-RAMYA05/MnemoSense.git
|
| 106 |
+
cd MnemoSense-main
|
| 107 |
+
|
| 108 |
+
# Create and activate virtual environment
|
| 109 |
+
python -m venv .venv
|
| 110 |
+
source .venv/bin/activate
|
| 111 |
+
|
| 112 |
+
# Install dependencies
|
| 113 |
+
pip install -r requirements.txt
|
| 114 |
+
pip install faiss-cpu sentence-transformers opencv-python ffmpeg-python
|
| 115 |
+
|
| 116 |
+
# Configure OpenAI
|
| 117 |
+
export OPENAI_API_KEY=sk-...
|
| 118 |
+
export OPENAI_MODEL=gpt-4o-mini
|
| 119 |
+
|
| 120 |
+
# Start continuous memory ingestion
|
| 121 |
+
python -m src.continuous_ingest
|
| 122 |
+
|
| 123 |
+
# Launch interactive web interface
|
| 124 |
+
python -m src.web_ui
|
| 125 |
+
```
|
| 126 |
+
## Future Work
|
| 127 |
+
|
| 128 |
+
- Jetson-based upgrade: Migrating MnemoSense to an NVIDIA Jetson (e.g., Nano or Orin Nano) would unlock CUDA-accelerated execution for ASR, vision, and LLM components, enabling smoother real-time capture and recall.
|
| 129 |
+
|
| 130 |
+
- TensorRT optimization: Converting Whisper-, CLIP/BLIP-, and encoder models into TensorRT engines would provide 2–4× faster inference and lower latency, making continuous multimodal processing feasible on-device.
|
| 131 |
+
|
| 132 |
+
- NVIDIA Riva for speech: Replacing or complementing Whisper with NVIDIA Riva’s streaming ASR and TTS would give MnemoSense a production-grade, low-latency speech interface tuned for edge deployment.
|
| 133 |
+
|
| 134 |
+
- NVIDIA NeMo for LLMs: Using NVIDIA NeMo to fine-tune compact LLMs on user-specific memory capsules would enable personalized, privacy-preserving summarization and retrieval logic.
|
| 135 |
+
|
| 136 |
+
End result: By leveraging Jetson + CUDA, TensorRT, Riva, and NeMo, MnemoSense can evolve from a CPU-only prototype into a GPU-accelerated, fully on-device “external memory” assistant with richer multimodal understanding, lower latency, and better power efficiency.
|
| 137 |
+
|
| 138 |
+
|
app.py
ADDED
|
@@ -0,0 +1,61 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import gradio as gr
|
| 2 |
+
import time, os, uuid, datetime as dt
|
| 3 |
+
from stt import transcribe_file
|
| 4 |
+
from summarize import mmr_summarize
|
| 5 |
+
from rag_store import add_text, search
|
| 6 |
+
from llm import answer
|
| 7 |
+
|
| 8 |
+
def ingest(audio_path: str, notes: str):
|
| 9 |
+
if not audio_path:
|
| 10 |
+
return "No audio provided.", ""
|
| 11 |
+
t0 = time.time()
|
| 12 |
+
text = transcribe_file(audio_path) or ""
|
| 13 |
+
if not text.strip():
|
| 14 |
+
return "Couldn't transcribe. Try speaking closer to the mic.", ""
|
| 15 |
+
summary = mmr_summarize(text, max_sentences=4)
|
| 16 |
+
meta = {
|
| 17 |
+
"id": str(uuid.uuid4()),
|
| 18 |
+
"ts": dt.datetime.utcnow().isoformat(),
|
| 19 |
+
"tags": [t.strip() for t in (notes or "").split(",") if t.strip()]
|
| 20 |
+
}
|
| 21 |
+
add_text(summary, meta)
|
| 22 |
+
dt_ms = int((time.time()-t0)*1000)
|
| 23 |
+
return f"Indexed summary in {dt_ms} ms (text-only).", summary
|
| 24 |
+
|
| 25 |
+
def ask(q: str, audio_q: str):
|
| 26 |
+
query = (q or "").strip()
|
| 27 |
+
if (not query) and audio_q:
|
| 28 |
+
query = transcribe_file(audio_q)
|
| 29 |
+
if not query.strip():
|
| 30 |
+
return "", "", "Please provide a question (text or audio)."
|
| 31 |
+
hits = search(query, k=5)
|
| 32 |
+
ctxs = [h.get("text","") for h in hits]
|
| 33 |
+
ans = answer(query, ctxs)
|
| 34 |
+
refs = "\n\n".join([f"- {h.get('text','')[:160]}…" for h in hits])
|
| 35 |
+
return query, ans, refs if refs else "(no references yet)"
|
| 36 |
+
|
| 37 |
+
with gr.Blocks(title="MnemoSense — Spaces Demo") as demo:
|
| 38 |
+
gr.Markdown("# MnemoSense — Text-only Memory (HF Spaces)")
|
| 39 |
+
gr.Markdown("**Privacy-first**: Only summaries are stored. Try the **Ingest** tab, then ask questions.")
|
| 40 |
+
|
| 41 |
+
with gr.Tab("Ingest"):
|
| 42 |
+
with gr.Row():
|
| 43 |
+
mic = gr.Audio(sources=["microphone","upload"], type="filepath", label="Record or Upload (<= 60s)")
|
| 44 |
+
notes = gr.Textbox(label="Optional tags (comma-separated)", placeholder="demo, meeting, idea")
|
| 45 |
+
btn_ingest = gr.Button("Transcribe → Summarize → Index")
|
| 46 |
+
status = gr.Textbox(label="Status", interactive=False)
|
| 47 |
+
summary = gr.Textbox(label="Summary stored", lines=4, interactive=False)
|
| 48 |
+
btn_ingest.click(ingest, inputs=[mic, notes], outputs=[status, summary])
|
| 49 |
+
|
| 50 |
+
with gr.Tab("Ask"):
|
| 51 |
+
with gr.Row():
|
| 52 |
+
q = gr.Textbox(label="Question", placeholder="What did we say about the mission?")
|
| 53 |
+
q_audio = gr.Audio(sources=["microphone","upload"], type="filepath", label="Or ask by voice")
|
| 54 |
+
btn_ask = gr.Button("Retrieve → Answer")
|
| 55 |
+
out_q = gr.Textbox(label="You asked", interactive=False)
|
| 56 |
+
out_ans = gr.Textbox(label="Answer", lines=6, interactive=False)
|
| 57 |
+
out_refs = gr.Textbox(label="References (summaries)", lines=6, interactive=False)
|
| 58 |
+
btn_ask.click(ask, inputs=[q, q_audio], outputs=[out_q, out_ans, out_refs])
|
| 59 |
+
|
| 60 |
+
if __name__ == "__main__":
|
| 61 |
+
demo.launch()
|
embedder.py
ADDED
|
@@ -0,0 +1,31 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from typing import List
|
| 2 |
+
_model = None
|
| 3 |
+
|
| 4 |
+
def _ensure_model():
|
| 5 |
+
global _model
|
| 6 |
+
if _model is not None:
|
| 7 |
+
return
|
| 8 |
+
try:
|
| 9 |
+
from sentence_transformers import SentenceTransformer
|
| 10 |
+
_model = SentenceTransformer("all-MiniLM-L6-v2")
|
| 11 |
+
except Exception:
|
| 12 |
+
# ultra-light fallback if sentence-transformers can't load
|
| 13 |
+
import numpy as np
|
| 14 |
+
class _HashEmb:
|
| 15 |
+
def encode(self, texts, normalize_embeddings=True):
|
| 16 |
+
out = []
|
| 17 |
+
for t in texts:
|
| 18 |
+
h = abs(hash(t)) % (10**8)
|
| 19 |
+
vec = np.array([(h >> i) & 1 for i in range(256)], dtype=float)
|
| 20 |
+
if normalize_embeddings:
|
| 21 |
+
n = np.linalg.norm(vec) + 1e-8
|
| 22 |
+
vec = vec / n
|
| 23 |
+
out.append(vec)
|
| 24 |
+
return out
|
| 25 |
+
_model = _HashEmb()
|
| 26 |
+
|
| 27 |
+
def embed_texts(texts: List[str]):
|
| 28 |
+
if isinstance(texts, str):
|
| 29 |
+
texts = [texts]
|
| 30 |
+
_ensure_model()
|
| 31 |
+
return _model.encode(texts, normalize_embeddings=True)
|
llm.py
ADDED
|
@@ -0,0 +1,39 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import os
|
| 2 |
+
from typing import List
|
| 3 |
+
from summarize import mmr_summarize
|
| 4 |
+
|
| 5 |
+
def _join(ctxs: List[str], max_chars=4000) -> str:
|
| 6 |
+
out, used = [], 0
|
| 7 |
+
for c in (ctxs or []):
|
| 8 |
+
c = (c or "").strip()
|
| 9 |
+
if not c: continue
|
| 10 |
+
if used + len(c) > max_chars: break
|
| 11 |
+
out.append(c); used += len(c)
|
| 12 |
+
return "\n\n".join(out) if out else "(no context)"
|
| 13 |
+
|
| 14 |
+
def local_answer(question: str, contexts: List[str]) -> str:
|
| 15 |
+
ctx = _join(contexts, 3000)
|
| 16 |
+
if not ctx or ctx == "(no context)":
|
| 17 |
+
return "I don't have enough information yet."
|
| 18 |
+
return mmr_summarize(ctx, max_sentences=4)
|
| 19 |
+
|
| 20 |
+
def openai_answer(question: str, contexts: List[str]) -> str:
|
| 21 |
+
try:
|
| 22 |
+
from openai import OpenAI
|
| 23 |
+
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
|
| 24 |
+
model = os.getenv("OPENAI_MODEL","gpt-4o-mini")
|
| 25 |
+
system = "You are MnemoSense. Answer using ONLY the provided context. If insufficient, say you don't know."
|
| 26 |
+
user = f"Context:\n{_join(contexts)}\n\nQuestion: {question}"
|
| 27 |
+
resp = client.chat.completions.create(
|
| 28 |
+
model=model,
|
| 29 |
+
messages=[{"role":"system","content":system},{"role":"user","content":user}],
|
| 30 |
+
temperature=0.2,
|
| 31 |
+
)
|
| 32 |
+
return resp.choices[0].message.content.strip()
|
| 33 |
+
except Exception:
|
| 34 |
+
return "(local) " + local_answer(question, contexts)
|
| 35 |
+
|
| 36 |
+
def answer(question: str, contexts: List[str]) -> str:
|
| 37 |
+
if os.getenv("OPENAI_API_KEY"):
|
| 38 |
+
return openai_answer(question, contexts)
|
| 39 |
+
return local_answer(question, contexts)
|
rag_store.py
ADDED
|
@@ -0,0 +1,52 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import os, json
|
| 2 |
+
from typing import List, Dict, Any
|
| 3 |
+
import numpy as np
|
| 4 |
+
from embedder import embed_texts
|
| 5 |
+
|
| 6 |
+
BASE = os.path.dirname(__file__)
|
| 7 |
+
DB_DIR = os.path.join(BASE, "data")
|
| 8 |
+
META = os.path.join(DB_DIR, "transcripts.jsonl")
|
| 9 |
+
VEC = os.path.join(DB_DIR, "vec.npy")
|
| 10 |
+
|
| 11 |
+
os.makedirs(DB_DIR, exist_ok=True)
|
| 12 |
+
|
| 13 |
+
def _load_meta() -> List[Dict[str, Any]]:
|
| 14 |
+
if not os.path.exists(META): return []
|
| 15 |
+
with open(META, "r") as f:
|
| 16 |
+
return [json.loads(line) for line in f if line.strip()]
|
| 17 |
+
|
| 18 |
+
def _append_meta(row: Dict[str, Any]):
|
| 19 |
+
with open(META, "a") as f:
|
| 20 |
+
f.write(json.dumps(row, ensure_ascii=False)+"\n")
|
| 21 |
+
|
| 22 |
+
def _load_vecs() -> np.ndarray:
|
| 23 |
+
if not os.path.exists(VEC): return np.zeros((0,256), dtype=np.float32)
|
| 24 |
+
return np.load(VEC)
|
| 25 |
+
|
| 26 |
+
def _save_vecs(X: np.ndarray):
|
| 27 |
+
np.save(VEC, X)
|
| 28 |
+
|
| 29 |
+
def add_text(text: str, meta: Dict[str, Any]):
|
| 30 |
+
X = _load_vecs()
|
| 31 |
+
emb = embed_texts([text])[0]
|
| 32 |
+
emb = np.array(emb, dtype=np.float32)
|
| 33 |
+
emb = emb / (np.linalg.norm(emb)+1e-8)
|
| 34 |
+
X_new = emb[None, :] if X.size==0 else np.vstack([X, emb])
|
| 35 |
+
_save_vecs(X_new)
|
| 36 |
+
_append_meta(meta | {"text": text})
|
| 37 |
+
|
| 38 |
+
def search(query: str, k: int = 5) -> List[Dict[str, Any]]:
|
| 39 |
+
rows = _load_meta()
|
| 40 |
+
if not rows: return []
|
| 41 |
+
X = _load_vecs()
|
| 42 |
+
qv = np.array(embed_texts([query])[0], dtype=np.float32)
|
| 43 |
+
qv = qv / (np.linalg.norm(qv)+1e-8)
|
| 44 |
+
sims = (X @ qv).tolist() if X.size else []
|
| 45 |
+
idx = np.argsort(sims)[::-1][:k] if sims else []
|
| 46 |
+
hits = []
|
| 47 |
+
for i in idx:
|
| 48 |
+
if i < len(rows):
|
| 49 |
+
r = dict(rows[i])
|
| 50 |
+
r["score"] = float(sims[i])
|
| 51 |
+
hits.append(r)
|
| 52 |
+
return hits
|
requirements.txt
ADDED
|
@@ -0,0 +1,7 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
gradio>=4.44.0
|
| 2 |
+
openai>=1.40.0
|
| 3 |
+
whisper
|
| 4 |
+
ffmpeg-python
|
| 5 |
+
numpy
|
| 6 |
+
sentence-transformers
|
| 7 |
+
scipy
|
stt.py
ADDED
|
@@ -0,0 +1,27 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import os, subprocess
|
| 2 |
+
|
| 3 |
+
def _to_wav16k(path: str) -> str:
|
| 4 |
+
if path.endswith(".wav"): return path
|
| 5 |
+
wav = path + ".wav"
|
| 6 |
+
cmd = ["ffmpeg", "-y", "-i", path, "-ac", "1", "-ar", "16000", wav]
|
| 7 |
+
subprocess.run(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE, check=False)
|
| 8 |
+
return wav if os.path.exists(wav) and os.path.getsize(wav) > 0 else path
|
| 9 |
+
|
| 10 |
+
def transcribe_file(path: str, model_size="base") -> str:
|
| 11 |
+
# Prefer OpenAI Whisper API (fast on Spaces CPU)
|
| 12 |
+
if os.getenv("OPENAI_API_KEY"):
|
| 13 |
+
try:
|
| 14 |
+
from openai import OpenAI
|
| 15 |
+
client = OpenAI()
|
| 16 |
+
wav = _to_wav16k(path)
|
| 17 |
+
with open(wav, "rb") as f:
|
| 18 |
+
tr = client.audio.transcriptions.create(model="whisper-1", file=f)
|
| 19 |
+
return (tr.text or "").strip()
|
| 20 |
+
except Exception:
|
| 21 |
+
pass
|
| 22 |
+
# Fallback: local whisper (may be slow on CPU)
|
| 23 |
+
import whisper
|
| 24 |
+
wav = _to_wav16k(path)
|
| 25 |
+
model = whisper.load_model(model_size)
|
| 26 |
+
out = model.transcribe(wav, fp16=False)
|
| 27 |
+
return (out or {}).get("text","").strip()
|
summarize.py
ADDED
|
@@ -0,0 +1,30 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import re, numpy as np
|
| 2 |
+
from typing import List
|
| 3 |
+
from embedder import embed_texts
|
| 4 |
+
|
| 5 |
+
def split_sentences(text: str) -> List[str]:
|
| 6 |
+
sents = re.split(r'(?<=[\.\!\?])\s+', text.strip())
|
| 7 |
+
return [s.strip() for s in sents if s.strip()]
|
| 8 |
+
|
| 9 |
+
def mmr_summarize(text: str, max_sentences: int = 4, diversity: float = 0.6) -> str:
|
| 10 |
+
sents = split_sentences(text)
|
| 11 |
+
if not sents: return text.strip()
|
| 12 |
+
if len(sents) <= max_sentences: return " ".join(sents)
|
| 13 |
+
embs = embed_texts(sents)
|
| 14 |
+
embs = np.array(embs)
|
| 15 |
+
centroid = embs.mean(axis=0)
|
| 16 |
+
centroid = centroid / (np.linalg.norm(centroid) + 1e-8)
|
| 17 |
+
selected = [int(np.argmax(embs @ centroid))]
|
| 18 |
+
while len(selected) < max_sentences:
|
| 19 |
+
best, idx = -1e9, None
|
| 20 |
+
for i in range(len(sents)):
|
| 21 |
+
if i in selected: continue
|
| 22 |
+
rel = float(embs[i] @ centroid)
|
| 23 |
+
red = max(float(embs[i] @ embs[j]) for j in selected) if selected else 0.0
|
| 24 |
+
score = diversity*rel - (1-diversity)*red
|
| 25 |
+
if score > best:
|
| 26 |
+
best, idx = score, i
|
| 27 |
+
if idx is None: break
|
| 28 |
+
selected.append(idx)
|
| 29 |
+
selected.sort()
|
| 30 |
+
return " ".join(sents[i] for i in selected)
|