Spaces:

AsadIsmail
/

prism-memory

Running

App Files Files Community

AsadIsmail commited on about 1 month ago

Commit

2801c36

verified ·

1 Parent(s): 607bb36

Add PRISM-Memory Space demo

Browse files

Files changed (6) hide show

MEMORY_EXTRACTION_SKILL.md +113 -0
README.md +31 -5
app.py +178 -0
requirements.txt +2 -0
results/confirmed_exp15_summary.json +53 -0
results/scenario_comparisons.json +140 -0

MEMORY_EXTRACTION_SKILL.md ADDED Viewed

	@@ -0,0 +1,113 @@

+# PRISM-Memory Extraction Skill
+**Hook:** Turn conversations into durable, searchable memory.
+This is the single extraction skill to keep from the `better_memory` work.
+Public release should point to one checkpoint and one extraction behavior:
+- **Model:** `exp15_sft_qwen7b_4ep`
+- **Base model:** `Qwen/Qwen2.5-7B-Instruct`
+- **Role:** proposition extraction for long-term conversational memory
+- **Why this one:** best confirmed total profile, best adversarial behavior, and
+  best LongMemEval score
+## Skill Definition
+The extractor operates turn by turn and emits `0-5` atomic propositions per
+turn. Each proposition should be a standalone fact about a person, event,
+preference, or property, with dates carried into the fact when available.
+Canonical prompt:
+```text
+You are a memory extraction assistant. Given a conversation turn, extract 0-5 atomic, standalone facts. Each fact must be a complete sentence about a specific person, event, preference, or property. Include dates/times when mentioned. Skip greetings, filler, and questions. Output ONLY a JSON array of strings, e.g. ["fact1", "fact2"] or [].
+```
+This prompt comes from
+`/home/ec2-user/SageMaker/better_memory/experiment15_learned_extraction.py`.
+## Inference Contract
+1. Format the turn with speaker and session date.
+2. Extract `0-5` propositions as a JSON array.
+3. Clean speaker references so generic labels become real names.
+4. Resolve relative temporal expressions against the session date.
+5. Prefix each proposition with the normalized session date before indexing.
+6. Retrieve with the PRISM hybrid stack, not with the extractor alone.
+## Retrieval Setup To Keep
+- **Retriever:** `PRISMv3Rerank`
+- **Sparse retrieval:** BM25
+- **Dense retrieval:** `all-MiniLM-L6-v2`
+- **Reranker:** `cross-encoder/ms-marco-MiniLM-L-6-v2`
+Best confirmed retrieval settings:
+- **LoCoMo:** adversarial `k=5`, multi-hop `k=10`, all other categories `k=8`
+- **LongMemEval:** multi-session `k=20`, all other categories `k=8` except
+  single-session-user `k=5`
+## What Worked
+1. **The original 20k base mattered.**
+   `sft4` came from the exact `train_sft_clean_merged.jsonl` base distribution.
+   Runs that changed the base subset regressed.
+2. **Four epochs was the sweet spot.**
+   `sft4` is the local optimum the repo could actually reproduce.
+3. **Absolute date anchoring helped.**
+   Temporal repairs worked when the model saw explicit, normalized dates rather
+   than benchmark-specific relative phrasing.
+4. **Post-processing mattered.**
+   Speaker cleanup plus relative-date resolution was necessary to turn raw
+   outputs into stable memory records.
+5. **Hybrid retrieval beat simpler retrieval.**
+   BM25 + dense + reranking consistently outperformed BM25-only or dense-only
+   approaches.
+6. **Turn-local extraction was enough.**
+   The model performed better without feeding long recent-context windows into
+   the extractor.
+7. **Multihop supervision preserved inferential behavior.**
+   When temporal data was added, multihop QA was the only extra signal that
+   reliably helped preserve inferential performance.
+## What Did Not Work
+1. **Relative-date training.**
+   Training the extractor to emit benchmark-style relative dates hurt temporal
+   performance instead of helping it.
+2. **LoCoMo-domain SFT data.**
+   Adding LoCoMo training conversations consistently regressed the model.
+3. **More than 20k original LME examples.**
+   Scaling the original noisy temporal labels to 50k amplified anchor loss and
+   caused major regression.
+4. **Small clean bases.**
+   5k-base follow-on runs forgot too much and collapsed inferential behavior.
+5. **Heavy QA multipliers.**
+   High temporal or QA multipliers damaged adversarial precision and LongMemEval.
+6. **High learning rates on follow-on QA runs.**
+   Aggressive fine-tuning degraded the traits that made `sft4` good.
+7. **Trying to push past the local optimum.**
+   Most post-`sft4` training traded away adversarial performance for narrower
+   gains.
+## Release Rule
+Release only this extraction skill and only this checkpoint publicly:
+- `exp15_sft_qwen7b_4ep`
+Treat all other checkpoints as internal ablations and learning artifacts, not as
+parallel public releases.

README.md CHANGED Viewed

@@ -1,12 +1,38 @@
 ---
-title: Prism Memory
-emoji: 🏢
-colorFrom: purple
-colorTo: blue
 sdk: gradio
 sdk_version: 6.12.0
 app_file: app.py
 pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: PRISM-Memory
+colorFrom: blue
+colorTo: green
 sdk: gradio
 sdk_version: 6.12.0
 app_file: app.py
 pinned: false
 ---
+# PRISM-Memory Space
+**Hook:** Turn conversations into durable, searchable memory.
+This Space is the lightweight public demo for the single released
+`PRISM-Memory` extraction skill. It shows the best checkpoint only.
+## Inputs
+The app reads:
+- `../results/confirmed_exp15_summary.json`
+- `../results/scenario_comparisons.json`
+- `../MEMORY_EXTRACTION_SKILL.md`
+## What It Shows
+1. The confirmed metrics for the released checkpoint
+2. Selected benchmark cases showing strengths and failure modes
+3. The single canonical memory extraction skill to keep
+## Local Run
+```bash
+cd /home/ec2-user/SageMaker/autoresearch_memory/hf_space_demo
+python -m pip install -r requirements.txt
+python app.py
+```

app.py ADDED Viewed

	@@ -0,0 +1,178 @@

+from __future__ import annotations
+import json
+from pathlib import Path
+import gradio as gr
+import pandas as pd
+APP_DIR = Path(__file__).resolve().parent
+if (APP_DIR / "MEMORY_EXTRACTION_SKILL.md").exists() or (APP_DIR / "results").exists():
+    ROOT = APP_DIR
+else:
+    ROOT = APP_DIR.parent
+RESULTS_DIR = ROOT / "results"
+SUMMARY_PATH = RESULTS_DIR / "confirmed_exp15_summary.json"
+SCENARIO_PATH = RESULTS_DIR / "scenario_comparisons.json"
+SKILL_PATH = ROOT / "MEMORY_EXTRACTION_SKILL.md"
+LOCOMO_CATEGORY_NAMES = {
+    "1": "factual",
+    "2": "temporal",
+    "3": "inferential",
+    "4": "multi-hop",
+    "5": "adversarial",
+}
+LME_CATEGORY_ORDER = [
+    "knowledge-update",
+    "multi-session",
+    "single-session-assistant",
+    "single-session-preference",
+    "single-session-user",
+    "temporal-reasoning",
+]
+def _load_json(path: Path, default):
+    if not path.exists():
+        return default
+    return json.loads(path.read_text())
+def _load_summary() -> dict:
+    return _load_json(SUMMARY_PATH, {"results": [], "failures": []})
+def _load_scenarios() -> dict:
+    return _load_json(SCENARIO_PATH, {"scenarios": []})
+def _load_skill() -> str:
+    if not SKILL_PATH.exists():
+        return "Skill document not found."
+    return SKILL_PATH.read_text()
+def _best_result() -> dict | None:
+    results = _load_summary().get("results", [])
+    return results[0] if results else None
+def release_markdown() -> str:
+    item = _best_result()
+    if not item:
+        return "## No confirmed release result yet"
+    checkpoint = Path(item["checkpoint"]).name
+    return "\n\n".join(
+        [
+            "# PRISM-Memory",
+            "**Turn conversations into durable, searchable memory.**",
+            f"Released checkpoint: `{checkpoint}`",
+            f"Confirmed LoCoMo: `{item['locomo']['mean']:.3f}`",
+            f"Confirmed LongMemEval: `{item['lme']['mean']:.3f}`",
+        ]
+    )
+def summary_df() -> pd.DataFrame:
+    item = _best_result()
+    if not item:
+        return pd.DataFrame(columns=["checkpoint", "locomo_mean", "lme_mean", "cache_hits", "cache_misses", "eval_minutes"])
+    return pd.DataFrame(
+        [
+            {
+                "checkpoint": Path(item["checkpoint"]).name,
+                "locomo_mean": round(item["locomo"]["mean"], 3),
+                "lme_mean": round(item["lme"]["mean"], 3),
+                "cache_hits": item["qa_cache"]["hits"],
+                "cache_misses": item["qa_cache"]["misses"],
+                "eval_minutes": item["elapsed_min"],
+            }
+        ]
+    )
+def category_df() -> pd.DataFrame:
+    item = _best_result()
+    if not item:
+        return pd.DataFrame(columns=["benchmark", "category", "score"])
+    rows = []
+    for category in sorted(item["locomo"]["categories"], key=int):
+        score = item["locomo"]["categories"][category]
+        rows.append(
+            {
+                "benchmark": "LoCoMo",
+                "category": LOCOMO_CATEGORY_NAMES.get(category, category),
+                "score": round(score, 3),
+            }
+        )
+    for category in LME_CATEGORY_ORDER:
+        if category not in item["lme"]["categories"]:
+            continue
+        score = item["lme"]["categories"][category]
+        rows.append({"benchmark": "LongMemEval", "category": category, "score": round(score, 3)})
+    return pd.DataFrame(rows)
+def _scenario_label(item: dict) -> str:
+    return f"{item['title']}: {item['question']}"
+def scenario_choices() -> list[str]:
+    scenarios = _load_scenarios().get("scenarios", [])
+    return [_scenario_label(s) for s in scenarios]
+def render_scenario(choice: str):
+    scenarios = _load_scenarios().get("scenarios", [])
+    if not scenarios:
+        return "No scenario data yet.", pd.DataFrame(columns=["prediction", "top_retrieval"])
+    item = next(
+        (scenario for scenario in scenarios if _scenario_label(scenario) == choice or scenario["id"] == choice),
+        scenarios[0],
+    )
+    system = next((entry for entry in item.get("systems", []) if entry.get("name") == "sft4"), item["systems"][0])
+    header = [
+        f"### {item['title']}",
+        "",
+        f"**Question:** {item['question']}",
+        f"**Gold answer:** {item['gold_answer']}",
+        f"**What this case shows:** {item.get('note', 'Selected benchmark case.')}",
+        f"**Case type:** {item.get('kind', 'n/a')}",
+    ]
+    table = pd.DataFrame(
+        [
+            {
+                "prediction": system.get("prediction", ""),
+                "top_retrieval": "\n".join(system.get("top_retrieval", [])[:3]),
+            }
+        ]
+    )
+    return "\n".join(header), table
+with gr.Blocks(title="PRISM-Memory Demo") as demo:
+    gr.Markdown(release_markdown())
+    with gr.Tab("Metrics"):
+        gr.Markdown("## Released Checkpoint")
+        metrics = gr.Dataframe(value=summary_df(), interactive=False, wrap=True)
+        gr.Markdown("## Category Breakdown")
+        categories = gr.Dataframe(value=category_df(), interactive=False, wrap=True)
+        refresh = gr.Button("Refresh Data")
+        refresh.click(fn=lambda: (summary_df(), category_df()), outputs=[metrics, categories])
+    with gr.Tab("Memory Cases"):
+        choices = scenario_choices() or ["pending"]
+        picker = gr.Dropdown(choices=choices, value=choices[0], label="Benchmark Case")
+        scenario_md = gr.Markdown()
+        scenario_table = gr.Dataframe(interactive=False, wrap=True)
+        picker.change(render_scenario, inputs=picker, outputs=[scenario_md, scenario_table])
+        demo.load(fn=lambda: render_scenario(choices[0]), outputs=[scenario_md, scenario_table])
+    with gr.Tab("Skill"):
+        gr.Markdown(_load_skill())
+if __name__ == "__main__":
+    demo.launch()

requirements.txt ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ gradio>=5.0.0
2	+ pandas>=2.0.0

results/confirmed_exp15_summary.json ADDED Viewed

	@@ -0,0 +1,53 @@

+{
+  "results": [
+    {
+      "alias": "sft4",
+      "checkpoint": "/home/ec2-user/SageMaker/better_memory/exp15_sft_qwen7b_4ep",
+      "elapsed_min": 28.93,
+      "args": {
+        "n_lme": 10,
+        "context_window": 0,
+        "locomo_temporal_k": 8,
+        "locomo_adversarial_k": 5,
+        "lme_multisess_k": 20,
+        "use_temporal_prompt": false,
+        "strict_cache": true
+      },
+      "qa_cache": {
+        "model": "gpt-4.1",
+        "cache_size": 16969,
+        "hits": 460,
+        "misses": 0,
+        "missing_examples": []
+      },
+      "locomo": {
+        "categories": {
+          "1": 0.3339551926061944,
+          "2": 0.4978785869736096,
+          "3": 0.26059974747474746,
+          "4": 0.514447774438597,
+          "5": 0.8837209302325582
+        },
+        "mean": 0.49812044634514124
+      },
+      "lme": {
+        "categories": {
+          "knowledge-update": 0.558840579710145,
+          "multi-session": 0.13909774436090225,
+          "single-session-assistant": 0.765639589169001,
+          "single-session-preference": 0.05196674560130369,
+          "single-session-user": 0.9133333333333333,
+          "temporal-reasoning": 0.43166666666666664
+        },
+        "mean": 0.47675744314022533
+      },
+      "logged_comparison": {
+        "logged_locomo_mean": 0.498,
+        "logged_lme_mean": 0.477,
+        "locomo_delta": 0.00012044634514124519,
+        "lme_delta": -0.00024255685977464525
+      }
+    }
+  ],
+  "failures": []
+}

results/scenario_comparisons.json ADDED Viewed

	@@ -0,0 +1,140 @@

+{
+  "qa_cache": {
+    "model": "gpt-4.1",
+    "cache_size": 16969,
+    "hits": 5,
+    "misses": 0,
+    "missing_examples": []
+  },
+  "scenarios": [
+    {
+      "id": "temporal_anchor_hobby",
+      "title": "Temporal Anchor",
+      "source_id": "conv-49",
+      "category": 2,
+      "question": "Which hobby did Sam take up in May 2023?",
+      "gold_answer": "painting",
+      "kind": "strength",
+      "note": "The released model keeps the dated hobby proposition and answers correctly.",
+      "systems": [
+        {
+          "name": "sft4",
+          "prediction": "painting",
+          "top_retrieval": [
+            "Sam: [18 May 2023] Sam is considering trying painting as a new hobby.",
+            "Sam: [24 May 2023] Sam has been considering trying painting as a new hobby.",
+            "Sam: [6 October 2023] Sam asked Evan if he has explored any fun indoor activities or hobbies.",
+            "Sam: [18 May 2023] Sam is excited to try new things.",
+            "Sam: [24 May 2023] Sam is trying to break old habits.",
+            "Sam: [15 August 2023] Sam attended a cooking class.",
+            "Sam: [17 December 2023] Sam used to love hiking.",
+            "[1:47 pm on 18 May, 2023] Sam: We hiked a good distance - quite a feat for me back then. It's definitely a great memory."
+          ]
+        }
+      ]
+    },
+    {
+      "id": "adversarial_refusal_guitar",
+      "title": "Adversarial Refusal",
+      "source_id": "conv-50",
+      "category": 5,
+      "question": "Why did Dave get his guitar customized with a shiny finish?",
+      "gold_answer": "",
+      "kind": "strength",
+      "note": "This tests whether the system refuses to invent an answer when the premise is unsupported.",
+      "systems": [
+        {
+          "name": "sft4",
+          "prediction": "None",
+          "top_retrieval": [
+            "[2:55 pm on 31 August, 2023] Dave: That guitar has a gorgeous purple hue. Why did you make it so shiny?",
+            "Dave: [2 November 2023] The guitar was in bad condition when Dave found it.",
+            "[2:55 pm on 31 August, 2023] Dave: Good pick! The customized purple glow gives it a unique look that really stands out.",
+            "[2:55 pm on 31 August, 2023] Dave: That's a great guitar, Calvin! Love the design, it's so unique and special.",
+            "Dave: [16 May 2023] Calvin lost his guitar and amp but managed to save his music gear and microphone."
+          ]
+        }
+      ]
+    },
+    {
+      "id": "diagnosis_specificity",
+      "title": "Diagnosis Specificity",
+      "source_id": "conv-49",
+      "category": 1,
+      "question": "Which ailment does Sam have to face due to his weight?",
+      "gold_answer": "gastritis",
+      "kind": "failure",
+      "note": "A representative factual miss: the model retrieves the health-risk frame but not the specific diagnosis.",
+      "systems": [
+        {
+          "name": "sft4",
+          "prediction": "serious health risk",
+          "top_retrieval": [
+            "Sam: [8 October 2023] The doctor told Sam that his weight is a serious health risk.",
+            "Sam: [24 May 2023] The doctor's check-up revealed that Sam's weight was not good.",
+            "Sam: [13 August 2023] Sam is currently experiencing challenges affecting his health.",
+            "[6:48 pm on 17 December, 2023] Sam: Yeah, I'm struggling with my weight and it's affecting my confidence. I feel like I can't overcome all the challenges with my weight, I keep lacking motivation.",
+            "Sam: [21 November 2023] Sam has been trying to make dietary changes to address his discomfort.",
+            "Sam: [9 November 2023] Sam is a Weight Watchers coach in his group.",
+            "Sam: [7 August 2023] Sam has been prioritizing his health for some time.",
+            "Sam: [15 August 2023] Sam is concerned about his health."
+          ]
+        }
+      ]
+    },
+    {
+      "id": "location_inference",
+      "title": "Location Inference",
+      "source_id": "conv-49",
+      "category": 3,
+      "question": "Does Evan live close to a beach or mountains?",
+      "gold_answer": "beach",
+      "kind": "failure",
+      "note": "A representative inferential miss: retrieval includes both clues, but the model overcommits to the mountain mention.",
+      "systems": [
+        {
+          "name": "sft4",
+          "prediction": "mountains",
+          "top_retrieval": [
+            "Evan: [27 August 2023] Evan also shared his recent road trip to the Rocky Mountains and love for hiking.",
+            "Evan: [27 August 2023] Evan lives within a two-hour drive of a place with incredible views and a peaceful atmosphere.",
+            "Evan: [9 November 2023] They also discussed enjoying a sunset together at Evan's favorite spot by the beach, planning to visit it soon to de-stress.",
+            "Evan: [10 January 2024] Evan enjoys going on beach sunsets as a low-impact exercise.",
+            "[7:11 pm on 24 May, 2023] Evan: Hey Sam, thanks for asking! It was great - fresh air, peacefulness and a cozy cabin surrounded by mountains and forests made it feel like a real retreat.",
+            "Evan: [31 December 2023] Sam shared about a recent hiking trip, while Evan mentioned a mountain drive that ended in a minor accident.",
+            "Evan: [27 August 2023] Evan recommended a nearby lake for hiking and nature exploration.",
+            "Evan: [27 August 2023] Evan enjoys road trips and exploring nature."
+          ]
+        }
+      ]
+    },
+    {
+      "id": "reading_detail",
+      "title": "Reading Detail",
+      "source_id": "conv-49",
+      "category": 4,
+      "question": "What novel is Evan reading that he finds gripping?",
+      "gold_answer": "The Great Gatsby",
+      "kind": "failure",
+      "note": "A representative multi-hop miss: the model retains the coarse book description but misses the specific title.",
+      "systems": [
+        {
+          "name": "sft4",
+          "prediction": "a new mystery novel",
+          "top_retrieval": [
+            "Evan: [27 August 2023] Evan is reading a book that he finds increasingly compelling.",
+            "Evan: [27 July 2023] Evan is currently reading a new mystery novel.",
+            "Evan: [27 July 2023] Evan is reading 'The Great Gatsby'.",
+            "Evan: [26 December 2023] Evan finds that art helps him recognize and handle his own feelings.",
+            "Evan: [27 August 2023] Evan expressed interest in a book and discussed potential physical therapy for his knee.",
+            "Evan: [10 January 2024] Evan concluded that he needs to be more careful next time.",
+            "Evan: [6 October 2023] Evan thinks writing is a great way to express oneself.",
+            "Evan: [13 August 2023] Evan suggested checking out a dream interpretation book to help interpret Sam's dream.",
+            "Evan: [6 October 2023] Evan believes that writing can be super therapeutic.",
+            "Evan: [6 October 2023] Evan usually paints what is on his mind or something he is feeling."
+          ]
+        }
+      ]
+    }
+  ]
+}