Spaces:
Running
Running
Add PRISM-Memory Space demo
Browse files- MEMORY_EXTRACTION_SKILL.md +113 -0
- README.md +31 -5
- app.py +178 -0
- requirements.txt +2 -0
- results/confirmed_exp15_summary.json +53 -0
- results/scenario_comparisons.json +140 -0
MEMORY_EXTRACTION_SKILL.md
ADDED
|
@@ -0,0 +1,113 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# PRISM-Memory Extraction Skill
|
| 2 |
+
|
| 3 |
+
**Hook:** Turn conversations into durable, searchable memory.
|
| 4 |
+
|
| 5 |
+
This is the single extraction skill to keep from the `better_memory` work.
|
| 6 |
+
Public release should point to one checkpoint and one extraction behavior:
|
| 7 |
+
|
| 8 |
+
- **Model:** `exp15_sft_qwen7b_4ep`
|
| 9 |
+
- **Base model:** `Qwen/Qwen2.5-7B-Instruct`
|
| 10 |
+
- **Role:** proposition extraction for long-term conversational memory
|
| 11 |
+
- **Why this one:** best confirmed total profile, best adversarial behavior, and
|
| 12 |
+
best LongMemEval score
|
| 13 |
+
|
| 14 |
+
## Skill Definition
|
| 15 |
+
|
| 16 |
+
The extractor operates turn by turn and emits `0-5` atomic propositions per
|
| 17 |
+
turn. Each proposition should be a standalone fact about a person, event,
|
| 18 |
+
preference, or property, with dates carried into the fact when available.
|
| 19 |
+
|
| 20 |
+
Canonical prompt:
|
| 21 |
+
|
| 22 |
+
```text
|
| 23 |
+
You are a memory extraction assistant. Given a conversation turn, extract 0-5 atomic, standalone facts. Each fact must be a complete sentence about a specific person, event, preference, or property. Include dates/times when mentioned. Skip greetings, filler, and questions. Output ONLY a JSON array of strings, e.g. ["fact1", "fact2"] or [].
|
| 24 |
+
```
|
| 25 |
+
|
| 26 |
+
This prompt comes from
|
| 27 |
+
`/home/ec2-user/SageMaker/better_memory/experiment15_learned_extraction.py`.
|
| 28 |
+
|
| 29 |
+
## Inference Contract
|
| 30 |
+
|
| 31 |
+
1. Format the turn with speaker and session date.
|
| 32 |
+
2. Extract `0-5` propositions as a JSON array.
|
| 33 |
+
3. Clean speaker references so generic labels become real names.
|
| 34 |
+
4. Resolve relative temporal expressions against the session date.
|
| 35 |
+
5. Prefix each proposition with the normalized session date before indexing.
|
| 36 |
+
6. Retrieve with the PRISM hybrid stack, not with the extractor alone.
|
| 37 |
+
|
| 38 |
+
## Retrieval Setup To Keep
|
| 39 |
+
|
| 40 |
+
- **Retriever:** `PRISMv3Rerank`
|
| 41 |
+
- **Sparse retrieval:** BM25
|
| 42 |
+
- **Dense retrieval:** `all-MiniLM-L6-v2`
|
| 43 |
+
- **Reranker:** `cross-encoder/ms-marco-MiniLM-L-6-v2`
|
| 44 |
+
|
| 45 |
+
Best confirmed retrieval settings:
|
| 46 |
+
|
| 47 |
+
- **LoCoMo:** adversarial `k=5`, multi-hop `k=10`, all other categories `k=8`
|
| 48 |
+
- **LongMemEval:** multi-session `k=20`, all other categories `k=8` except
|
| 49 |
+
single-session-user `k=5`
|
| 50 |
+
|
| 51 |
+
## What Worked
|
| 52 |
+
|
| 53 |
+
1. **The original 20k base mattered.**
|
| 54 |
+
`sft4` came from the exact `train_sft_clean_merged.jsonl` base distribution.
|
| 55 |
+
Runs that changed the base subset regressed.
|
| 56 |
+
|
| 57 |
+
2. **Four epochs was the sweet spot.**
|
| 58 |
+
`sft4` is the local optimum the repo could actually reproduce.
|
| 59 |
+
|
| 60 |
+
3. **Absolute date anchoring helped.**
|
| 61 |
+
Temporal repairs worked when the model saw explicit, normalized dates rather
|
| 62 |
+
than benchmark-specific relative phrasing.
|
| 63 |
+
|
| 64 |
+
4. **Post-processing mattered.**
|
| 65 |
+
Speaker cleanup plus relative-date resolution was necessary to turn raw
|
| 66 |
+
outputs into stable memory records.
|
| 67 |
+
|
| 68 |
+
5. **Hybrid retrieval beat simpler retrieval.**
|
| 69 |
+
BM25 + dense + reranking consistently outperformed BM25-only or dense-only
|
| 70 |
+
approaches.
|
| 71 |
+
|
| 72 |
+
6. **Turn-local extraction was enough.**
|
| 73 |
+
The model performed better without feeding long recent-context windows into
|
| 74 |
+
the extractor.
|
| 75 |
+
|
| 76 |
+
7. **Multihop supervision preserved inferential behavior.**
|
| 77 |
+
When temporal data was added, multihop QA was the only extra signal that
|
| 78 |
+
reliably helped preserve inferential performance.
|
| 79 |
+
|
| 80 |
+
## What Did Not Work
|
| 81 |
+
|
| 82 |
+
1. **Relative-date training.**
|
| 83 |
+
Training the extractor to emit benchmark-style relative dates hurt temporal
|
| 84 |
+
performance instead of helping it.
|
| 85 |
+
|
| 86 |
+
2. **LoCoMo-domain SFT data.**
|
| 87 |
+
Adding LoCoMo training conversations consistently regressed the model.
|
| 88 |
+
|
| 89 |
+
3. **More than 20k original LME examples.**
|
| 90 |
+
Scaling the original noisy temporal labels to 50k amplified anchor loss and
|
| 91 |
+
caused major regression.
|
| 92 |
+
|
| 93 |
+
4. **Small clean bases.**
|
| 94 |
+
5k-base follow-on runs forgot too much and collapsed inferential behavior.
|
| 95 |
+
|
| 96 |
+
5. **Heavy QA multipliers.**
|
| 97 |
+
High temporal or QA multipliers damaged adversarial precision and LongMemEval.
|
| 98 |
+
|
| 99 |
+
6. **High learning rates on follow-on QA runs.**
|
| 100 |
+
Aggressive fine-tuning degraded the traits that made `sft4` good.
|
| 101 |
+
|
| 102 |
+
7. **Trying to push past the local optimum.**
|
| 103 |
+
Most post-`sft4` training traded away adversarial performance for narrower
|
| 104 |
+
gains.
|
| 105 |
+
|
| 106 |
+
## Release Rule
|
| 107 |
+
|
| 108 |
+
Release only this extraction skill and only this checkpoint publicly:
|
| 109 |
+
|
| 110 |
+
- `exp15_sft_qwen7b_4ep`
|
| 111 |
+
|
| 112 |
+
Treat all other checkpoints as internal ablations and learning artifacts, not as
|
| 113 |
+
parallel public releases.
|
README.md
CHANGED
|
@@ -1,12 +1,38 @@
|
|
| 1 |
---
|
| 2 |
-
title:
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
colorTo: blue
|
| 6 |
sdk: gradio
|
| 7 |
sdk_version: 6.12.0
|
| 8 |
app_file: app.py
|
| 9 |
pinned: false
|
| 10 |
---
|
| 11 |
|
| 12 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
+
title: PRISM-Memory
|
| 3 |
+
colorFrom: blue
|
| 4 |
+
colorTo: green
|
|
|
|
| 5 |
sdk: gradio
|
| 6 |
sdk_version: 6.12.0
|
| 7 |
app_file: app.py
|
| 8 |
pinned: false
|
| 9 |
---
|
| 10 |
|
| 11 |
+
# PRISM-Memory Space
|
| 12 |
+
|
| 13 |
+
**Hook:** Turn conversations into durable, searchable memory.
|
| 14 |
+
|
| 15 |
+
This Space is the lightweight public demo for the single released
|
| 16 |
+
`PRISM-Memory` extraction skill. It shows the best checkpoint only.
|
| 17 |
+
|
| 18 |
+
## Inputs
|
| 19 |
+
|
| 20 |
+
The app reads:
|
| 21 |
+
|
| 22 |
+
- `../results/confirmed_exp15_summary.json`
|
| 23 |
+
- `../results/scenario_comparisons.json`
|
| 24 |
+
- `../MEMORY_EXTRACTION_SKILL.md`
|
| 25 |
+
|
| 26 |
+
## What It Shows
|
| 27 |
+
|
| 28 |
+
1. The confirmed metrics for the released checkpoint
|
| 29 |
+
2. Selected benchmark cases showing strengths and failure modes
|
| 30 |
+
3. The single canonical memory extraction skill to keep
|
| 31 |
+
|
| 32 |
+
## Local Run
|
| 33 |
+
|
| 34 |
+
```bash
|
| 35 |
+
cd /home/ec2-user/SageMaker/autoresearch_memory/hf_space_demo
|
| 36 |
+
python -m pip install -r requirements.txt
|
| 37 |
+
python app.py
|
| 38 |
+
```
|
app.py
ADDED
|
@@ -0,0 +1,178 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from __future__ import annotations
|
| 2 |
+
|
| 3 |
+
import json
|
| 4 |
+
from pathlib import Path
|
| 5 |
+
|
| 6 |
+
import gradio as gr
|
| 7 |
+
import pandas as pd
|
| 8 |
+
|
| 9 |
+
APP_DIR = Path(__file__).resolve().parent
|
| 10 |
+
if (APP_DIR / "MEMORY_EXTRACTION_SKILL.md").exists() or (APP_DIR / "results").exists():
|
| 11 |
+
ROOT = APP_DIR
|
| 12 |
+
else:
|
| 13 |
+
ROOT = APP_DIR.parent
|
| 14 |
+
RESULTS_DIR = ROOT / "results"
|
| 15 |
+
SUMMARY_PATH = RESULTS_DIR / "confirmed_exp15_summary.json"
|
| 16 |
+
SCENARIO_PATH = RESULTS_DIR / "scenario_comparisons.json"
|
| 17 |
+
SKILL_PATH = ROOT / "MEMORY_EXTRACTION_SKILL.md"
|
| 18 |
+
LOCOMO_CATEGORY_NAMES = {
|
| 19 |
+
"1": "factual",
|
| 20 |
+
"2": "temporal",
|
| 21 |
+
"3": "inferential",
|
| 22 |
+
"4": "multi-hop",
|
| 23 |
+
"5": "adversarial",
|
| 24 |
+
}
|
| 25 |
+
LME_CATEGORY_ORDER = [
|
| 26 |
+
"knowledge-update",
|
| 27 |
+
"multi-session",
|
| 28 |
+
"single-session-assistant",
|
| 29 |
+
"single-session-preference",
|
| 30 |
+
"single-session-user",
|
| 31 |
+
"temporal-reasoning",
|
| 32 |
+
]
|
| 33 |
+
|
| 34 |
+
|
| 35 |
+
def _load_json(path: Path, default):
|
| 36 |
+
if not path.exists():
|
| 37 |
+
return default
|
| 38 |
+
return json.loads(path.read_text())
|
| 39 |
+
|
| 40 |
+
|
| 41 |
+
def _load_summary() -> dict:
|
| 42 |
+
return _load_json(SUMMARY_PATH, {"results": [], "failures": []})
|
| 43 |
+
|
| 44 |
+
|
| 45 |
+
def _load_scenarios() -> dict:
|
| 46 |
+
return _load_json(SCENARIO_PATH, {"scenarios": []})
|
| 47 |
+
|
| 48 |
+
|
| 49 |
+
def _load_skill() -> str:
|
| 50 |
+
if not SKILL_PATH.exists():
|
| 51 |
+
return "Skill document not found."
|
| 52 |
+
return SKILL_PATH.read_text()
|
| 53 |
+
|
| 54 |
+
|
| 55 |
+
def _best_result() -> dict | None:
|
| 56 |
+
results = _load_summary().get("results", [])
|
| 57 |
+
return results[0] if results else None
|
| 58 |
+
|
| 59 |
+
|
| 60 |
+
def release_markdown() -> str:
|
| 61 |
+
item = _best_result()
|
| 62 |
+
if not item:
|
| 63 |
+
return "## No confirmed release result yet"
|
| 64 |
+
checkpoint = Path(item["checkpoint"]).name
|
| 65 |
+
return "\n\n".join(
|
| 66 |
+
[
|
| 67 |
+
"# PRISM-Memory",
|
| 68 |
+
"**Turn conversations into durable, searchable memory.**",
|
| 69 |
+
f"Released checkpoint: `{checkpoint}`",
|
| 70 |
+
f"Confirmed LoCoMo: `{item['locomo']['mean']:.3f}`",
|
| 71 |
+
f"Confirmed LongMemEval: `{item['lme']['mean']:.3f}`",
|
| 72 |
+
]
|
| 73 |
+
)
|
| 74 |
+
|
| 75 |
+
|
| 76 |
+
def summary_df() -> pd.DataFrame:
|
| 77 |
+
item = _best_result()
|
| 78 |
+
if not item:
|
| 79 |
+
return pd.DataFrame(columns=["checkpoint", "locomo_mean", "lme_mean", "cache_hits", "cache_misses", "eval_minutes"])
|
| 80 |
+
return pd.DataFrame(
|
| 81 |
+
[
|
| 82 |
+
{
|
| 83 |
+
"checkpoint": Path(item["checkpoint"]).name,
|
| 84 |
+
"locomo_mean": round(item["locomo"]["mean"], 3),
|
| 85 |
+
"lme_mean": round(item["lme"]["mean"], 3),
|
| 86 |
+
"cache_hits": item["qa_cache"]["hits"],
|
| 87 |
+
"cache_misses": item["qa_cache"]["misses"],
|
| 88 |
+
"eval_minutes": item["elapsed_min"],
|
| 89 |
+
}
|
| 90 |
+
]
|
| 91 |
+
)
|
| 92 |
+
|
| 93 |
+
|
| 94 |
+
def category_df() -> pd.DataFrame:
|
| 95 |
+
item = _best_result()
|
| 96 |
+
if not item:
|
| 97 |
+
return pd.DataFrame(columns=["benchmark", "category", "score"])
|
| 98 |
+
rows = []
|
| 99 |
+
for category in sorted(item["locomo"]["categories"], key=int):
|
| 100 |
+
score = item["locomo"]["categories"][category]
|
| 101 |
+
rows.append(
|
| 102 |
+
{
|
| 103 |
+
"benchmark": "LoCoMo",
|
| 104 |
+
"category": LOCOMO_CATEGORY_NAMES.get(category, category),
|
| 105 |
+
"score": round(score, 3),
|
| 106 |
+
}
|
| 107 |
+
)
|
| 108 |
+
for category in LME_CATEGORY_ORDER:
|
| 109 |
+
if category not in item["lme"]["categories"]:
|
| 110 |
+
continue
|
| 111 |
+
score = item["lme"]["categories"][category]
|
| 112 |
+
rows.append({"benchmark": "LongMemEval", "category": category, "score": round(score, 3)})
|
| 113 |
+
return pd.DataFrame(rows)
|
| 114 |
+
|
| 115 |
+
|
| 116 |
+
def _scenario_label(item: dict) -> str:
|
| 117 |
+
return f"{item['title']}: {item['question']}"
|
| 118 |
+
|
| 119 |
+
|
| 120 |
+
def scenario_choices() -> list[str]:
|
| 121 |
+
scenarios = _load_scenarios().get("scenarios", [])
|
| 122 |
+
return [_scenario_label(s) for s in scenarios]
|
| 123 |
+
|
| 124 |
+
|
| 125 |
+
def render_scenario(choice: str):
|
| 126 |
+
scenarios = _load_scenarios().get("scenarios", [])
|
| 127 |
+
if not scenarios:
|
| 128 |
+
return "No scenario data yet.", pd.DataFrame(columns=["prediction", "top_retrieval"])
|
| 129 |
+
|
| 130 |
+
item = next(
|
| 131 |
+
(scenario for scenario in scenarios if _scenario_label(scenario) == choice or scenario["id"] == choice),
|
| 132 |
+
scenarios[0],
|
| 133 |
+
)
|
| 134 |
+
system = next((entry for entry in item.get("systems", []) if entry.get("name") == "sft4"), item["systems"][0])
|
| 135 |
+
header = [
|
| 136 |
+
f"### {item['title']}",
|
| 137 |
+
"",
|
| 138 |
+
f"**Question:** {item['question']}",
|
| 139 |
+
f"**Gold answer:** {item['gold_answer']}",
|
| 140 |
+
f"**What this case shows:** {item.get('note', 'Selected benchmark case.')}",
|
| 141 |
+
f"**Case type:** {item.get('kind', 'n/a')}",
|
| 142 |
+
]
|
| 143 |
+
table = pd.DataFrame(
|
| 144 |
+
[
|
| 145 |
+
{
|
| 146 |
+
"prediction": system.get("prediction", ""),
|
| 147 |
+
"top_retrieval": "\n".join(system.get("top_retrieval", [])[:3]),
|
| 148 |
+
}
|
| 149 |
+
]
|
| 150 |
+
)
|
| 151 |
+
return "\n".join(header), table
|
| 152 |
+
|
| 153 |
+
|
| 154 |
+
with gr.Blocks(title="PRISM-Memory Demo") as demo:
|
| 155 |
+
gr.Markdown(release_markdown())
|
| 156 |
+
|
| 157 |
+
with gr.Tab("Metrics"):
|
| 158 |
+
gr.Markdown("## Released Checkpoint")
|
| 159 |
+
metrics = gr.Dataframe(value=summary_df(), interactive=False, wrap=True)
|
| 160 |
+
gr.Markdown("## Category Breakdown")
|
| 161 |
+
categories = gr.Dataframe(value=category_df(), interactive=False, wrap=True)
|
| 162 |
+
refresh = gr.Button("Refresh Data")
|
| 163 |
+
refresh.click(fn=lambda: (summary_df(), category_df()), outputs=[metrics, categories])
|
| 164 |
+
|
| 165 |
+
with gr.Tab("Memory Cases"):
|
| 166 |
+
choices = scenario_choices() or ["pending"]
|
| 167 |
+
picker = gr.Dropdown(choices=choices, value=choices[0], label="Benchmark Case")
|
| 168 |
+
scenario_md = gr.Markdown()
|
| 169 |
+
scenario_table = gr.Dataframe(interactive=False, wrap=True)
|
| 170 |
+
picker.change(render_scenario, inputs=picker, outputs=[scenario_md, scenario_table])
|
| 171 |
+
demo.load(fn=lambda: render_scenario(choices[0]), outputs=[scenario_md, scenario_table])
|
| 172 |
+
|
| 173 |
+
with gr.Tab("Skill"):
|
| 174 |
+
gr.Markdown(_load_skill())
|
| 175 |
+
|
| 176 |
+
|
| 177 |
+
if __name__ == "__main__":
|
| 178 |
+
demo.launch()
|
requirements.txt
ADDED
|
@@ -0,0 +1,2 @@
|
|
|
|
|
|
|
|
|
|
| 1 |
+
gradio>=5.0.0
|
| 2 |
+
pandas>=2.0.0
|
results/confirmed_exp15_summary.json
ADDED
|
@@ -0,0 +1,53 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"results": [
|
| 3 |
+
{
|
| 4 |
+
"alias": "sft4",
|
| 5 |
+
"checkpoint": "/home/ec2-user/SageMaker/better_memory/exp15_sft_qwen7b_4ep",
|
| 6 |
+
"elapsed_min": 28.93,
|
| 7 |
+
"args": {
|
| 8 |
+
"n_lme": 10,
|
| 9 |
+
"context_window": 0,
|
| 10 |
+
"locomo_temporal_k": 8,
|
| 11 |
+
"locomo_adversarial_k": 5,
|
| 12 |
+
"lme_multisess_k": 20,
|
| 13 |
+
"use_temporal_prompt": false,
|
| 14 |
+
"strict_cache": true
|
| 15 |
+
},
|
| 16 |
+
"qa_cache": {
|
| 17 |
+
"model": "gpt-4.1",
|
| 18 |
+
"cache_size": 16969,
|
| 19 |
+
"hits": 460,
|
| 20 |
+
"misses": 0,
|
| 21 |
+
"missing_examples": []
|
| 22 |
+
},
|
| 23 |
+
"locomo": {
|
| 24 |
+
"categories": {
|
| 25 |
+
"1": 0.3339551926061944,
|
| 26 |
+
"2": 0.4978785869736096,
|
| 27 |
+
"3": 0.26059974747474746,
|
| 28 |
+
"4": 0.514447774438597,
|
| 29 |
+
"5": 0.8837209302325582
|
| 30 |
+
},
|
| 31 |
+
"mean": 0.49812044634514124
|
| 32 |
+
},
|
| 33 |
+
"lme": {
|
| 34 |
+
"categories": {
|
| 35 |
+
"knowledge-update": 0.558840579710145,
|
| 36 |
+
"multi-session": 0.13909774436090225,
|
| 37 |
+
"single-session-assistant": 0.765639589169001,
|
| 38 |
+
"single-session-preference": 0.05196674560130369,
|
| 39 |
+
"single-session-user": 0.9133333333333333,
|
| 40 |
+
"temporal-reasoning": 0.43166666666666664
|
| 41 |
+
},
|
| 42 |
+
"mean": 0.47675744314022533
|
| 43 |
+
},
|
| 44 |
+
"logged_comparison": {
|
| 45 |
+
"logged_locomo_mean": 0.498,
|
| 46 |
+
"logged_lme_mean": 0.477,
|
| 47 |
+
"locomo_delta": 0.00012044634514124519,
|
| 48 |
+
"lme_delta": -0.00024255685977464525
|
| 49 |
+
}
|
| 50 |
+
}
|
| 51 |
+
],
|
| 52 |
+
"failures": []
|
| 53 |
+
}
|
results/scenario_comparisons.json
ADDED
|
@@ -0,0 +1,140 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"qa_cache": {
|
| 3 |
+
"model": "gpt-4.1",
|
| 4 |
+
"cache_size": 16969,
|
| 5 |
+
"hits": 5,
|
| 6 |
+
"misses": 0,
|
| 7 |
+
"missing_examples": []
|
| 8 |
+
},
|
| 9 |
+
"scenarios": [
|
| 10 |
+
{
|
| 11 |
+
"id": "temporal_anchor_hobby",
|
| 12 |
+
"title": "Temporal Anchor",
|
| 13 |
+
"source_id": "conv-49",
|
| 14 |
+
"category": 2,
|
| 15 |
+
"question": "Which hobby did Sam take up in May 2023?",
|
| 16 |
+
"gold_answer": "painting",
|
| 17 |
+
"kind": "strength",
|
| 18 |
+
"note": "The released model keeps the dated hobby proposition and answers correctly.",
|
| 19 |
+
"systems": [
|
| 20 |
+
{
|
| 21 |
+
"name": "sft4",
|
| 22 |
+
"prediction": "painting",
|
| 23 |
+
"top_retrieval": [
|
| 24 |
+
"Sam: [18 May 2023] Sam is considering trying painting as a new hobby.",
|
| 25 |
+
"Sam: [24 May 2023] Sam has been considering trying painting as a new hobby.",
|
| 26 |
+
"Sam: [6 October 2023] Sam asked Evan if he has explored any fun indoor activities or hobbies.",
|
| 27 |
+
"Sam: [18 May 2023] Sam is excited to try new things.",
|
| 28 |
+
"Sam: [24 May 2023] Sam is trying to break old habits.",
|
| 29 |
+
"Sam: [15 August 2023] Sam attended a cooking class.",
|
| 30 |
+
"Sam: [17 December 2023] Sam used to love hiking.",
|
| 31 |
+
"[1:47 pm on 18 May, 2023] Sam: We hiked a good distance - quite a feat for me back then. It's definitely a great memory."
|
| 32 |
+
]
|
| 33 |
+
}
|
| 34 |
+
]
|
| 35 |
+
},
|
| 36 |
+
{
|
| 37 |
+
"id": "adversarial_refusal_guitar",
|
| 38 |
+
"title": "Adversarial Refusal",
|
| 39 |
+
"source_id": "conv-50",
|
| 40 |
+
"category": 5,
|
| 41 |
+
"question": "Why did Dave get his guitar customized with a shiny finish?",
|
| 42 |
+
"gold_answer": "",
|
| 43 |
+
"kind": "strength",
|
| 44 |
+
"note": "This tests whether the system refuses to invent an answer when the premise is unsupported.",
|
| 45 |
+
"systems": [
|
| 46 |
+
{
|
| 47 |
+
"name": "sft4",
|
| 48 |
+
"prediction": "None",
|
| 49 |
+
"top_retrieval": [
|
| 50 |
+
"[2:55 pm on 31 August, 2023] Dave: That guitar has a gorgeous purple hue. Why did you make it so shiny?",
|
| 51 |
+
"Dave: [2 November 2023] The guitar was in bad condition when Dave found it.",
|
| 52 |
+
"[2:55 pm on 31 August, 2023] Dave: Good pick! The customized purple glow gives it a unique look that really stands out.",
|
| 53 |
+
"[2:55 pm on 31 August, 2023] Dave: That's a great guitar, Calvin! Love the design, it's so unique and special.",
|
| 54 |
+
"Dave: [16 May 2023] Calvin lost his guitar and amp but managed to save his music gear and microphone."
|
| 55 |
+
]
|
| 56 |
+
}
|
| 57 |
+
]
|
| 58 |
+
},
|
| 59 |
+
{
|
| 60 |
+
"id": "diagnosis_specificity",
|
| 61 |
+
"title": "Diagnosis Specificity",
|
| 62 |
+
"source_id": "conv-49",
|
| 63 |
+
"category": 1,
|
| 64 |
+
"question": "Which ailment does Sam have to face due to his weight?",
|
| 65 |
+
"gold_answer": "gastritis",
|
| 66 |
+
"kind": "failure",
|
| 67 |
+
"note": "A representative factual miss: the model retrieves the health-risk frame but not the specific diagnosis.",
|
| 68 |
+
"systems": [
|
| 69 |
+
{
|
| 70 |
+
"name": "sft4",
|
| 71 |
+
"prediction": "serious health risk",
|
| 72 |
+
"top_retrieval": [
|
| 73 |
+
"Sam: [8 October 2023] The doctor told Sam that his weight is a serious health risk.",
|
| 74 |
+
"Sam: [24 May 2023] The doctor's check-up revealed that Sam's weight was not good.",
|
| 75 |
+
"Sam: [13 August 2023] Sam is currently experiencing challenges affecting his health.",
|
| 76 |
+
"[6:48 pm on 17 December, 2023] Sam: Yeah, I'm struggling with my weight and it's affecting my confidence. I feel like I can't overcome all the challenges with my weight, I keep lacking motivation.",
|
| 77 |
+
"Sam: [21 November 2023] Sam has been trying to make dietary changes to address his discomfort.",
|
| 78 |
+
"Sam: [9 November 2023] Sam is a Weight Watchers coach in his group.",
|
| 79 |
+
"Sam: [7 August 2023] Sam has been prioritizing his health for some time.",
|
| 80 |
+
"Sam: [15 August 2023] Sam is concerned about his health."
|
| 81 |
+
]
|
| 82 |
+
}
|
| 83 |
+
]
|
| 84 |
+
},
|
| 85 |
+
{
|
| 86 |
+
"id": "location_inference",
|
| 87 |
+
"title": "Location Inference",
|
| 88 |
+
"source_id": "conv-49",
|
| 89 |
+
"category": 3,
|
| 90 |
+
"question": "Does Evan live close to a beach or mountains?",
|
| 91 |
+
"gold_answer": "beach",
|
| 92 |
+
"kind": "failure",
|
| 93 |
+
"note": "A representative inferential miss: retrieval includes both clues, but the model overcommits to the mountain mention.",
|
| 94 |
+
"systems": [
|
| 95 |
+
{
|
| 96 |
+
"name": "sft4",
|
| 97 |
+
"prediction": "mountains",
|
| 98 |
+
"top_retrieval": [
|
| 99 |
+
"Evan: [27 August 2023] Evan also shared his recent road trip to the Rocky Mountains and love for hiking.",
|
| 100 |
+
"Evan: [27 August 2023] Evan lives within a two-hour drive of a place with incredible views and a peaceful atmosphere.",
|
| 101 |
+
"Evan: [9 November 2023] They also discussed enjoying a sunset together at Evan's favorite spot by the beach, planning to visit it soon to de-stress.",
|
| 102 |
+
"Evan: [10 January 2024] Evan enjoys going on beach sunsets as a low-impact exercise.",
|
| 103 |
+
"[7:11 pm on 24 May, 2023] Evan: Hey Sam, thanks for asking! It was great - fresh air, peacefulness and a cozy cabin surrounded by mountains and forests made it feel like a real retreat.",
|
| 104 |
+
"Evan: [31 December 2023] Sam shared about a recent hiking trip, while Evan mentioned a mountain drive that ended in a minor accident.",
|
| 105 |
+
"Evan: [27 August 2023] Evan recommended a nearby lake for hiking and nature exploration.",
|
| 106 |
+
"Evan: [27 August 2023] Evan enjoys road trips and exploring nature."
|
| 107 |
+
]
|
| 108 |
+
}
|
| 109 |
+
]
|
| 110 |
+
},
|
| 111 |
+
{
|
| 112 |
+
"id": "reading_detail",
|
| 113 |
+
"title": "Reading Detail",
|
| 114 |
+
"source_id": "conv-49",
|
| 115 |
+
"category": 4,
|
| 116 |
+
"question": "What novel is Evan reading that he finds gripping?",
|
| 117 |
+
"gold_answer": "The Great Gatsby",
|
| 118 |
+
"kind": "failure",
|
| 119 |
+
"note": "A representative multi-hop miss: the model retains the coarse book description but misses the specific title.",
|
| 120 |
+
"systems": [
|
| 121 |
+
{
|
| 122 |
+
"name": "sft4",
|
| 123 |
+
"prediction": "a new mystery novel",
|
| 124 |
+
"top_retrieval": [
|
| 125 |
+
"Evan: [27 August 2023] Evan is reading a book that he finds increasingly compelling.",
|
| 126 |
+
"Evan: [27 July 2023] Evan is currently reading a new mystery novel.",
|
| 127 |
+
"Evan: [27 July 2023] Evan is reading 'The Great Gatsby'.",
|
| 128 |
+
"Evan: [26 December 2023] Evan finds that art helps him recognize and handle his own feelings.",
|
| 129 |
+
"Evan: [27 August 2023] Evan expressed interest in a book and discussed potential physical therapy for his knee.",
|
| 130 |
+
"Evan: [10 January 2024] Evan concluded that he needs to be more careful next time.",
|
| 131 |
+
"Evan: [6 October 2023] Evan thinks writing is a great way to express oneself.",
|
| 132 |
+
"Evan: [13 August 2023] Evan suggested checking out a dream interpretation book to help interpret Sam's dream.",
|
| 133 |
+
"Evan: [6 October 2023] Evan believes that writing can be super therapeutic.",
|
| 134 |
+
"Evan: [6 October 2023] Evan usually paints what is on his mind or something he is feeling."
|
| 135 |
+
]
|
| 136 |
+
}
|
| 137 |
+
]
|
| 138 |
+
}
|
| 139 |
+
]
|
| 140 |
+
}
|