AsadIsmail commited on
Commit
2801c36
·
verified ·
1 Parent(s): 607bb36

Add PRISM-Memory Space demo

Browse files
MEMORY_EXTRACTION_SKILL.md ADDED
@@ -0,0 +1,113 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # PRISM-Memory Extraction Skill
2
+
3
+ **Hook:** Turn conversations into durable, searchable memory.
4
+
5
+ This is the single extraction skill to keep from the `better_memory` work.
6
+ Public release should point to one checkpoint and one extraction behavior:
7
+
8
+ - **Model:** `exp15_sft_qwen7b_4ep`
9
+ - **Base model:** `Qwen/Qwen2.5-7B-Instruct`
10
+ - **Role:** proposition extraction for long-term conversational memory
11
+ - **Why this one:** best confirmed total profile, best adversarial behavior, and
12
+ best LongMemEval score
13
+
14
+ ## Skill Definition
15
+
16
+ The extractor operates turn by turn and emits `0-5` atomic propositions per
17
+ turn. Each proposition should be a standalone fact about a person, event,
18
+ preference, or property, with dates carried into the fact when available.
19
+
20
+ Canonical prompt:
21
+
22
+ ```text
23
+ You are a memory extraction assistant. Given a conversation turn, extract 0-5 atomic, standalone facts. Each fact must be a complete sentence about a specific person, event, preference, or property. Include dates/times when mentioned. Skip greetings, filler, and questions. Output ONLY a JSON array of strings, e.g. ["fact1", "fact2"] or [].
24
+ ```
25
+
26
+ This prompt comes from
27
+ `/home/ec2-user/SageMaker/better_memory/experiment15_learned_extraction.py`.
28
+
29
+ ## Inference Contract
30
+
31
+ 1. Format the turn with speaker and session date.
32
+ 2. Extract `0-5` propositions as a JSON array.
33
+ 3. Clean speaker references so generic labels become real names.
34
+ 4. Resolve relative temporal expressions against the session date.
35
+ 5. Prefix each proposition with the normalized session date before indexing.
36
+ 6. Retrieve with the PRISM hybrid stack, not with the extractor alone.
37
+
38
+ ## Retrieval Setup To Keep
39
+
40
+ - **Retriever:** `PRISMv3Rerank`
41
+ - **Sparse retrieval:** BM25
42
+ - **Dense retrieval:** `all-MiniLM-L6-v2`
43
+ - **Reranker:** `cross-encoder/ms-marco-MiniLM-L-6-v2`
44
+
45
+ Best confirmed retrieval settings:
46
+
47
+ - **LoCoMo:** adversarial `k=5`, multi-hop `k=10`, all other categories `k=8`
48
+ - **LongMemEval:** multi-session `k=20`, all other categories `k=8` except
49
+ single-session-user `k=5`
50
+
51
+ ## What Worked
52
+
53
+ 1. **The original 20k base mattered.**
54
+ `sft4` came from the exact `train_sft_clean_merged.jsonl` base distribution.
55
+ Runs that changed the base subset regressed.
56
+
57
+ 2. **Four epochs was the sweet spot.**
58
+ `sft4` is the local optimum the repo could actually reproduce.
59
+
60
+ 3. **Absolute date anchoring helped.**
61
+ Temporal repairs worked when the model saw explicit, normalized dates rather
62
+ than benchmark-specific relative phrasing.
63
+
64
+ 4. **Post-processing mattered.**
65
+ Speaker cleanup plus relative-date resolution was necessary to turn raw
66
+ outputs into stable memory records.
67
+
68
+ 5. **Hybrid retrieval beat simpler retrieval.**
69
+ BM25 + dense + reranking consistently outperformed BM25-only or dense-only
70
+ approaches.
71
+
72
+ 6. **Turn-local extraction was enough.**
73
+ The model performed better without feeding long recent-context windows into
74
+ the extractor.
75
+
76
+ 7. **Multihop supervision preserved inferential behavior.**
77
+ When temporal data was added, multihop QA was the only extra signal that
78
+ reliably helped preserve inferential performance.
79
+
80
+ ## What Did Not Work
81
+
82
+ 1. **Relative-date training.**
83
+ Training the extractor to emit benchmark-style relative dates hurt temporal
84
+ performance instead of helping it.
85
+
86
+ 2. **LoCoMo-domain SFT data.**
87
+ Adding LoCoMo training conversations consistently regressed the model.
88
+
89
+ 3. **More than 20k original LME examples.**
90
+ Scaling the original noisy temporal labels to 50k amplified anchor loss and
91
+ caused major regression.
92
+
93
+ 4. **Small clean bases.**
94
+ 5k-base follow-on runs forgot too much and collapsed inferential behavior.
95
+
96
+ 5. **Heavy QA multipliers.**
97
+ High temporal or QA multipliers damaged adversarial precision and LongMemEval.
98
+
99
+ 6. **High learning rates on follow-on QA runs.**
100
+ Aggressive fine-tuning degraded the traits that made `sft4` good.
101
+
102
+ 7. **Trying to push past the local optimum.**
103
+ Most post-`sft4` training traded away adversarial performance for narrower
104
+ gains.
105
+
106
+ ## Release Rule
107
+
108
+ Release only this extraction skill and only this checkpoint publicly:
109
+
110
+ - `exp15_sft_qwen7b_4ep`
111
+
112
+ Treat all other checkpoints as internal ablations and learning artifacts, not as
113
+ parallel public releases.
README.md CHANGED
@@ -1,12 +1,38 @@
1
  ---
2
- title: Prism Memory
3
- emoji: 🏢
4
- colorFrom: purple
5
- colorTo: blue
6
  sdk: gradio
7
  sdk_version: 6.12.0
8
  app_file: app.py
9
  pinned: false
10
  ---
11
 
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: PRISM-Memory
3
+ colorFrom: blue
4
+ colorTo: green
 
5
  sdk: gradio
6
  sdk_version: 6.12.0
7
  app_file: app.py
8
  pinned: false
9
  ---
10
 
11
+ # PRISM-Memory Space
12
+
13
+ **Hook:** Turn conversations into durable, searchable memory.
14
+
15
+ This Space is the lightweight public demo for the single released
16
+ `PRISM-Memory` extraction skill. It shows the best checkpoint only.
17
+
18
+ ## Inputs
19
+
20
+ The app reads:
21
+
22
+ - `../results/confirmed_exp15_summary.json`
23
+ - `../results/scenario_comparisons.json`
24
+ - `../MEMORY_EXTRACTION_SKILL.md`
25
+
26
+ ## What It Shows
27
+
28
+ 1. The confirmed metrics for the released checkpoint
29
+ 2. Selected benchmark cases showing strengths and failure modes
30
+ 3. The single canonical memory extraction skill to keep
31
+
32
+ ## Local Run
33
+
34
+ ```bash
35
+ cd /home/ec2-user/SageMaker/autoresearch_memory/hf_space_demo
36
+ python -m pip install -r requirements.txt
37
+ python app.py
38
+ ```
app.py ADDED
@@ -0,0 +1,178 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from __future__ import annotations
2
+
3
+ import json
4
+ from pathlib import Path
5
+
6
+ import gradio as gr
7
+ import pandas as pd
8
+
9
+ APP_DIR = Path(__file__).resolve().parent
10
+ if (APP_DIR / "MEMORY_EXTRACTION_SKILL.md").exists() or (APP_DIR / "results").exists():
11
+ ROOT = APP_DIR
12
+ else:
13
+ ROOT = APP_DIR.parent
14
+ RESULTS_DIR = ROOT / "results"
15
+ SUMMARY_PATH = RESULTS_DIR / "confirmed_exp15_summary.json"
16
+ SCENARIO_PATH = RESULTS_DIR / "scenario_comparisons.json"
17
+ SKILL_PATH = ROOT / "MEMORY_EXTRACTION_SKILL.md"
18
+ LOCOMO_CATEGORY_NAMES = {
19
+ "1": "factual",
20
+ "2": "temporal",
21
+ "3": "inferential",
22
+ "4": "multi-hop",
23
+ "5": "adversarial",
24
+ }
25
+ LME_CATEGORY_ORDER = [
26
+ "knowledge-update",
27
+ "multi-session",
28
+ "single-session-assistant",
29
+ "single-session-preference",
30
+ "single-session-user",
31
+ "temporal-reasoning",
32
+ ]
33
+
34
+
35
+ def _load_json(path: Path, default):
36
+ if not path.exists():
37
+ return default
38
+ return json.loads(path.read_text())
39
+
40
+
41
+ def _load_summary() -> dict:
42
+ return _load_json(SUMMARY_PATH, {"results": [], "failures": []})
43
+
44
+
45
+ def _load_scenarios() -> dict:
46
+ return _load_json(SCENARIO_PATH, {"scenarios": []})
47
+
48
+
49
+ def _load_skill() -> str:
50
+ if not SKILL_PATH.exists():
51
+ return "Skill document not found."
52
+ return SKILL_PATH.read_text()
53
+
54
+
55
+ def _best_result() -> dict | None:
56
+ results = _load_summary().get("results", [])
57
+ return results[0] if results else None
58
+
59
+
60
+ def release_markdown() -> str:
61
+ item = _best_result()
62
+ if not item:
63
+ return "## No confirmed release result yet"
64
+ checkpoint = Path(item["checkpoint"]).name
65
+ return "\n\n".join(
66
+ [
67
+ "# PRISM-Memory",
68
+ "**Turn conversations into durable, searchable memory.**",
69
+ f"Released checkpoint: `{checkpoint}`",
70
+ f"Confirmed LoCoMo: `{item['locomo']['mean']:.3f}`",
71
+ f"Confirmed LongMemEval: `{item['lme']['mean']:.3f}`",
72
+ ]
73
+ )
74
+
75
+
76
+ def summary_df() -> pd.DataFrame:
77
+ item = _best_result()
78
+ if not item:
79
+ return pd.DataFrame(columns=["checkpoint", "locomo_mean", "lme_mean", "cache_hits", "cache_misses", "eval_minutes"])
80
+ return pd.DataFrame(
81
+ [
82
+ {
83
+ "checkpoint": Path(item["checkpoint"]).name,
84
+ "locomo_mean": round(item["locomo"]["mean"], 3),
85
+ "lme_mean": round(item["lme"]["mean"], 3),
86
+ "cache_hits": item["qa_cache"]["hits"],
87
+ "cache_misses": item["qa_cache"]["misses"],
88
+ "eval_minutes": item["elapsed_min"],
89
+ }
90
+ ]
91
+ )
92
+
93
+
94
+ def category_df() -> pd.DataFrame:
95
+ item = _best_result()
96
+ if not item:
97
+ return pd.DataFrame(columns=["benchmark", "category", "score"])
98
+ rows = []
99
+ for category in sorted(item["locomo"]["categories"], key=int):
100
+ score = item["locomo"]["categories"][category]
101
+ rows.append(
102
+ {
103
+ "benchmark": "LoCoMo",
104
+ "category": LOCOMO_CATEGORY_NAMES.get(category, category),
105
+ "score": round(score, 3),
106
+ }
107
+ )
108
+ for category in LME_CATEGORY_ORDER:
109
+ if category not in item["lme"]["categories"]:
110
+ continue
111
+ score = item["lme"]["categories"][category]
112
+ rows.append({"benchmark": "LongMemEval", "category": category, "score": round(score, 3)})
113
+ return pd.DataFrame(rows)
114
+
115
+
116
+ def _scenario_label(item: dict) -> str:
117
+ return f"{item['title']}: {item['question']}"
118
+
119
+
120
+ def scenario_choices() -> list[str]:
121
+ scenarios = _load_scenarios().get("scenarios", [])
122
+ return [_scenario_label(s) for s in scenarios]
123
+
124
+
125
+ def render_scenario(choice: str):
126
+ scenarios = _load_scenarios().get("scenarios", [])
127
+ if not scenarios:
128
+ return "No scenario data yet.", pd.DataFrame(columns=["prediction", "top_retrieval"])
129
+
130
+ item = next(
131
+ (scenario for scenario in scenarios if _scenario_label(scenario) == choice or scenario["id"] == choice),
132
+ scenarios[0],
133
+ )
134
+ system = next((entry for entry in item.get("systems", []) if entry.get("name") == "sft4"), item["systems"][0])
135
+ header = [
136
+ f"### {item['title']}",
137
+ "",
138
+ f"**Question:** {item['question']}",
139
+ f"**Gold answer:** {item['gold_answer']}",
140
+ f"**What this case shows:** {item.get('note', 'Selected benchmark case.')}",
141
+ f"**Case type:** {item.get('kind', 'n/a')}",
142
+ ]
143
+ table = pd.DataFrame(
144
+ [
145
+ {
146
+ "prediction": system.get("prediction", ""),
147
+ "top_retrieval": "\n".join(system.get("top_retrieval", [])[:3]),
148
+ }
149
+ ]
150
+ )
151
+ return "\n".join(header), table
152
+
153
+
154
+ with gr.Blocks(title="PRISM-Memory Demo") as demo:
155
+ gr.Markdown(release_markdown())
156
+
157
+ with gr.Tab("Metrics"):
158
+ gr.Markdown("## Released Checkpoint")
159
+ metrics = gr.Dataframe(value=summary_df(), interactive=False, wrap=True)
160
+ gr.Markdown("## Category Breakdown")
161
+ categories = gr.Dataframe(value=category_df(), interactive=False, wrap=True)
162
+ refresh = gr.Button("Refresh Data")
163
+ refresh.click(fn=lambda: (summary_df(), category_df()), outputs=[metrics, categories])
164
+
165
+ with gr.Tab("Memory Cases"):
166
+ choices = scenario_choices() or ["pending"]
167
+ picker = gr.Dropdown(choices=choices, value=choices[0], label="Benchmark Case")
168
+ scenario_md = gr.Markdown()
169
+ scenario_table = gr.Dataframe(interactive=False, wrap=True)
170
+ picker.change(render_scenario, inputs=picker, outputs=[scenario_md, scenario_table])
171
+ demo.load(fn=lambda: render_scenario(choices[0]), outputs=[scenario_md, scenario_table])
172
+
173
+ with gr.Tab("Skill"):
174
+ gr.Markdown(_load_skill())
175
+
176
+
177
+ if __name__ == "__main__":
178
+ demo.launch()
requirements.txt ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ gradio>=5.0.0
2
+ pandas>=2.0.0
results/confirmed_exp15_summary.json ADDED
@@ -0,0 +1,53 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "results": [
3
+ {
4
+ "alias": "sft4",
5
+ "checkpoint": "/home/ec2-user/SageMaker/better_memory/exp15_sft_qwen7b_4ep",
6
+ "elapsed_min": 28.93,
7
+ "args": {
8
+ "n_lme": 10,
9
+ "context_window": 0,
10
+ "locomo_temporal_k": 8,
11
+ "locomo_adversarial_k": 5,
12
+ "lme_multisess_k": 20,
13
+ "use_temporal_prompt": false,
14
+ "strict_cache": true
15
+ },
16
+ "qa_cache": {
17
+ "model": "gpt-4.1",
18
+ "cache_size": 16969,
19
+ "hits": 460,
20
+ "misses": 0,
21
+ "missing_examples": []
22
+ },
23
+ "locomo": {
24
+ "categories": {
25
+ "1": 0.3339551926061944,
26
+ "2": 0.4978785869736096,
27
+ "3": 0.26059974747474746,
28
+ "4": 0.514447774438597,
29
+ "5": 0.8837209302325582
30
+ },
31
+ "mean": 0.49812044634514124
32
+ },
33
+ "lme": {
34
+ "categories": {
35
+ "knowledge-update": 0.558840579710145,
36
+ "multi-session": 0.13909774436090225,
37
+ "single-session-assistant": 0.765639589169001,
38
+ "single-session-preference": 0.05196674560130369,
39
+ "single-session-user": 0.9133333333333333,
40
+ "temporal-reasoning": 0.43166666666666664
41
+ },
42
+ "mean": 0.47675744314022533
43
+ },
44
+ "logged_comparison": {
45
+ "logged_locomo_mean": 0.498,
46
+ "logged_lme_mean": 0.477,
47
+ "locomo_delta": 0.00012044634514124519,
48
+ "lme_delta": -0.00024255685977464525
49
+ }
50
+ }
51
+ ],
52
+ "failures": []
53
+ }
results/scenario_comparisons.json ADDED
@@ -0,0 +1,140 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "qa_cache": {
3
+ "model": "gpt-4.1",
4
+ "cache_size": 16969,
5
+ "hits": 5,
6
+ "misses": 0,
7
+ "missing_examples": []
8
+ },
9
+ "scenarios": [
10
+ {
11
+ "id": "temporal_anchor_hobby",
12
+ "title": "Temporal Anchor",
13
+ "source_id": "conv-49",
14
+ "category": 2,
15
+ "question": "Which hobby did Sam take up in May 2023?",
16
+ "gold_answer": "painting",
17
+ "kind": "strength",
18
+ "note": "The released model keeps the dated hobby proposition and answers correctly.",
19
+ "systems": [
20
+ {
21
+ "name": "sft4",
22
+ "prediction": "painting",
23
+ "top_retrieval": [
24
+ "Sam: [18 May 2023] Sam is considering trying painting as a new hobby.",
25
+ "Sam: [24 May 2023] Sam has been considering trying painting as a new hobby.",
26
+ "Sam: [6 October 2023] Sam asked Evan if he has explored any fun indoor activities or hobbies.",
27
+ "Sam: [18 May 2023] Sam is excited to try new things.",
28
+ "Sam: [24 May 2023] Sam is trying to break old habits.",
29
+ "Sam: [15 August 2023] Sam attended a cooking class.",
30
+ "Sam: [17 December 2023] Sam used to love hiking.",
31
+ "[1:47 pm on 18 May, 2023] Sam: We hiked a good distance - quite a feat for me back then. It's definitely a great memory."
32
+ ]
33
+ }
34
+ ]
35
+ },
36
+ {
37
+ "id": "adversarial_refusal_guitar",
38
+ "title": "Adversarial Refusal",
39
+ "source_id": "conv-50",
40
+ "category": 5,
41
+ "question": "Why did Dave get his guitar customized with a shiny finish?",
42
+ "gold_answer": "",
43
+ "kind": "strength",
44
+ "note": "This tests whether the system refuses to invent an answer when the premise is unsupported.",
45
+ "systems": [
46
+ {
47
+ "name": "sft4",
48
+ "prediction": "None",
49
+ "top_retrieval": [
50
+ "[2:55 pm on 31 August, 2023] Dave: That guitar has a gorgeous purple hue. Why did you make it so shiny?",
51
+ "Dave: [2 November 2023] The guitar was in bad condition when Dave found it.",
52
+ "[2:55 pm on 31 August, 2023] Dave: Good pick! The customized purple glow gives it a unique look that really stands out.",
53
+ "[2:55 pm on 31 August, 2023] Dave: That's a great guitar, Calvin! Love the design, it's so unique and special.",
54
+ "Dave: [16 May 2023] Calvin lost his guitar and amp but managed to save his music gear and microphone."
55
+ ]
56
+ }
57
+ ]
58
+ },
59
+ {
60
+ "id": "diagnosis_specificity",
61
+ "title": "Diagnosis Specificity",
62
+ "source_id": "conv-49",
63
+ "category": 1,
64
+ "question": "Which ailment does Sam have to face due to his weight?",
65
+ "gold_answer": "gastritis",
66
+ "kind": "failure",
67
+ "note": "A representative factual miss: the model retrieves the health-risk frame but not the specific diagnosis.",
68
+ "systems": [
69
+ {
70
+ "name": "sft4",
71
+ "prediction": "serious health risk",
72
+ "top_retrieval": [
73
+ "Sam: [8 October 2023] The doctor told Sam that his weight is a serious health risk.",
74
+ "Sam: [24 May 2023] The doctor's check-up revealed that Sam's weight was not good.",
75
+ "Sam: [13 August 2023] Sam is currently experiencing challenges affecting his health.",
76
+ "[6:48 pm on 17 December, 2023] Sam: Yeah, I'm struggling with my weight and it's affecting my confidence. I feel like I can't overcome all the challenges with my weight, I keep lacking motivation.",
77
+ "Sam: [21 November 2023] Sam has been trying to make dietary changes to address his discomfort.",
78
+ "Sam: [9 November 2023] Sam is a Weight Watchers coach in his group.",
79
+ "Sam: [7 August 2023] Sam has been prioritizing his health for some time.",
80
+ "Sam: [15 August 2023] Sam is concerned about his health."
81
+ ]
82
+ }
83
+ ]
84
+ },
85
+ {
86
+ "id": "location_inference",
87
+ "title": "Location Inference",
88
+ "source_id": "conv-49",
89
+ "category": 3,
90
+ "question": "Does Evan live close to a beach or mountains?",
91
+ "gold_answer": "beach",
92
+ "kind": "failure",
93
+ "note": "A representative inferential miss: retrieval includes both clues, but the model overcommits to the mountain mention.",
94
+ "systems": [
95
+ {
96
+ "name": "sft4",
97
+ "prediction": "mountains",
98
+ "top_retrieval": [
99
+ "Evan: [27 August 2023] Evan also shared his recent road trip to the Rocky Mountains and love for hiking.",
100
+ "Evan: [27 August 2023] Evan lives within a two-hour drive of a place with incredible views and a peaceful atmosphere.",
101
+ "Evan: [9 November 2023] They also discussed enjoying a sunset together at Evan's favorite spot by the beach, planning to visit it soon to de-stress.",
102
+ "Evan: [10 January 2024] Evan enjoys going on beach sunsets as a low-impact exercise.",
103
+ "[7:11 pm on 24 May, 2023] Evan: Hey Sam, thanks for asking! It was great - fresh air, peacefulness and a cozy cabin surrounded by mountains and forests made it feel like a real retreat.",
104
+ "Evan: [31 December 2023] Sam shared about a recent hiking trip, while Evan mentioned a mountain drive that ended in a minor accident.",
105
+ "Evan: [27 August 2023] Evan recommended a nearby lake for hiking and nature exploration.",
106
+ "Evan: [27 August 2023] Evan enjoys road trips and exploring nature."
107
+ ]
108
+ }
109
+ ]
110
+ },
111
+ {
112
+ "id": "reading_detail",
113
+ "title": "Reading Detail",
114
+ "source_id": "conv-49",
115
+ "category": 4,
116
+ "question": "What novel is Evan reading that he finds gripping?",
117
+ "gold_answer": "The Great Gatsby",
118
+ "kind": "failure",
119
+ "note": "A representative multi-hop miss: the model retains the coarse book description but misses the specific title.",
120
+ "systems": [
121
+ {
122
+ "name": "sft4",
123
+ "prediction": "a new mystery novel",
124
+ "top_retrieval": [
125
+ "Evan: [27 August 2023] Evan is reading a book that he finds increasingly compelling.",
126
+ "Evan: [27 July 2023] Evan is currently reading a new mystery novel.",
127
+ "Evan: [27 July 2023] Evan is reading 'The Great Gatsby'.",
128
+ "Evan: [26 December 2023] Evan finds that art helps him recognize and handle his own feelings.",
129
+ "Evan: [27 August 2023] Evan expressed interest in a book and discussed potential physical therapy for his knee.",
130
+ "Evan: [10 January 2024] Evan concluded that he needs to be more careful next time.",
131
+ "Evan: [6 October 2023] Evan thinks writing is a great way to express oneself.",
132
+ "Evan: [13 August 2023] Evan suggested checking out a dream interpretation book to help interpret Sam's dream.",
133
+ "Evan: [6 October 2023] Evan believes that writing can be super therapeutic.",
134
+ "Evan: [6 October 2023] Evan usually paints what is on his mind or something he is feeling."
135
+ ]
136
+ }
137
+ ]
138
+ }
139
+ ]
140
+ }