File size: 2,516 Bytes
047d480
 
419e63b
 
047d480
419e63b
047d480
419e63b
047d480
 
 
 
 
 
 
419e63b
047d480
419e63b
 
047d480
419e63b
047d480
 
419e63b
047d480
419e63b
 
047d480
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
419e63b
047d480
419e63b
 
047d480
419e63b
 
 
047d480
419e63b
 
 
 
 
 
 
 
047d480
 
 
419e63b
 
9088f51
419e63b
047d480
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
# PRISM-Memory Release Results

This page summarizes the confirmed public release metrics and the internal
comparison evidence that informed the release choice.

## Released Model

- Model: `PRISM-Memory 7B Adapter`
- Base model: `Qwen/Qwen2.5-7B-Instruct`
- Adapter type: LoRA
- Confirmed LoCoMo mean: `0.4981204463`
- Confirmed LongMemEval mean: `0.4767574431`
- QA cache hits during confirmation: `460`
- QA cache misses during confirmation: `0`

## Public Comparison

PRISM-Memory fine-tunes `Qwen/Qwen2.5-7B-Instruct` for the memory extraction
step that the PropMem reference gets from GPT-4.1.

| Benchmark | PRISM-Memory | GPT-4.1-based PropMem reference | Read |
|---|---:|---:|---|
| LongMemEval | `0.4768` | `0.4650` | PRISM wins |
| LoCoMo | `0.4981` | `0.5360` | PRISM trails, but stays competitive |

The QA layer is held constant. This is an extraction-step comparison, not an
end-to-end GPT-4.1 replacement claim.

## LoCoMo Breakdown

| Category | Score |
|---|---:|
| factual | `0.3339551926` |
| temporal | `0.4978785870` |
| inferential | `0.2605997475` |
| multi-hop | `0.5144477744` |
| adversarial | `0.8837209302` |

## LongMemEval Breakdown

| Category | Score |
|---|---:|
| knowledge-update | `0.5588405797` |
| multi-session | `0.1390977444` |
| single-session-assistant | `0.7656395892` |
| single-session-preference | `0.0519667456` |
| single-session-user | `0.9133333333` |
| temporal-reasoning | `0.4316666667` |

## Why This Model Was Released

The closest internal runner-up nearly tied the released model on overall
LoCoMo, but it lost on the broader release profile:

- lower LongMemEval score: `0.4689`
- weaker adversarial precision
- less balanced behavior across the full evaluation surface

Question-level comparison on held-out LoCoMo:

- disagreements: `152 / 400`
- questions favoring PRISM-Memory: `56`
- questions favoring the runner-up: `52`

That is close enough to be a real internal comparison, but not close enough to
justify two public models.

## Artifact Files

- [../../results/release_summary.json](../../results/release_summary.json)
- [../../results/release_model.json](../../results/release_model.json)
- [../../results/try_it_sessions.json](../../results/try_it_sessions.json)
- [../../results/internal_locomo_pairwise_diffs.json](../../results/internal_locomo_pairwise_diffs.json)

Related docs:

- [extraction-skill.md](extraction-skill.md)
- [extraction-examples.md](extraction-examples.md)
- [datasets.md](datasets.md)
- [model-card.md](model-card.md)