| # PRISM-Memory Training Data |
|
|
| The PRISM-Memory release is trained on **synthetic** multi-session |
| conversations with **GPT-4.1-derived** memory-writing labels. No real user chat |
| logs are part of the public release story. |
|
|
| ## Dataset At A Glance |
|
|
| | Item | Count | What it means | |
| |---|---:|---| |
| | synthetic training conversations | `2,329` | multi-session conversations used to build the training label bank | |
| | synthetic held-out conversations | `584` | held-out conversations used for evaluation examples and reference labels | |
| | total generated conversations | `2,913` | train plus eval | |
| | supervised extraction examples | `100,427` | memory-writing examples derived from the synthetic conversations | |
| | released training subset | `20,000` | supervised examples used to train the public adapter | |
| | agent and task families | `6` | research, data analysis, QA, coding, planning, writing | |
|
|
| The synthetic conversation generator deliberately creates long-horizon memory |
| pressure: |
|
|
| - facts introduced early and queried later |
| - updated plans and corrected details |
| - deleted or invalidated information |
| - multi-session continuity |
| - mixtures of preferences, project state, dates, and operational facts |
|
|
| ## How The Data Is Built |
|
|
| The training pipeline has two layers. |
|
|
| ### 1. Synthetic conversation generation |
|
|
| The first layer creates multi-session conversations around realistic work and |
| assistant scenarios. Each conversation comes with scenario metadata, a persona, |
| multiple sessions, and explicit memory events such as inserts, updates, and |
| deletes. |
|
|
| Across the full corpus: |
|
|
| - `899` conversations are short |
| - `1,162` are medium |
| - `852` are long |
| - `897` are insert-only |
| - `937` include updates |
| - `435` include both updates and deletes |
|
|
| ### 2. Supervised memory-writing labels |
|
|
| The second layer converts those conversations into supervised extraction |
| examples. Each example contains: |
|
|
| - retrieved memories seen so far |
| - recent conversation context |
| - the current user turn |
| - target memory operations that should be written from that turn |
|
|
| The released model learns this memory-writing step. |
|
|
| ## What A Training Example Looks Like |
|
|
| One real synthetic scenario in the corpus is about **cloud infrastructure |
| performance optimization** for a low-latency trading platform. |
|
|
| **Synthetic scenario** |
|
|
| - domain: cloud infrastructure performance optimization |
| - persona: senior cloud systems engineer at a fintech startup |
| - conversation shape: two sessions, ten chunks, five later questions |
|
|
| **Synthetic user turn** |
|
|
| > Here’s the initial architecture outline: deploy microservices on AWS Fargate, use PostgreSQL 13 as the primary database, plan Kubernetes orchestration, use Redis for caching, keep API latency under 50ms, and redesign the system with a team of five engineers. |
|
|
| **Target memory records** |
|
|
| - Deploy microservices on AWS Fargate |
| - Orchestrate containers on a Kubernetes cluster (planned) |
| - Primary database: PostgreSQL 13 |
| - Use Redis as an in-memory caching layer |
| - Latency target: API responses under 50ms |
|
|
| Later turns in the same conversation update that memory with new load targets, |
| TTL settings, and rollout constraints such as zero downtime. |
|
|
| ## What Trained The Released Model |
|
|
| The public adapter was trained on `20,000` supervised extraction examples |
| sampled from the larger `100,427`-example label bank. |
|
|
| In plain terms, the model saw many examples of this pattern: |
|
|
| 1. a conversation turn mentions several durable facts |
| 2. the target output keeps only the memory-worthy facts |
| 3. those facts are written as short standalone memory records |
|
|
| That is why the release behaves like a memory writer rather than a chat model. |
|
|
| ## Evaluation Surfaces |
|
|
| The released model is evaluated on two held-out surfaces. |
|
|
| | Benchmark | Held-out surface | What it tests | |
| |---|---|---| |
| | `LoCoMo` | held-out conversations `conv-49` and `conv-50` | factual, temporal, inferential, multi-hop, and adversarial recall | |
| | `LongMemEval` | held-out items across six categories | knowledge updates, multi-session recall, single-session recall, and temporal reasoning | |
|
|
| Both the PRISM extractor and the GPT-4.1-based PropMem reference are scored |
| with the same QA layer, so the public comparison isolates the extraction step. |
|
|
| ## What Is Public Today |
|
|
| Public now: |
|
|
| - the dataset design |
| - corpus counts |
| - example training records |
| - held-out extraction examples |
| - benchmark results and category breakdowns |
|
|
| Not public yet: |
|
|
| - the full raw synthetic conversation files |
| - the full supervised label bank |
| - the auxiliary ablation corpora used for follow-on experiments |
|
|
| ## Practical Lessons From The Data |
|
|
| 1. The strongest release model came from the stable `20,000`-example base, not |
| from benchmark-specific add-ons. |
| 2. Explicit date anchoring helped more than benchmark-style answer formatting. |
| 3. More narrow benchmark data did not automatically improve generalization. |
| 4. The supervision is most useful when it teaches durable facts, updates, and |
| contradictions instead of stylistic imitation. |
|
|
| Related docs: |
|
|
| - [extraction-skill.md](extraction-skill.md) |
| - [release-results.md](release-results.md) |
| - [technical-blog.md](technical-blog.md) |
|
|