File size: 5,879 Bytes
9c60174 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 | # Long-Term Memory Retrieval Benchmark
Code release for the experiments described in the accompanying paper:
- **Hierarchical memory** organization (User Profile / Semantic / Episodic).
- **Plan-Act-Read agentic retrieval** that interleaves keyword, time-filter,
and embedding search.
- **Flat / dense / oracle baselines** for comparison.
## Repository layout
```
.
βββ main.py # End-to-end QA pipeline (agent, embed, keyword modes)
βββ evaluate_qa.py # Atomic-rubric QA evaluator (strict + partial)
βββ model_zoo.py # Model registry
βββ prompts/ # Prompt templates
β βββ agentic_retrieval_prompt.txt
β βββ agentic_retrieval_prompt_wo_profile.txt
β βββ keyword_search_prompt.txt
β βββ read_and_extract_prompt.txt
βββ memory/ # Episodic + semantic memory stores
βββ baselines/
β βββ MemoChat/ # MemoChat baseline (upstream code + our wrapper)
β βββ raptor/ # RAPTOR baseline (upstream code + our wrapper)
β βββ read-agent/ # ReadAgent baseline wrapper
βββ scripts/
β βββ build_retrieval_cache.py # Pre-compute GTE-7B embeddings for the corpus
β βββ make_v5_shards.py # Deterministic shard split by question_id
β βββ merge_jsonl_by_dataset_order.py
β βββ run_oracle_qa.py # Gold-session-only upper bound
β βββ plot_main_results.py
β βββ llm_judge_agreement.py
β βββ slurm/
β βββ example_dense_retrieval.slurm
β βββ example_agentic_retrieval.slurm
βββ requirements.txt
```
The benchmark dataset (`evolv_mem_v5.json`) is released separately; place it
under `dataset/` along with the supporting files referenced by `main.py`
(`all_sessions.json`, `all_session_summary.json`, etc.).
## Setup
```bash
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
```
### API keys
The pipeline calls LLMs through three optional providers; set whichever you
plan to use:
| Provider | Env var | Flag |
|------------------------------------------------|----------------------|--------------|
| OpenAI-compatible inference API | `NV_API_KEY` | `--nvidia` |
| OpenAI-compatible LiteLLM proxy | `LITELLM_API_KEY` | `--tritonai` |
| Direct Anthropic API | `ANTHROPIC_API_KEY` | (default) |
| Azure OpenAI | `AZURE_OPENAI_KEY` | (default) |
Each `--<flag>` selects which client the pipeline uses; entries in
`model_zoo.py` are tagged accordingly.
## Quick start
### 1. Build the per-question retrieval cache (one-time)
```bash
python scripts/build_retrieval_cache.py \
--dataset dataset/evolv_mem_v5.json \
--all_sessions dataset/all_sessions.json \
--out_dir response_cache/retrieval/
```
### 2. Shard the dataset for parallel runs
```bash
python scripts/make_v5_shards.py \
--dataset dataset/evolv_mem_v5.json \
--ret_cache_jsonl response_cache/retrieval/flat-gte/v5_retrievallog_turn_flat-gte \
--out_dir output/shards/v5_run_nchunks10/ \
--num_shards 8
```
### 3. Run the QA pipeline
Flat dense retrieval @ top-k=20 (single shard, e.g. for smoke testing):
```bash
export ret_cache="output/shards/v5_run_nchunks10/ret_cache/shard_00.jsonl"
python main.py \
--in_file output/shards/v5_run_nchunks10/dataset/shard_00.json \
--out_file output/shards/v5_run_nchunks10/dense_gte_topk20/part_00.jsonl \
--model_name gpt-5.5 \
--top_k 20 \
--n_chunks 10 \
--nvidia \
--all_sessions_file dataset/all_sessions.json \
--no_semantic \
--mode embed
```
Agentic retrieval over hierarchical memory:
```bash
python main.py \
--in_file output/shards/v5_run_nchunks10/dataset/shard_00.json \
--out_file output/shards/v5_run_nchunks10/agentic_hier/part_00.jsonl \
--model_name gpt-5.5 \
--top_k 20 \
--n_chunks 10 \
--nvidia \
--all_sessions_file dataset/all_sessions.json \
--hier_v2 --hier_union \
--mode agent
```
To launch the full 8-shard parallel sweep on a SLURM cluster, edit and submit
`scripts/slurm/example_dense_retrieval.slurm` or
`scripts/slurm/example_agentic_retrieval.slurm`.
### 4. Merge shards and evaluate
```bash
python scripts/merge_jsonl_by_dataset_order.py \
--dataset dataset/evolv_mem_v5.json \
--parts_glob "output/shards/v5_run_nchunks10/dense_gte_topk20/part_*.jsonl" \
--out_file output/v5_run_dense_gte_topk20.jsonl
python evaluate_qa.py \
--hyp_file output/v5_run_dense_gte_topk20.jsonl \
--ref_file dataset/evolv_mem_v5.json \
--eval_model_name gpt-5.2 \
--eval_mode both \
--nvidia
```
The evaluator caches an atomic-rubric per question
(`<dataset>.atomic-v1.rubric.json`) so subsequent runs reuse it.
## Pipeline modes
`main.py --mode` selects how a question is answered:
- `embed`: top-k flat dense retrieval (GTE 7B), then a single LLM call to answer.
- `keyword`: LLM-generated keywords + lexical matching, then answer.
- `agent`: Plan-Act-Read loop. Combines `--hier_v2` (semantic-summary stage) and
`--hier_union` (union with flat top-K) for the hierarchical-memory variant.
`--no_semantic` disables the semantic-summary memory layer (flat memory).
## Baselines
The three external baselines (MemoChat, RAPTOR, ReadAgent) live under
`baselines/` together with our thin wrappers
(`run_<baseline>_baseline.py`). Each baseline's upstream LICENSE is preserved.
## License
This repository is released under the license stated in the corresponding
LICENSE file (TBD prior to release). Upstream baselines retain their original
licenses.
|