| # Long-Term Memory Retrieval Benchmark |
|
|
| Code release for the experiments described in the accompanying paper: |
| - **Hierarchical memory** organization (User Profile / Semantic / Episodic). |
| - **Plan-Act-Read agentic retrieval** that interleaves keyword, time-filter, |
| and embedding search. |
| - **Flat / dense / oracle baselines** for comparison. |
|
|
| ## Repository layout |
|
|
| ``` |
| . |
| βββ main.py # End-to-end QA pipeline (agent, embed, keyword modes) |
| βββ evaluate_qa.py # Atomic-rubric QA evaluator (strict + partial) |
| βββ model_zoo.py # Model registry |
| βββ prompts/ # Prompt templates |
| β βββ agentic_retrieval_prompt.txt |
| β βββ agentic_retrieval_prompt_wo_profile.txt |
| β βββ keyword_search_prompt.txt |
| β βββ read_and_extract_prompt.txt |
| βββ memory/ # Episodic + semantic memory stores |
| βββ baselines/ |
| β βββ MemoChat/ # MemoChat baseline (upstream code + our wrapper) |
| β βββ raptor/ # RAPTOR baseline (upstream code + our wrapper) |
| β βββ read-agent/ # ReadAgent baseline wrapper |
| βββ scripts/ |
| β βββ build_retrieval_cache.py # Pre-compute GTE-7B embeddings for the corpus |
| β βββ make_v5_shards.py # Deterministic shard split by question_id |
| β βββ merge_jsonl_by_dataset_order.py |
| β βββ run_oracle_qa.py # Gold-session-only upper bound |
| β βββ plot_main_results.py |
| β βββ llm_judge_agreement.py |
| β βββ slurm/ |
| β βββ example_dense_retrieval.slurm |
| β βββ example_agentic_retrieval.slurm |
| βββ requirements.txt |
| ``` |
|
|
| The benchmark dataset (`evolv_mem_v5.json`) is released separately; place it |
| under `dataset/` along with the supporting files referenced by `main.py` |
| (`all_sessions.json`, `all_session_summary.json`, etc.). |
|
|
| ## Setup |
|
|
| ```bash |
| python -m venv .venv && source .venv/bin/activate |
| pip install -r requirements.txt |
| ``` |
|
|
| ### API keys |
|
|
| The pipeline calls LLMs through three optional providers; set whichever you |
| plan to use: |
|
|
| | Provider | Env var | Flag | |
| |------------------------------------------------|----------------------|--------------| |
| | OpenAI-compatible inference API | `NV_API_KEY` | `--nvidia` | |
| | OpenAI-compatible LiteLLM proxy | `LITELLM_API_KEY` | `--tritonai` | |
| | Direct Anthropic API | `ANTHROPIC_API_KEY` | (default) | |
| | Azure OpenAI | `AZURE_OPENAI_KEY` | (default) | |
|
|
| Each `--<flag>` selects which client the pipeline uses; entries in |
| `model_zoo.py` are tagged accordingly. |
|
|
| ## Quick start |
|
|
| ### 1. Build the per-question retrieval cache (one-time) |
|
|
| ```bash |
| python scripts/build_retrieval_cache.py \ |
| --dataset dataset/evolv_mem_v5.json \ |
| --all_sessions dataset/all_sessions.json \ |
| --out_dir response_cache/retrieval/ |
| ``` |
|
|
| ### 2. Shard the dataset for parallel runs |
|
|
| ```bash |
| python scripts/make_v5_shards.py \ |
| --dataset dataset/evolv_mem_v5.json \ |
| --ret_cache_jsonl response_cache/retrieval/flat-gte/v5_retrievallog_turn_flat-gte \ |
| --out_dir output/shards/v5_run_nchunks10/ \ |
| --num_shards 8 |
| ``` |
|
|
| ### 3. Run the QA pipeline |
|
|
| Flat dense retrieval @ top-k=20 (single shard, e.g. for smoke testing): |
|
|
| ```bash |
| export ret_cache="output/shards/v5_run_nchunks10/ret_cache/shard_00.jsonl" |
| python main.py \ |
| --in_file output/shards/v5_run_nchunks10/dataset/shard_00.json \ |
| --out_file output/shards/v5_run_nchunks10/dense_gte_topk20/part_00.jsonl \ |
| --model_name gpt-5.5 \ |
| --top_k 20 \ |
| --n_chunks 10 \ |
| --nvidia \ |
| --all_sessions_file dataset/all_sessions.json \ |
| --no_semantic \ |
| --mode embed |
| ``` |
|
|
| Agentic retrieval over hierarchical memory: |
|
|
| ```bash |
| python main.py \ |
| --in_file output/shards/v5_run_nchunks10/dataset/shard_00.json \ |
| --out_file output/shards/v5_run_nchunks10/agentic_hier/part_00.jsonl \ |
| --model_name gpt-5.5 \ |
| --top_k 20 \ |
| --n_chunks 10 \ |
| --nvidia \ |
| --all_sessions_file dataset/all_sessions.json \ |
| --hier_v2 --hier_union \ |
| --mode agent |
| ``` |
|
|
| To launch the full 8-shard parallel sweep on a SLURM cluster, edit and submit |
| `scripts/slurm/example_dense_retrieval.slurm` or |
| `scripts/slurm/example_agentic_retrieval.slurm`. |
|
|
| ### 4. Merge shards and evaluate |
|
|
| ```bash |
| python scripts/merge_jsonl_by_dataset_order.py \ |
| --dataset dataset/evolv_mem_v5.json \ |
| --parts_glob "output/shards/v5_run_nchunks10/dense_gte_topk20/part_*.jsonl" \ |
| --out_file output/v5_run_dense_gte_topk20.jsonl |
| |
| python evaluate_qa.py \ |
| --hyp_file output/v5_run_dense_gte_topk20.jsonl \ |
| --ref_file dataset/evolv_mem_v5.json \ |
| --eval_model_name gpt-5.2 \ |
| --eval_mode both \ |
| --nvidia |
| ``` |
|
|
| The evaluator caches an atomic-rubric per question |
| (`<dataset>.atomic-v1.rubric.json`) so subsequent runs reuse it. |
|
|
| ## Pipeline modes |
|
|
| `main.py --mode` selects how a question is answered: |
|
|
| - `embed`: top-k flat dense retrieval (GTE 7B), then a single LLM call to answer. |
| - `keyword`: LLM-generated keywords + lexical matching, then answer. |
| - `agent`: Plan-Act-Read loop. Combines `--hier_v2` (semantic-summary stage) and |
| `--hier_union` (union with flat top-K) for the hierarchical-memory variant. |
|
|
| `--no_semantic` disables the semantic-summary memory layer (flat memory). |
|
|
| ## Baselines |
|
|
| The three external baselines (MemoChat, RAPTOR, ReadAgent) live under |
| `baselines/` together with our thin wrappers |
| (`run_<baseline>_baseline.py`). Each baseline's upstream LICENSE is preserved. |
|
|
| ## License |
|
|
| This repository is released under the license stated in the corresponding |
| LICENSE file (TBD prior to release). Upstream baselines retain their original |
| licenses. |
|
|