# Long-Term Memory Retrieval Benchmark Code release for the experiments described in the accompanying paper: - **Hierarchical memory** organization (User Profile / Semantic / Episodic). - **Plan-Act-Read agentic retrieval** that interleaves keyword, time-filter, and embedding search. - **Flat / dense / oracle baselines** for comparison. ## Repository layout ``` . ├── main.py # End-to-end QA pipeline (agent, embed, keyword modes) ├── evaluate_qa.py # Atomic-rubric QA evaluator (strict + partial) ├── model_zoo.py # Model registry ├── prompts/ # Prompt templates │ ├── agentic_retrieval_prompt.txt │ ├── agentic_retrieval_prompt_wo_profile.txt │ ├── keyword_search_prompt.txt │ └── read_and_extract_prompt.txt ├── memory/ # Episodic + semantic memory stores ├── baselines/ │ ├── MemoChat/ # MemoChat baseline (upstream code + our wrapper) │ ├── raptor/ # RAPTOR baseline (upstream code + our wrapper) │ └── read-agent/ # ReadAgent baseline wrapper ├── scripts/ │ ├── build_retrieval_cache.py # Pre-compute GTE-7B embeddings for the corpus │ ├── make_v5_shards.py # Deterministic shard split by question_id │ ├── merge_jsonl_by_dataset_order.py │ ├── run_oracle_qa.py # Gold-session-only upper bound │ ├── plot_main_results.py │ ├── llm_judge_agreement.py │ └── slurm/ │ ├── example_dense_retrieval.slurm │ └── example_agentic_retrieval.slurm └── requirements.txt ``` The benchmark dataset (`evolv_mem_v5.json`) is released separately; place it under `dataset/` along with the supporting files referenced by `main.py` (`all_sessions.json`, `all_session_summary.json`, etc.). ## Setup ```bash python -m venv .venv && source .venv/bin/activate pip install -r requirements.txt ``` ### API keys The pipeline calls LLMs through three optional providers; set whichever you plan to use: | Provider | Env var | Flag | |------------------------------------------------|----------------------|--------------| | OpenAI-compatible inference API | `NV_API_KEY` | `--nvidia` | | OpenAI-compatible LiteLLM proxy | `LITELLM_API_KEY` | `--tritonai` | | Direct Anthropic API | `ANTHROPIC_API_KEY` | (default) | | Azure OpenAI | `AZURE_OPENAI_KEY` | (default) | Each `--` selects which client the pipeline uses; entries in `model_zoo.py` are tagged accordingly. ## Quick start ### 1. Build the per-question retrieval cache (one-time) ```bash python scripts/build_retrieval_cache.py \ --dataset dataset/evolv_mem_v5.json \ --all_sessions dataset/all_sessions.json \ --out_dir response_cache/retrieval/ ``` ### 2. Shard the dataset for parallel runs ```bash python scripts/make_v5_shards.py \ --dataset dataset/evolv_mem_v5.json \ --ret_cache_jsonl response_cache/retrieval/flat-gte/v5_retrievallog_turn_flat-gte \ --out_dir output/shards/v5_run_nchunks10/ \ --num_shards 8 ``` ### 3. Run the QA pipeline Flat dense retrieval @ top-k=20 (single shard, e.g. for smoke testing): ```bash export ret_cache="output/shards/v5_run_nchunks10/ret_cache/shard_00.jsonl" python main.py \ --in_file output/shards/v5_run_nchunks10/dataset/shard_00.json \ --out_file output/shards/v5_run_nchunks10/dense_gte_topk20/part_00.jsonl \ --model_name gpt-5.5 \ --top_k 20 \ --n_chunks 10 \ --nvidia \ --all_sessions_file dataset/all_sessions.json \ --no_semantic \ --mode embed ``` Agentic retrieval over hierarchical memory: ```bash python main.py \ --in_file output/shards/v5_run_nchunks10/dataset/shard_00.json \ --out_file output/shards/v5_run_nchunks10/agentic_hier/part_00.jsonl \ --model_name gpt-5.5 \ --top_k 20 \ --n_chunks 10 \ --nvidia \ --all_sessions_file dataset/all_sessions.json \ --hier_v2 --hier_union \ --mode agent ``` To launch the full 8-shard parallel sweep on a SLURM cluster, edit and submit `scripts/slurm/example_dense_retrieval.slurm` or `scripts/slurm/example_agentic_retrieval.slurm`. ### 4. Merge shards and evaluate ```bash python scripts/merge_jsonl_by_dataset_order.py \ --dataset dataset/evolv_mem_v5.json \ --parts_glob "output/shards/v5_run_nchunks10/dense_gte_topk20/part_*.jsonl" \ --out_file output/v5_run_dense_gte_topk20.jsonl python evaluate_qa.py \ --hyp_file output/v5_run_dense_gte_topk20.jsonl \ --ref_file dataset/evolv_mem_v5.json \ --eval_model_name gpt-5.2 \ --eval_mode both \ --nvidia ``` The evaluator caches an atomic-rubric per question (`.atomic-v1.rubric.json`) so subsequent runs reuse it. ## Pipeline modes `main.py --mode` selects how a question is answered: - `embed`: top-k flat dense retrieval (GTE 7B), then a single LLM call to answer. - `keyword`: LLM-generated keywords + lexical matching, then answer. - `agent`: Plan-Act-Read loop. Combines `--hier_v2` (semantic-summary stage) and `--hier_union` (union with flat top-K) for the hierarchical-memory variant. `--no_semantic` disables the semantic-summary memory layer (flat memory). ## Baselines The three external baselines (MemoChat, RAPTOR, ReadAgent) live under `baselines/` together with our thin wrappers (`run__baseline.py`). Each baseline's upstream LICENSE is preserved. ## License This repository is released under the license stated in the corresponding LICENSE file (TBD prior to release). Upstream baselines retain their original licenses.