DECADE / README.md
anonymous-penguin's picture
Initial code release
9c60174 verified

Long-Term Memory Retrieval Benchmark

Code release for the experiments described in the accompanying paper:

  • Hierarchical memory organization (User Profile / Semantic / Episodic).
  • Plan-Act-Read agentic retrieval that interleaves keyword, time-filter, and embedding search.
  • Flat / dense / oracle baselines for comparison.

Repository layout

.
β”œβ”€β”€ main.py                       # End-to-end QA pipeline (agent, embed, keyword modes)
β”œβ”€β”€ evaluate_qa.py                # Atomic-rubric QA evaluator (strict + partial)
β”œβ”€β”€ model_zoo.py                  # Model registry
β”œβ”€β”€ prompts/                      # Prompt templates
β”‚   β”œβ”€β”€ agentic_retrieval_prompt.txt
β”‚   β”œβ”€β”€ agentic_retrieval_prompt_wo_profile.txt
β”‚   β”œβ”€β”€ keyword_search_prompt.txt
β”‚   └── read_and_extract_prompt.txt
β”œβ”€β”€ memory/                       # Episodic + semantic memory stores
β”œβ”€β”€ baselines/
β”‚   β”œβ”€β”€ MemoChat/                 # MemoChat baseline (upstream code + our wrapper)
β”‚   β”œβ”€β”€ raptor/                   # RAPTOR baseline (upstream code + our wrapper)
β”‚   └── read-agent/               # ReadAgent baseline wrapper
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ build_retrieval_cache.py  # Pre-compute GTE-7B embeddings for the corpus
β”‚   β”œβ”€β”€ make_v5_shards.py         # Deterministic shard split by question_id
β”‚   β”œβ”€β”€ merge_jsonl_by_dataset_order.py
β”‚   β”œβ”€β”€ run_oracle_qa.py          # Gold-session-only upper bound
β”‚   β”œβ”€β”€ plot_main_results.py
β”‚   β”œβ”€β”€ llm_judge_agreement.py
β”‚   └── slurm/
β”‚       β”œβ”€β”€ example_dense_retrieval.slurm
β”‚       └── example_agentic_retrieval.slurm
└── requirements.txt

The benchmark dataset (evolv_mem_v5.json) is released separately; place it under dataset/ along with the supporting files referenced by main.py (all_sessions.json, all_session_summary.json, etc.).

Setup

python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

API keys

The pipeline calls LLMs through three optional providers; set whichever you plan to use:

Provider Env var Flag
OpenAI-compatible inference API NV_API_KEY --nvidia
OpenAI-compatible LiteLLM proxy LITELLM_API_KEY --tritonai
Direct Anthropic API ANTHROPIC_API_KEY (default)
Azure OpenAI AZURE_OPENAI_KEY (default)

Each --<flag> selects which client the pipeline uses; entries in model_zoo.py are tagged accordingly.

Quick start

1. Build the per-question retrieval cache (one-time)

python scripts/build_retrieval_cache.py \
    --dataset dataset/evolv_mem_v5.json \
    --all_sessions dataset/all_sessions.json \
    --out_dir response_cache/retrieval/

2. Shard the dataset for parallel runs

python scripts/make_v5_shards.py \
    --dataset dataset/evolv_mem_v5.json \
    --ret_cache_jsonl response_cache/retrieval/flat-gte/v5_retrievallog_turn_flat-gte \
    --out_dir output/shards/v5_run_nchunks10/ \
    --num_shards 8

3. Run the QA pipeline

Flat dense retrieval @ top-k=20 (single shard, e.g. for smoke testing):

export ret_cache="output/shards/v5_run_nchunks10/ret_cache/shard_00.jsonl"
python main.py \
    --in_file  output/shards/v5_run_nchunks10/dataset/shard_00.json \
    --out_file output/shards/v5_run_nchunks10/dense_gte_topk20/part_00.jsonl \
    --model_name gpt-5.5 \
    --top_k 20 \
    --n_chunks 10 \
    --nvidia \
    --all_sessions_file dataset/all_sessions.json \
    --no_semantic \
    --mode embed

Agentic retrieval over hierarchical memory:

python main.py \
    --in_file  output/shards/v5_run_nchunks10/dataset/shard_00.json \
    --out_file output/shards/v5_run_nchunks10/agentic_hier/part_00.jsonl \
    --model_name gpt-5.5 \
    --top_k 20 \
    --n_chunks 10 \
    --nvidia \
    --all_sessions_file dataset/all_sessions.json \
    --hier_v2 --hier_union \
    --mode agent

To launch the full 8-shard parallel sweep on a SLURM cluster, edit and submit scripts/slurm/example_dense_retrieval.slurm or scripts/slurm/example_agentic_retrieval.slurm.

4. Merge shards and evaluate

python scripts/merge_jsonl_by_dataset_order.py \
    --dataset dataset/evolv_mem_v5.json \
    --parts_glob "output/shards/v5_run_nchunks10/dense_gte_topk20/part_*.jsonl" \
    --out_file output/v5_run_dense_gte_topk20.jsonl

python evaluate_qa.py \
    --hyp_file output/v5_run_dense_gte_topk20.jsonl \
    --ref_file dataset/evolv_mem_v5.json \
    --eval_model_name gpt-5.2 \
    --eval_mode both \
    --nvidia

The evaluator caches an atomic-rubric per question (<dataset>.atomic-v1.rubric.json) so subsequent runs reuse it.

Pipeline modes

main.py --mode selects how a question is answered:

  • embed: top-k flat dense retrieval (GTE 7B), then a single LLM call to answer.
  • keyword: LLM-generated keywords + lexical matching, then answer.
  • agent: Plan-Act-Read loop. Combines --hier_v2 (semantic-summary stage) and --hier_union (union with flat top-K) for the hierarchical-memory variant.

--no_semantic disables the semantic-summary memory layer (flat memory).

Baselines

The three external baselines (MemoChat, RAPTOR, ReadAgent) live under baselines/ together with our thin wrappers (run_<baseline>_baseline.py). Each baseline's upstream LICENSE is preserved.

License

This repository is released under the license stated in the corresponding LICENSE file (TBD prior to release). Upstream baselines retain their original licenses.