DECADE / README.md

Initial code release

9c60174 verified 21 days ago

5.88 kB

	# Long-Term Memory Retrieval Benchmark

	Code release for the experiments described in the accompanying paper:
	- Hierarchical memory organization (User Profile / Semantic / Episodic).
	- Plan-Act-Read agentic retrieval that interleaves keyword, time-filter,
	and embedding search.
	- Flat / dense / oracle baselines for comparison.

	## Repository layout

	```
	.
	├── main.py # End-to-end QA pipeline (agent, embed, keyword modes)
	├── evaluate_qa.py # Atomic-rubric QA evaluator (strict + partial)
	├── model_zoo.py # Model registry
	├── prompts/ # Prompt templates
	│ ├── agentic_retrieval_prompt.txt
	│ ├── agentic_retrieval_prompt_wo_profile.txt
	│ ├── keyword_search_prompt.txt
	│ └── read_and_extract_prompt.txt
	├── memory/ # Episodic + semantic memory stores
	├── baselines/
	│ ├── MemoChat/ # MemoChat baseline (upstream code + our wrapper)
	│ ├── raptor/ # RAPTOR baseline (upstream code + our wrapper)
	│ └── read-agent/ # ReadAgent baseline wrapper
	├── scripts/
	│ ├── build_retrieval_cache.py # Pre-compute GTE-7B embeddings for the corpus
	│ ├── make_v5_shards.py # Deterministic shard split by question_id
	│ ├── merge_jsonl_by_dataset_order.py
	│ ├── run_oracle_qa.py # Gold-session-only upper bound
	│ ├── plot_main_results.py
	│ ├── llm_judge_agreement.py
	│ └── slurm/
	│ ├── example_dense_retrieval.slurm
	│ └── example_agentic_retrieval.slurm
	└── requirements.txt
	```

	The benchmark dataset (`evolv_mem_v5.json`) is released separately; place it
	under `dataset/` along with the supporting files referenced by `main.py`
	(`all_sessions.json`, `all_session_summary.json`, etc.).

	## Setup

	```bash
	python -m venv .venv && source .venv/bin/activate
	pip install -r requirements.txt
	```

	### API keys

	The pipeline calls LLMs through three optional providers; set whichever you
	plan to use:

	\| Provider \| Env var \| Flag \|
	\|------------------------------------------------\|----------------------\|--------------\|
	\| OpenAI-compatible inference API \| `NV_API_KEY` \| `--nvidia` \|
	\| OpenAI-compatible LiteLLM proxy \| `LITELLM_API_KEY` \| `--tritonai` \|
	\| Direct Anthropic API \| `ANTHROPIC_API_KEY` \| (default) \|
	\| Azure OpenAI \| `AZURE_OPENAI_KEY` \| (default) \|

	Each `--<flag>` selects which client the pipeline uses; entries in
	`model_zoo.py` are tagged accordingly.

	## Quick start

	### 1. Build the per-question retrieval cache (one-time)

	```bash
	python scripts/build_retrieval_cache.py \
	--dataset dataset/evolv_mem_v5.json \
	--all_sessions dataset/all_sessions.json \
	--out_dir response_cache/retrieval/
	```

	### 2. Shard the dataset for parallel runs

	```bash
	python scripts/make_v5_shards.py \
	--dataset dataset/evolv_mem_v5.json \
	--ret_cache_jsonl response_cache/retrieval/flat-gte/v5_retrievallog_turn_flat-gte \
	--out_dir output/shards/v5_run_nchunks10/ \
	--num_shards 8
	```

	### 3. Run the QA pipeline

	Flat dense retrieval @ top-k=20 (single shard, e.g. for smoke testing):

	```bash
	export ret_cache="output/shards/v5_run_nchunks10/ret_cache/shard_00.jsonl"
	python main.py \
	--in_file output/shards/v5_run_nchunks10/dataset/shard_00.json \
	--out_file output/shards/v5_run_nchunks10/dense_gte_topk20/part_00.jsonl \
	--model_name gpt-5.5 \
	--top_k 20 \
	--n_chunks 10 \
	--nvidia \
	--all_sessions_file dataset/all_sessions.json \
	--no_semantic \
	--mode embed
	```

	Agentic retrieval over hierarchical memory:

	```bash
	python main.py \
	--in_file output/shards/v5_run_nchunks10/dataset/shard_00.json \
	--out_file output/shards/v5_run_nchunks10/agentic_hier/part_00.jsonl \
	--model_name gpt-5.5 \
	--top_k 20 \
	--n_chunks 10 \
	--nvidia \
	--all_sessions_file dataset/all_sessions.json \
	--hier_v2 --hier_union \
	--mode agent
	```

	To launch the full 8-shard parallel sweep on a SLURM cluster, edit and submit
	`scripts/slurm/example_dense_retrieval.slurm` or
	`scripts/slurm/example_agentic_retrieval.slurm`.

	### 4. Merge shards and evaluate

	```bash
	python scripts/merge_jsonl_by_dataset_order.py \
	--dataset dataset/evolv_mem_v5.json \
	--parts_glob "output/shards/v5_run_nchunks10/dense_gte_topk20/part_*.jsonl" \
	--out_file output/v5_run_dense_gte_topk20.jsonl

	python evaluate_qa.py \
	--hyp_file output/v5_run_dense_gte_topk20.jsonl \
	--ref_file dataset/evolv_mem_v5.json \
	--eval_model_name gpt-5.2 \
	--eval_mode both \
	--nvidia
	```

	The evaluator caches an atomic-rubric per question
	(`<dataset>.atomic-v1.rubric.json`) so subsequent runs reuse it.

	## Pipeline modes

	`main.py --mode` selects how a question is answered:

	- `embed`: top-k flat dense retrieval (GTE 7B), then a single LLM call to answer.
	- `keyword`: LLM-generated keywords + lexical matching, then answer.
	- `agent`: Plan-Act-Read loop. Combines `--hier_v2` (semantic-summary stage) and
	`--hier_union` (union with flat top-K) for the hierarchical-memory variant.

	`--no_semantic` disables the semantic-summary memory layer (flat memory).

	## Baselines

	The three external baselines (MemoChat, RAPTOR, ReadAgent) live under
	`baselines/` together with our thin wrappers
	(`run_<baseline>_baseline.py`). Each baseline's upstream LICENSE is preserved.

	## License

	This repository is released under the license stated in the corresponding
	LICENSE file (TBD prior to release). Upstream baselines retain their original
	licenses.