Update README.md

276bc2b verified 2 months ago

4.99 kB

	---
	license: apache-2.0
	language:
	- pyt
	base_model:
	- openai-community/gpt2
	---
	# MemoryDecoder-GPT2-Small

	## Model Description

	Memory Decoder is a pretrained, plug-and-play memory component designed for efficient domain adaptation of large language models. This checkpoint contains the GPT2-small Memory Decoder trained on WikiText-103, as described in our NeurIPS 2025 paper.

	- Paper: [Memory Decoder: A Pretrained, Plug-and-Play Memory for Large Language Models](https://www.arxiv.org/abs/2508.09874)
	- GitHub: [https://github.com/LUMIA-Group/MemoryDecoder](https://github.com/LUMIA-Group/MemoryDecoder/tree/main)
	- Conference: NeurIPS 2025 (Poster)
	- Model Size: 124M parameters
	- Base Architecture: GPT2-small transformer decoder

	## Overview

	Memory Decoder bridges the gap between non-parametric retrieval methods and parametric fine-tuning approaches. By pre-training a compact transformer decoder to internalize retrieval patterns, it provides:

	- Plug-and-Play Integration: Works with any GPT2 model variant without modifying original parameters
	- Efficient Inference: No retrieval overhead - just parallel forward passes
	- Domain Expertise: Captures long-tail knowledge like kNN-LM but with parametric efficiency
	- Preserved Capabilities: Original model remains unchanged

	## Quick Start

	### Step 1: Import Libraries and Initialize Models

	```python
	from memDec import MemoryDecoder
	import transformers
	from transformers import AutoModelForCausalLM
	from loguru import logger

	# Define paths to your models
	base_lm_path = "gpt2-xl" # or any GPT2 variant
	knn_generator_path = "Clover-Hill/MemoryDecoder-gpt2-small"

	# Load tokenizer and models
	tokenizer = transformers.AutoTokenizer.from_pretrained(base_lm_path)
	base_lm = AutoModelForCausalLM.from_pretrained(base_lm_path)
	knn_generator = AutoModelForCausalLM.from_pretrained(knn_generator_path)
	```

	### Step 2: Prepare Models and Create Joint Model

	```python
	# Resize embeddings and set to evaluation mode
	base_lm.eval()
	knn_generator.eval()

	# Create the joint Memory Decoder model
	joint = MemoryDecoder(base_lm, knn_generator, lmbda=0.55, knn_temp=1.0).to("cuda")
	```

	### Step 3: Generate Text and Compare Results

	```python
	# Prepare input prompt
	prompt = "As with previous Valkyira Chronicles games , Valkyria Chronicles III is"
	inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

	# Generate with Memory Decoder
	out_ids = joint.generate(**inputs, max_new_tokens=20, do_sample=False)
	logger.info(f"Memory Decoder output: {tokenizer.decode(out_ids[0], skip_special_tokens=True)}")

	# Generate with base model for comparison
	out_ids = base_lm.generate(**inputs, max_new_tokens=20, do_sample=False)
	logger.info(f"Base Model output: {tokenizer.decode(out_ids[0], skip_special_tokens=True)}")
	```

	📊 Generation Results Comparison:

	\| Model \| Generated Continuation \|
	\|-------\|------------------------\|
	\| Base Model \| "...is a turn-based strategy game. The player takes control of a squad of Valkyria soldiers..." \|
	\| +Memory Decoder \| "...is a role-playing* video game developed by Sega and published by Sega for the PlayStation 2."* \|

	> [!NOTE]
	> Memory Decoder correctly identifies Valkyria Chronicles III as a role-playing game (factually accurate), while the base model incorrectly predicts it as a strategy game.

	## Performance on WikiText-103

	\| Model Configuration \| Perplexity \| Improvement \|
	\|:-------------------\|:----------:\|:-----------:\|
	\| GPT2-small (baseline) \| 24.89 \| - \|
	\| GPT2-small + MemoryDecoder \| 13.36 \| -11.53 \|
	\| GPT2-medium (baseline) \| 18.29 \| - \|
	\| GPT2-medium + MemoryDecoder \| 12.25 \| -6.04 \|
	\| GPT2-large (baseline) \| 15.80 \| - \|
	\| GPT2-large + MemoryDecoder \| 11.53 \| -4.27 \|
	\| GPT2-xl (baseline) \| 14.39 \| - \|
	\| GPT2-xl + MemoryDecoder \| 10.93 \| -3.46 \|

	## Key Features

	- Universal Compatibility: Works with all GPT2 model sizes (small, medium, large, xl)
	- Parameter Efficient: Only 124M additional parameters enhance models up to 1.5B
	- Domain Adaptation: Trained to capture WikiText-103 domain knowledge
	- Inference Speed: Minimal overhead compared to retrieval-based methods

	## Training Details

	- Training Data: WikiText-103
	- Training Objective: Hybrid KL divergence and language modeling loss
	- Supervision Signal: kNN distributions from GPT2-xl, it is suggested to use the finetuned version of GPT2-xl [here](https://huggingface.co/Clover-Hill/gpt2-xl-finetuned-wikitext103).
	- Hyperparameters:
	- Learning rate: 1e-3
	- Beta (loss balance): 0.5
	- Training Epoch: 70

	## Citation

	```bibtex
	@article{cao2025memory,
	title={Memory decoder: A pretrained, plug-and-play memory for large language models},
	author={Cao, Jiaqi and Wang, Jiarui and Wei, Rubin and Guo, Qipeng and Chen, Kai and Zhou, Bowen and Lin, Zhouhan},
	journal={arXiv preprint arXiv:2508.09874},
	year={2025}
	}
	```

	## Contact

	For questions and support: maximus.cao@outlook.com