RecaLLM-Qwen2.5-7B
RecaLLM is a family of reasoning language models that interleave reasoning with explicit, constrained-decoding recall spans that copy evidence verbatim from context. This addresses the lost-in-thought phenomenon: reasoning LLMs lose in-context retrieval ability after chain-of-thought generation.
This model is RecaLLM trained on Qwen2.5-7B-Instruct.
Paper: arXiv:2604.09494 | Code: https://github.com/kswhitecross/RecaLLM
Quick Start
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"kswhitecross/RecaLLM-Qwen2.5-7B",
trust_remote_code=True,
torch_dtype="auto",
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("kswhitecross/RecaLLM-Qwen2.5-7B")
messages = [{"role": "user", "content": "Your prompt here..."}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to(model.device)
outputs = model.generate(inputs, max_new_tokens=10240, temperature=0.6, top_p=0.95)
print(tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=False))
Recall Spans
RecaLLM uses special single tokens <|start_recall|> and <|end_recall|> to delimit recall spans. During a recall span, a constrained decoding mechanism (built into the model class) masks out invalid tokens at each step, allowing only tokens that continue a valid prefix match against the input context. This guarantees that every recall span is a verbatim contiguous substring of the searchable context.
The model decides what and when to recall; constrained decoding ensures the recalled content is exact. Example output:
<think>
I need to find the population of the city mentioned in the document.
Looking at the context, I see <|start_recall|>The city of Springfield has a population of 167,376<|end_recall|>.
So the population is 167,376.
</think>
Answer: 167,376
System Prompt
This model was trained with a specific system prompt that instructs it to use recall spans. The chat template automatically includes this prompt when no system message is provided. Using a different system prompt may significantly degrade performance. See the code repository for the full system prompt.
Inference with vLLM
For vLLM inference, use the VLLMTokenRecaLLMLogitsProcessor from the RecaLLM code repository. vLLM should load the model using the base Qwen2 architecture (not the custom RecaLLM transformers classes) via hf_overrides, with the logits processor applied as an overlay:
from vllm import LLM, SamplingParams
from recallm import VLLMTokenRecaLLMLogitsProcessor
llm = LLM(
model="kswhitecross/RecaLLM-Qwen2.5-7B",
trust_remote_code=True,
logits_processors=[VLLMTokenRecaLLMLogitsProcessor],
hf_overrides={
"model_type": "qwen2",
"architectures": ["Qwen2ForCausalLM"],
},
)
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, max_tokens=10240)
outputs = llm.generate(["Your prompt here..."], sampling_params=sampling_params)
See recallm/recallm_vllm.py in the code repo for the full logits processor implementation.
Citation
@article{whitecross2026recallm,
title={RecaLLM: Addressing the Lost-in-Thought Phenomenon with Explicit In-Context Retrieval},
author={Whitecross, Kyle and Rahimi, Negin},
journal={arXiv preprint arXiv:2604.09494},
year={2026}
}
- Downloads last month
- 497