Introduction

LTE is an RLVR approach that mitigates the exploration stagnation of LMs by their previously self-made mistakes and does not require any external expert guidance. LTE improves the performance upper bound of LMs and enhances both exploitation and exploration during training.

Key Highlights

Self-generated Hints: LTE uses the errors generated by the LMs themselves during training as hints.
No External Expert Guidance: LTE does not require any external expert guidance to mitigate the exploration stagnation of LMs.

Inference

Here is an example of using LTE models for inference:

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

model_path="JamyDohrn/LTE-Qwen3-8B-Base"

question = "which number is larger? 9.11 or 9.9?"

tokenizer = AutoTokenizer.from_pretrained(model_path)
messages = [{"role": "user", "content": question}]
chat = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

llm = LLM(model=model_path)
params = SamplingParams(temperature=0.6, max_tokens=32768)
outputs = llm.generate([chat], params)
print(outputs[0].outputs[0].text)

Acknowledgements

LTE is built on the following repositories and we thank their teams for their valuable contributions to the community:

Citation

If you find our work useful, feel free to cite our paper:

@misc{tang2026steprivertwicelearning,
      title={Do Not Step Into the Same River Twice: Learning to Reason from Trial and Error}, 
      author={Chenming Tang and Hsiu-Yuan Huang and Weijie Liu and Clive Bai and Saiyong Yang and Yunfang Wu},
      year={2026},
      eprint={2510.26109},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2510.26109}, 
}