arxiv:2605.14323

Dynamic Latent Routing

Published on May 14

· Submitted by

Fangyuan Yu on May 15

Thoughtworks

Upvote

Authors:

Abstract

Temporal composition of sub-policies in MDPs with time-varying rewards enables optimal policy recovery through generalized Dijkstra search, which inspires a dynamic latent routing method for language model fine-tuning that outperforms traditional supervised approaches.

AI-generated summary

We investigate the temporal concatenation of sub-policies in Markov Decision Processes (MDP) with time-varying reward functions. We introduce General Dijkstra Search (GDS), and prove that globally optimal goal-reaching policies can be recovered through temporal composition of intermediate optimal sub-policies. Motivated by the "search, select, update" principle underlying GDS, we propose Dynamic Latent Routing (DLR), a language-model post-training method that jointly learns discrete latent codes, routing policies, and model parameters through dynamic search in a single training stage. In low-data fine-tuning settings, DLR matches or outperforms supervised fine-tuning across four datasets and six models, achieving a mean gain of +6.6 percentage points, while prior discrete-latent baselines consistently underperform SFT. Mechanistic analyses and targeted code ablations show that DLR learns structured routing behaviors with distinct causal roles.

View arXiv page View PDF Add to collection

Community

Ksgk-fy

Paper submitter about 4 hours ago

Humans live in a continuous world, yet think in discrete language. LLMs live in a tokenized world, yet think in continuous representations. So is discrete language merely a communication artifact?

Our answer: no. Language is not just discrete; it is compositional. That structure lets agents both act and learn by composition. "Open the door, then enter the room” composes two policies into a new one. “A narwhal is a whale with a horn” teaches an unfamiliar concept by combining ones we already know.

We turn this intuition into an RL theorem: with General Dijkstra Search, shorter policies can be concatenated, much like language, to form optimal goal-reaching policies. Instead of learning by reconsidering every situation from scratch, The agent can reuse learned sub-policies and search over their possible compositions as goals change.

Inspired by this, Dynamic Latent Routing lets an LLM learn its own inner monologue. Each code acts like a small steering signal for a sub-policy inside the model. DLR searches for useful codes, trains the model to reuse them, and lets codes compose into longer thoughts. Across low-data fine-tuning settings, DLR matches or outperforms SFT, with learned codes and n-grams taking on distinct causal roles.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.14323 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.14323 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.14323 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.