WTF GENIUS PAPERS
Papers that made me appreciate my major and my life a little more. obs=Observation, innov=Innovation. Most papers are abt improving tiny models.
Paper • 2605.06548 • Published • 52Note obs. Strong text generation need not be strictly left-to-right, but alternatives struggle to combine efficiency, scaling, and global semantic modeling. innov. Cola DLM maps text into continuous latents, diffuses a global semantic prior with block-causal DiT, then decodes conditionally, separating semantic planning from local wording.
Scaling Latent Reasoning via Looped Language Models
Paper • 2510.25741 • Published • 229Note obs. Standard LLMs mostly learn to "think" after pretraining (prompting/CoT), so pretraining compute is underused for learned multi-step manipulation. innov. Looped Language Models add iterative latent computation plus an entropy-regularized depth allocator during pretraining; the Ouro 1.4B/2.6B models report performance comparable to up to ~12B baselines across many evals.
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
Paper • 2502.05171 • Published • 155Note obs. Token-level CoT scales compute by emitting more text, which costs latency and can be bottlenecked by context length. innov. A recurrent-depth LM unrolls a shared block to arbitrary depth at test time, scaling latent compute without extra tokens; a 3.5B prototype shows reasoning gains up to a compute load comparable to ~50B params.
Pretraining Language Models to Ponder in Continuous Space
Paper • 2505.20674 • Published • 3Note obs. One forward pass per token forces the model to commit early, while deeper deliberation typically requires emitting more tokens. innov. "Pondering" repeats forward passes within a token step by feeding back a probability-weighted embedding mixture; learned self-supervised during pretraining, and reports that pondering-enhanced small models can match much larger vanilla counterparts.
Encode, Think, Decode: Scaling test-time reasoning with recursive latent thoughts
Paper • 2510.07358 • Published • 4Note obs. Interpretability evidence suggests reasoning computation concentrates in a subset of layers, but standard decoding traverses the full stack once. innov. Train the model to iterate a selected mid-layer block as a recursive latent "thinking" loop, then decode back to tokens for test-time reasoning scaling.
EMO: Pretraining Mixture of Experts for Emergent Modularity
Paper • 2605.06663 • Published • 5Note obs. MoEs activate sparse experts, but using only a domain-relevant subset at inference usually breaks performance. innov. EMO constrains tokens within a document to choose from a shared expert pool, making semantic expert modules emerge from document boundaries; 25% expert retention costs only ~1% absolute drop.
From Growing to Looping: A Unified View of Iterative Computation in LLMs
Paper • 2602.16490 • Published • 1Note obs. Depth growth and inference-time looping both improve reasoning, but their relationship has been unclear. innov. The paper shows they share depth-wise signatures of iterative computation and compose: looping middle blocks of depth-grown models can improve reasoning primitives up to 2× even without loop training.
JEPA-Reasoner: Decoupling Latent Reasoning from Token Generation
Paper • 2512.19171 • Published • 4Note obs. AR LMs entangle reasoning with surface token generation, so wording errors can corrupt the reasoning state. innov. JEPA-Reasoner reasons in latent space with a separate Talker for text realization, improving error containment and uncertainty handling; a 0.9B model strongly beats a coupled baseline.
Multiplex Thinking: Reasoning via Token-wise Branch-and-Merge
Paper • 2601.08808 • Published • 39Note obs. Discrete CoT increases length and bandwidth, but uncertainty is often localized to a small set of plausible next steps. innov. Sample K candidate tokens per step and merge their embeddings into one continuous multiplex token, enabling branch-and-merge reasoning optimized with on-policy RL.
Soft Thinking: Unlocking the Reasoning Potential of LLMs in Continuous Concept Space
Paper • 2505.15778 • Published • 19Note obs. Discrete CoT explores reasoning paths one token at a time, which limits branching and often inflates token budget. innov. Generate "soft" concept tokens as embedding mixtures (continuous-space reasoning) without extra training; reports pass@1 gains (up to ~2.48) while reducing token usage (up to ~22.4%) on math/coding benchmarks.
SoftCoT: Soft Chain-of-Thought for Efficient Reasoning with LLMs
Paper • 2502.12134 • Published • 3Note obs. Hard-token CoT constrains intermediate reasoning to discrete vocabulary and can be inefficient or brittle. innov. A lightweight assistant proposes soft-thought tokens that are projected into the base LLM’s representation space, improving reasoning with parameter-efficient tuning while leaving the base model unchanged.
SoftCoT++: Test-Time Scaling with Soft Chain-of-Thought Reasoning
Paper • 2505.11484 • Published • 6Note obs. Continuous latent thoughts can be high-information, but a fixed latent trace limits exploration compared to multi-sample discrete CoT. innov. Perturb latent thoughts using specialized initial tokens and train for diversity (contrastive) so test-time scaling explores multiple soft-thought paths; reports stronger gains than SoftCoT baselines and compatibility with self-consistency.
Large Language Models Explore by Latent Distilling
Paper • 2604.24927 • Published • 72Note obs. Stochastic sampling often gives lexical variety without true semantic exploration, limiting pass@k scaling. innov. ESamp trains a lightweight test-time distiller to predict deep hidden states from shallow ones, using prediction error as a novelty signal to bias decoding toward unexplored reasoning paths.
Parallel-Probe: Towards Efficient Parallel Thinking via 2D Probing
Paper • 2602.03845 • Published • 27Note obs. Parallel reasoning (multiple CoT branches at once) wastes compute because branches evolve independently without coordination, leading to redundant trajectories and inability to know when consensus is reached. innov. A 2D probing interface that extracts intermediate answers from all parallel branches (creating a branch×depth matrix), enabling global divergence pruning and stability stopping—cutting redundant computation by 60%+ while maintaining accuracy through training-free onl. control.
Attention Residuals
Paper • 2603.15031 • Published • 184Note obs. PreNorm residual streams add all layer outputs uniformly, causing hidden-state growth and diluted layer contributions. innov. AttnRes learns input-dependent attention over previous layer outputs; Block AttnRes makes it practical and improves Kimi Linear models across pretraining and downstream evals.
Tensor Product Attention Is All You Need
Paper • 2501.06425 • Published • 91Note obs. Long-context inference is dominated by KV-cache memory, limiting usable sequence length under fixed hardware. innov. Tensor Product Attention factorizes Q/K/V into compact contextual low-rank components, shrinking KV cache while improving quality; T6 beats MHA/MQA/GQA/MLA baselines.
Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models
Paper • 2601.07372 • Published • 47Note obs. Transformers spend compute simulating lookup of stable knowledge that could be stored more directly. innov. Engram adds O(1) conditional memory as a sparsity axis alongside MoE, improving knowledge, reasoning, code/math, and long-context tasks with a prefetchable memory module.
Interpretable-by-Design Transformers via Architectural Stream Independence
Paper • 2603.07482 • Published • 1Note obs. Post-hoc interpretability fights residual-stream entanglement after it has already mixed token identity and context. innov. Late Fusion keeps symbolic token and contextual streams independent until output, preserving interpretable heads and making interventions less semantically destructive.
The Dual-Stream Transformer: Channelized Architecture for Interpretable Language Modeling
Paper • 2603.07461 • Published • 2Note obs. A single residual stream hides which computation preserves token identity versus contextual meaning. innov. Dual-Stream separates token and context channels, with tunable mixing; Kronecker mixing keeps most LM performance while exposing cleaner, more modular computation.
Mixture of Cognitive Reasoners: Modular Reasoning with Brain-Like Specialization
Paper • 2506.13331 • Published • 2Note obs. Monolithic LMs entangle logic, language, spatial, and social reasoning instead of specializing modules as brains do. innov. MiCRo partitions a pretrained LM into four cognitively aligned modules; routing/ablation shows causal specialization while matching or improving reasoning and alignment baselines.
A Neuroscience-Inspired Dual-Process Model of Compositional Generalization
Paper • 2507.18868 • Published • 2Note obs. Neural sequence models still struggle with systematic compositional generalization. innov. Mirage pairs a fast meta-trained Transformer with an explicit schema engine for prioritized rule refinement, reaching >99% SCAN accuracy in a task-agnostic dual-process setup.
Disentangling Reasoning Capabilities from Language Models with Compositional Reasoning Transformers
Paper • 2210.11265 • Published • 1Note obs. General reasoning needs composable skills, not a single undifferentiated hidden state. innov. ReasonFormer separates representation from specialized reasoning modules and composes them in parallel/cascade, improving performance and few-shot generalization across reasoning datasets.
Better & Faster Large Language Models via Multi-token Prediction
Paper • 2404.19737 • Published • 81Note obs. Next-token-only training underutilizes supervision in the near future context and limits sample efficiency. innov. Multiple future-token prediction heads on a shared trunk, improving capability and enabling multi-token generation speedups at inference.
Direct Multi-Token Decoding
Paper • 2510.11958 • Published • 9Note obs. Decoder-only LMs re-run early/middle layers for every next token even when those layers have already produced a representation sufficient to support short-range continuation. innov. Direct Multi-Token Decoding reuses late layers to emit multiple tokens per cycle, reducing full forward passes; a fine-tuned Qwen3-4B variant reports up to ~2x speedup with only minor performance loss.
Shaping capabilities with token-level data filtering
Paper • 2601.21571 • Published • 29Note obs. Removing entire documents to shape capabilities is blunt and destroys useful data, while existing unlearning methods are vulnerable to adversarial attacks. innov. Token-level filtering that masks specific tokens during backpropagation rather than removing documents, achieving precise capability shaping with 200× better efficiency than document filtering and 10× robustness to adversarial fine-tuning.
Self-Improving Pretraining: using post-trained models to pretrain better models
Paper • 2601.21343 • Published • 19Note obs. Standard pretraining bakes in hallucinations and unsafe patterns that post-training can't fully fix, and pretraining data quality is fixed rather than dynamically improvable. innov. A pretraining paradigm where a post-trained model acts as real-time judge and rewriter, using RL to improve the next K tokens during pretraining itself, preventing bad patterns from becoming embedded rather than trying to fix them later.
ReMiT: RL-Guided Mid-Training for Iterative LLM Evolution
Paper • 2602.03075 • Published • 7Note obs. Standard pretrain→post-train pipelines are one-way, so RL-improved reasoning signals rarely improve the base distribution. innov. ReMiT feeds RL-tuned policy signals back into mid-training by reweighting reasoning-relevant tokens, creating an iterative loop with reported gains in both pretraining and post-training evals.
Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models
Paper • 2601.18734 • Published • 6Note obs. On-policy distillation fixes train/inference mismatch, but usually needs a separate larger teacher and ignores privileged ground-truth traces. innov. OPSD makes one model act as teacher and student under different contexts, using privileged traces to supervise student rollouts with 4–8× better token efficiency than GRPO.
Distribution-Aligned Sequence Distillation for Superior Long-CoT Reasoning
Paper • 2601.09088 • Published • 63Note obs. Common "sequence distillation" (SFT on teacher outputs) treats distillation as filtered imitation, underrepresenting the teacher's sequence distribution, mismatching it to student capacity, and adding exposure bias from teacher forcing vs AR inference. innov. Distribution-aligned sequence distillation with explicit teacher-student interaction; reports DASD-4B-Thinking as SOTA at its scale using ~448K samples, with model and dataset released.
Privileged Information Distillation for Language Models
Paper • 2602.04942 • Published • 28Note obs. Privileged reasoning/action context helps long-horizon agent RL, but deployment policies often cannot use that extra information. innov. π-Distill trains PI-conditioned teacher and no-PI student policies inside one model, with OPSD as an RL variant; it distills stronger agentic behavior than standard SFT+RL transfer.
Distilling 100B+ Models 40x Faster with TRL
📝78TRL distillation for 100B+ teachers, 40x faster
Note obs. On-policy distillation from huge teachers is attractive but bottlenecked by teacher hosting, generation throughput, and logprob transfer. innov. TRL’s DistillationTrainer adds buffered vLLM generation, external teacher-server support, and compact binary logprob payloads, enabling practical 100B+ teacher distillation with up to 40× speedups.
Embarrassingly Simple Self-Distillation Improves Code Generation
Paper • 2604.01193 • Published • 48Note obs. Code models already sample useful solutions from their own distribution, even without a teacher, verifier, or RL loop. innov. SSD samples a model’s own code outputs and SFTs on them, substantially improving LiveCodeBench and balancing precision with exploration.
Scalable Power Sampling: Unlocking Efficient, Training-Free Reasoning for LLMs via Distribution Sharpening
Paper • 2601.21590 • Published • 14Note obs. RL post-training is computationally expensive but mostly just reshapes the base model's distribution rather than teaching new reasoning, and MCMC-based power sampling is theoretically sound but 10× too slow for practical use. innov. A training-free autoregressive approximation of power distribution sampling that uses future trajectory quality estimates to sharpen generation, matching GRPO performance with only 2.5-3.5× inference overhead instead of massive training costs.
Falcon-H1-Tiny: A series of extremely small, yet powerful language models redefining capabilities at small scale
📝42Generate text using extremely small yet powerful language models
Note obs. Traditional scaling assumes reasoning requires billions of parameters and general pretraining before specialization, leaving small models incapable of complex tasks like tool calling or code generation. innov. "Anti-curriculum" pretraining (pure reasoning/code data from scratch), parallel Mamba-Attention hybrids, and GRPO alignment enable 90M-600M models to match 7B baselines, proving extreme specialization and depth beat brute-force scaling.
R-Zero: Self-Evolving Reasoning LLM from Zero Data
Paper • 2508.05004 • Published • 131Note obs. Self-evolving reasoning systems are often bottlenecked by needing large, curated task/label sets to bootstrap training signals. innov. A data-free/self-evolving setup that generates and learns from its own experiences to grow reasoning behavior; positioned as reducing dependence on curated tasks and enabling iterative capability growth without human-labeled corpora.
Golden Goose: A Simple Trick to Synthesize Unlimited RLVR Tasks from Unverifiable Internet Text
Paper • 2601.22975 • Published • 111Note obs. RLVR training hits a wall when you run out of verifiable data, and most internet text (textbooks, articles) can't be used because there's no automatic way to check if the model's reasoning is correct. innov. A fill-in-the-middle multiple-choice formulation that converts any coherent text into verifiable RL tasks by masking critical reasoning steps and algorithmically generating plausible distractors, creating infinite training data from previously unusable corpora.
EvoSyn: Generalizable Evolutionary Data Synthesis for Verifiable Learning
Paper • 2510.17928 • Published • 4Note obs. Synthetic verifiable data pipelines often fail because verification artifacts are weak/trivial, so they do not reliably separate strong from weak solutions. innov. An evolutionary, task-agnostic framework that jointly synthesizes problems, candidate solutions, and executably-checkable verifiers, iteratively improving strategies via consistency-based evaluation; reports gains under verifiable-RL and distillation setups.
Reinforcement Learning from Meta-Evaluation: Aligning Language Models Without Ground-Truth Labels
Paper • 2601.21268 • Published • 4Note obs. Traditional RL requires ground-truth labels or external verifiers which are expensive or unavailable for many domains, and LLM-as-judge methods still require reference answers. innov. Using natural-language meta-questions (e.g., "Is this logically consistent?") answered by an evaluator model as the reward signal, enabling training in domains where correctness is ambiguous or expensive to verify, with performance matching label-based RL.
Experiential Reinforcement Learning
Paper • 2602.13949 • Published • 74Note obs. Sparse delayed rewards tell an LM that it failed, but not what behavioral change would have fixed the trajectory. innov. ERL trains attempt→feedback→reflection→retry episodes, reinforces the improved retry, and internalizes the correction so deployment keeps single-pass inference cost.
Expanding the Capabilities of Reinforcement Learning via Text Feedback
Paper • 2602.02482 • Published • 2Note obs. RL from scalar rewards is unreasonably uninformative (single bit per rollout), while distillation requires expensive full demonstrations; text feedback is abundant in human-AI interaction but unused. innov. Better RLTF where models learn to internalize textual critiques during training to improve single-turn test performance, using either self-distillation (matching feedback-conditioned generations) or feedback modeling (auxiliary prediction), achieving dense susupervision w/o demo costs.
F-GRPO: Don't Let Your Policy Learn the Obvious and Forget the Rare
Paper • 2602.06717 • Published • 74Note obs. Small RLVR groups over-reinforce easy/common successes while missing rare-correct modes, improving average behavior but hurting exploration. innov. F-GRPO adds focal, difficulty-aware advantage scaling to GRPO-style methods, boosting rare-solution retention and pass@k without larger groups or extra sampling compute.
ResRL: Boosting LLM Reasoning via Negative Sample Projection Residual Reinforcement Learning
Paper • 2605.00380 • Published • 4Note obs. Negative-sample RL can preserve diversity, but naive penalties may suppress semantic regions shared by correct and incorrect responses. innov. ResRL projects negative hidden states onto a positive low-rank subspace and scales gradients by residuals, improving reasoning while preserving diversity across math, code, agent, and tool tasks.
Evolution Strategies at Scale: LLM Fine-Tuning Beyond Reinforcement Learning
Paper • 2509.24372 • Published • 14Note obs. ES was dismissed as unscalable for billion-parameter LMs, leaving RL as the default fine-tuning route. innov. The paper scales full-parameter ES to LLMs and reports advantages over RL in sample efficiency, long-horizon rewards, robustness, stability, and reward-hacking resistance.
Evolution Strategies at the Hyperscale
Paper • 2511.16652 • Published • 5Note obs. Naive ES becomes too expensive because full-rank perturbations require huge memory and batched matrix multiplies. innov. EGGROLL uses low-rank perturbations that average into high-rank updates, making ES competitive with GRPO for reasoning and enabling stable training of unusual recurrent/integer LMs.
The Surprising Effectiveness of Test-Time Training for Abstract Reasoning
Paper • 2411.07279 • Published • 4Note obs. Small LLMs (1B-8B parameters) fail on novel reasoning tasks because they cannot update their understanding during inference; few-shot prompting only activates existing patterns rather than learning new operational rules, creating a hard ceiling on abstract reasoning like ARC visual puzzles. innov. Test-Time Training: during inference, use the few-shot examples as training data to perform actual gradient updates on a small subset of model parameters (LoRA), temporarily adapting the model to
TEMPO: Scaling Test-time Training for Large Reasoning Models
Paper • 2604.19295 • Published • 34Note obs. TTT for reasoning plateaus because self-generated rewards drift and diversity collapses as the policy updates. innov. TEMPO alternates policy refinement on unlabeled questions with critic recalibration on labeled data, formalized as EM; reports large sustained gains on AIME with Qwen3 and OLMO3.
Beyond Static Tools: Test-Time Tool Evolution for Scientific Reasoning
Paper • 2601.07641 • Published • 48Note obs. Static tool libraries are intrinsically incomplete in scientific domains, so long-tail tasks fail even when reasoning is competent. innov. A test-time loop that synthesizes, verifies, and iteratively evolves executable tools, evaluated on a benchmark that couples tasks with evolved tools.
OpenClaw-RL: Train Any Agent Simply by Talking
Paper • 2603.10165 • Published • 153Note obs. Agent interactions constantly produce next-state feedback, but most RL systems discard it or require domain-specific reward pipelines. innov. OpenClaw-RL extracts evaluative and directive signals from ordinary chat/tool/GUI/SWE interactions, using async serving, judging, and training to learn from live agent use.
Search-R2: Enhancing Search-Integrated Reasoning via Actor-Refiner Collaboration
Paper • 2602.03647 • Published • 7Note obs. Training search-integrated agents with RL fails because sparse trajectory rewards can't distinguish high-quality reasoning from lucky guesses, causing error propagation from early retrieval mistakes that cascade through the chain. innov. Actor-Refiner framework wherein Actor gives trajectories with tool calls and a Meta-Refiner performs "cut-and-regenerate" repairs on specific failed steps using a hybrid reward (outcome + information density of evidence), optimized to achieve 16% gains.
Aster: Autonomous Scientific Discovery over 20x Faster Than Existing Methods
Paper • 2602.07040 • Published • 2Note obs. Scientific-discovery agents bottleneck on slow evaluate-edit cycles, especially when each trial requires training, simulation, or expensive search. innov. Aster iteratively edits programs against evaluator scripts, accelerating autonomous discovery across math, kernels, biology, neuroscience, and LM training; reports >20× speed gains.
Judging What We Cannot Solve: A Consequence-Based Approach for Oracle-Free Evaluation of Research-Level Math
Paper • 2602.06291 • Published • 24Note obs. Research-level math answers are hard to verify because plausible solutions may require scarce expert judgment. innov. Consequence-Based Utility tests a candidate solution by using it as an exemplar for related verifiable problems, ranking hard-math outputs better than reward models or LLM judges.
TERMINATOR: Learning Optimal Exit Points for Early Stopping in Chain-of-Thought Reasoning
Paper • 2603.12529 • Published • 19Note obs. LRMs often reach the final answer early, then keep spending reasoning tokens on redundant continuation. innov. TERMINATOR learns first-answer/optimal-exit positions from traces and cuts CoT length by 14–55% across math, code, and QA while matching or beating early-stop baselines.
ConPress: Learning Efficient Reasoning from Multi-Question Contextual Pressure
Paper • 2602.01472 • Published • 1Note obs. LRMs become shorter when multiple questions compete for one context, revealing concise reasoning traces they can already produce. innov. ConPress harvests these pressure-induced compressed traces for SFT, cutting reasoning tokens by 59% on MATH500 and 33% on AIME25 with only 8k examples.
Draft-Thinking: Learning Efficient Reasoning in Long Chain-of-Thought LLMs
Paper • 2603.00578 • Published • 2Note obs. Long CoT often couples reasoning ability with unnecessary verbosity. innov. Draft-Thinking teaches concise “draft” reasoning through progressive curriculum and adaptive prompting, reducing MATH500 reasoning budget by 82.6% for only a small accuracy drop.
TokenSkip: Controllable Chain-of-Thought Compression in LLMs
Paper • 2502.12067 • Published • 4Note obs. Long CoT tokens have unequal semantic value, but normal decoding pays for every token. innov. TokenSkip identifies and skips lower-importance reasoning tokens for controllable compression, cutting GSM8K reasoning tokens about 40% with negligible accuracy loss.
Can LLMs Predict Their Own Failures? Self-Awareness via Internal Circuits
Paper • 2512.20578 • Published • 86Note obs. External judges and text-based self-critique add compute and can correlate weakly with true correctness. innov. A small auxiliary predictor that reads hidden-state and attention traces from a frozen model to estimate correctness during generation with negligible overhead.
H-Neurons: On the Existence, Impact, and Origin of Hallucination-Associated Neurons in LLMs
Paper • 2512.01797 • Published • 9Note obs. Hallucination work often treats errors as data/objective failures, leaving neuron-level mechanisms unclear. innov. Finds a sparse set of hallucination-associated neurons, under 0.1%, that predict hallucinations, causally connect to over-compliance, and already arise during pretraining.
Where Did This Sentence Come From? Tracing Provenance in LLM Reasoning Distillation
Paper • 2512.20908 • Published • 29Note obs. After reasoning distillation, it is unclear which parts of a student's reasoning are inherited from the teacher versus reverting to the base student. innov. A provenance-tracing analysis that compares teacher/base/distilled probabilities at the sentence/action level and uses the signal to guide principled data selection.
When Less Language is More: Language-Reasoning Disentanglement Makes LLMs Better Multilingual Reasoners
Paper • 2505.15257 • Published • 1Note obs. Multilingual reasoning can be bottlenecked by language-specific representations rather than the reasoning process itself. innov. Training-free causal ablation removes language-specific components while preserving top-layer language features, improving reasoning across 10 LLMs and 11 languages.
Are formal and functional linguistic mechanisms dissociated in language models?
Paper • 2503.11302 • Published • 1Note obs. LMs can be fluent while failing facts/reasoning, raising whether formal language and functional reasoning rely on separable circuits. innov. Circuit analysis across five LMs and ten tasks finds little overlap between formal and functional circuits, but also no single unified “formal language” circuit.
Enhancing Linguistic Competence of Language Models through Pre-training with Language Learning Tasks
Paper • 2601.03448 • Published • 13Note obs. Raw-text pretraining does not explicitly optimize linguistic competence, so grammatical generalizations emerge indirectly and slowly. innov. Transform text into structured language-learning tasks and mix them into pretraining to accelerate linguistic competence acquisition without specializing away from general tasks.
LSRIF: Logic-Structured Reinforcement Learning for Instruction Following
Paper • 2601.06431 • Published • 12Note obs. Standard RLHF rewards treat multi-constraint instructions as flat; partial failures are not propagated, so models learn weak adherence to ordered and conditional requirements. innov. Build LsrInstruct with explicit parallel/sequential/conditional constraint structures, and a structure-aware reward (avg aggregation, failure-penalty propagation, branch selection) that improves in- and out-of-domain instruction following and general reasoning.
RECAST: Expanding the Boundaries of LLMs' Complex Instruction Following with Multi-Constraint Data
Paper • 2505.19030 • Published • 1Note obs. When prompts contain many explicit requirements (often >10), instruction-following breaks down because evaluation and training data under-represent that regime. innov. Synthesize constraint-heavy examples from real prompts and add automatic verification (rule-based for quantitative constraints plus qualitative validation) to build RECAST-30K; fine-tuning yields improved complex instruction-following and supports reward construction.
OpenRubrics: Towards Scalable Synthetic Rubric Generation for Reward Modeling and LLM Alignment
Paper • 2510.07743 • Published • 14Note obs. Scalar/pairwise reward models compress preferences too aggressively and miss multi-dimensional criteria that humans use in evaluation. innov. A large (prompt, rubric) collection plus contrastive rubric generation and noise filtering to train rubric-conditioned reward models; reports a ~6.8% gain over size-matched reward baselines with transfer to downstream policy performance.
Scaling Towards the Information Boundary of Instruction Set: InfinityInstruct-Subject Technical Report
Paper • 2507.06968 • Published • 1Note obs. Large instruction corpora can still be shallow (low complexity) or narrow (poor rare-domain coverage), so scaling sample count alone plateaus. innov. A framework combining hierarchical labeling, informative seed selection, evolutionary synthesis, and deficiency diagnosis to iteratively expand coverage and depth; produces a ~1.5M-instruction set and reports improved instruction-following on multiple models.
CRAFT Your Dataset: Task-Specific Synthetic Dataset Generation Through Corpus Retrieval and Augmentation
Paper • 2409.02098 • Published • 3Note obs. Traditional synthetic data generation creates generic text that doesn't match the specific distribution of downstream tasks, requiring massive filtering or domain adaptation that wastes compute and compromises quality. innov. A retrieval-augmented generation framework that crafts synthetic datasets by retrieving task-specific exemplars from a corpus and using them as seeds for targeted augmentation, producing training data that matches the exact task distribution without task-specific huma
Less is Enough: Synthesizing Diverse Data in Feature Space of LLMs
Paper • 2602.10388 • Published • 244Note obs. Text-surface diversity is a weak proxy for whether SFT data covers the features a model actually needs. innov. FAC measures missing coverage in SAE/interpretable feature space, then synthesizes examples targeting those gaps; the feature-space signal transfers across LLaMA, Mistral, and Qwen families.
Synthetic Data (Almost) from Scratch: Generalized Instruction Tuning for Language Models
Paper • 2402.13064 • Published • 51Note obs. Instruction data generation often depends on seed examples, narrowing coverage and inheriting dataset biases. innov. GLAN builds a taxonomy of human knowledge/capabilities, expands it into subjects/syllabi/concepts, and generates broad instruction data without task-specific seed datasets.
BARE: Combining Base and Instruction-Tuned Language Models for Better Synthetic Data Generation
Paper • 2502.01697 • Published • 2Note obs. Instruct-tuned models follow directions but lack diversity; base models are more diverse but lower-quality and less controllable. innov. BARE uses a base model to generate diverse candidates and an instruct model to refine them, producing small synthetic datasets that improve code, math, and RAFT results.
Beyond Output Critique: Self-Correction via Task Distillation
Paper • 2602.00871 • Published • 2Note obs. Small models fail at self-correction because they lack capacity to simultaneously critique outputs and maintain task understanding; existing methods assume frontier model capacity. innov. Separates task abstraction (what must be solved) from solution generation, using structured templates as intermediate representations; small models can use templates distilled by larger models to achieve reliable self-correction without external verifiers.
DataDecide: How to Predict Best Pretraining Data with Small Experiments
Paper • 2504.11393 • Published • 20Note obs. Choosing pretraining data mixtures at target scale is too expensive to brute-force. innov. DataDecide runs controlled multi-corpus, multi-scale experiments and shows small proxy runs—especially likelihood-based signals—can predict many 1B-scale data choices at tiny compute cost.
Predicting LLM Reasoning Performance with Small Proxy Model
Paper • 2509.21013 • Published • 6Note obs. Small proxy models often fail to predict large-model reasoning because reasoning emerges more reliably past several billion parameters. innov. rBridge aligns proxy scoring with the pretraining objective and target tasks using frontier reasoning traces, cutting dataset-ranking cost by >100× while improving correlation to large-model reasoning.
Can Small Training Runs Reliably Guide Data Curation? Rethinking Proxy-Model Practice
Paper • 2512.24503 • Published • 1Note obs. “Fair” fixed-hyperparameter proxy runs can flip data-quality conclusions because each data recipe has different optimal training settings. innov. The paper argues proxy assessment should approximate data-specific tuning; simply reducing proxy learning rates improves correlation with fully tuned large-scale runs across 23 recipes.
DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining
Paper • 2305.10429 • Published • 5Note obs. Pretraining data-domain proportions strongly affect performance, but downstream-tuned mixture search is expensive and task-specific. innov. DoReMi trains a small Group-DRO proxy to infer domain weights, then resamples large-model data; reports +6.5 few-shot points and baseline accuracy in 2.6× fewer steps.
QuaDMix: Quality-Diversity Balanced Data Selection for Efficient LLM Pretraining
Paper • 2504.16511 • Published • 23Note obs. Quality filtering and diversity balancing are usually optimized separately, missing the tradeoff under a fixed token budget. innov. QuaDMix jointly models data quality and domain diversity with a parameterized sampler, using proxy simulations plus LightGBM search; reports 7.2% average benchmark improvement.
Beyond Random Sampling: Efficient Language Model Pretraining via Curriculum Learning
Paper • 2506.11300 • Published • 2Note obs. Pretraining usually shuffles data randomly, underusing ordering signals that could improve early and mid-training efficiency. innov. A systematic curriculum study trains 200+ models and finds warmup curricula using compression ratio, lexical diversity, or readability can yield lasting gains up to ~3.5%.
Target-Oriented Pretraining Data Selection via Neuron-Activated Graph
Paper • 2604.15706 • Published • 10Note obs. Targeted pretraining needs data matched to a desired capability, but black-box embeddings obscure what target features matter. innov. NAG selects sparse high-impact neurons for target examples and ranks candidate data by graph similarity, improving target pretraining and exposing a functional neuron backbone.
The Best Instruction-Tuning Data are Those That Fit
Paper • 2502.04194 • Published • 3Note obs. SFT responses from stronger or external LMs can be out-of-distribution for the target model, causing diminishing returns or robustness loss. innov. GRAPE samples responses from multiple LMs and selects the response the target model assigns highest probability, outperforming stronger-teacher and more-data baselines.
LEAD: Iterative Data Selection for Efficient LLM Instruction Tuning
Paper • 2505.07437 • Published • 2Note obs. Model-aware iterative data selection is useful but costly because it repeatedly runs full-dataset inference. innov. LEAD estimates sample utility inside the training loop using dynamic uncertainty plus cluster-level bandits, improving performance with only 2.5% of data and 5–10× less time.
Improving Influence-based Instruction Tuning Data Selection for Balanced Learning of Diverse Capabilities
Paper • 2501.12147 • Published • 2Note obs. Influence-based selection favors tasks with intrinsically high influence, hurting balanced capability learning. innov. BIDS normalizes influence scores and iteratively selects data for the most underrepresented task, letting a 15% subset beat full-data tuning with better balance.
Clustering and Ranking: Diversity-preserved Instruction Selection through Expert-aligned Quality Estimation
Paper • 2402.18191 • Published • 2Note obs. Instruction selection methods often rely on fragile APIs, inherit GPT biases, or collapse dataset diversity. innov. CaR ranks examples with an expert-aligned 550M scoring model, then clusters to preserve diversity; selecting 1.96% of Alpaca beats Alpaca by large margins in GPT-4 evals.
SCAR: Efficient Instruction-Tuning for Large Language Models via Style Consistency-Aware Response Ranking
Paper • 2406.10882 • Published • 3Note obs. Data quality alone is insufficient; response-style consistency affects how efficiently SFT transfers behavior. innov. SCAR ranks examples by consistency in linguistic form and instructional surprisal, sometimes matching or beating full-data tuning with only 0.7% of the dataset.
Towards Active Synthetic Data Generation for Finetuning Language Models
Paper • 2512.00884 • Published • 1Note obs. Synthetic SFT data is usually generated once before training, ignoring what the student still fails to learn. innov. Iterative active generation queries the teacher based on the current student state; simple active-learning criteria beat static generation under fixed sample/compute budgets.
ReFusion: A Diffusion Large Language Model with Parallel Autoregressive Decoding
Paper • 2512.13586 • Published • 93Note obs. Token-level masked diffusion LMs can't use KV caching and must learn combinatorial token dependencies, raising compute and hurting coherence vs AR decoding. innov. Slot-level plan-and-infill: diffusion picks weakly dependent fixed-length slots; an AR infiller decodes selected slots in parallel under a unified causal/KV-cache setup, reporting >18× speedup vs prior MDMs and ~2.33× vs strong ARMs while narrowing the accuracy gap.
Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models
Paper • 2503.09573 • Published • 77Note obs. Diffusion LMs can parallelize but often lag in likelihood modeling and can be constrained to fixed-length generation setups. innov. Block diffusion models define an AR distribution over blocks and run diffusion within each block, enabling arbitrary-length generation plus KV caching and parallel sampling; reports strong diffusion-model language-modeling results.
LLaDA2.1: Speeding Up Text Diffusion via Token Editing
Paper • 2602.08676 • Published • 72Note obs. Diffusion LMs have parallelism, but mask-to-token decoding makes speed/quality tradeoffs brittle. innov. LLaDA2.1 adds token-to-token editing, speedy/quality decoding modes, and dLLM-specific RL; releases 16B/100B models with strong benchmark and high-throughput coding results.
DMax: Aggressive Parallel Decoding for dLLMs
Paper • 2604.08302 • Published • 51Note obs. Aggressive parallel decoding in diffusion LMs accumulates errors when uncertain token guesses become bad context. innov. DMax reframes denoising as refinement from masks and bad predictions toward token embeddings, using On-Policy Uniform Training and Soft Parallel Decoding for much higher tokens-per-forward.
DFlash: Block Diffusion for Flash Speculative Decoding
Paper • 2602.06036 • Published • 76Note obs. Speculative decoding is still limited when the draft model itself generates tokens autoregressively. innov. DFlash uses a block-diffusion draft model to propose many tokens in one pass conditioned on target context, reporting >6× lossless acceleration and gains over EAGLE-3.
Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding
Paper • 2505.22618 • Published • 45Note obs. Bidirectional attention in diffusion LMs prevents standard KV caching, making inference dominated by repeated recomputation across denoising steps. innov. A block-wise approximate KV cache reuses activations across steps, plus a confidence-aware parallel decoding fix for dependency disruption; reports large end-to-end speedups on diffusion LM baselines under long-generation settings.
dKV-Cache: The Cache for Diffusion Language Models
Paper • 2505.15781 • Published • 16Note obs. Diffusion decoding recomputes QKV for all tokens at each step, even though many token representations change slowly over denoising. innov. Delayed/conditioned cache reuse (dKV-Cache) with two variants: near-lossless decoding-oriented caching and a more aggressive greedy mode; reports ~2–10x inference speedups across general, math, and code benchmarks.
Attention Is All You Need for KV Cache in Diffusion LLMs
Paper • 2510.14973 • Published • 42Note obs. KV states drift unevenly across layers and steps, yet diffusion decoders typically refresh everything uniformly. innov. Elastic-Cache uses an attention-aware drift test plus a depth-aware refresh schedule to recompute only where needed; reports substantial speedups on math/code tasks (including very large gains on long sequences) with negligible quality loss.
d^2Cache: Accelerating Diffusion-Based LLMs via Dual Adaptive Caching
Paper • 2509.23094 • Published • 5Note obs. Even with block diffusion, recomputing KV everywhere remains redundant because many tokens/layers exhibit limited change across steps. innov. A training-free dual adaptive caching framework selectively updates KV states for a subset of tokens per step and reuses the rest, aiming to preserve quality while substantially reducing redundant compute in diffusion decoding.
FS-DFM: Fast and Accurate Long Text Generation with Few-Step Diffusion Language Models
Paper • 2509.20624 • Published • 1Note obs. Discrete diffusion for text can require hundreds to thousands of model evaluations, so parallelism does not automatically translate into low latency. innov. Few-Step Discrete Flow-Matching trains step-budget consistency so small step counts approximate long trajectories; reports that ~8 steps can match a 1024-step baseline perplexity for long generation, enabling very large sampling speedups.
Learning to Parallel: Accelerating Diffusion Large Language Models via Adaptive Parallel Decoding
Paper • 2509.25188 • Published • 3Note obs. Fixed confidence-threshold heuristics for parallel decoding are input-agnostic and miss optimal speed/quality trade-offs. innov. Train a lightweight filter to predict whether a token is already final, enabling adaptive unmasking plus end-of-text prediction; reports up to ~22.6x speedup without performance drop (and higher with KV-cache) on the LLaDA benchmark.
From Bits to Rounds: Parallel Decoding with Exploration for Diffusion Language Models
Paper • 2511.21103 • Published • 1Note obs. Prior parallel decoding prioritizes high-confidence tokens, but those tokens carry little information, slowing progress per round. innov. A bits-to-rounds principle links required rounds to total information and per-round budget, motivating Explore-Then-Exploit decoding that targets uncertain tokens to trigger cascades; empirically reduces rounds without quality loss.
Plan, Verify and Fill: A Structured Parallel Decoding Approach for Diffusion Language Models
Paper • 2601.12247 • Published • 1Note obs. Many diffusion decoding strategies are reactive and do not use bidirectional context to enforce a global generation trajectory. innov. Plan-Verify-Fill builds a hierarchical skeleton via high-leverage anchors and uses quantitative verification to stop deliberation when marginal value drops; reports up to ~65% fewer function evaluations on LLaDA/Dream instruct models.
Stable-DiffCoder: Pushing the Frontier of Code Diffusion Large Language Model
Paper • 2601.15892 • Published • 53Note obs. Code diffusion models often lag AR under matched budgets, in part due to unstable training and inefficient knowledge uptake. innov. Reuse an AR code pipeline but add block-diffusion continual pretraining with tailored warmup and noise scheduling to improve code benchmark performance.
Diffusion Language Models Know the Answer Before Decoding
Paper • 2508.19982 • Published • 27Note obs. Diffusion LMs often settle on the right answer early, then waste compute refining after their beliefs have effectively converged. innov. A training-free decoding rule that watches a simple confidence-gap signal to stop iterating and commit, cutting steps by up to ~3.4× while preserving accuracy (near-full correctness at half steps on GSM8K/MMLU).
Inference-Time Hyper-Scaling with KV Cache Compression
Paper • 2506.05345 • Published • 30Note obs. Test-time scaling is not only limited by tokens; KV-cache memory becomes the bottleneck for long and parallel reasoning. innov. Dynamic Memory Sparsification compresses KV caches about 8× after short training, letting the same memory budget support more reasoning and improving AIME, GPQA, and LiveCodeBench.
TriAttention: Efficient Long Reasoning with Trigonometric KV Compression
Paper • 2604.04921 • Published • 113Note obs. Long-reasoning KV compression based on recent post-RoPE attention scores is unstable because position rotation changes query orientation. innov. TriAttention scores keys from stable pre-RoPE Q/K centers and trigonometric distance preferences, matching full attention with major throughput or memory savings.
PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU
Paper • 2312.12456 • Published • 45Note obs. LLM inference on consumer GPUs is bottlenecked by memory bandwidth and capacity, requiring expensive multi-GPU setups or cloud APIs because hot-activated neurons (frequently used) and cold-activated neurons (rarely used) are treated identically in memory hierarchy. innov. A neuron-aware inference engine that predicts which neurons will activate (hot/cold classification) and keeps hot neurons in GPU memory while offloading cold neurons to CPU RAM, achieving 11× speedup on consumer GPUs with
PowerInfer-2: Fast Large Language Model Inference on a Smartphone
Paper • 2406.06282 • Published • 39Note obs. Running 40B+ parameter models on smartphones was considered impossible due to DRAM limitations (8GB vs 80GB+ models) and thermal constraints, requiring either tiny models or cloud connectivity that compromises privacy and latency. innov. Neuron-aware adaptive scheduling with asymmetric tensor partitioning across heterogeneous memory (NPU, LPDDR, flash/SD card) and dynamic sparse expert selection based on local correlation, enabling 3B-40B model inference on smartphones at 11 tokens/sec wit
River-LLM: Large Language Model Seamless Exit Based on KV Share
Paper • 2604.18396 • Published • 6Note obs. Decoder-only early exit lacks KV caches for skipped layers, so theoretical layer savings often fail to become wall-clock speedups. innov. River-LLM creates a lightweight KV-Shared Exit River and uses state-transition similarity to guide exits, achieving 1.71–2.16× practical speedup training-free.
LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding
Paper • 2404.16710 • Published • 81Note obs. Early exit can speed LLMs, but exits are often inaccurate unless the model was trained to support them. innov. LayerSkip uses progressive layer dropout and shared early-exit loss, then performs self-speculative decoding where early layers draft and later layers verify, reaching up to ~2× speedups.
EPAS: Efficient Training with Progressive Activation Sharing
Paper • 2601.19089 • Published • 1Note obs. Efficient inference can require post-hoc distillation/pruning of a full model; activation sharing in inference is hard with pretrained models w/o performance loss. innov. Progressive activation sharing in training expands these regions from deep to shallow layers; allows the model to adapt to shared reps; results in a family of efficient models (varying sharing percentages) from one training run, with continual pretraining transforming pretrained models into efficient variants w/o distil.
Late-to-Early Training: LET LLMs Learn Earlier, So Faster and Better
Paper • 2602.05393 • Published • 8Note obs. Pretraining usually makes every target model learn from scratch, even though smaller pretrained models already contain useful late-layer abstractions. innov. LET transfers late-layer representations from a smaller teacher into earlier target-model layers/steps, speeding convergence and reporting up to 1.6× faster training plus ~5% downstream accuracy gains.
Horizon-LM: A RAM-Centric Architecture for LLM Training
Paper • 2602.04816 • Published • 20Note obs. Training large models requires expensive multi-GPU setups because systems persistently host model replicas and full computation graphs in limited GPU memory, despite terabytes of cheaper host RAM sitting idle. innov. A CPU-master/GPU-template execution model that streams layers from host RAM to GPU on-demand with double-buffered pipelining, eliminating persistent GPU-resident params and enabling 120B model training on single GPU while achieving 12× better throughput than SoTA offloading.
Chronicals: A High-Performance Framework for LLM Fine-Tuning with 3.51x Speedup over Unsloth
Paper • 2601.02609 • Published • 2Note obs. Training a 7B model requires 84GB VRAM (14B params + 14B grads + 56B optimizer states), making it impossible on single consumer GPUs despite models only being 14GB in memory. innov. Cut Cross-Entropy reduces vocabulary logits memory by 37× via online softmax, fused kernels eliminate intermediate allocations, and LoRA+ uses 16× different learning rates between A/B matrices, enabling full fine-tuning with 3.5× less memory and time than current SOTA (Unsloth).
Canzona: A Unified, Asynchronous, and Load-Balanced Framework for Distributed Matrix-based Optimizers
Paper • 2602.06079 • Published • 20Note obs. Matrix-aware optimizers clash with tensor-parallel sharding because the optimizer wants global matrix structure while training fragments parameters. innov. Canzona separates logical optimizer assignment from physical tensor layout, using balanced partitioning and async scheduling; reports 1.57× faster iterations and 5.8× lower optimizer latency.
RaBiT: Residual-Aware Binarization Training for Accurate and Efficient LLMs
Paper • 2602.05367 • Published • 8Note obs. Residual binarization should make LLMs cheap, but independent binary paths co-adapt and lose useful correction capacity. innov. RaBiT builds a residual hierarchy from one shared full-precision weight so each binary path corrects previous error, improving 2-bit QAT and reporting 4.49× inference speedup.
REAM: Merging Improves Pruning of Experts in LLMs
Paper • 2604.04356 • Published • 9Note obs. MoE expert pruning saves memory but deletes capacity and distorts learned expert mixtures. innov. REAM compresses MoEs by merging router-weighted expert activations/weights instead of dropping experts, preserving more behavior and often approaching original-model performance.
Context Parametrization with Compositional Adapters
Paper • 2509.22158 • Published • 1Note obs. Fine-tuned adapters (LoRA) for different tasks/contexts interfere with each other when composed or switched dynamically, requiring separate forward passes or full model copies that multiply memory costs and prevent efficient multi-task serving. innov. Compositional adapters that can be added, subtracted, and combined arithmetically (A + B - C) while maintaining task fidelity, enabling dynamic context switching through vector arithmetic in adapter space without inference-time interference,
Nanbeige4.1-3B: A Small General Model that Reasons, Aligns, and Acts
Paper • 2602.13367 • Published • 35Note obs. 3B models usually specialize narrowly; strong reasoning, coding, alignment, and long tool-use rarely coexist at that scale. innov. Nanbeige4.1-3B combines reward modeling, complexity-aware code RL, and turn-level search supervision, supporting up to 600 tool-call turns while beating similar and some larger baselines.
Omnilingual MT: Machine Translation for 1,600 Languages
Paper • 2603.16309 • Published • 22Note obs. MT systems still cover only a fraction of the world’s languages, especially low-resource ones. innov. OMT combines broad data creation/curation with specialized 1B–8B translation models for 1,600+ languages, matching or exceeding a 70B baseline and releasing eval resources.
UniWeTok: An Unified Binary Tokenizer with Codebook Size 2^{128} for Unified Multimodal Large Language Model
Paper • 2602.14178 • Published • 14Note obs. Visual tokenizers usually trade off reconstruction, semantic extraction, and generation/editing compatibility. innov. UniWeTok uses a huge binary codebook, pre-post distillation, generative-aware priors, and a hybrid encoder to unify perception, generation, and editing with strong reported efficiency.
OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation
Paper • 2604.18486 • Published • 90Note obs. Pure language CoT can miss causal world dynamics in embodied vision-language planning. innov. OneVL trains compact latent tokens with both text-CoT and future-frame decoders, then discards decoders at inference; it beats explicit CoT at answer-only latency.
LINC: A Neurosymbolic Approach for Logical Reasoning by Combining Language Models with First-Order Logic Provers
Paper • 2310.15164 • Published • 4Note obs. LLMs remain unreliable at formal logical deduction even when prompted with CoT. innov. LINC uses the LM as a semantic parser into first-order logic, then delegates proof search to a symbolic prover, yielding large ProofWriter/FOLIO gains over pure CoT.
When Personalization Misleads: Understanding and Mitigating Hallucinations in Personalized LLMs
Paper • 2601.11000 • Published • 27Note obs. Personalization can entangle user-history signals with factual representations, causing answers to track prior context over objective truth. innov. An inference-time steering method that preserves personalization while suppressing factual distortion, plus a benchmark for joint factual and personalized QA.
Anchored Decoding: Provably Reducing Copyright Risk for Any Language Model
Paper • 2602.07120 • Published • 2Note obs. Mixed-license LMs can memorize and reproduce protected text at inference, even after safety filtering. innov. Anchored Decoding constrains a risky model to stay within an information budget around a permissively trained safe LM, reducing copying risk with sequence-level guarantees.
IMU-1: Sample-Efficient Pre-training of Small Language Models
Paper • 2602.02522 • Published • 8Note obs. Small-LM pretraining can waste huge token budgets when recipe details lag modern scaling practice. innov. IMU-1 combines architecture tweaks, optimizer choices, staged training, and checkpoint EMA to make a 430M model trained on 72B tokens approach much more data-hungry baselines.
Echoes as Anchors: Probabilistic Costs and Attention Refocusing in LLM Reasoning
Paper • 2602.06600 • Published • 2Note obs. LRMs often restate the prompt before reasoning; what looks like wasted echoing may actually refocus attention and stabilize computation. innov. The paper formalizes echo costs, then uses Echo-Distilled SFT and Echoic Prompting to induce/use prompt echoes for stronger reasoning performance.
Nonsense Helps: Prompt Space Perturbation Broadens Reasoning Exploration
Paper • 2605.05566 • Published • 26Note obs. GRPO can hit zero-advantage failure when all sampled rollouts fail, leaving hard prompts with no useful training signal. innov. LoPE prepends low-perplexity pseudo-Latin perturbations before resampling, shifting the output distribution enough to unlock alternative reasoning paths across 1.7B–7B models.
-
Introspective Diffusion Language Models
Paper • 2604.11035 • Published • 23