[ {"id":"A001","ground_truth":"The KV cache grows linearly with sequence length and often exceeds the memory footprint of the model itself, creating a primary memory bottleneck during autoregressive decoding that limits achievable batch sizes and context windows.","source":"MiniCache (2024), KV-CAR (2025)"}, {"id":"A002","ground_truth":"KV cache states exhibit high similarity between adjacent layers in the middle-to-deep portion of transformer models, enabling cross-layer compression by merging similar key-value states without significant accuracy loss.","source":"MiniCache (2024)"}, {"id":"A003","ground_truth":"KV cache compression combined with streaming enables long-context LLM serving by reducing the memory footprint required to store and transfer KV states, allowing faster context loading for repeated or shared prefixes.","source":"CacheGen (2023)"}, {"id":"A004","ground_truth":"Speculative decoding reduces LLM inference latency by using a faster draft model to propose multiple candidate tokens in parallel, which the target model then verifies in a single forward pass, achieving 2-3x speedup while maintaining identical output distribution.","source":"Unlocking Efficiency Survey (2024)"}, {"id":"A005","ground_truth":"Higher-quality draft models improve token acceptance rates and thus overall speedup, but require more computation per draft step, creating a fundamental trade-off between draft quality and draft speed that must be optimized per deployment scenario.","source":"Unlocking Efficiency Survey (2024)"}, {"id":"A006","ground_truth":"Self-speculative decoding eliminates the need for a separate draft model by using early exit layers or skipped layers of the target model itself as the draft mechanism, enabling inference acceleration without maintaining a separate smaller model.","source":"SWIFT (2024)"}, {"id":"A007","ground_truth":"Adaptive KV cache merging, guided by model-internal signals about which tokens are candidates for merging, consistently outperforms static eviction policies on long-context tasks by preserving information that fixed-budget eviction permanently removes.","source":"Model Tells You Where to Merge (2024)"}, {"id":"A008","ground_truth":"KV cache compression methods show disproportionate degradation on multi-step reasoning tasks compared to standard factual benchmarks, indicating that standard benchmark results overestimate compression safety for reasoning-heavy applications.","source":"Hold Onto That Thought (2025)"}, {"id":"A009","ground_truth":"LLM quantization benefits from higher dimensionality because quantization error averages out across more dimensions, making large models more amenable to aggressive quantization than small models.","source":"GPTVQ (2024)"}, {"id":"A010","ground_truth":"Compressed LLMs show meaningful capability degradation on agentic multi-step tasks even when they pass standard single-turn benchmarks, suggesting that compression evaluation must include agentic task suites beyond static QA benchmarks.","source":"Can Compressed LLMs Truly Act? (2025)"}, {"id":"A011","ground_truth":"Token importance for KV cache eviction is better captured by lag-relative information — measuring how much a token's information is already represented by more recent context — than by raw attention scores alone.","source":"LagKV (2025)"}, {"id":"A012","ground_truth":"There exist fundamental information-theoretic compression barriers for autoregressive transformers that bound how much the KV cache can be compressed regardless of method, with the bound tightening for longer sequences and more complex tasks.","source":"Compression Barriers (2025)"}, {"id":"A013","ground_truth":"Multi-draft speculative decoding, which proposes multiple diverse candidate continuations per step, achieves higher expected acceptance rates than single-draft methods, with optimal performance requiring careful diversity-quality trade-off in draft selection.","source":"Towards Optimal Multi-draft Speculative Decoding (2025)"}, {"id":"A014","ground_truth":"Draft models that can identify when to stop generating draft tokens via self-verification of confidence avoid speculative over-generation and achieve better throughput than always-draft-until-budget approaches.","source":"Draft Model Knows When to Stop (2024)"}, {"id":"A015","ground_truth":"Dynamic token sparsification during the prefill phase combined with KV cache compression during decoding addresses both the compute bottleneck and memory bottleneck of large vision-language models in a unified framework.","source":"ZipVL (2024)"}, {"id":"A016","ground_truth":"RLHF trains language models to align with human preferences by first training a reward model on human comparison data, then using reinforcement learning (typically PPO) to fine-tune the language model to maximize predicted reward.","source":"A Survey of RLHF (2023)"}, {"id":"A017","ground_truth":"The reward model in RLHF acts as a learned proxy for human judgment, translating pairwise human preference data into scalar reward signals that guide policy optimization, and its quality is the primary determinant of alignment outcome.","source":"A Survey of RLHF (2023)"}, {"id":"A018","ground_truth":"RLHF reward models systematically assign higher scores to longer responses regardless of quality due to length bias in human annotation, which can be mitigated through explicit length normalization or bias-fitting corrections.","source":"Bias Fitting (2025)"}, {"id":"A019","ground_truth":"LoRA adapts large pre-trained models by injecting trainable low-rank decomposition matrices into existing weight layers, reducing trainable parameters by orders of magnitude while keeping the pre-trained weights frozen.","source":"LoRA (2021)"}, {"id":"A020","ground_truth":"LoRA variants have proliferated across diverse adaptation scenarios including domain adaptation, task-specific fine-tuning, and multi-task learning, with key design choices being rank selection, target module selection, and initialization strategy.","source":"A Survey on LoRA (2024)"}, {"id":"A021","ground_truth":"LoRA reduces GPU memory requirements for fine-tuning by eliminating optimizer states for frozen parameters, enabling fine-tuning of large models on commodity hardware at a fraction of full fine-tuning memory cost.","source":"LoRA (2021)"}, {"id":"A022","ground_truth":"DPO eliminates the explicit reward model and RL optimization loop of PPO-based RLHF by directly optimizing the language model policy on preference pairs, showing the optimal policy under Bradley-Terry preferences is implicitly defined by the model's own likelihood ratio.","source":"DPO (2023)"}, {"id":"A023","ground_truth":"DPO training is more stable than PPO because it eliminates the online sampling loop, the separate reward model, and the KL penalty term optimization, reducing the number of hyperparameters and failure modes.","source":"DPO (2023)"}, {"id":"A024","ground_truth":"QLoRA enables fine-tuning of 65B+ parameter models on a single GPU by quantizing the base model to 4-bit NormalFloat precision and adding trainable LoRA adapters in BFloat16, with backpropagation through the quantized weights via double quantization.","source":"QLoRA (2023)"}, {"id":"A025","ground_truth":"QLoRA achieves performance matching full 16-bit fine-tuning on instruction following benchmarks while reducing memory usage by 4-6x, establishing 4-bit quantized LoRA as a practical default for fine-tuning models that exceed single-GPU memory limits.","source":"QLoRA (2023)"}, {"id":"A026","ground_truth":"LoftQ improves upon QLoRA by jointly optimizing the quantized model and the LoRA initialization to minimize the approximation gap between the full-precision pretrained weights and the quantized-plus-adapter representation.","source":"LoftQ (2023)"}, {"id":"A027","ground_truth":"Direct alignment methods including DPO, IPO, and their variants can be unified under a common theoretical framework as different parameterizations of the same preference learning objective, subsuming PPO-based RLHF as a special case.","source":"From RLHF to Direct Alignment (2026)"}, {"id":"A028","ground_truth":"Standard DPO exhibits gradient imbalance where gradients from winning and losing responses have asymmetric magnitudes, causing training to disproportionately suppress undesired outputs rather than reinforce desired ones.","source":"Gradient Imbalance in DPO (2025)"}, {"id":"A029","ground_truth":"LoRA rank is a critical hyperparameter where rank too low fails to capture task complexity and rank too high approaches full fine-tuning cost with diminishing returns, with optimal rank varying substantially by task and model scale.","source":"A Survey on LoRA (2024)"}, {"id":"A030","ground_truth":"Reward hacking — where the policy learns to maximize reward model scores through behaviors not reflective of true quality — is a fundamental challenge in RLHF that stems from the imperfect proxy nature of the reward model.","source":"Adversarial Preference Learning (2025)"}, {"id":"A031","ground_truth":"RAG enhances purely parametric LLMs on knowledge-intensive tasks by allowing the model to retrieve relevant documents at inference time, reducing factual hallucination and enabling knowledge updates without retraining.","source":"Lewis et al. RAG (2020)"}, {"id":"A032","ground_truth":"RAG evaluation requires multiple dimensions including retrieval quality, context faithfulness, and answer relevance, and current evaluation frameworks have moved toward multi-dimensional metrics that separately assess each pipeline component.","source":"Evaluation of RAG Survey (2024)"}, {"id":"A033","ground_truth":"Graph-based RAG improves over flat vector retrieval by encoding entity relationships and document connectivity into the retrieval process, enabling multi-hop reasoning that flat nearest-neighbor retrieval cannot support.","source":"Graph RAG Survey (2024)"}, {"id":"A034","ground_truth":"Standard RAG pipelines fail primarily through retrieval errors (returning irrelevant documents), context utilization errors (failing to use retrieved content correctly), and faithfulness errors (generating claims not supported by retrieved context).","source":"RAG for AI-Generated Content Survey (2024)"}, {"id":"A035","ground_truth":"Hybrid retrieval combining dense semantic vectors and sparse BM25-style matching consistently outperforms either method alone on scientific document retrieval, with each modality capturing complementary aspects of relevance.","source":"Sparse Meets Dense (2024)"}, {"id":"A036","ground_truth":"Semantic chunking improves retrieval quality on some tasks but its computational overhead often exceeds its benefit, with fixed-size chunking remaining competitive when chunk sizes are appropriately tuned, suggesting semantic chunking should be used selectively.","source":"Is Semantic Chunking Worth the Computational Cost? (2024)"}, {"id":"A037","ground_truth":"Hybrid retrieval in RAG benefits from dynamic interpolation weight adjustment between dense and sparse signals rather than fixed weights, with dynamic tuning providing consistent improvements across heterogeneous query types.","source":"DAT (2025)"}, {"id":"A038","ground_truth":"RAG outperforms long-context LLMs on tasks requiring precise multi-document retrieval and citation, while long-context LLMs outperform RAG on tasks requiring holistic understanding of lengthy documents, with neither approach dominating universally.","source":"LaRA (2025)"}, {"id":"A039","ground_truth":"Graph RAG enables customized LLMs for specialized domains by structuring domain knowledge as graphs that capture entity relationships, enabling more precise retrieval of domain-specific information than flat vector indices.","source":"Graph RAG for Customized LLMs Survey (2025)"}, {"id":"A040","ground_truth":"Large language models can be prompted to generate both dense semantic embeddings and sparse lexical representations in a single forward pass, enabling zero-shot unified retrieval without separate training of dense and sparse retrieval systems.","source":"PromptReps (2024)"}, {"id":"A041","ground_truth":"RAG for NLP has evolved from simple retrieve-then-generate pipelines to iterative and adaptive variants that refine retrieval queries based on intermediate generation outputs, improving accuracy on multi-step reasoning tasks.","source":"RAG for NLP Survey (2024)"}, {"id":"A042","ground_truth":"Advanced RAG architectures beyond the basic retrieve-then-generate paradigm include modular RAG, self-reflective RAG, and hybrid RAG, each addressing specific failure modes of naive RAG through architectural modifications.","source":"RAG and Beyond Survey (2024)"}, {"id":"A043","ground_truth":"No single chunking strategy outperforms all others across different domain-specific document types; scientific, legal, and conversational documents each favor different chunking granularities and boundary detection methods.","source":"Impact of Chunking Strategies (2025)"}, {"id":"A044","ground_truth":"RAG can automate systematic literature reviews by retrieving relevant papers, synthesizing findings across retrieved documents, and generating structured summaries, significantly reducing manual review effort while maintaining acceptable coverage.","source":"Automating SLRs with RAG (2024)"}, {"id":"A045","ground_truth":"Small embedding models combined with LLM-based re-ranking can match or outperform large embedding models in tri-modal hybrid retrieval for RAG, with the re-ranking stage compensating for lower embedding quality at lower total cost.","source":"Rethinking Hybrid Retrieval (2025)"}, {"id":"A046","ground_truth":"Token-level reward guidance in DPO provides finer-grained training signal than sequence-level preference labels, improving alignment quality on tasks where individual token choices determine response quality.","source":"TGDPO (2025)"}, {"id":"A047","ground_truth":"DPO applied to code LLMs improves code generation alignment with human preferences on correctness and safety without requiring human-labeled reward models, using pairs of passing and failing code solutions as preference data.","source":"Aligning CodeLLMs with DPO (2024)"}, {"id":"A048","ground_truth":"Lossless speculative decoding for diffusion LLMs requires a fundamentally different acceptance criterion than autoregressive speculative decoding due to the non-autoregressive nature of diffusion model inference.","source":"Spiffy (2025)"}, {"id":"A049","ground_truth":"Standard RAG pipelines struggle with non-factoid questions that require multi-aspect reasoning and synthesis across multiple retrieved documents, necessitating question type-aware decomposition strategies.","source":"Typed-RAG (2025)"}, {"id":"A050","ground_truth":"Graph-based RAG for large language model customization leverages structured knowledge graphs to provide domain-specific reasoning support that flat retrieval cannot, enabling deployment of specialized models without full retraining.","source":"Graph RAG for Customized LLMs Survey (2025)"}, {"id":"B001","older_consensus":"H2O and StreamingLLM established attention-score-based heavy hitter token eviction as the dominant KV cache management approach, permanently dropping low-attention tokens.","supersession":"Adaptive KV cache merging methods showed that merging similar tokens preserves information that eviction loses, improving long-context task accuracy over eviction baselines.","supersession_type":"soft","key_papers":["H2O (2023)","Model Tells You Where to Merge (2024)"]}, {"id":"B002","older_consensus":"KV cache compression methods operated independently per layer, applying the same compression logic uniformly across all transformer layers.","supersession":"MiniCache showed that KV cache states are highly similar between adjacent middle-to-deep layers, enabling cross-layer merging that achieves 5x compression ratio with near-lossless performance — a result impossible under layer-independent compression.","supersession_type":"hard","key_papers":["StreamingLLM (2023)","MiniCache (2024)"]}, {"id":"B003","older_consensus":"Original speculative decoding required maintaining a separate smaller draft model alongside the target model, adding memory overhead and system complexity.","supersession":"Self-speculative decoding methods like SWIFT demonstrated that the target model's own intermediate layers can serve as draft mechanisms, eliminating the separate draft model while maintaining comparable speedup.","supersession_type":"soft","key_papers":["Leviathan et al. (2022)","SWIFT (2024)"]}, {"id":"B004","older_consensus":"Standard autoregressive draft models for speculative decoding are trained to minimize next-token prediction loss, making them susceptible to sequential error accumulation over multi-token drafts.","supersession":"CTC-based draft models produce non-autoregressive drafts that avoid sequential error propagation, demonstrating higher acceptance rates on diverse inputs.","supersession_type":"hard","key_papers":["Autoregressive draft models (2022-2023)","CTC-based Draft Model (2024)"]}, {"id":"B005","older_consensus":"Original speculative decoding generates a single draft sequence of k tokens per step, with acceptance rate bounded by the quality of one draft path.","supersession":"Multi-draft speculative decoding generates multiple diverse candidate continuations and selects among them, achieving higher expected acceptance probability than any single draft.","supersession_type":"soft","key_papers":["Leviathan et al. (2022)","Towards Optimal Multi-draft Speculative Decoding (2025)"]}, {"id":"B006","older_consensus":"PPO-based RLHF requires a separate reward model training phase, an online data collection loop, and complex KL-constrained policy optimization, making it computationally expensive and difficult to tune.","supersession":"DPO showed that the RLHF objective can be optimized directly on offline preference data without a reward model or RL loop, achieving comparable alignment quality at a fraction of the computational cost.","supersession_type":"hard","key_papers":["InstructGPT/PPO RLHF (2022)","DPO (2023)"]}, {"id":"B007","older_consensus":"Standard LoRA fine-tuning requires 16-bit model weights in GPU memory, limiting fine-tuning of 65B+ parameter models to expensive multi-GPU setups.","supersession":"QLoRA demonstrated that 4-bit NormalFloat quantized base weights with BFloat16 LoRA adapters achieves full fine-tuning quality while fitting 65B parameter model fine-tuning on a single 48GB GPU.","supersession_type":"hard","key_papers":["LoRA (2021)","QLoRA (2023)"]}, {"id":"B008","older_consensus":"QLoRA initializes LoRA adapters at zero, creating a gap between the quantized model's representations and the original full-precision model's representations at the start of fine-tuning.","supersession":"LoftQ identified the initialization gap as a measurable source of downstream quality degradation and showed that jointly optimizing quantization and LoRA initialization eliminates the gap.","supersession_type":"soft","key_papers":["QLoRA (2023)","LoftQ (2023)"]}, {"id":"B009","older_consensus":"Standard DPO trains on fixed offline preference pairs, which can suffer from distribution shift when the policy drifts away from the data distribution used to collect preferences.","supersession":"RS-DPO showed that combining rejection sampling to generate on-policy preference pairs with DPO optimization reduces distribution shift and improves alignment quality.","supersession_type":"soft","key_papers":["DPO (2023)","RS-DPO (2024)"]}, {"id":"B010","older_consensus":"Standard RLHF reward models are trained to distinguish preferred from non-preferred responses, with the implicit assumption that reward model accuracy generalizes to adversarial inputs.","supersession":"Adversarial preference learning demonstrated that standard reward models can be systematically deceived by adversarially constructed preference pairs that exploit blind spots in the training distribution.","supersession_type":"hard","key_papers":["Standard RLHF reward model (2022)","Adversarial Preference Learning (2025)"]}, {"id":"B011","older_consensus":"Early RAG implementations used fixed-size chunking as the default document segmentation strategy based on simplicity and predictable chunk sizes.","supersession":"Semantic chunking was expected to substantially improve retrieval but empirical evaluation showed improvements are task-dependent and often do not justify the computational overhead.","supersession_type":"soft","key_papers":["Fixed-size chunking (2021)","Is Semantic Chunking Worth the Computational Cost? (2024)"]}, {"id":"B012","older_consensus":"Dense Passage Retrieval established dense vector retrieval as state-of-the-art for open-domain QA, assuming semantic embedding similarity captures relevance better than lexical matching.","supersession":"Hybrid dense-sparse retrieval demonstrated that dense and sparse signals capture complementary aspects of relevance, with hybrid methods consistently outperforming pure dense retrieval.","supersession_type":"hard","key_papers":["DPR (2020)","Sparse Meets Dense (2024)"]}, {"id":"B013","older_consensus":"RAG was established as the dominant approach for integrating external knowledge into LLMs, with the assumption that retrieval-then-generate is always preferable to relying on parametric model knowledge.","supersession":"LaRA benchmarking showed that long-context LLMs can match or outperform RAG on tasks where relevant information fits within the context window, with no silver bullet.","supersession_type":"soft","key_papers":["Lewis et al. RAG (2020)","LaRA (2025)"]}, {"id":"B014","older_consensus":"Original RAG treated all retrieved documents as independent context chunks with no modeling of relationships between documents.","supersession":"Graph RAG introduced structured retrieval over knowledge graphs and citation networks, enabling multi-hop reasoning by traversing entity relationships that flat retrieval cannot express.","supersession_type":"soft","key_papers":["Lewis et al. RAG (2020)","Graph RAG Survey (2024)"]}, {"id":"B015","older_consensus":"Dense retrieval and sparse retrieval were maintained as separate systems with separate indices, requiring late-fusion combination at query time.","supersession":"PromptReps demonstrated that a single LLM forward pass can simultaneously generate both dense and sparse representations, unifying the retrieval pipeline without separate system maintenance.","supersession_type":"hard","key_papers":["DPR + BM25 (2020)","PromptReps (2024)"]}, {"id":"B016","older_consensus":"Hybrid retrieval systems used static interpolation weights tuned on a validation set, applying the same weight to all queries regardless of query type.","supersession":"DAT demonstrated that per-query dynamic alpha tuning consistently outperforms static weight assignment across diverse query distributions.","supersession_type":"soft","key_papers":["Static hybrid retrieval (2022)","DAT (2025)"]}, {"id":"B017","older_consensus":"H2O established attention score magnitude as the primary criterion for KV cache token importance, evicting low-attention tokens based on the assumption that high attention implies high importance.","supersession":"Mixing Importance with Diversity showed that optimizing jointly for token importance and diversity outperforms pure importance-based eviction, preventing selection of redundant high-attention tokens.","supersession_type":"soft","key_papers":["H2O (2023)","Mixing Importance with Diversity (2025)"]}, {"id":"B018","older_consensus":"PPO-based RLHF was the established paradigm for LLM alignment, with DPO and other direct methods seen as heuristic simplifications rather than principled alternatives.","supersession":"Theoretical unification showed that PPO-based RLHF and all major direct alignment methods are special cases of a unified preference learning framework, placing direct methods on equal theoretical footing.","supersession_type":"hard","key_papers":["PPO RLHF (2022)","From RLHF to Direct Alignment (2026)"]}, {"id":"B019","older_consensus":"Early RAG evaluation used single-metric approaches (BLEU, ROUGE, EM) borrowed from extractive QA, not accounting for the multi-component nature of RAG pipelines.","supersession":"Structured RAG evaluation frameworks exposed that single metrics miss component-level failures, necessitating multi-dimensional evaluation covering retrieval precision, faithfulness, and answer relevance separately.","supersession_type":"soft","key_papers":["Early RAG evaluation (2021)","Evaluation of RAG Survey (2024)"]}, {"id":"B020","older_consensus":"KV cache quantization was the standard approach for reducing KV cache memory, operating on stored cache values after generation.","supersession":"Autoencoder-based KV compression demonstrated that learned compact representations achieve higher compression ratios than quantization alone while maintaining model fidelity through reconstruction.","supersession_type":"soft","key_papers":["KV cache quantization (2023)","KV-CAR (2025)"]}, {"id":"B021","older_consensus":"DPO was assumed to have stable gradient dynamics due to its simple binary cross-entropy formulation on preference pairs.","supersession":"Gradient imbalance analysis showed DPO gradients for winning responses are systematically smaller than for losing responses, causing the model to primarily learn what to avoid rather than what to generate.","supersession_type":"hard","key_papers":["DPO (2023)","Gradient Imbalance in DPO (2025)"]}, {"id":"B022","older_consensus":"Speculative decoding selected draft tokens greedily, prioritizing individual token probability without considering batch-level throughput optimization.","supersession":"TETRIS formulated batch speculative decoding as an optimal token selection problem, showing joint optimization of draft token selection across a batch significantly improves GPU utilization.","supersession_type":"soft","key_papers":["Greedy speculative decoding (2023)","TETRIS (2025)"]}, {"id":"B023","older_consensus":"Semantic chunking used single content boundaries to segment documents, assuming clean semantic unit separation is sufficient for good retrieval.","supersession":"Mix-of-Overlap showed that allowing multiple overlapping chunk boundaries captures context that single-boundary chunking misses, improving RAG recall on questions spanning chunk boundaries.","supersession_type":"hard","key_papers":["Single-boundary semantic chunking (2023)","Mix-Of-Overlap (2025)"]}, {"id":"B024","older_consensus":"Standard LoRA assigns the same rank to all target layers uniformly, based on the assumption that different transformer layers have similar intrinsic dimensionality for a given task.","supersession":"La-LoRA showed that layer-wise adaptive rank assignment, allocating higher rank to layers with higher task-relevant information, consistently outperforms uniform rank assignment with the same total parameter budget.","supersession_type":"soft","key_papers":["LoRA (2021)","La-LoRA (2025)"]}, {"id":"B025","older_consensus":"Standard RAG architectures prepend retrieved documents to the query and process them through the full attention mechanism.","supersession":"Cross-attention decoupling demonstrated that standard prepend-and-attend is computationally inefficient, with decoupled architectures achieving better faithfulness at lower compute cost.","supersession_type":"hard","key_papers":["Standard RAG (2020)","Decoupling Knowledge and Context (2025)"]}, {"id":"B026","older_consensus":"Fixed-budget KV eviction methods remove the lowest-importance tokens once the cache exceeds a threshold.","supersession":"Adaptive merging methods showed that merging similar tokens preserves information content while reducing cache size, leading to better long-context task performance than fixed-budget eviction.","supersession_type":"soft","key_papers":["H2O/SnapKV (2023)","Model Tells You Where to Merge (2024)"]}, {"id":"B027","older_consensus":"Standard LoRA uses random Gaussian initialization for the low-rank matrices, which is simple but may not align with the pretrained weight structure.","supersession":"QR-LoRA demonstrated that QR decomposition-based initialization aligns the adapter subspace with principal directions of the pretrained weight matrix, improving fine-tuning convergence and final quality.","supersession_type":"hard","key_papers":["LoRA (2021)","QR-LoRA (2025)"]}, {"id":"B028","older_consensus":"Flat vector retrieval treats all documents as independent points in embedding space with no structural relationships encoded.","supersession":"Graph-structured retrieval encodes document citations and entity co-occurrences as explicit traversal paths, enabling multi-hop reasoning that flat retrieval cannot achieve.","supersession_type":"soft","key_papers":["DPR (2020)","Graph RAG Survey (2024)"]}, {"id":"B029","older_consensus":"Draft models for speculative decoding were trained independently via standard language model pretraining on general corpora, without explicit alignment to the target model's reasoning patterns.","supersession":"Direct alignment of draft models using chain-of-thought distillation from the target model demonstrated substantially higher token acceptance rates by learning the target model's specific reasoning patterns.","supersession_type":"soft","key_papers":["Standard draft model training (2023)","Direct Alignment of Draft Model (2024)"]}, {"id":"B030","older_consensus":"Standard RLHF reward models were trained to classify preference pairs without mechanisms to prevent exploitation of spurious surface correlations.","supersession":"Discriminative reward modeling with attention hacking mitigation demonstrated that standard reward models attend to superficial features rather than semantic quality, with explicit debiasing substantially improving generalization.","supersession_type":"hard","key_papers":["Standard RLHF reward model (2022)","Alleviating Attention Hacking (2025)"]}, {"id":"B031","older_consensus":"RAG was assumed to be strictly preferable to relying on parametric model memory for knowledge-intensive tasks.","supersession":"LaRA benchmarking found no silver bullet: RAG and long-context LLMs each dominate on different task types, with optimal choice requiring task-level routing.","supersession_type":"soft","key_papers":["Lewis et al. RAG (2020)","LaRA (2025)"]}, {"id":"B032","older_consensus":"KV cache compression ratios were assumed to be limited only by the accuracy-compression empirical tradeoff, with no fundamental theoretical lower bound.","supersession":"Compression Barriers established information-theoretic lower bounds on KV cache size for autoregressive models, proving no compression method can exceed these bounds for arbitrary input sequences.","supersession_type":"hard","key_papers":["Empirical KV compression (2023)","Compression Barriers (2025)"]}, {"id":"B033","older_consensus":"LoRA was widely assumed to be a universal drop-in replacement for full fine-tuning across all tasks given its strong NLP benchmark performance.","supersession":"Empirical comparison on handwritten text recognition showed LoRA significantly underperforms full fine-tuning on tasks requiring fine-grained spatial feature adaptation.","supersession_type":"soft","key_papers":["LoRA (2021)","LoRA vs Fine-Tuning for HTR (2025)"]}, {"id":"B034","older_consensus":"Sparse retrieval methods required encoder models trained with sparse objectives and were not applicable to decoder-only architectures.","supersession":"Scaling sparse retrieval in decoder-only LLMs demonstrated that decoder architectures can generate effective sparse retrieval representations, expanding sparse retrieval beyond encoder-only model families.","supersession_type":"soft","key_papers":["SPLADE encoder-based sparse retrieval (2021)","Scaling Sparse and Dense in Decoder-Only LLMs (2025)"]}, {"id":"B035","older_consensus":"H2O and SnapKV used raw attention scores as the sole criterion for token importance in KV cache eviction decisions.","supersession":"LagKV showed that attention scores alone miss tokens whose information is no longer novel, while lag-relative importance scoring correctly preserves tokens with unique information not captured downstream.","supersession_type":"hard","key_papers":["H2O/SnapKV (2023)","LagKV (2025)"]}, {"id":"B036","older_consensus":"RLHF and DPO preference optimization methods were developed and evaluated exclusively for autoregressive language models generating discrete tokens.","supersession":"DPO for diffusion models demonstrated that preference optimization principles extend to continuous diffusion processes, challenging the assumption that preference alignment is specific to autoregressive generation.","supersession_type":"soft","key_papers":["DPO for LLMs (2023)","Diffusion Model Alignment with DPO (2023)"]}, {"id":"B037","older_consensus":"Standard RAG chunking operates in a forward pass through documents, creating chunks based on local content signals without considering downstream context needs.","supersession":"Context reconstruction strategies showed that augmenting chunks with backward context references substantially improves RAG retrieval on questions requiring understanding of cross-section relationships.","supersession_type":"soft","key_papers":["Forward-only chunking (2022)","Reconstructing Context (2025)"]}, {"id":"B038","older_consensus":"LLM serving systems managed KV cache entirely in GPU memory, requiring the entire cache to fit in VRAM and limiting context lengths to available GPU memory.","supersession":"CacheGen demonstrated that KV cache compression combined with streaming enables long-context serving beyond GPU memory limits, shifting the bottleneck from memory capacity to bandwidth.","supersession_type":"hard","key_papers":["In-memory KV cache serving (2022)","CacheGen (2023)"]}, {"id":"B039","older_consensus":"LoRA hyperparameters were selected manually based on practitioner experience and validation set performance, with no automated selection framework.","supersession":"AutoML-based PEFT selection demonstrated that automated hyperparameter optimization for LoRA configurations can match or outperform manual selection while reducing required expert knowledge.","supersession_type":"soft","key_papers":["Manual LoRA configuration (2021)","AutoAdapt (2025)"]}, {"id":"B040","older_consensus":"Standard dense vector RAG was used for domain adaptation by fine-tuning the embedding model on domain data, assuming better embeddings are sufficient for domain-specific retrieval.","supersession":"Graph RAG for specialized domains showed that domain expert knowledge requires structured relationship representation beyond embedding similarity, enabling reasoning over domain-specific ontologies.","supersession_type":"hard","key_papers":["Dense vector RAG for domains (2021)","Graph RAG for Customized LLMs (2025)"]}, {"id":"B041","older_consensus":"FP16 KV cache storage was the baseline for LLM inference memory usage, with quantization seen as an optional optimization.","supersession":"GPU-accelerated INT8 KV cache quantization demonstrated near-identical inference quality at half the memory cost with negligible latency overhead, establishing INT8 as a practical new default.","supersession_type":"soft","key_papers":["FP16 KV cache baseline (2022)","GPU-Accelerated INT8 KV Cache (2026)"]}, {"id":"B042","older_consensus":"BM25-only retrieval was the standard baseline for domain-specific question answering due to simplicity and strong empirical performance on in-domain queries.","supersession":"Hybrid dense-sparse retrieval substantially outperforms BM25 alone on domain-specific QA, particularly for paraphrase-heavy queries where lexical matching fails.","supersession_type":"soft","key_papers":["BM25 domain QA (2020)","Domain-specific QA with Hybrid Search (2024)"]}, {"id":"B043","older_consensus":"RLHF reward models were evaluated by preference prediction accuracy on held-out annotation pairs, with higher accuracy assumed to reflect higher alignment quality.","supersession":"Length bias analysis showed reward model preference accuracy is inflated by systematic preference for longer responses, and high-accuracy reward models may be measuring verbosity rather than quality.","supersession_type":"hard","key_papers":["Standard RLHF reward model evaluation (2022)","Bias Fitting (2025)"]}, {"id":"B044","older_consensus":"Speculative decoding methods always generated k draft tokens per step and submitted all for verification, potentially wasting compute on obviously incorrect later draft tokens.","supersession":"Self-verification speculative decoding showed that draft models can learn to stop early when confidence drops, reducing wasted verification compute and improving overall throughput.","supersession_type":"soft","key_papers":["Fixed-k speculative decoding (2023)","Draft Model Knows When to Stop (2024)"]}, {"id":"B045","older_consensus":"Standard RAG prepended retrieved documents as prefix context and relied on the language model's attention to identify relevant information within the concatenated context.","supersession":"Cross-attention decoupled RAG demonstrated that separating knowledge processing from query context processing reduces computational redundancy and improves grounding fidelity.","supersession_type":"hard","key_papers":["Standard prepend-context RAG (2020)","Decoupling Knowledge and Context (2025)"]}, {"id":"B046","older_consensus":"A single LoRA configuration was applied uniformly across diverse downstream tasks with the expectation it would generalize across task types.","supersession":"Task-adaptive LoRA frameworks showed that decomposing tasks and assigning different LoRA configurations per task cluster outperforms single-configuration LoRA across diverse benchmarks.","supersession_type":"soft","key_papers":["Single-config LoRA (2021)","SPM-LoRA (2025)"]}, {"id":"B047","older_consensus":"Static KV cache management policies applied uniform eviction or compression budgets across all layers and time steps regardless of heterogeneous importance of different cache regions.","supersession":"Dynamic retrieval-based KV cache management demonstrated that heterogeneous, input-adaptive cache policies substantially outperform static policies on diverse generation tasks.","supersession_type":"hard","key_papers":["Static KV cache management (2023)","HeteroCache (2026)"]}, {"id":"B048","older_consensus":"A single chunking strategy was assumed to provide reasonable performance across document types in RAG deployments.","supersession":"Systematic comparison across diverse document types demonstrated that no single strategy dominates, with optimal chunking being document-structure-dependent.","supersession_type":"soft","key_papers":["Single-strategy chunking (2022)","Chunking Techniques Comparison (2025)"]}, {"id":"B049","older_consensus":"LoftQ used standard alternating least squares to jointly optimize quantization and LoRA initialization without accounting for the downstream gradient landscape.","supersession":"GA-LoftQ showed that incorporating gradient information into the alternating least squares framework better aligns quantization with the fine-tuning objective, further reducing the initialization gap.","supersession_type":"hard","key_papers":["LoftQ (2023)","GA-LoftQ (2025)"]}, {"id":"B050","older_consensus":"Standard text-only RAG pipelines were applied to heterogeneous documents containing tables and figures, treating non-text content as noise or converting it to text via OCR.","supersession":"Heterogeneous document RAG frameworks demonstrated that joint retrieval over text, table, and figure modalities substantially outperforms text-only RAG on documents where key information resides in structured tables.","supersession_type":"soft","key_papers":["Text-only RAG (2020)","TableRAG (2025)"]}, {"id":"C001","label":"CONTESTED","camps":"Camp A: KV cache compression preserves reasoning performance within acceptable bounds for most standard benchmarks (MiniCache, SnapKV, H2O papers). Camp B: KV cache compression causes disproportionate reasoning degradation specifically on multi-step chain-of-thought tasks that standard perplexity and factual benchmarks fail to reveal (Hold Onto That Thought, 2025)."}, {"id":"C002","label":"CONTESTED","camps":"Camp A: Quantization at 4-bit precision preserves sufficient task capability for agentic workflows on standard instruction-following benchmarks. Camp B: Compressed LLMs exhibit significant capability degradation specifically on multi-step agentic tasks requiring planning, tool use, and error recovery that static benchmarks do not capture (Can Compressed LLMs Truly Act?, 2025)."}, {"id":"C003","label":"CONTESTED","camps":"Camp A: KV cache compression ratios are primarily limited by empirical accuracy-compression tradeoffs with no hard theoretical floor. Camp B: Information-theoretic compression barriers for autoregressive transformers establish method-independent lower bounds on achievable KV cache size (Compression Barriers, 2025)."}, {"id":"C004","label":"CONTESTED","camps":"Camp A: Speculative decoding with the standard token acceptance criterion guarantees lossless output distribution for autoregressive models. Camp B: Lossless speculative decoding for diffusion LLMs requires fundamentally different acceptance criteria not present in original autoregressive speculative decoding (Spiffy, 2025)."}, {"id":"C005","label":"CONTESTED","camps":"Camp A: Attention scores are a reliable and sufficient proxy for token importance in KV eviction, supported by H2O and SnapKV results. Camp B: Attention scores fail to capture redundancy between tokens, with lag-relative importance signals outperforming attention-score-only eviction on long-context tasks (LagKV, 2025)."}, {"id":"C006","label":"CONTESTED","camps":"Camp A: DPO provides more stable and simpler alignment than PPO and achieves comparable quality without a separate reward model. Camp B: PPO retains advantages over DPO on complex reasoning tasks where on-policy exploration matters, and DPO's offline nature limits adaptability (multiple RLHF survey papers, 2024-2025)."}, {"id":"C007","label":"CONTESTED","camps":"Camp A: RLHF reward models with diverse training data are robust to most naturally occurring adversarial inputs. Camp B: Standard RLHF reward models are systematically vulnerable to adversarially constructed preference pairs exploiting reward model blind spots (Adversarial Preference Learning, 2025)."}, {"id":"C008","label":"CONTESTED","camps":"Camp A: Length bias is a known but manageable artifact controllable through careful dataset curation. Camp B: Length bias is a fundamental and systematic distortion where reward models conflate verbosity with quality, requiring explicit architectural mitigation (Bias Fitting, 2025)."}, {"id":"C009","label":"CONTESTED","camps":"Camp A: LoRA achieves near-full-fine-tuning quality across most NLP tasks, supporting its use as the default fine-tuning approach. Camp B: LoRA significantly underperforms full fine-tuning on specialized tasks like handwritten text recognition where task-specific weight updates require higher expressive capacity (LoRA vs Fine-Tuning for HTR, 2025)."}, {"id":"C010","label":"CONTESTED","camps":"Camp A: DPO is inherently more stable than PPO because it eliminates online sampling and the separate reward model. Camp B: DPO exhibits gradient imbalance and distribution shift that destabilize training at scale, requiring hybrid approaches like RS-DPO (Gradient Imbalance in DPO, 2025; RS-DPO, 2024)."}, {"id":"C011","label":"CONTESTED","camps":"Camp A: QLoRA achieves full fine-tuning quality within noise thresholds on standard benchmarks, making 4-bit quantized fine-tuning production-viable. Camp B: QLoRA's quantization-induced initialization gap causes measurable downstream quality degradation requiring additional correction via LoftQ or GA-LoftQ (LoftQ, 2023; GA-LoftQ, 2025)."}, {"id":"C012","label":"CONTESTED","camps":"Camp A: RAG is more cost-efficient and controllable for knowledge-intensive tasks, consistently outperforming long-context LLMs on multi-document retrieval. Camp B: There is no silver bullet — long-context LLMs match or surpass RAG on many tasks within context window limits, and optimal choice requires task-level routing (LaRA, 2025)."}, {"id":"C013","label":"CONTESTED","camps":"Camp A: Semantic chunking consistently improves retrieval precision by preserving coherent meaning units across most RAG applications. Camp B: Semantic chunking's quality improvements are task-dependent and marginal, often not justifying its computational overhead over well-tuned fixed-size chunking (Is Semantic Chunking Worth the Computational Cost?, 2024)."}, {"id":"C014","label":"CONTESTED","camps":"Camp A: Fixed interpolation weights between BM25 and dense retrieval generalize well with minimal tuning for most hybrid retrieval applications. Camp B: Dynamic per-query alpha tuning is necessary for robust hybrid retrieval across heterogeneous query and document types (DAT, 2025)."}, {"id":"C015","label":"CONTESTED","camps":"Camp A: Large dense retrieval models with re-ranking can handle multi-hop reasoning without explicit graph structure through chain-of-thought generation. Camp B: Graph-based RAG is necessary for multi-hop reasoning because flat retrieval cannot encode cross-document entity relationships required for relational inference (Graph RAG Survey, 2024)."}, {"id":"C016","label":"CONTESTED","camps":"Camp A: Standard concatenation-based RAG is sufficient and simpler to implement and maintain in production. Camp B: Cross-attention decoupling reduces redundant computation and improves faithfulness by preventing retrieved content from interfering with query representation (Decoupling Knowledge and Context, 2025)."}, {"id":"C017","label":"CONTESTED","camps":"Camp A: KV cache eviction is sufficient because dropped tokens contribute marginally to model outputs. Camp B: KV cache merging preserves information that eviction permanently destroys, with merging methods showing measurably better long-context accuracy at the same cache budget (Model Tells You Where to Merge, 2024)."}, {"id":"C018","label":"CONTESTED","camps":"Camp A: RLHF models trained on sufficiently diverse human feedback generalize reasonably across cultural and linguistic contexts. Camp B: RLHF reward models trained predominantly on English and Western feedback exhibit systematic cultural bias limiting cross-cultural alignment quality (RLHF Cultural Survey, 2025)."}, {"id":"C019","label":"CONTESTED","camps":"Camp A: Faithfulness and answer relevance metrics (as in Ragas) provide a complete and sufficient evaluation of RAG system quality. Camp B: Complete RAG evaluation requires additional dimensions including retrieval precision, context utilization rate, and temporal relevance that standard metrics do not capture (Evaluation of RAG Survey, 2024)."}, {"id":"C020","label":"CONTESTED","camps":"Camp A: Speculative decoding provides reliable 2-3x speedup across serving scenarios when draft and target models are well-matched. Camp B: Speculative decoding speedup degrades significantly in high-batch serving scenarios and on diverse prompt distributions where draft model acceptance rates drop substantially (Speculative Decoding Survey, 2024)."}, {"id":"C021","label":"CONTESTED","camps":"Camp A: Low ranks (r=4 to r=16) generalize well across tasks and model sizes, making manual rank selection straightforward. Camp B: Optimal LoRA rank is highly task- and layer-dependent, requiring adaptive rank assignment methods for peak performance (La-LoRA, AutoAdapt, 2025)."}, {"id":"C022","label":"CONTESTED","camps":"Camp A: Semantic chunking with moderate overlap is the best general-purpose strategy across most document types. Camp B: No single chunking strategy dominates across diverse document types; optimal chunking is document-structure-dependent and requires task-specific configuration (Chunking Comparison, 2025)."}, {"id":"C023","label":"CONTESTED","camps":"Camp A: DPO and its variants provide sufficient alignment guarantees for safety-critical applications while being more tractable than RLHF. Camp B: Online RL methods like PPO-based RLHF remain necessary for robust safety alignment because DPO's offline nature cannot adapt to distribution shift during deployment (alignment theory papers, 2025-2026)."}, {"id":"C024","label":"CONTESTED","camps":"Camp A: KV cache compression methods transfer directly from text-only LLMs to vision-language models with minimal adaptation. Camp B: Vision-language models require joint importance and diversity criteria because visual tokens have fundamentally different importance distributions, causing uniform compression to disproportionately lose visual information (ZipVL, Mixing Importance with Diversity, 2024-2025)."}, {"id":"C025","label":"CONTESTED","camps":"Camp A: Hybrid retrieval consistently outperforms pure dense retrieval across domains due to complementary lexical and semantic matching. Camp B: Hybrid retrieval advantages are highly dataset-dependent, with strong domain-specific embeddings often matching hybrid performance on in-domain tasks (Rethinking Hybrid Retrieval, 2025; PIRB, 2024)."}, {"id":"C026","label":"CONTESTED","camps":"Camp A: QLoRA's 4-bit quantization enables large model fine-tuning with negligible quality cost for models exceeding GPU memory limits. Camp B: For models fitting in memory with standard LoRA, QLoRA introduces quantization noise requiring explicit correction, and its use should be restricted to memory-constrained scenarios (QLoRA 2023; LoftQ 2023; GA-LoftQ 2025)."}, {"id":"C027","label":"CONTESTED","camps":"Camp A: Standard small language model draft models achieve sufficient token acceptance rates without specialized alignment training. Camp B: Explicit draft model alignment with chain-of-thought distillation significantly improves acceptance rates beyond what standard draft model training achieves (Direct Alignment of Draft Model, 2024; AdaEAGLE, 2024)."}, {"id":"C028","label":"CONTESTED","camps":"Camp A: Graph RAG's higher computational cost is justified by substantially better multi-hop reasoning quality on complex domain tasks. Camp B: For standard single-hop QA workloads, graph construction and traversal overhead often exceeds quality benefits, making flat retrieval more practical (Graph RAG surveys, 2024)."}, {"id":"C029","label":"CONTESTED","camps":"Camp A: RLHF principles transfer directly to multimodal alignment with appropriate reward signal design for image-text pairs. Camp B: Multimodal RLHF requires substantially different reward modeling than text-only RLHF, and current methods have not solved the cross-modal reward specification problem (Preference Alignment on Diffusion Models Survey, 2025)."}, {"id":"C030","label":"CONTESTED","camps":"Camp A: Targeted eviction with careful importance scoring is sufficient and more practical than wholesale KV cache architecture rethinking. Camp B: Targeted eviction methods hit an inherent performance ceiling on long-generation tasks, requiring fundamental rethinking of KV cache compression architecture (Rethinking KV Cache Compression, 2025)."} ]