{"runs":[{"id":1,"domain":"aiml","started_at":"2026-02-26T00:52:14.175562+00:00","finished_at":"2026-02-26T00:58:20.497515+00:00","date_start":"2026-02-19","date_end":"2026-02-26","paper_count":732,"status":"completed"}],"papers":[{"id":2,"run_id":1,"domain":"aiml","arxiv_id":"2602.20739","entry_id":"","title":"PyVision-RL: Forging Open Agentic Vision Models via RL","authors":"[\"Shitian Zhao\", \"Shaoheng Lin\", \"Ming Li\", \"Haoquan Zhang\", \"Wenshuo Peng\", \"Kaipeng Zhang\", \"Chen Wei\"]","abstract":"Reinforcement learning for agentic multimodal models often suffers from interaction collapse, where models learn to reduce tool usage and multi-turn reasoning, limiting the benefits of agentic behavior. We introduce PyVision-RL, a reinforcement learning framework for open-weight multimodal models that stabilizes training and sustains interaction. Our approach combines an oversampling-filtering-ranking rollout strategy with an accumulative tool reward to prevent collapse and encourage multi-turn ","published":"2026-02-24T10:08:33+00:00","categories":"[\"cs.AI\", \"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.20739v1","arxiv_url":"http://arxiv.org/abs/2602.20739v1","comment":"preprint","source":"both","github_repo":"https://github.com/agents-x-project/PyVision-RL","github_stars":null,"hf_upvotes":22,"hf_models":"[{\"id\": \"Agents-X/PyVision-Image-7B-SFT\", \"likes\": 0}, {\"id\": \"Agents-X/PyVision-Image-7B-RL\", \"likes\": 0}, {\"id\": \"Agents-X/PyVision-Video-7B-RL\", \"likes\": 0}, {\"id\": \"Agents-X/PyVision-Video-7B-SFT\", \"likes\": 0}]","hf_datasets":"[{\"id\": \"Agents-X/PyVision-Image-SFT-Data\", \"likes\": 0}, {\"id\": \"Agents-X/PyVision-Image-RL-Data\", \"likes\": 0}, {\"id\": \"Agents-X/PyVision-Video-SFT-Data\", \"likes\": 0}]","hf_spaces":"[]","score_axis_1":7.0,"score_axis_2":8.0,"score_axis_3":9.0,"composite":8.05,"summary":"PyVision-RL applies reinforcement learning to prevent interaction collapse in agentic multimodal models, enabling sustained multi-turn tool use and on-demand visual processing. Introduces unified training for image/video understanding with significantly reduced visual token usage through selective frame sampling.","reasoning":"Strong HF presence with multiple released models on HF. Novel RL-based approach to agentic behavior. High practical value with code available. Addresses real problem of interaction collapse.","code_url":"https://github.com/agents-x-project/PyVision-RL","s2_tldr":"PyVision-RL is introduced, a reinforcement learning framework for open-weight multimodal models that stabilizes training and sustains interaction and combines an oversampling-filtering-ranking rollout strategy with an accumulative tool reward to prevent collapse and encourage multi-turn tool use.","s2_paper_id":"e14ec44b244378020736963dc55c1712e6dc75aa","topics":"[\"Agents\", \"RL\", \"Multimodal\"]"},{"id":3,"run_id":1,"domain":"aiml","arxiv_id":"2602.20161","entry_id":"","title":"Mobile-O: Unified Multimodal Understanding and Generation on Mobile Device","authors":"[\"Abdelrahman Shaker\", \"Ahmed Heakl\", \"Jaseel Muhammad\", \"Ritesh Thawkar\", \"Omkar Thawakar\", \"Senmao Li\", \"Hisham Cholakkal\", \"Ian Reid\", \"Eric P. Xing\", \"Salman Khan\"]","abstract":"Unified multimodal models can both understand and generate visual content within a single architecture. Existing models, however, remain data-hungry and too heavy for deployment on edge devices. We present Mobile-O, a compact vision-language-diffusion model that brings unified multimodal intelligence to a mobile device. Its core module, the Mobile Conditioning Projector (MCP), fuses vision-language features with a diffusion generator using depthwise-separable convolutions and layerwise alignment","published":"2026-02-23T18:59:58+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.20161v2","arxiv_url":"http://arxiv.org/abs/2602.20161v2","comment":"Project page: https://amshaker.github.io/Mobile-O/","source":"both","github_repo":"https://github.com/Amshaker/Mobile-O","github_stars":null,"hf_upvotes":21,"hf_models":"[{\"id\": \"Amshaker/Mobile-O-0.5B-iOS\", \"likes\": 7}, {\"id\": \"Amshaker/Mobile-O-1.5B\", \"likes\": 6}, {\"id\": \"Amshaker/Mobile-O-0.5B\", \"likes\": 5}]","hf_datasets":"[{\"id\": \"Amshaker/Mobile-O-Post-Train\", \"likes\": 7}, {\"id\": \"Amshaker/Mobile-O-Pre-Train\", \"likes\": 5}, {\"id\": \"Amshaker/Mobile-O-SFT\", \"likes\": 4}]","hf_spaces":"[]","score_axis_1":7.0,"score_axis_2":8.0,"score_axis_3":9.0,"composite":8.05,"summary":"Mobile-O achieves unified multimodal understanding and generation on mobile devices through its Mobile Conditioning Projector, running in ~3s per image on iPhone. The 0.5B-1.5B parameter models achieve 74% on GenEval and outperform Show-O/JanusFlow by 5-11% while running 6-11x faster, with models and mobile app publicly released.","reasoning":"Code AND weights on HuggingFace (3 model variants), plus GitHub repo and mobile app. Paradigm shift in on-device multimodal AI with strong practical applicability. 21 HF upvotes indicate community interest.","code_url":"https://github.com/Amshaker/Mobile-O","s2_tldr":"Running in only ~3s per 512x512 image on an iPhone, Mobile-O establishes the first practical framework for real-time unified multimodal understanding and generation on edge devices, which it hopes will ease future research in real-time unified multimodal intelligence running entirely on-device with no cloud dependency.","s2_paper_id":"52a073c8b426203a6fe390bece417b4f0a258772","topics":"[\"Multimodal\", \"Training\"]"},{"id":4,"run_id":1,"domain":"aiml","arxiv_id":"2602.18977","entry_id":"","title":"Frame2Freq: Spectral Adapters for Fine-Grained Video Understanding","authors":"[\"Thinesh Thiyakesan Ponbagavathi\", \"Constantin Seibold\", \"Alina Roitberg\"]","abstract":"Adapting image-pretrained backbones to video typically relies on time-domain adapters tuned to a single temporal scale. Our experiments show that these modules pick up static image cues and very fast flicker changes, while overlooking medium-speed motion. Capturing dynamics across multiple time-scales is, however, crucial for fine-grained temporal analysis (i.e., opening vs. closing bottle). To address this, we introduce Frame2Freq -- a family of frequency-aware adapters that perform spectral ","published":"2026-02-21T23:05:53+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.18977v1","arxiv_url":"http://arxiv.org/abs/2602.18977v1","comment":"Accepted to CVPR 2026 (Main Track)","source":"arxiv","github_repo":"https://github.com/th-nesh/Frame2Freq","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":7.0,"score_axis_2":8.0,"score_axis_3":9.0,"composite":8.05,"summary":"Frame2Freq introduces frequency-aware adapters using FFT along time for image-to-video adaptation of VFMs. It learns frequency-band specific embeddings highlighting discriminative ranges, outperforming prior PEFT methods and surpassing fully fine-tuned models on four of five fine-grained activity recognition datasets.","reasoning":"Novel frequency-domain approach to video adaptation with code on GitHub, CVPR 2026. Highly practical for temporal modeling, strong empirical results across datasets.","code_url":"https://github.com/th-nesh/Frame2Freq","s2_tldr":"Across five fine-grained activity recognition datasets, Frame2Freq outperforms prior PEFT methods and even surpasses fully fine-tuned models on four of them, providing encouraging evidence that frequency analysis methods are a powerful tool for modeling temporal dynamics in image-to-video transfer.","s2_paper_id":"90d23c22ebd63bc8ef75186b176dd6fe1d7aaceb","topics":"[]"},{"id":6,"run_id":1,"domain":"aiml","arxiv_id":"2602.21186","entry_id":"","title":"Spa3R: Predictive Spatial Field Modeling for 3D Visual Reasoning","authors":"[\"Haoyi Jiang\", \"Liu Liu\", \"Xinjie Wang\", \"Yonghao He\", \"Wei Sui\", \"Zhizhong Su\", \"Wenyu Liu\", \"Xinggang Wang\"]","abstract":"While Vision-Language Models (VLMs) exhibit exceptional 2D visual understanding, their ability to comprehend and reason about 3D space--a cornerstone of spatial intelligence--remains superficial. Current methodologies attempt to bridge this domain gap either by relying on explicit 3D modalities or by augmenting VLMs with partial, view-conditioned geometric priors. However, such approaches hinder scalability and ultimately burden the language model with the ill-posed task of implicitly reconstruc","published":"2026-02-24T18:37:34+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.21186v1","arxiv_url":"http://arxiv.org/abs/2602.21186v1","comment":"","source":"arxiv","github_repo":"https://github.com/hustvl/Spa3R","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":7.0,"score_axis_2":8.0,"score_axis_3":8.0,"composite":7.7,"summary":"Spa3R learns unified, view-invariant spatial representations from unposed multi-view images via Predictive Spatial Field Modeling (PSFM). When integrated into VLMs as Spa3-VLM, achieves 58.6% accuracy on 3D VQA tasks, demonstrating that spatial intelligence can emerge from 2D vision alone without explicit 3D supervision.","reasoning":"Novel PSFM paradigm and strong empirical results. Code available with good practical applicability for 3D reasoning from 2D images.","code_url":"https://github.com/hustvl/Spa3R","s2_tldr":"It is argued that spatial intelligence can emerge inherently from 2D vision alone, rather than being imposed via explicit spatial instruction tuning, to introduce Spa3R, a self-supervised framework that learns a unified, view-invariant spatial representation directly from unposed multi-view images.","s2_paper_id":"6445fbf0d634310c2cd41a5b569707baccd1e1a6","topics":"[\"3D / Vision\", \"Reasoning\", \"Language Models\"]"},{"id":7,"run_id":1,"domain":"aiml","arxiv_id":"2602.21015","entry_id":"","title":"From Perception to Action: An Interactive Benchmark for Vision Reasoning","authors":"[\"Yuhao Wu\", \"Maojia Song\", \"Yihuai Lan\", \"Lei Wang\", \"Zhiqiang Hu\", \"Yao Xiao\", \"Heng Zhou\", \"Weihua Zheng\", \"Dylan Raharja\", \"Soujanya Poria\"]","abstract":"Understanding the physical structure is essential for real-world applications such as embodied agents, interactive design, and long-horizon manipulation. Yet, prevailing Vision-Language Model (VLM) evaluations still center on structure-agnostic, single-turn setups (e.g., VQA), which fail to assess agents' ability to reason about how geometry, contact, and support relations jointly constrain what actions are possible in a dynamic environment. To address this gap, we introduce the Causal Hierarchy","published":"2026-02-24T15:33:02+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.21015v1","arxiv_url":"http://arxiv.org/abs/2602.21015v1","comment":"Work in processing. Website: https://social-ai-studio.github.io/CHAIN/","source":"both","github_repo":"https://github.com/Social-AI-Studio/CHAIN","github_stars":null,"hf_upvotes":20,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":7.0,"score_axis_2":8.0,"score_axis_3":8.0,"composite":7.7,"summary":"CHAIN is an interactive 3D physics-driven benchmark evaluating VLMs' ability to understand physical structure and execute structured action sequences. Shifts evaluation from passive perception to active problem-solving with tasks like mechanical puzzles and 3D stacking. Reveals top VLMs struggle with physical constraints and long-horizon planning.","reasoning":"Novel interactive benchmark addressing critical gap in VLM evaluation. Good HF interest, code/demo available. Practical for embodied AI development and evaluation.","code_url":"https://github.com/Social-AI-Studio/CHAIN","s2_tldr":"This work conducts a comprehensive study of state-of-the-art VLMs and diffusion-based models under unified interactive settings, showing that top-performing models still struggle to internalize physical structure and causal constraints, often failing to produce reliable long-horizon plans and cannot robustly translate perceived structure into effective actions.","s2_paper_id":"8492422c8919925da6373a4de6abc149a6c832ab","topics":"[\"Benchmark\", \"Reasoning\", \"Language Models\"]"},{"id":8,"run_id":1,"domain":"aiml","arxiv_id":"2602.20903","entry_id":"","title":"TextPecker: Rewarding Structural Anomaly Quantification for Enhancing Visual Text Rendering","authors":"[\"Hanshen Zhu\", \"Yuliang Liu\", \"Xuecheng Wu\", \"An-Lan Wang\", \"Hao Feng\", \"Dingkang Yang\", \"Chao Feng\", \"Can Huang\", \"Jingqun Tang\", \"Xiang Bai\"]","abstract":"Visual Text Rendering (VTR) remains a critical challenge in text-to-image generation, where even advanced models frequently produce text with structural anomalies such as distortion, blurriness, and misalignment. However, we find that leading MLLMs and specialist OCR models largely fail to perceive these structural anomalies, creating a critical bottleneck for both VTR evaluation and RL-based optimization. As a result, even state-of-the-art generators (e.g., SeedDream4.0, Qwen-Image) still strug","published":"2026-02-24T13:40:23+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.20903v1","arxiv_url":"http://arxiv.org/abs/2602.20903v1","comment":"Code: https://github.com/CIawevy/TextPecker","source":"both","github_repo":"https://github.com/CIawevy/TextPecker","github_stars":null,"hf_upvotes":0,"hf_models":"[{\"id\": \"CIawevy/TextPecker-8B-InternVL3\", \"likes\": 1}, {\"id\": \"CIawevy/TextPecker-8B-Qwen3VL\", \"likes\": 1}, {\"id\": \"CIawevy/QwenImage-TextPecker-SQPA\", \"likes\": 1}, {\"id\": \"CIawevy/SD3.5M-TextPecker-SQPA\", \"likes\": 0}, {\"id\": \"CIawevy/Flux.1-dev-TextPecker-SQPA\", \"likes\": 0}]","hf_datasets":"[{\"id\": \"CIawevy/TextPecker-1.5M\", \"likes\": 0}]","hf_spaces":"[]","score_axis_1":7.0,"score_axis_2":7.0,"score_axis_3":9.0,"composite":7.7,"summary":"TextPecker addresses structural anomalies in visual text rendering through novel RL strategy with stroke-level anomaly perception. Constructs recognition dataset with character-level structural annotations and stroke-editing synthesis engine. Achieves 4% structural fidelity gain and 8.7% semantic alignment improvement on Qwen-Image for Chinese text, with code and models on HF/GitHub.","reasoning":"High practical value addressing critical VTR problem with code/weights available. Novel stroke-level anomaly perception and RL approach. Strong performance improvements on SOTA models.","code_url":"https://github.com/CIawevy/TextPecker","s2_tldr":"TextPecker is proposed, a plug-and-play structural anomaly perceptive RL strategy that mitigates noisy reward signals and works with any textto-image generator, providing a foundational step towards reliable and structural faithful visual text generation.","s2_paper_id":"dd6577c25fc0ad788ccbb1069b5f22f59987adff","topics":"[\"RL\", \"Image Generation\", \"Training\"]"},{"id":10,"run_id":1,"domain":"aiml","arxiv_id":"2602.20160","entry_id":"","title":"tttLRM: Test-Time Training for Long Context and Autoregressive 3D Reconstruction","authors":"[\"Chen Wang\", \"Hao Tan\", \"Wang Yifan\", \"Zhiqin Chen\", \"Yuheng Liu\", \"Kalyan Sunkavalli\", \"Sai Bi\", \"Lingjie Liu\", \"Yiwei Hu\"]","abstract":"We propose tttLRM, a novel large 3D reconstruction model that leverages a Test-Time Training (TTT) layer to enable long-context, autoregressive 3D reconstruction with linear computational complexity, further scaling the model's capability. Our framework efficiently compresses multiple image observations into the fast weights of the TTT layer, forming an implicit 3D representation in the latent space that can be decoded into various explicit formats, such as Gaussian Splats (GS) for downstream ap","published":"2026-02-23T18:59:45+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.20160v1","arxiv_url":"http://arxiv.org/abs/2602.20160v1","comment":"Accepted by CVPR 2026. Project Page: https://cwchenwang.github.io/tttLRM","source":"both","github_repo":"https://github.com/cwchenwang/tttLRM","github_stars":null,"hf_upvotes":4,"hf_models":"[{\"id\": \"chenwang/tttLRM\", \"likes\": 0}]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":7.0,"score_axis_2":8.0,"score_axis_3":8.0,"composite":7.7,"summary":"tttLRM introduces Test-Time Training layers for long-context autoregressive 3D reconstruction with linear complexity, supporting progressive refinement from streaming observations. The method compresses observations into fast weights forming implicit 3D representations decodable to Gaussian Splats, with pretrained model released on HuggingFace.","reasoning":"Code on GitHub and model on HuggingFace (4 upvotes). Novel TTT application to 3D reconstruction with strong architectural innovation and practical streaming capability.","code_url":"https://github.com/cwchenwang/tttLRM","s2_tldr":null,"s2_paper_id":"aed85330b57feee38b7d7d2710ffda0ca350eeef","topics":"[\"3D / Vision\", \"Efficiency\"]"},{"id":11,"run_id":1,"domain":"aiml","arxiv_id":"2602.20089","entry_id":"","title":"StructXLIP: Enhancing Vision-language Models with Multimodal Structural Cues","authors":"[\"Zanxi Ruan\", \"Qiuyu Kong\", \"Songqun Gao\", \"Yiming Wang\", \"Marco Cristani\"]","abstract":"Edge-based representations are fundamental cues for visual understanding, a principle rooted in early vision research and still central today. We extend this principle to vision-language alignment, showing that isolating and aligning structural cues across modalities can greatly benefit fine-tuning on long, detail-rich captions, with a specific focus on improving cross-modal retrieval. We introduce StructXLIP, a fine-tuning alignment paradigm that extracts edge maps (e.g., Canny), treating them ","published":"2026-02-23T17:57:37+00:00","categories":"[\"cs.CV\", \"cs.AI\"]","pdf_url":"https://arxiv.org/pdf/2602.20089v1","arxiv_url":"http://arxiv.org/abs/2602.20089v1","comment":"Accepted by CVPR 2026","source":"arxiv","github_repo":"https://github.com/intelligolabs/StructXLIP","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":7.0,"score_axis_2":8.0,"score_axis_3":8.0,"composite":7.7,"summary":"StructXLIP enhances vision-language models by aligning structural cues (edge maps) across modalities during fine-tuning. It introduces structure-centric losses that improve cross-modal retrieval by maximizing mutual information between multimodal structural representations, achieving state-of-the-art results on general and specialized domains.","reasoning":"Code and pretrained models publicly available on GitHub; novel approach to vision-language alignment using structural representations; strong practical applicability for cross-modal retrieval tasks.","code_url":"https://github.com/intelligolabs/StructXLIP","s2_tldr":"This work introduces StructXLIP, a fine-tuning alignment paradigm that extracts edge maps, treating them as proxies for the visual structure of an image, and filters the corresponding captions to emphasize structural cues, making them\"structure-centric\".","s2_paper_id":"56854cee9fe21b833913ff76d3b25e6e6397d2e9","topics":"[\"Language Models\", \"Multimodal\", \"Retrieval / RAG\"]"},{"id":12,"run_id":1,"domain":"aiml","arxiv_id":"2602.19870","entry_id":"","title":"ApET: Approximation-Error Guided Token Compression for Efficient VLMs","authors":"[\"Qiankun Ma\", \"Ziyao Zhang\", \"Haofei Wang\", \"Jie Chen\", \"Zhen Song\", \"Hairong Zheng\"]","abstract":"Recent Vision-Language Models (VLMs) have demonstrated remarkable multimodal understanding capabilities, yet the redundant visual tokens incur prohibitive computational overhead and degrade inference efficiency. Prior studies typically relies on [CLS] attention or text-vision cross-attention to identify and discard redundant visual tokens. Despite promising results, such solutions are prone to introduce positional bias and, more critically, are incompatible with efficient attention kernels such ","published":"2026-02-23T14:15:37+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.19870v1","arxiv_url":"http://arxiv.org/abs/2602.19870v1","comment":"CVPR2026","source":"arxiv","github_repo":"https://github.com/MaQianKun0/ApET","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":7.0,"score_axis_2":7.0,"score_axis_3":9.0,"composite":7.7,"summary":"ApET proposes attention-free visual token compression for VLMs using approximation error to identify redundant tokens. The method retains 95.2% performance on image tasks while compressing tokens by 88.9%, seamlessly integrating with FlashAttention for practical VLM acceleration.","reasoning":"Code available on GitHub; novel information-theoretic approach to token compression; excellent practical applicability for efficient VLM deployment.","code_url":"https://github.com/MaQianKun0/ApET","s2_tldr":"ApET, an Approximation-Error guided Token compression framework, which first reconstructs the original visual tokens with a small set of basis tokens via linear approximation, then leverages the approximation error to identify and drop the least informative tokens.","s2_paper_id":"10edff9c9044cccef655de3d0656adab8d0501f5","topics":"[\"Efficiency\", \"Language Models\", \"Multimodal\"]"},{"id":13,"run_id":1,"domain":"aiml","arxiv_id":"2602.19161","entry_id":"","title":"Flash-VAED: Plug-and-Play VAE Decoders for Efficient Video Generation","authors":"[\"Lunjie Zhu\", \"Yushi Huang\", \"Xingtong Ge\", \"Yufei Xue\", \"Zhening Liu\", \"Yumeng Zhang\", \"Zehong Lin\", \"Jun Zhang\"]","abstract":"Latent diffusion models have enabled high-quality video synthesis, yet their inference remains costly and time-consuming. As diffusion transformers become increasingly efficient, the latency bottleneck inevitably shifts to VAE decoders. To reduce their latency while maintaining quality, we propose a universal acceleration framework for VAE decoders that preserves full alignment with the original latent distribution. Specifically, we propose (1) an independence-aware channel pruning method to eff","published":"2026-02-22T12:43:50+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.19161v1","arxiv_url":"http://arxiv.org/abs/2602.19161v1","comment":"Code will be released at https://github.com/Aoko955/Flash-VAED","source":"arxiv","github_repo":"https://github.com/Aoko955/Flash-VAED","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":7.0,"score_axis_2":7.0,"score_axis_3":9.0,"composite":7.7,"summary":"Flash-VAED accelerates VAE decoders for video generation through independence-aware channel pruning and stage-wise operator optimization. Achieves ~6x decoder speedup with 96.9% reconstruction performance, accelerating end-to-end generation by 36% with minimal quality loss.","reasoning":"High practical value addressing inference bottleneck, code available, strong efficiency gains. Universal acceleration framework applicable to multiple VAE decoders.","code_url":"https://github.com/Aoko955/Flash-VAED","s2_tldr":"This work proposes a universal acceleration framework for VAE decoders that preserves full alignment with the original latent distribution, and designs a three-phase dynamic distillation framework that efficiently transfers the capabilities of the original VAE decoder to Flash-VAED.","s2_paper_id":"a61a1d7ae585d2678ab9c1956b18ba2cad21a013","topics":"[\"Video Generation\", \"Efficiency\", \"Image Generation\"]"},{"id":15,"run_id":1,"domain":"aiml","arxiv_id":"2602.21202","entry_id":"","title":"Multi-Vector Index Compression in Any Modality","authors":"[\"Hanxiang Qin\", \"Alexander Martin\", \"Rohan Jha\", \"Chunsheng Zuo\", \"Reno Kriz\", \"Benjamin Van Durme\"]","abstract":"We study efficient multi-vector retrieval for late interaction in any modality. Late interaction has emerged as a dominant paradigm for information retrieval in text, images, visual documents, and videos, but its computation and storage costs grow linearly with document length, making it costly for image-, video-, and audio-rich corpora. To address this limitation, we explore query-agnostic methods for compressing multi-vector document representations under a constant vector budget. We introduce","published":"2026-02-24T18:57:33+00:00","categories":"[\"cs.IR\", \"cs.CL\", \"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.21202v1","arxiv_url":"http://arxiv.org/abs/2602.21202v1","comment":"12 pages, 4 figures","source":"both","github_repo":"https://github.com/hanxiangqin/omni-col-press","github_stars":null,"hf_upvotes":18,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":7.0,"score_axis_2":7.0,"score_axis_3":8.0,"composite":7.35,"summary":"Introduces query-agnostic compression methods for multi-vector late interaction retrieval across text, images, and video. The novel attention-guided clustering (AGC) approach consistently outperforms parametric compression methods while providing flexible index sizing, addressing the linear cost scaling problem of late interaction models.","reasoning":"Solid novelty with AGC method and practical impact on retrieval efficiency. Code available on GitHub, good HF community interest, applicable across modalities.","code_url":"https://github.com/hanxiangqin/omni-col-press","s2_tldr":"This work introduces four approaches for index compression: sequence resizing, memory tokens, hierarchical pooling, and a novel attention-guided clustering (AGC), which shows that attention-guided clustering consistently outperforms other parameterized compression methods, provides greater flexibility in index size, and achieves competitive or improved performance compared to a full, uncompressed index.","s2_paper_id":"5267c2b437e21dccbc84308b15d18b9a6b57f998","topics":"[\"Efficiency\", \"Retrieval / RAG\"]"},{"id":16,"run_id":1,"domain":"aiml","arxiv_id":"2602.21053","entry_id":"","title":"OCR-Agent: Agentic OCR with Capability and Memory Reflection","authors":"[\"Shimin Wen\", \"Zeyu Zhang\", \"Xingdou Bian\", \"Hongjie Zhu\", \"Lulu He\", \"Layi Shama\", \"Daji Ergu\", \"Ying Cai\"]","abstract":"Large Vision-Language Models (VLMs) have demonstrated significant potential on complex visual understanding tasks through iterative optimization methods.However, these models generally lack effective self-correction mechanisms, making it difficult for them to independently rectify cognitive biases. Consequently, during multi-turn revisions, they often fall into repetitive and ineffective attempts, failing to achieve stable improvements in answer quality.To address this issue, we propose a novel ","published":"2026-02-24T16:10:27+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.21053v1","arxiv_url":"http://arxiv.org/abs/2602.21053v1","comment":"","source":"both","github_repo":"https://github.com/AIGeeksGroup/OCR-Agent","github_stars":null,"hf_upvotes":1,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":7.0,"score_axis_2":7.0,"score_axis_3":8.0,"composite":7.35,"summary":"OCR-Agent introduces iterative self-correction framework with Capability Reflection and Memory Reflection for VLMs. Model diagnoses errors, generates correction plans, reviews past attempts to avoid repetition, then re-reasons. Outperforms InternVL3-8B by +2.0 on English OCRBench v2, achieving SOTA on Visual Understanding (79.9) and Reasoning (66.5).","reasoning":"Strong practical results with code available. Novel reflection mechanisms improve VLM reasoning without additional training. Good HF presence and clear applicability.","code_url":"https://github.com/AIGeeksGroup/OCR-Agent","s2_tldr":"This paper proposes a novel iterative self-correction framework that endows models with two key capabilities: Capability Reflection and Memory Reflection, and demonstrates that structured, self-aware reflection can significantly enhance VLMs'reasoning robustness without additional training.","s2_paper_id":"ab722ae1a1c820f463aabdb76f5ba147cf1cfc90","topics":"[\"Agents\", \"Language Models\", \"Multimodal\"]"},{"id":17,"run_id":1,"domain":"aiml","arxiv_id":"2602.21042","entry_id":"","title":"OmniOCR: Generalist OCR for Ethnic Minority Languages","authors":"[\"Bonan Liu\", \"Zeyu Zhang\", \"Bingbing Meng\", \"Han Wang\", \"Hanshuo Zhang\", \"Chengping Wang\", \"Daji Ergu\", \"Ying Cai\"]","abstract":"Optical character recognition (OCR) has advanced rapidly with deep learning and multimodal models, yet most methods focus on well-resourced scripts such as Latin and Chinese. Ethnic minority languages remain underexplored due to complex writing systems, scarce annotations, and diverse historical and modern forms, making generalization in low-resource or zero-shot settings challenging. To address these challenges, we present OmniOCR, a universal framework for ethnic minority scripts. OmniOCR intr","published":"2026-02-24T16:02:49+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.21042v1","arxiv_url":"http://arxiv.org/abs/2602.21042v1","comment":"","source":"both","github_repo":"https://github.com/AIGeeksGroup/OmniOCR","github_stars":null,"hf_upvotes":1,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":7.0,"score_axis_2":7.0,"score_axis_3":8.0,"composite":7.35,"summary":"OmniOCR is a universal framework for ethnic minority script OCR using Dynamic Low-Rank Adaptation that allocates model capacity across layers and scripts. Sparsity regularization ensures compact adaptation without inference overhead. Achieves 39%-66% accuracy improvement over SOTA on TibetanMNIST, Shui, ancient Yi, and Dongba scripts.","reasoning":"Addresses important low-resource language problem with novel Dynamic LoRA approach. Strong empirical results, code available, practical for underserved communities.","code_url":"https://github.com/AIGeeksGroup/OmniOCR","s2_tldr":"OmniOCR introduces Dynamic Low-Rank Adaptation (Dynamic LoRA) to allocate model capacity across layers and scripts, enabling effective adaptation while preserving knowledge and a sparsity regularization prunes redundant updates, ensuring compact and efficient adaptation without extra inference cost.","s2_paper_id":"4202a21511724df16462b4df13de4d79566521d8","topics":"[\"Multimodal\"]"},{"id":19,"run_id":1,"domain":"aiml","arxiv_id":"2602.20951","entry_id":"","title":"See and Fix the Flaws: Enabling VLMs and Diffusion Models to Comprehend Visual Artifacts via Agentic Data Synthesis","authors":"[\"Jaehyun Park\", \"Minyoung Ahn\", \"Minkyu Kim\", \"Jonghyun Lee\", \"Jae-Gil Lee\", \"Dongmin Park\"]","abstract":"Despite recent advances in diffusion models, AI generated images still often contain visual artifacts that compromise realism. Although more thorough pre-training and bigger models might reduce artifacts, there is no assurance that they can be completely eliminated, which makes artifact mitigation a highly crucial area of study. Previous artifact-aware methodologies depend on human-labeled artifact datasets, which are costly and difficult to scale, underscoring the need for an automated approach","published":"2026-02-24T14:34:13+00:00","categories":"[\"cs.CV\", \"cs.AI\"]","pdf_url":"https://arxiv.org/pdf/2602.20951v1","arxiv_url":"http://arxiv.org/abs/2602.20951v1","comment":"","source":"both","github_repo":"https://github.com/krafton-ai/ArtiAgent","github_stars":null,"hf_upvotes":11,"hf_models":"[]","hf_datasets":"[{\"id\": \"KRAFTON/ArtiBench\", \"likes\": 3}]","hf_spaces":"[]","score_axis_1":7.0,"score_axis_2":7.0,"score_axis_3":8.0,"composite":7.35,"summary":"ArtiAgent automatically generates 100K artifact-annotated image pairs through agentic framework with perception, synthesis, and curation agents. Uses novel patch-wise embedding manipulation in diffusion transformers for artifact injection. Demonstrated efficacy across artifact detection, VLM/diffusion model training, and restoration tasks with open code release.","reasoning":"Strong practical value with large dataset, code on GitHub (11 HF upvotes, both sources), and novel agentic synthesis approach. High applicability for training artifact-aware models.","code_url":"https://github.com/krafton-ai/ArtiAgent","s2_tldr":"This paper proposes ArtiAgent, which efficiently creates pairs of real and artifact-injected images that synthesize 100K images with rich artifact annotations and demonstrates both efficacy and versatility across diverse applications.","s2_paper_id":"75aa377c8e9178b95900fa26bcf4fd63f283a9e9","topics":"[\"Agents\", \"Benchmark\"]"},{"id":20,"run_id":1,"domain":"aiml","arxiv_id":"2602.20913","entry_id":"","title":"LongVideo-R1: Smart Navigation for Low-cost Long Video Understanding","authors":"[\"Jihao Qiu\", \"Lingxi Xie\", \"Xinyue Huo\", \"Qi Tian\", \"Qixiang Ye\"]","abstract":"This paper addresses the critical and underexplored challenge of long video understanding with low computational budgets. We propose LongVideo-R1, an active, reasoning-equipped multimodal large language model (MLLM) agent designed for efficient video context navigation, avoiding the redundancy of exhaustive search. At the core of LongVideo-R1 lies a reasoning module that leverages high-level visual cues to infer the most informative video clip for subsequent processing. During inference, the age","published":"2026-02-24T13:49:47+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.20913v1","arxiv_url":"http://arxiv.org/abs/2602.20913v1","comment":"17 pages, 9 figures, 8 tables, accepted to CVPR 2026","source":"arxiv","github_repo":"https://github.com/qiujihao19/LongVideo-R1","github_stars":null,"hf_upvotes":0,"hf_models":"[{\"id\": \"ChurchillQAQ/LongVideo-R1-Qwen2.5\", \"likes\": 0}]","hf_datasets":"[{\"id\": \"ChurchillQAQ/LongVideo-R1-Data\", \"likes\": 0}]","hf_spaces":"[]","score_axis_1":7.0,"score_axis_2":7.0,"score_axis_3":8.0,"composite":7.35,"summary":"LongVideo-R1 introduces active reasoning-equipped MLLM agent for efficient long video understanding with low compute budget. Uses hierarchical navigation with high-level visual cues, trained via SFT+RL on 33K GPT-5-generated trajectories. Achieves superior QA accuracy-efficiency tradeoff by selectively processing relevant clips, with code and data publicly available.","reasoning":"Novel reasoning-based navigation approach with code/weights on GitHub and HF model. Strong practical value for long video understanding efficiency. SFT+RL training paradigm well-executed.","code_url":"https://github.com/qiujihao19/LongVideo-R1","s2_tldr":"The LongVideo-R1 agent is fine-tuned upon the Qwen-3-8B model through a two-stage paradigm: supervised fine-tuning followed by reinforcement learning (RL), where RL employs a specifically designed reward function to maximize selective and efficient clip navigation.","s2_paper_id":"7f9b85329fa3743968ac02255ff96d194274e34f","topics":"[\"Robotics\", \"Language Models\", \"Multimodal\"]"},{"id":21,"run_id":1,"domain":"aiml","arxiv_id":"2602.20685","entry_id":"","title":"RAYNOVA: 3D-Geometry-Free Auto-Regressive Driving World Modeling with Unified Spatio-Temporal Representation","authors":"[\"Yichen Xie\", \"Chensheng Peng\", \"Mazen Abdelfattah\", \"Yihan Hu\", \"Jiezhi Yang\", \"Eric Higgins\", \"Ryan Brigden\", \"Masayoshi Tomizuka\", \"Wei Zhan\"]","abstract":"World foundation models aim to simulate the evolution of the real world with physically plausible behavior. Unlike prior methods that handle spatial and temporal correlations separately, we propose RAYNOVA, a geometry-free world model that employs a dual-causal autoregressive framework. It follows both scale-wise and temporal topological orders in the autoregressive process, and leverages global attention for unified 4D spatio-temporal reasoning. Different from existing works that impose strong ","published":"2026-02-24T08:41:40+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.20685v1","arxiv_url":"http://arxiv.org/abs/2602.20685v1","comment":"Accepted by CVPR 2026; Project website: http://yichen928.github.io/raynova","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":7.0,"score_axis_2":7.0,"score_axis_3":8.0,"composite":7.35,"summary":"RAYNOVA proposes geometry-free 4D world modeling using dual-causal autoregressive framework with unified spatio-temporal reasoning. Achieves SOTA multi-view video generation on nuScenes with 3D-free representation based on Pl\u00fccker-ray encoding, enabling generalization across camera setups.","reasoning":"No code/weights currently (website suggests future release). Novel unified 4D representation without 3D priors. High practical value for autonomous driving simulation.","code_url":"http://yichen928.github.io/raynova","s2_tldr":"RAYNOVA is proposed, a geometry-free world model that employs a dual-causal autoregressive framework that follows both scale-wise and temporal topological orders in the autoregressive process, and leverages global attention for unified 4D spatio-temporal reasoning.","s2_paper_id":"083af0159a6fbfd3202bc37fe70772727e6bc184","topics":"[\"World Models\", \"3D / Vision\", \"Language Models\"]"},{"id":23,"run_id":1,"domain":"aiml","arxiv_id":"2602.20417","entry_id":"","title":"gQIR: Generative Quanta Image Reconstruction","authors":"[\"Aryan Garg\", \"Sizhuo Ma\", \"Mohit Gupta\"]","abstract":"Capturing high-quality images from only a few detected photons is a fundamental challenge in computational imaging. Single-photon avalanche diode (SPAD) sensors promise high-quality imaging in regimes where conventional cameras fail, but raw \\emph{quanta frames} contain only sparse, noisy, binary photon detections. Recovering a coherent image from a burst of such frames requires handling alignment, denoising, and demosaicing (for color) under noise statistics far outside those assumed by standar","published":"2026-02-23T23:33:00+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.20417v1","arxiv_url":"http://arxiv.org/abs/2602.20417v1","comment":"CVPR 2026","source":"arxiv","github_repo":"https://github.com/Aryan-Garg/gQIR","github_stars":null,"hf_upvotes":0,"hf_models":"[{\"id\": \"aRy4n/gQIR\", \"likes\": 0}]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":7.0,"score_axis_2":8.0,"score_axis_3":7.0,"composite":7.35,"summary":"gQIR adapts large text-to-image latent diffusion models to photon-limited quanta imaging by handling Bernoulli photon statistics. Processes SPAD burst frames for denoising, alignment, and demosaicing, producing high-quality reconstructions even under extreme photon sparsity and high-speed motion.","reasoning":"Code and HF model available (CVPR 2026 accepted). Highly novel adaptation of generative priors to extreme sensing regime with first color SPAD dataset. Strong practical potential for computational imaging applications.","code_url":"https://github.com/Aryan-Garg/gQIR","s2_tldr":"This work presents an approach that adapts large text-to-image latent diffusion models to the photon-limited domain of quanta burst imaging, and substantially improves perceptual quality over classical and modern learning-based baselines.","s2_paper_id":"18964d133984ef6fa28f3dbd892c7c58ecdd2b6b","topics":"[\"Training\"]"},{"id":27,"run_id":1,"domain":"aiml","arxiv_id":"2602.19994","entry_id":"","title":"RADE-Net: Robust Attention Network for Radar-Only Object Detection in Adverse Weather","authors":"[\"Christof Leitgeb\", \"Thomas Puchleitner\", \"Max Peter Ronecker\", \"Daniel Watzenig\"]","abstract":"Automotive perception systems are obligated to meet high requirements. While optical sensors such as Camera and Lidar struggle in adverse weather conditions, Radar provides a more robust perception performance, effectively penetrating fog, rain, and snow. Since full Radar tensors have large data sizes and very few datasets provide them, most Radar-based approaches work with sparse point clouds or 2D projections, which can result in information loss. Additionally, deep learning methods show poten","published":"2026-02-23T16:01:31+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.19994v1","arxiv_url":"http://arxiv.org/abs/2602.19994v1","comment":"Accepted to 2026 IEEE Intelligent Vehicles Symposium (IV)","source":"arxiv","github_repo":"https://github.com/chr-is-tof/RADE-Net","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":7.0,"score_axis_2":7.0,"score_axis_3":8.0,"composite":7.35,"summary":"RADE-Net proposes a 3D projection method for 4D Radar tensors that reduces data size by 91.9% while preserving rich Doppler and Elevation features. The lightweight model with spatial and channel-attention outperforms Radar-only baselines by 16.7% and surpasses Lidar in adverse weather.","reasoning":"Code available on GitHub; novel Radar tensor projection approach; strong practical applicability for autonomous driving in adverse weather.","code_url":"https://github.com/chr-is-tof/RADE-Net","s2_tldr":"RADE-Net is introduced, a lightweight model tailored to 3D projections of the RADE tensor, which outperform several Lidar approaches in scenarios with adverse weather conditions and achieves a 16.7% improvement compared to their baseline, as well as 6.5% improvement over current Radar-only models.","s2_paper_id":"1a8a7f34e732e53dc2797ae3bae77c6dd2c762c6","topics":"[\"3D / Vision\", \"Benchmark\"]"},{"id":30,"run_id":1,"domain":"aiml","arxiv_id":"2602.19611","entry_id":"","title":"RAID: Retrieval-Augmented Anomaly Detection","authors":"[\"Mingxiu Cai\", \"Zhe Zhang\", \"Gaochang Wu\", \"Tianyou Chai\", \"Xiatian Zhu\"]","abstract":"Unsupervised Anomaly Detection (UAD) aims to identify abnormal regions by establishing correspondences between test images and normal templates. Existing methods primarily rely on image reconstruction or template retrieval but face a fundamental challenge: matching between test images and normal templates inevitably introduces noise due to intra-class variations, imperfect correspondences, and limited templates. Observing that Retrieval-Augmented Generation (RAG) leverages retrieved samples dire","published":"2026-02-23T08:54:27+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.19611v1","arxiv_url":"http://arxiv.org/abs/2602.19611v1","comment":"","source":"arxiv","github_repo":"https://github.com/Mingxiu-Cai/RAID","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":7.0,"score_axis_2":7.0,"score_axis_3":8.0,"composite":7.35,"summary":"RAID reinterprets unsupervised anomaly detection through the lens of Retrieval-Augmented Generation, using hierarchical retrieval and a guided MoE network to suppress matching noise. It achieves SOTA on MVTec, VisA, MPDD, and BTAD across full-shot, few-shot, and multi-dataset settings, offering a practical noise-resilient framework for industrial anomaly detection.","reasoning":"Code available on GitHub. Novel RAG-based perspective on UAD is interesting, but the domain (anomaly detection) sees incremental SOTA improvements often. Practical for industrial inspection.","code_url":"https://github.com/Mingxiu-Cai/RAID","s2_tldr":"RAID is introduced, a retrieval-augmented UAD framework designed for noise-resilient anomaly detection and localization that achieves state-of-the-art performance across full-shot, few-shot, and multi-dataset settings on MVTec, VisA, MPDD, and BTAD benchmarks.","s2_paper_id":"9476edc16927d7a634607204dfa43583bed8f656","topics":"[\"Retrieval / RAG\"]"},{"id":31,"run_id":1,"domain":"aiml","arxiv_id":"2602.19517","entry_id":"","title":"Classroom Final Exam: An Instructor-Tested Reasoning Benchmark","authors":"[\"Chongyang Gao\", \"Diji Yang\", \"Shuyan Zhou\", \"Xichen Yan\", \"Luchuan Song\", \"Shuo Li\", \"Kezhen Chen\"]","abstract":"We introduce \\CFE{} (\\textbf{C}lassroom \\textbf{F}inal \\textbf{E}xam), a multimodal benchmark for evaluating the reasoning capabilities of large language models across more than 20 STEM domains. \\CFE{} is curated from repeatedly used, authentic university homework and exam problems, together with reference solutions provided by course instructors. \\CFE{} presents a significant challenge even for frontier models: the newly released Gemini-3.1-pro-preview achieves an overall accuracy of 59.69\\%, w","published":"2026-02-23T05:17:41+00:00","categories":"[\"cs.AI\", \"cs.CE\", \"cs.CL\", \"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.19517v1","arxiv_url":"http://arxiv.org/abs/2602.19517v1","comment":"","source":"arxiv","github_repo":"https://github.com/Analogy-AI/CFE_Bench","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":7.0,"score_axis_2":7.0,"score_axis_3":8.0,"composite":7.35,"summary":"Classroom Final Exam (CFE) is a multimodal benchmark of 20+ STEM domains curated from authentic university exams with instructor-provided solutions. Frontier models achieve ~60% accuracy; diagnostic analysis reveals suboptimal step efficiency and error accumulation in multi-step reasoning.","reasoning":"Code on GitHub. Novel authentic benchmark for STEM reasoning is valuable for practitioners. No weights, but benchmark dataset is practical.","code_url":"https://github.com/Analogy-AI/CFE_Bench","s2_tldr":"It is found that although frontier models can often answer intermediate sub-questions correctly, they struggle to reliably derive and maintain correct intermediate states throughout multi-step solutions.","s2_paper_id":"d2a3696f6f7375eb467a10f3af949d3fd2b080b4","topics":"[\"Reasoning\", \"Benchmark\", \"Language Models\"]"},{"id":33,"run_id":1,"domain":"aiml","arxiv_id":"2602.20205","entry_id":"","title":"OTPrune: Distribution-Aligned Visual Token Pruning via Optimal Transport","authors":"[\"Xiwen Chen\", \"Wenhui Zhu\", \"Gen Li\", \"Xuanzhao Dong\", \"Yujian Xiong\", \"Hao Wang\", \"Peijie Qiu\", \"Qingquan Song\", \"Zhipeng Wang\", \"Shao Tang\"]","abstract":"Multi-modal large language models (MLLMs) achieve strong visual-language reasoning but suffer from high inference cost due to redundant visual tokens. Recent work explores visual token pruning to accelerate inference, while existing pruning methods overlook the underlying distributional structure of visual representations. We propose OTPrune, a training-free framework that formulates pruning as distribution alignment via optimal transport (OT). By minimizing the 2-Wasserstein distance between th","published":"2026-02-22T21:02:47+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.20205v1","arxiv_url":"http://arxiv.org/abs/2602.20205v1","comment":"Accepted by CVPR2026 (Findings). arXiv admin note: text overlap with arXiv:2503.02175 by other authors","source":"arxiv","github_repo":"https://github.com/xiwenc1/OTPrune","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":7.0,"score_axis_2":7.0,"score_axis_3":8.0,"composite":7.35,"summary":"OTPrune formulates visual token pruning as optimal transport problem, minimizing 2-Wasserstein distance between full and pruned distributions. Provides theoretically-grounded submodular optimization with provable monotonicity/submodularity properties, achieving superior performance-efficiency tradeoffs on MLLMs.","reasoning":"Accepted at CVPR 2026 with code on GitHub. Strong theoretical foundation and practical efficiency gains for accelerating MLLM inference.","code_url":"https://github.com/xiwenc1/OTPrune","s2_tldr":"This work proposes OTPrune, a training-free framework that formulates pruning as distribution alignment via optimal transport (OT), and derives a tractable submodular objective that enables efficient optimization, and theoretically proves its monotonicity and submodularity.","s2_paper_id":"edd8ab5b9b401d86b87c13c296173f988466575f","topics":"[\"Efficiency\", \"Language Models\", \"Reasoning\"]"},{"id":34,"run_id":1,"domain":"aiml","arxiv_id":"2602.19316","entry_id":"","title":"Pay Attention to CTC: Fast and Robust Pseudo-Labelling for Unified Speech Recognition","authors":"[\"Alexandros Haliassos\", \"Rodrigo Mira\", \"Stavros Petridis\"]","abstract":"Unified Speech Recognition (USR) has emerged as a semi-supervised framework for training a single model for audio, visual, and audiovisual speech recognition, achieving state-of-the-art results on in-distribution benchmarks. However, its reliance on autoregressive pseudo-labelling makes training expensive, while its decoupled supervision of CTC and attention branches increases susceptibility to self-reinforcing errors, particularly under distribution shifts involving longer sequences, noise, or ","published":"2026-02-22T19:38:21+00:00","categories":"[\"cs.CV\", \"cs.SD\"]","pdf_url":"https://arxiv.org/pdf/2602.19316v1","arxiv_url":"http://arxiv.org/abs/2602.19316v1","comment":"ICLR 2026. Code: https://github.com/ahaliassos/usr2","source":"arxiv","github_repo":"https://github.com/ahaliassos/usr2","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":7.0,"score_axis_2":7.0,"score_axis_3":8.0,"composite":7.35,"summary":"USR 2.0 improves Unified Speech Recognition through CTC-driven teacher forcing for efficient pseudo-labelling, replacing expensive autoregressive beam search with single-pass generation. Halves training time, improves out-of-distribution robustness, and achieves SOTA on LRS3, LRS2, and WildVSR benchmarks.","reasoning":"Accepted at ICLR 2026 with code on GitHub. Significant efficiency improvements and strong empirical results make this highly practical for speech recognition.","code_url":"https://github.com/ahaliassos/usr2","s2_tldr":"CTC-driven teacher forcing is proposed, where greedily decoded CTC pseudo-labels are fed into the decoder to generate attention targets in a single forward pass, which halves training time, improves robustness to out-of-distribution inputs, and achieves state-of-the-art results on LRS3, LRS2, and WildVSR, surpassing USR and modality-specific self-supervised baselines.","s2_paper_id":"510450b66d62fc575a81ba68dfa8e574b8c56d24","topics":"[\"Speech / Audio\", \"Benchmark\"]"},{"id":37,"run_id":1,"domain":"aiml","arxiv_id":"2602.19089","entry_id":"","title":"Ani3DHuman: Photorealistic 3D Human Animation with Self-guided Stochastic Sampling","authors":"[\"Qi Sun\", \"Can Wang\", \"Jiaxiang Shang\", \"Yingchun Liu\", \"Jing Liao\"]","abstract":"Current 3D human animation methods struggle to achieve photorealism: kinematics-based approaches lack non-rigid dynamics (e.g., clothing dynamics), while methods that leverage video diffusion priors can synthesize non-rigid motion but suffer from quality artifacts and identity loss. To overcome these limitations, we present Ani3DHuman, a framework that marries kinematics-based animation with video diffusion priors. We first introduce a layered motion representation that disentangles rigid motion","published":"2026-02-22T08:07:28+00:00","categories":"[\"cs.CV\", \"cs.GR\", \"cs.LG\"]","pdf_url":"https://arxiv.org/pdf/2602.19089v1","arxiv_url":"http://arxiv.org/abs/2602.19089v1","comment":"CVPR 2026","source":"both","github_repo":"https://github.com/qiisun/ani3dhuman","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":7.0,"score_axis_2":7.0,"score_axis_3":8.0,"composite":7.35,"summary":"Ani3DHuman combines kinematics-based animation with video diffusion priors for photorealistic 3D human animation. It introduces layered motion representation separating rigid from non-rigid motion and novel self-guided stochastic sampling to overcome out-of-distribution problems in diffusion restoration.","reasoning":"Novel architecture combining kinematics with diffusion, code available on GitHub. Practical for animation/graphics with clear improvements over existing methods.","code_url":"https://github.com/qiisun/ani3dhuman","s2_tldr":"A novel self-guided stochastic sampling method is proposed, which effectively addresses the out-of-distribution problem by combining stochastic sampling (for photorealistic quality) with self-guidance (for identity fidelity).","s2_paper_id":"f8196fe597f63d1a49043bc0a39eeec6ead76a28","topics":"[\"3D / Vision\", \"Video Generation\"]"},{"id":40,"run_id":1,"domain":"aiml","arxiv_id":"2602.19053","entry_id":"","title":"TeFlow: Enabling Multi-frame Supervision for Self-Supervised Feed-forward Scene Flow Estimation","authors":"[\"Qingwen Zhang\", \"Chenhan Jiang\", \"Xiaomeng Zhu\", \"Yunqi Miao\", \"Yushan Zhang\", \"Olov Andersson\", \"Patric Jensfelt\"]","abstract":"Self-supervised feed-forward methods for scene flow estimation offer real-time efficiency, but their supervision from two-frame point correspondences is unreliable and often breaks down under occlusions. Multi-frame supervision has the potential to provide more stable guidance by incorporating motion cues from past frames, yet naive extensions of two-frame objectives are ineffective because point correspondences vary abruptly across frames, producing inconsistent signals. In the paper, we presen","published":"2026-02-22T05:50:16+00:00","categories":"[\"cs.CV\", \"cs.RO\"]","pdf_url":"https://arxiv.org/pdf/2602.19053v1","arxiv_url":"http://arxiv.org/abs/2602.19053v1","comment":"CVPR 2026; 15 pages, 8 figures","source":"arxiv","github_repo":"https://github.com/KTH-RPL/OpenSceneFlow","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":7.0,"score_axis_2":7.0,"score_axis_3":8.0,"composite":7.35,"summary":"TeFlow enables multi-frame supervision for self-supervised feed-forward scene flow estimation through temporal ensembling. It mines temporally consistent motion cues across frames, achieving 33% improvement over baselines on Argoverse 2/nuScenes while matching optimization-based methods but 150x faster.","reasoning":"Novel temporal consistency approach with open code and weights on GitHub. Highly practical for robotics/AV with significant speed and accuracy gains.","code_url":"https://github.com/KTH-RPL/OpenSceneFlow","s2_tldr":"TeFlow introduces a temporal ensembling strategy that forms reliable supervisory signals by aggregating the most temporally consistent motion cues from a candidate pool built across multiple frames, establishing a new state-of-the-art for self-supervised feed-forward methods.","s2_paper_id":"ab33c920f0e4769707ca7a3e0db296e617922ab5","topics":"[]"},{"id":41,"run_id":1,"domain":"aiml","arxiv_id":"2602.18996","entry_id":"","title":"Learning Cross-View Object Correspondence via Cycle-Consistent Mask Prediction","authors":"[\"Shannan Yan\", \"Leqi Zheng\", \"Keyu Lv\", \"Jingchen Ni\", \"Hongyang Wei\", \"Jiajun Zhang\", \"Guangting Wang\", \"Jing Lyu\", \"Chun Yuan\", \"Fengyun Rao\"]","abstract":"We study the task of establishing object-level visual correspondence across different viewpoints in videos, focusing on the challenging egocentric-to-exocentric and exocentric-to-egocentric scenarios. We propose a simple yet effective framework based on conditional binary segmentation, where an object query mask is encoded into a latent representation to guide the localization of the corresponding object in a target video. To encourage robust, view-invariant representations, we introduce a cycle","published":"2026-02-22T00:53:03+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.18996v1","arxiv_url":"http://arxiv.org/abs/2602.18996v1","comment":"The paper has been accepted to CVPR 2026","source":"both","github_repo":"https://github.com/shannany0606/CCMP","github_stars":null,"hf_upvotes":14,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":7.0,"score_axis_2":7.0,"score_axis_3":8.0,"composite":7.35,"summary":"Establishes object-level correspondence across ego/exo views using conditional binary segmentation with cycle-consistency training. The framework encodes query masks to guide localization, with bidirectional reconstruction providing self-supervision and enabling test-time training, achieving SOTA on Ego-Exo4D and HANDAL-X.","reasoning":"Novel cycle-consistency approach with code on GitHub, CVPR 2026 accepted. Highly practical for ego-exo video understanding with strong TTT results.","code_url":"https://github.com/shannany0606/CCMP","s2_tldr":"A simple yet effective framework based on conditional binary segmentation, where an object query mask is encoded into a latent representation to guide the localization of the corresponding object in a target video, to encourage robust, view-invariant representations.","s2_paper_id":"2d34e01961d223b897ce4956fada8a0e326a0a59","topics":"[\"3D / Vision\"]"},{"id":46,"run_id":1,"domain":"aiml","arxiv_id":"2602.19626","entry_id":"","title":"Nacrith: Neural Lossless Compression via Ensemble Context Modeling and High-Precision CDF Coding","authors":"[\"Roberto Tacconelli\"]","abstract":"We present Nacrith, a lossless compression system that combines a 135M-parameter transformer language model (SmolLM2-135M) with an ensemble of lightweight online predictors and a 32-bit arithmetic coder, achieving the best compression results among the systems evaluated in this study on natural language text. Beyond the base LLM-plus-arithmetic-coding paradigm, Nacrith introduces several contributions: (1) a CDF precision upgrade from 2^16 to 2^24 that eliminates ~75% of quantization overhead ca","published":"2026-02-23T09:14:05+00:00","categories":"[\"cs.IT\", \"cs.CL\"]","pdf_url":"https://arxiv.org/pdf/2602.19626v2","arxiv_url":"http://arxiv.org/abs/2602.19626v2","comment":"10 pages","source":"both","github_repo":"https://github.com/robtacconelli/Nacrith-GPU","github_stars":null,"hf_upvotes":2,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":7.0,"score_axis_2":7.0,"score_axis_3":8.0,"composite":7.35,"summary":"Presents Nacrith, a neural lossless compression system combining 135M-parameter transformer with ensemble predictors and 32-bit arithmetic coder. Achieves 0.918 bpb on alice29, outperforming gzip by 3.1x and CMIX by 44%. Introduces hybrid binary format (NC06) for arbitrary files and runs on consumer GPUs with ~500MB weights. Code available on GitHub.","reasoning":"Novel compression architecture with practical implementation and strong results. GPU code available, though weights not on HF. High practical applicability for compression tasks.","code_url":"https://github.com/robtacconelli/Nacrith-GPU","s2_tldr":null,"s2_paper_id":"5950557e831a53b7bc28ff9b7e40319ab87e431b","topics":"[\"Efficiency\", \"Language Models\", \"Architecture\"]"},{"id":47,"run_id":1,"domain":"aiml","arxiv_id":"2602.19058","entry_id":"","title":"Do LLMs and VLMs Share Neurons for Inference? Evidence and Mechanisms of Cross-Modal Transfer","authors":"[\"Chenhang Cui\", \"An Zhang\", \"Yuxin Chen\", \"Gelei Deng\", \"Jingnan Zheng\", \"Zhenkai Liang\", \"Xiang Wang\", \"Tat-Seng Chua\"]","abstract":"Large vision-language models (LVLMs) have rapidly advanced across various domains, yet they still lag behind strong text-only large language models (LLMs) on tasks that require multi-step inference and compositional decision-making. Motivated by their shared transformer architectures, we investigate whether the two model families rely on common internal computation for such inference. At the neuron level, we uncover a surprisingly large overlap: more than half of the top-activated units during m","published":"2026-02-22T06:04:05+00:00","categories":"[\"cs.CL\"]","pdf_url":"https://arxiv.org/pdf/2602.19058v1","arxiv_url":"http://arxiv.org/abs/2602.19058v1","comment":"","source":"arxiv","github_repo":"https://github.com/chenhangcuisg-code/Do-LLMs-VLMs-Share-Neurons","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":7.0,"score_axis_2":7.0,"score_axis_3":8.0,"composite":7.35,"summary":"Discovers that LLMs and LVLMs share over half of top-activated neurons during multi-step inference, revealing a modality-invariant inference subspace. Proposes SNRF to transfer inference circuitry from LLMs to LVLMs via low-rank fusion in shared-neuron subspace, consistently enhancing LVLM reasoning with minimal parameters and no multimodal fine-tuning.","reasoning":"Novel discovery with practical transfer method and code on GitHub. High impact for improving LVLM reasoning capabilities efficiently, strong mechanistic insights with actionable technique.","code_url":"https://github.com/chenhangcuisg-code/Do-LLMs-VLMs-Share-Neurons","s2_tldr":"It is demonstrated that shared neurons form an interpretable bridge between LLMs and LVLMs, enabling low-cost transfer of inference ability into multimodal models, and across diverse mathematics and perception benchmarks, SNRF consistently enhances LVLM inference performance while preserving perceptual capabilities.","s2_paper_id":"b5dba7d72e5b5f0f394bfe4332595b5547a96d64","topics":"[\"Language Models\", \"Multimodal\", \"Architecture\"]"},{"id":48,"run_id":1,"domain":"aiml","arxiv_id":"2602.19049","entry_id":"","title":"IAPO: Information-Aware Policy Optimization for Token-Efficient Reasoning","authors":"[\"Yinhan He\", \"Yaochen Zhu\", \"Mingjia Shi\", \"Wendy Zheng\", \"Lin Su\", \"Xiaoqing Wang\", \"Qi Guo\", \"Jundong Li\"]","abstract":"Large language models increasingly rely on long chains of thought to improve accuracy, yet such gains come with substantial inference-time costs. We revisit token-efficient post-training and argue that existing sequence-level reward-shaping methods offer limited control over how reasoning effort is allocated across tokens. To bridge the gap, we propose IAPO, an information-theoretic post-training framework that assigns token-wise advantages based on each token's conditional mutual information (M","published":"2026-02-22T05:30:14+00:00","categories":"[\"cs.CL\", \"cs.LG\"]","pdf_url":"https://arxiv.org/pdf/2602.19049v1","arxiv_url":"http://arxiv.org/abs/2602.19049v1","comment":"","source":"arxiv","github_repo":"https://github.com/YinhanHe123/IAPO","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":7.0,"score_axis_2":7.0,"score_axis_3":8.0,"composite":7.35,"summary":"IAPO proposes information-theoretic post-training using conditional mutual information to assign token-wise advantages, identifying informative reasoning steps while suppressing low-utility exploration. Achieves improved accuracy while reducing reasoning length by up to 36%, outperforming existing token-efficient RL methods with code available on GitHub.","reasoning":"Novel principled approach to token-efficient reasoning with strong empirical results and theoretical grounding. High practical value for reducing inference costs, code on GitHub boosts accessibility.","code_url":"https://github.com/YinhanHe123/IAPO","s2_tldr":"This work proposes IAPO, an information-theoretic post-training framework that assigns token-wise advantages based on each token's conditional mutual information with the final answer, providing an explicit, principled mechanism for identifying informative reasoning steps and suppressing low-utility exploration.","s2_paper_id":"63c10311e07318c8208e1736313242a822d0390b","topics":"[\"Efficiency\", \"Reasoning\", \"Optimization\"]"},{"id":49,"run_id":1,"domain":"aiml","arxiv_id":"2602.19020","entry_id":"","title":"Learning to Detect Language Model Training Data via Active Reconstruction","authors":"[\"Junjie Oscar Yin\", \"John X. Morris\", \"Vitaly Shmatikov\", \"Sewon Min\", \"Hannaneh Hajishirzi\"]","abstract":"Detecting LLM training data is generally framed as a membership inference attack (MIA) problem. However, conventional MIAs operate passively on fixed model weights, using log-likelihoods or text generations. In this work, we introduce \\textbf{Active Data Reconstruction Attack} (ADRA), a family of MIA that actively induces a model to reconstruct a given text through training. We hypothesize that training data are \\textit{more reconstructible} than non-members, and the difference in their reconstr","published":"2026-02-22T03:20:06+00:00","categories":"[\"cs.LG\", \"cs.AI\", \"cs.CL\"]","pdf_url":"https://arxiv.org/pdf/2602.19020v1","arxiv_url":"http://arxiv.org/abs/2602.19020v1","comment":"","source":"both","github_repo":"https://github.com/oseyosey/MIA-RL","github_stars":null,"hf_upvotes":1,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":7.0,"score_axis_2":8.0,"score_axis_3":7.0,"composite":7.35,"summary":"Introduces Active Data Reconstruction Attack (ADRA), a novel membership inference attack that uses RL to actively induce models to reconstruct training data. ADRA outperforms existing MIAs by 10.7% average, with specific improvements of 18.8% on BookMIA and 7.6% on AIME, demonstrating superior detection of pre-training, post-training, and distillation data.","reasoning":"Novel RL-based approach to MIA with code available on GitHub and strong empirical results. Practical for security research but limited direct practitioner applicability beyond auditing.","code_url":"https://github.com/oseyosey/MIA-RL","s2_tldr":"This work introduces Active Data Reconstruction Attack (ADRA), a family of MIA that actively induces a model to reconstruct a given text through training through training, motivated by findings that reinforcement learning sharpens behaviors already encoded in weights.","s2_paper_id":"b9000e0514f23f771497458c70cf844bf303d9a9","topics":"[\"Language Models\"]"},{"id":50,"run_id":1,"domain":"aiml","arxiv_id":"2602.18232","entry_id":"","title":"Thinking by Subtraction: Confidence-Driven Contrastive Decoding for LLM Reasoning","authors":"[\"Lexiang Tang\", \"Weihao Gao\", \"Bingchen Zhao\", \"Lu Ma\", \"Qiao jin\", \"Bang Yang\", \"Yuexian Zou\"]","abstract":"Recent work on test-time scaling for large language model (LLM) reasoning typically assumes that allocating more inference-time computation uniformly improves correctness. However, prior studies show that reasoning uncertainty is highly localized: a small subset of low-confidence tokens disproportionately contributes to reasoning errors and unnecessary output expansion. Motivated by this observation, we propose Thinking by Subtraction, a confidence-driven contrastive decoding approach that impro","published":"2026-02-20T14:13:22+00:00","categories":"[\"cs.CL\", \"cs.AI\"]","pdf_url":"https://arxiv.org/pdf/2602.18232v1","arxiv_url":"http://arxiv.org/abs/2602.18232v1","comment":"","source":"arxiv","github_repo":"https://github.com/bolo-web/CCD","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":7.0,"score_axis_2":7.0,"score_axis_3":8.0,"composite":7.35,"summary":"Confidence-Driven Contrastive Decoding (CCD) improves LLM reasoning by detecting low-confidence tokens and applying targeted contrastive intervention. Achieves significant accuracy improvements across mathematical reasoning benchmarks while reducing output length with minimal KV-cache overhead as a training-free method.","reasoning":"Code on GitHub, novel confidence-driven approach. Training-free with practical efficiency gains. Strong results across benchmarks without requiring weights.","code_url":"https://github.com/bolo-web/CCD","s2_tldr":"This work proposes Thinking by Subtraction, a confidence-driven contrastive decoding approach that improves reasoning reliability through targeted token-level intervention that significantly improves accuracy across mathematical reasoning benchmarks while substantially reducing output length.","s2_paper_id":"15e630a7b276eeb24afb7651152e810ecafa4257","topics":"[\"Language Models\", \"Reasoning\"]"},{"id":51,"run_id":1,"domain":"aiml","arxiv_id":"2602.18176","entry_id":"","title":"Improving Sampling for Masked Diffusion Models via Information Gain","authors":"[\"Kaisen Yang\", \"Jayden Teoh\", \"Kaicheng Yang\", \"Yitong Zhang\", \"Alex Lamb\"]","abstract":"Masked Diffusion Models (MDMs) offer greater flexibility in decoding order than autoregressive models but require careful planning to achieve high-quality generation. Existing samplers typically adopt greedy heuristics, prioritizing positions with the highest local certainty to decode at each step. Through failure case analysis, we identify a fundamental limitation of this approach: it neglects the downstream impact of current decoding choices on subsequent steps and fails to minimize cumulative","published":"2026-02-20T12:26:03+00:00","categories":"[\"cs.CL\"]","pdf_url":"https://arxiv.org/pdf/2602.18176v1","arxiv_url":"http://arxiv.org/abs/2602.18176v1","comment":"https://github.com/yks23/Information-Gain-Sampler","source":"arxiv","github_repo":"https://github.com/yks23/Information-Gain-Sampler","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":7.0,"score_axis_2":7.0,"score_axis_3":8.0,"composite":7.35,"summary":"Info-Gain Sampler improves Masked Diffusion Model decoding by balancing immediate uncertainty with information gain over future masked tokens. Achieves 3.6% accuracy improvement on reasoning and 63.1% win-rate in creative writing, reducing cumulative uncertainty from 78.4 to 48.6 on reasoning tasks.","reasoning":"Code on GitHub with strong empirical results across diverse tasks. Novel principled approach to MDM sampling. Practical improvements without training.","code_url":"https://github.com/yks23/Information-Gain-Sampler","s2_tldr":"The Info-Gain Sampler is proposed, a principled decoding framework that balances immediate uncertainty with information gain over future masked tokens, and demonstrates that Info-Gain Sampler consistently outperforms existing samplers for MDMs.","s2_paper_id":"f88c1d5decfd2fe980dacbeb9752925448db8f5c","topics":"[\"Agents\"]"},{"id":53,"run_id":1,"domain":"aiml","arxiv_id":"2602.17664","entry_id":"","title":"Sink-Aware Pruning for Diffusion Language Models","authors":"[\"Aidar Myrzakhan\", \"Tianyi Li\", \"Bowei Guo\", \"Shengkun Tang\", \"Zhiqiang Shen\"]","abstract":"Diffusion Language Models (DLMs) incur high inference cost due to iterative denoising, motivating efficient pruning. Existing pruning heuristics largely inherited from autoregressive (AR) LLMs, typically preserve attention sink tokens because AR sinks serve as stable global anchors. We show that this assumption does not hold for DLMs: the attention-sink position exhibits substantially higher variance over the full generation trajectory (measured by how the dominant sink locations shift across ti","published":"2026-02-19T18:59:50+00:00","categories":"[\"cs.CL\", \"cs.AI\", \"cs.LG\"]","pdf_url":"https://arxiv.org/pdf/2602.17664v1","arxiv_url":"http://arxiv.org/abs/2602.17664v1","comment":"Code at: https://github.com/VILA-Lab/Sink-Aware-Pruning","source":"both","github_repo":"https://github.com/VILA-Lab/Sink-Aware-Pruning","github_stars":null,"hf_upvotes":3,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":7.0,"score_axis_2":7.0,"score_axis_3":8.0,"composite":7.35,"summary":"Introduces Sink-Aware Pruning for Diffusion Language Models, showing that attention sinks in DLMs are transient unlike in AR models. Proposes automatic identification and pruning of unstable sinks, achieving better quality-efficiency trade-off without retraining. Code and implementation publicly available on GitHub.","reasoning":"Novel insight about DLM attention patterns with practical efficiency gains. Code available on GitHub enables immediate adoption. Strong practical value.","code_url":"https://github.com/VILA-Lab/Sink-Aware-Pruning","s2_tldr":"This work proposes sink-Aware Pruning, which automatically identifies and prunes unstable sinks in DLMs (prior studies usually keep sinks for AR LLMs), and achieves a better quality-efficiency trade-off and outperforms strong prior pruning baselines under matched compute.","s2_paper_id":"ff5e4167f46efdf9a68b310a5a68786cf5ce789a","topics":"[\"Language Models\", \"Efficiency\"]"},{"id":54,"run_id":1,"domain":"aiml","arxiv_id":"2602.17645","entry_id":"","title":"Pushing the Frontier of Black-Box LVLM Attacks via Fine-Grained Detail Targeting","authors":"[\"Xiaohan Zhao\", \"Zhaoyi Li\", \"Yaxin Luo\", \"Jiacheng Cui\", \"Zhiqiang Shen\"]","abstract":"Black-box adversarial attacks on Large Vision-Language Models (LVLMs) are challenging due to missing gradients and complex multimodal boundaries. While prior state-of-the-art transfer-based approaches like M-Attack perform well using local crop-level matching between source and target images, we find this induces high-variance, nearly orthogonal gradients across iterations, violating coherent local alignment and destabilizing optimization. We attribute this to (i) ViT translation sensitivity tha","published":"2026-02-19T18:54:32+00:00","categories":"[\"cs.LG\", \"cs.AI\", \"cs.CL\", \"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.17645v1","arxiv_url":"http://arxiv.org/abs/2602.17645v1","comment":"Code at: https://github.com/vila-lab/M-Attack-V2","source":"arxiv","github_repo":"https://github.com/vila-lab/M-Attack-V2","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[{\"id\": \"MBZUAI-LLM/M-Attack-V2-Adversarial-Samples\", \"likes\": 0}]","hf_spaces":"[]","score_axis_1":7.0,"score_axis_2":7.0,"score_axis_3":8.0,"composite":7.35,"summary":"M-Attack-V2 substantially improves black-box adversarial attacks on frontier LVLMs by addressing gradient variance through Multi-Crop Alignment and Auxiliary Target Alignment. The method achieves 30% success on Claude-4.0 (up from 8%) and 100% on GPT-5 (up from 98%), with code and data publicly available.","reasoning":"Strong practical contribution with open code, addresses real security concerns for deployed LVLMs, significant empirical improvements over prior work.","code_url":"https://github.com/vila-lab/M-Attack-V2","s2_tldr":"M-Attack-V2 is a simple, modular enhancement over M-Attack that substantially improves transfer-based black-box attacks on frontier LVLMs: boosting success rates on Claude-4.0 from 8% to 30%, Gemini-2.5-Pro from 83% to 97%, and GPT-5 from 98% to 100%, outperforming prior black-box LVLM attacks.","s2_paper_id":"1065a81ce2c8db90072d3c315968f3a5873d362a","topics":"[\"Language Models\", \"Multimodal\", \"Training\"]"},{"id":56,"run_id":1,"domain":"aiml","arxiv_id":"2602.21203","entry_id":"","title":"Squint: Fast Visual Reinforcement Learning for Sim-to-Real Robotics","authors":"[\"Abdulaziz Almuzairee\", \"Henrik I. Christensen\"]","abstract":"Visual reinforcement learning is appealing for robotics but expensive -- off-policy methods are sample-efficient yet slow; on-policy methods parallelize well but waste samples. Recent work has shown that off-policy methods can train faster than on-policy methods in wall-clock time for state-based control. Extending this to vision remains challenging, where high-dimensional input images complicate training dynamics and introduce substantial storage and encoding overhead. To address these challeng","published":"2026-02-24T18:58:11+00:00","categories":"[\"cs.RO\", \"cs.CV\", \"cs.LG\"]","pdf_url":"https://arxiv.org/pdf/2602.21203v1","arxiv_url":"http://arxiv.org/abs/2602.21203v1","comment":"For website and code, see https://aalmuzairee.github.io/squint","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":7.0,"score_axis_2":6.0,"score_axis_3":8.0,"composite":7.0,"summary":"Squint is a visual Soft Actor Critic method for sim-to-real robotics that achieves faster wall-clock training than prior off-policy and on-policy methods. It trains manipulation policies in under 15 minutes on a single RTX 3090, with most tasks converging in under 6 minutes, and demonstrates successful sim-to-real transfer.","reasoning":"Strong practical applicability with impressive efficiency gains and real robot validation. Novelty is incremental (optimizations to existing SAC), but engineering contributions are valuable. Code availability helps.","code_url":"https://aalmuzairee.github.io/squint","s2_tldr":"This work introduces Squint, a visual Soft Actor Critic method that achieves faster wall-clock training than prior visual off-policy and on-policy methods, via parallel simulation, a distributional critic, resolution squinting, layer normalization, a tuned update-to-data ratio, and an optimized implementation.","s2_paper_id":"edaeb99457466b0cea0a896729d3a9caf3e63050","topics":"[\"Robotics\", \"Training\", \"RL\"]"},{"id":57,"run_id":1,"domain":"aiml","arxiv_id":"2602.21198","entry_id":"","title":"Learning from Trials and Errors: Reflective Test-Time Planning for Embodied LLMs","authors":"[\"Yining Hong\", \"Huang Huang\", \"Manling Li\", \"Li Fei-Fei\", \"Jiajun Wu\", \"Yejin Choi\"]","abstract":"Embodied LLMs endow robots with high-level task reasoning, but they cannot reflect on what went wrong or why, turning deployment into a sequence of independent trials where mistakes repeat rather than accumulate into experience. Drawing upon human reflective practitioners, we introduce Reflective Test-Time Planning, which integrates two modes of reflection: \\textit{reflection-in-action}, where the agent uses test-time scaling to generate and score multiple candidate actions using internal reflec","published":"2026-02-24T18:55:18+00:00","categories":"[\"cs.LG\", \"cs.AI\", \"cs.CL\", \"cs.CV\", \"cs.RO\"]","pdf_url":"https://arxiv.org/pdf/2602.21198v1","arxiv_url":"http://arxiv.org/abs/2602.21198v1","comment":"","source":"both","github_repo":"https://github.com/Reflective-Test-Time-Planning/Reflective-Test-Time-Planning","github_stars":null,"hf_upvotes":4,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":7.0,"score_axis_2":7.0,"score_axis_3":7.0,"composite":7.0,"summary":"Proposes Reflective Test-Time Planning for embodied LLMs, integrating reflection-in-action (test-time scaling with candidate action evaluation) and reflection-on-action (test-time training with policy updates). Shows significant gains on long-horizon household tasks and real-robot trials with improved credit assignment through retrospective reflection.","reasoning":"Interesting blend of test-time scaling and training with practical robotics validation. Code available, but no mention of weights. Novelty is moderate (combines existing concepts).","code_url":"https://github.com/Reflective-Test-Time-Planning/Reflective-Test-Time-Planning","s2_tldr":"Drawing upon human reflective practitioners, Reflective Test-Time Planning is introduced, which integrates two modes of reflection: reflection-in-action and reflection-on-action, which uses test-time scaling to generate and score multiple candidate actions using internal reflections before execution.","s2_paper_id":"ebd3d544b0e8fb318f9035e1b5dd5e707b13e08a","topics":"[\"Agents\", \"Robotics\", \"Reasoning\"]"},{"id":61,"run_id":1,"domain":"aiml","arxiv_id":"2602.20901","entry_id":"","title":"SpatiaLQA: A Benchmark for Evaluating Spatial Logical Reasoning in Vision-Language Models","authors":"[\"Yuechen Xie\", \"Xiaoyan Zhang\", \"Yicheng Shan\", \"Hao Zhu\", \"Rui Tang\", \"Rong Wei\", \"Mingli Song\", \"Yuanyu Wan\", \"Jie Song\"]","abstract":"Vision-Language Models (VLMs) have been increasingly applied in real-world scenarios due to their outstanding understanding and reasoning capabilities. Although VLMs have already demonstrated impressive capabilities in common visual question answering and logical reasoning, they still lack the ability to make reasonable decisions in complex real-world environments. We define this ability as spatial logical reasoning, which not only requires understanding the spatial relationships among objects i","published":"2026-02-24T13:38:37+00:00","categories":"[\"cs.CV\", \"cs.LG\"]","pdf_url":"https://arxiv.org/pdf/2602.20901v1","arxiv_url":"http://arxiv.org/abs/2602.20901v1","comment":"Accepted by CVPR 2026","source":"arxiv","github_repo":"https://github.com/xieyc99/SpatiaLQA","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":7.0,"score_axis_2":6.0,"score_axis_3":8.0,"composite":7.0,"summary":"SpatiaLQA benchmark evaluates spatial logical reasoning in VLMs through 9,605 QA pairs from 241 real indoor scenes. Tests both spatial relationship understanding and logical task dependencies. Proposes recursive scene graph assisted reasoning using visual foundation models, outperforming existing methods with code and dataset available.","reasoning":"Valuable benchmark with code/dataset release addressing important VLM capability gap. Moderate novelty in evaluation approach. High practical utility for robotics/embodied AI.","code_url":"https://github.com/xieyc99/SpatiaLQA","s2_tldr":"A method called recursive scene graph assisted reasoning is proposed, which leverages visual foundation models to progressively decompose complex scenes into task-relevant scene graphs, thereby enhancing the spatial logical reasoning ability of VLMs, outperforming all previous methods.","s2_paper_id":"0027642e76e7cc1ee1187530cad707d92c508668","topics":"[\"Language Models\", \"Multimodal\", \"Reasoning\"]"},{"id":62,"run_id":1,"domain":"aiml","arxiv_id":"2602.20807","entry_id":"","title":"RU4D-SLAM: Reweighting Uncertainty in Gaussian Splatting SLAM for 4D Scene Reconstruction","authors":"[\"Yangfan Zhao\", \"Hanwei Zhang\", \"Ke Huang\", \"Qiufeng Wang\", \"Zhenzhou Shao\", \"Dengyu Wu\"]","abstract":"Combining 3D Gaussian splatting with Simultaneous Localization and Mapping (SLAM) has gained popularity as it enables continuous 3D environment reconstruction during motion. However, existing methods struggle in dynamic environments, particularly moving objects complicate 3D reconstruction and, in turn, hinder reliable tracking. The emergence of 4D reconstruction, especially 4D Gaussian splatting, offers a promising direction for addressing these challenges, yet its potential for 4D-aware SLAM r","published":"2026-02-24T11:47:43+00:00","categories":"[\"cs.CV\", \"cs.RO\"]","pdf_url":"https://arxiv.org/pdf/2602.20807v1","arxiv_url":"http://arxiv.org/abs/2602.20807v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":7.0,"score_axis_2":7.0,"score_axis_3":7.0,"composite":7.0,"summary":"RU4D-SLAM extends 3D Gaussian splatting SLAM to 4D reconstruction in dynamic scenes, introducing uncertainty-aware perception and semantic-guided reweighting to handle moving objects and motion blur. The method improves trajectory accuracy and reconstruction quality in challenging conditions with blurred or occluded inputs.","reasoning":"Novel integration of 4D Gaussian splatting with SLAM and uncertainty modeling. Code URL provided but no evidence of released weights or HF presence. Strong practical value for robotics/AR applications.","code_url":"https://ru4d-slam.github.io","s2_tldr":"This work proposes a robust and efficient framework, namely Reweighting Uncertainty in Gaussian Splatting SLAM (RU4D-SLAM) for 4D scene reconstruction, that introduces temporal factors into spatial 3D representation while incorporating uncertainty-aware perception of scene changes, blurred image synthesis, and dynamic scene reconstruction.","s2_paper_id":"775380f56a495d48033342249f2d9655dae4e26a","topics":"[\"3D / Vision\"]"},{"id":63,"run_id":1,"domain":"aiml","arxiv_id":"2602.20731","entry_id":"","title":"Communication-Inspired Tokenization for Structured Image Representations","authors":"[\"Aram Davtyan\", \"Yusuf Sahin\", \"Yasaman Haghighi\", \"Sebastian Stapf\", \"Pablo Acuaviva\", \"Alexandre Alahi\", \"Paolo Favaro\"]","abstract":"Discrete image tokenizers have emerged as a key component of modern vision and multimodal systems, providing a sequential interface for transformer-based architectures. However, most existing approaches remain primarily optimized for reconstruction and compression, often yielding tokens that capture local texture rather than object-level semantic structure. Inspired by the incremental and compositional nature of human communication, we introduce COMmunication inspired Tokenization (COMiT), a fra","published":"2026-02-24T09:53:50+00:00","categories":"[\"cs.CV\", \"cs.AI\", \"cs.LG\"]","pdf_url":"https://arxiv.org/pdf/2602.20731v1","arxiv_url":"http://arxiv.org/abs/2602.20731v1","comment":"Project website: https://araachie.github.io/comit/","source":"both","github_repo":"https://github.com/araachie/comit","github_stars":null,"hf_upvotes":4,"hf_models":"[{\"id\": \"cvg-unibe/comit-xl\", \"likes\": 0}, {\"id\": \"cvg-unibe/comit-l\", \"likes\": 0}, {\"id\": \"cvg-unibe/comit-b\", \"likes\": 0}]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":7.0,"score_axis_2":7.0,"score_axis_3":7.0,"composite":7.0,"summary":"COMiT introduces communication-inspired tokenization for structured discrete visual representations, iteratively encoding images through localized crops and recurrent token updates. Achieves better compositional generalization and object-centric structure compared to reconstruction-focused tokenizers.","reasoning":"Code and multiple model weights on HF. Novel communication-inspired approach to tokenization. Good practical applicability for vision transformers and multimodal systems.","code_url":"https://github.com/araachie/comit","s2_tldr":"This work introduces COMmunication inspired Tokenization (COMiT), a framework for learning structured discrete visual token sequences and demonstrates that while semantic alignment provides grounding, attentive sequential tokenization is critical for inducing interpretable, object-centric token structure and substantially improving compositional generalization and relational reasoning over prior methods.","s2_paper_id":"6c08e3e406d5443c13b45250cea3de07473f6f0c","topics":"[\"Multimodal\", \"Efficiency\", \"Architecture\"]"},{"id":68,"run_id":1,"domain":"aiml","arxiv_id":"2602.20312","entry_id":"","title":"N4MC: Neural 4D Mesh Compression","authors":"[\"Guodong Chen\", \"Huanshuo Dong\", \"Mallesham Dasari\"]","abstract":"We present N4MC, the first 4D neural compression framework to efficiently compress time-varying mesh sequences by exploiting their temporal redundancy. Unlike prior neural mesh compression methods that treat each mesh frame independently, N4MC takes inspiration from inter-frame compression in 2D video codecs, and learns motion compensation in long mesh sequences. Specifically, N4MC converts consecutive irregular mesh frames into regular 4D tensors to provide a uniform and compact representation.","published":"2026-02-23T19:58:30+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.20312v1","arxiv_url":"http://arxiv.org/abs/2602.20312v1","comment":"","source":"arxiv","github_repo":"https://github.com/frozzzen3/N4MC","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":7.0,"score_axis_2":7.0,"score_axis_3":7.0,"composite":7.0,"summary":"N4MC introduces the first neural 4D mesh compression framework that exploits temporal redundancy using transformer-based interpolation for motion compensation. The method achieves superior rate-distortion performance over state-of-the-art while enabling real-time decoding of 4D mesh sequences.","reasoning":"Code available on GitHub. Novel 4D compression approach with strong practical applicability for dynamic content. Real-time decoding capability enhances usability.","code_url":"https://github.com/frozzzen3/N4MC","s2_tldr":"N4MC is presented, the first 4D neural compression framework to efficiently compress time-varying mesh sequences by exploiting their temporal redundancy, and introduces a transformer-based interpolation model that predicts intermediate mesh frames conditioned on latent embeddings derived from tracked volume centers, eliminating motion ambiguities.","s2_paper_id":"f00833f3f94cfd44eb04844668414435b9366865","topics":"[\"Efficiency\"]"},{"id":70,"run_id":1,"domain":"aiml","arxiv_id":"2602.20068","entry_id":"","title":"The Invisible Gorilla Effect in Out-of-distribution Detection","authors":"[\"Harry Anthony\", \"Ziyun Liang\", \"Hermione Warr\", \"Konstantinos Kamnitsas\"]","abstract":"Deep Neural Networks achieve high performance in vision tasks by learning features from regions of interest (ROI) within images, but their performance degrades when deployed on out-of-distribution (OOD) data that differs from training data. This challenge has led to OOD detection methods that aim to identify and reject unreliable predictions. Although prior work shows that OOD detection performance varies by artefact type, the underlying causes remain underexplored. To this end, we identify a pr","published":"2026-02-23T17:24:18+00:00","categories":"[\"cs.CV\", \"cs.LG\"]","pdf_url":"https://arxiv.org/pdf/2602.20068v1","arxiv_url":"http://arxiv.org/abs/2602.20068v1","comment":"Accepted at CVPR 2026","source":"arxiv","github_repo":"https://github.com/HarryAnthony/Invisible_Gorilla_Effect","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":7.0,"score_axis_2":7.0,"score_axis_3":7.0,"composite":7.0,"summary":"This work identifies the \"Invisible Gorilla Effect\" in OOD detection, where detection performance drops when artefacts differ visually from the model's regions of interest. The study annotates 11,355 images and evaluates 40 OOD methods across 7 benchmarks, revealing a previously unreported bias.","reasoning":"Code and annotations available on GitHub; novel insight into OOD detection failure modes; strong practical value for improving detector robustness.","code_url":"https://github.com/HarryAnthony/Invisible_Gorilla_Effect","s2_tldr":"This work identifies a previously unreported bias in OOD detection: for hard-to-detect artefacts (near-OOD), detection performance typically improves when the artefact shares visual similarity with the model's ROI and drops when it does not - a phenomenon the authors term the Invisible Gorilla Effect.","s2_paper_id":"d7482679b47a405ce2166598d26a8342f5d3cc12","topics":"[]"},{"id":75,"run_id":1,"domain":"aiml","arxiv_id":"2602.19512","entry_id":"","title":"Variational Trajectory Optimization of Anisotropic Diffusion Schedules","authors":"[\"Pengxi Liu\", \"Zeyu Michael Li\", \"Xiang Cheng\"]","abstract":"We introduce a variational framework for diffusion models with anisotropic noise schedules parameterized by a matrix-valued path $M_t(\u03b8)$ that allocates noise across subspaces. Central to our framework is a trajectory-level objective that jointly trains the score network and learns $M_t(\u03b8)$, which encompasses general parameterization classes of matrix-valued noise schedules. We further derive an estimator for the derivative with respect to $\u03b8$ of the score that enables efficient optimization of ","published":"2026-02-23T04:56:41+00:00","categories":"[\"cs.LG\", \"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.19512v1","arxiv_url":"http://arxiv.org/abs/2602.19512v1","comment":"","source":"arxiv","github_repo":"https://github.com/lizeyu090312/anisotropic-diffusion-paper","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":7.0,"score_axis_2":7.0,"score_axis_3":7.0,"composite":7.0,"summary":"Proposes variational framework for diffusion models with anisotropic noise schedules via matrix-valued paths and a trajectory-level objective. Includes efficient reverse-ODE solver generalizing Heun discretization. Improves upon EDM baseline across CIFAR-10, AFHQv2, FFHQ, ImageNet-64.","reasoning":"Code on GitHub. Anisotropic diffusion schedules are novel; trajectory optimization is interesting. Practical improvements but incremental SOTA in image generation.","code_url":"https://github.com/lizeyu090312/anisotropic-diffusion-paper","s2_tldr":"A variational framework for diffusion models with anisotropic noise schedules parameterized by a matrix-valued path that allocates noise across subspaces and develops an efficiently-implementable reverse-ODE solver that is an anisotropic generalization of the second-order Heun discretization algorithm.","s2_paper_id":"8cb6e009d49d9b059f4192482d049b527439755b","topics":"[\"Optimization\", \"Efficiency\"]"},{"id":76,"run_id":1,"domain":"aiml","arxiv_id":"2602.19497","entry_id":"","title":"MICON-Bench: Benchmarking and Enhancing Multi-Image Context Image Generation in Unified Multimodal Models","authors":"[\"Mingrui Wu\", \"Hang Liu\", \"Jiayi Ji\", \"Xiaoshuai Sun\", \"Rongrong Ji\"]","abstract":"Recent advancements in Unified Multimodal Models (UMMs) have enabled remarkable image understanding and generation capabilities. However, while models like Gemini-2.5-Flash-Image show emerging abilities to reason over multiple related images, existing benchmarks rarely address the challenges of multi-image context generation, focusing mainly on text-to-image or single-image editing tasks. In this work, we introduce \\textbf{MICON-Bench}, a comprehensive benchmark covering six tasks that evaluate ","published":"2026-02-23T04:32:52+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.19497v1","arxiv_url":"http://arxiv.org/abs/2602.19497v1","comment":"CVPR2026","source":"arxiv","github_repo":"https://github.com/Angusliuuu/MICON-Bench","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":7.0,"score_axis_2":7.0,"score_axis_3":7.0,"composite":7.0,"summary":"MICON-Bench benchmarks multi-image context generation in unified multimodal models across six tasks evaluating cross-image composition and identity preservation. Dynamic Attention Rebalancing (DAR) is a training-free plug-and-play mechanism improving coherence by dynamically adjusting attention during inference.","reasoning":"Code on GitHub; CVPR 2026. Novel benchmark for multi-image generation and training-free DAR method. Practical for evaluating and improving UMMs.","code_url":"https://github.com/Angusliuuu/MICON-Bench","s2_tldr":"A comprehensive benchmark covering six tasks that evaluate cross-image composition, contextual reasoning, and identity preservation, and an MLLM-driven Evaluation-by-Checkpoint framework for automatic verification of semantic and visual consistency, where multimodal large language model (MLLM) serves as a verifier.","s2_paper_id":"88a82808aa2c2f59af5b42e3cb8b88c646aad496","topics":"[\"Image Generation\", \"Multimodal\", \"Benchmark\"]"},{"id":77,"run_id":1,"domain":"aiml","arxiv_id":"2602.19432","entry_id":"","title":"CountEx: Fine-Grained Counting via Exemplars and Exclusion","authors":"[\"Yifeng Huang\", \"Gia Khanh Nguyen\", \"Minh Hoai\"]","abstract":"This paper presents CountEx, a discriminative visual counting framework designed to address a key limitation of existing prompt-based methods: the inability to explicitly exclude visually similar distractors. While current approaches allow users to specify what to count via inclusion prompts, they often struggle in cluttered scenes with confusable object categories, leading to ambiguity and overcounting. CountEx enables users to express both inclusion and exclusion intent, specifying what to cou","published":"2026-02-23T02:01:44+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.19432v1","arxiv_url":"http://arxiv.org/abs/2602.19432v1","comment":"","source":"arxiv","github_repo":"https://github.com/bbvisual/CountEx","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":7.0,"score_axis_2":7.0,"score_axis_3":7.0,"composite":7.0,"summary":"CountEx enables fine-grained visual counting with both inclusion and exclusion prompts through multimodal inputs (language + visual exemplars). Introduces discriminative query refinement to handle visually similar distractors and presents CoCount benchmark with 1,780 videos across 97 category pairs.","reasoning":"Novel discriminative counting approach with code on GitHub. Strong practical utility for practitioners needing to count specific objects while excluding confusables.","code_url":"https://github.com/bbvisual/CountEx","s2_tldr":"The proposed CountEx enables users to express both inclusion and exclusion intent, specifying what to count and what to ignore, through multimodal prompts including natural language descriptions and optional visual exemplars.","s2_paper_id":"73e9a071556b445040b22bb29568fdbd431be6e0","topics":"[]"},{"id":83,"run_id":1,"domain":"aiml","arxiv_id":"2602.19248","entry_id":"","title":"No Need For Real Anomaly: MLLM Empowered Zero-Shot Video Anomaly Detection","authors":"[\"Zunkai Dai\", \"Ke Li\", \"Jiajia Liu\", \"Jie Yang\", \"Yuanyuan Qiao\"]","abstract":"The collection and detection of video anomaly data has long been a challenging problem due to its rare occurrence and spatio-temporal scarcity. Existing video anomaly detection (VAD) methods under perform in open-world scenarios. Key contributing factors include limited dataset diversity, and inadequate understanding of context-dependent anomalous semantics. To address these issues, i) we propose LAVIDA, an end-to-end zero-shot video anomaly detection framework. ii) LAVIDA employs an Anomaly Exp","published":"2026-02-22T16:03:43+00:00","categories":"[\"cs.CV\", \"cs.AI\"]","pdf_url":"https://arxiv.org/pdf/2602.19248v1","arxiv_url":"http://arxiv.org/abs/2602.19248v1","comment":"Accepted by CVPR 2026","source":"arxiv","github_repo":"https://github.com/VitaminCreed/LAVIDA","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":7.0,"score_axis_2":7.0,"score_axis_3":7.0,"composite":7.0,"summary":"LAVIDA achieves zero-shot video anomaly detection using MLLMs and pseudo-anomalies generated from segmented objects. The method includes an Anomaly Exposure Sampler and reverse attention-based token compression, training only on synthetic anomalies without real VAD data.","reasoning":"Strong novelty with zero-shot capability and code available. Addresses practical problem of anomaly data scarcity, CVPR 2026 accepted.","code_url":"https://github.com/VitaminCreed/LAVIDA","s2_tldr":"Evaluations across four benchmark VAD datasets demonstrate that LAVIDA achieves SOTA performance in both frame-level and pixel-level anomaly detection under the zero-shot setting, and a token compression approach based on reverse attention to handle the spatio-temporal scarcity of anomalous patterns and decrease computational cost.","s2_paper_id":"816b3e521cb9af5f5fd5b22a9249eb41fc6bd7d9","topics":"[\"Benchmark\"]"},{"id":84,"run_id":1,"domain":"aiml","arxiv_id":"2602.19202","entry_id":"","title":"UniE2F: A Unified Diffusion Framework for Event-to-Frame Reconstruction with Video Foundation Models","authors":"[\"Gang Xu\", \"Zhiyu Zhu\", \"Junhui Hou\"]","abstract":"Event cameras excel at high-speed, low-power, and high-dynamic-range scene perception. However, as they fundamentally record only relative intensity changes rather than absolute intensity, the resulting data streams suffer from a significant loss of spatial information and static texture details. In this paper, we address this limitation by leveraging the generative prior of a pre-trained video diffusion model to reconstruct high-fidelity video frames from sparse event data. Specifically, we fir","published":"2026-02-22T14:06:49+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.19202v1","arxiv_url":"http://arxiv.org/abs/2602.19202v1","comment":"","source":"arxiv","github_repo":"https://github.com/CS-GangXu/UniE2F","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":7.0,"score_axis_2":7.0,"score_axis_3":7.0,"composite":7.0,"summary":"UniE2F reconstructs high-fidelity video frames from sparse event camera data using pre-trained video diffusion models. Introduces event-based inter-frame residual guidance and extends to zero-shot frame interpolation and prediction through modulated sampling.","reasoning":"Novel unified framework for event-to-frame reconstruction with code available. Leverages video foundation models effectively for specialized sensor data.","code_url":"https://github.com/CS-GangXu/UniE2F","s2_tldr":"This paper establishes a baseline model by directly applying event data as a condition to synthesize videos and introduces the event-based inter-frame residual guidance to enhance the accuracy of video frame reconstruction, thereby creating a unified event-to-frame reconstruction framework.","s2_paper_id":"29efb159e1083e0b1a59d34e267b5b52090ecdfd","topics":"[\"Language Models\", \"Video Generation\"]"},{"id":88,"run_id":1,"domain":"aiml","arxiv_id":"2602.19004","entry_id":"","title":"MoBind: Motion Binding for Fine-Grained IMU-Video Pose Alignment","authors":"[\"Duc Duy Nguyen\", \"Tat-Jun Chin\", \"Minh Hoai\"]","abstract":"We aim to learn a joint representation between inertial measurement unit (IMU) signals and 2D pose sequences extracted from video, enabling accurate cross-modal retrieval, temporal synchronization, subject and body-part localization, and action recognition. To this end, we introduce MoBind, a hierarchical contrastive learning framework designed to address three challenges: (1) filtering out irrelevant visual background, (2) modeling structured multi-sensor IMU configurations, and (3) achieving f","published":"2026-02-22T01:54:29+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.19004v1","arxiv_url":"http://arxiv.org/abs/2602.19004v1","comment":"8 pages, 6 tables, 7 figures, accepted to CVPR26","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":7.0,"score_axis_2":7.0,"score_axis_3":7.0,"composite":7.0,"summary":"MoBind learns joint IMU-video pose representations through hierarchical contrastive learning, aligning IMU with skeletal motion at both token and body-part levels. Achieves fine-grained temporal alignment enabling cross-modal retrieval, synchronization, localization, and action recognition across mRi, TotalCapture, and EgoHumans.","reasoning":"Novel hierarchical approach for IMU-video alignment with code available, CVPR 2026. Practical for AR/VR and human activity recognition applications.","code_url":"https://github.com/bbvisual/MoBind","s2_tldr":"MoBind is introduced, a hierarchical contrastive learning framework designed to address three challenges: filtering out irrelevant visual background, modeling structured multi-sensor IMU configurations, and achieving fine-grained, sub-second temporal alignment.","s2_paper_id":"3c90c67612e5c0a000b032a2064b9a6f6beedea0","topics":"[\"Training\", \"Retrieval / RAG\"]"},{"id":89,"run_id":1,"domain":"aiml","arxiv_id":"2602.18896","entry_id":"","title":"Beyond Stationarity: Rethinking Codebook Collapse in Vector Quantization","authors":"[\"Hao Lu\", \"Onur C. Koyun\", \"Yongxin Guo\", \"Zhengjie Zhu\", \"Abbas Alili\", \"Metin Nafi Gurcan\"]","abstract":"Vector Quantization (VQ) underpins many modern generative frameworks such as VQ-VAE, VQ-GAN, and latent diffusion models. Yet, it suffers from the persistent problem of codebook collapse, where a large fraction of code vectors remains unused during training. This work provides a new theoretical explanation by identifying the nonstationary nature of encoder updates as the fundamental cause of this phenomenon. We show that as the encoder drifts, unselected code vectors fail to receive updates and ","published":"2026-02-21T16:36:50+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.18896v1","arxiv_url":"http://arxiv.org/abs/2602.18896v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":7.0,"score_axis_2":7.0,"score_axis_3":7.0,"composite":7.0,"summary":"NSVQ and TransVQ address codebook collapse in vector quantization by identifying nonstationary encoder updates as root cause. NSVQ propagates encoder drift through kernel-based rules, while TransVQ uses lightweight mapping to transform codebooks. Both achieve near-complete codebook utilization with superior reconstruction quality on CelebA-HQ.","reasoning":"Code available on GitHub. Novel theoretical explanation and practical solutions for fundamental VQ problem. Strong experimental validation with open-sourced implementation.","code_url":"https://github.com/CAIR-LAB-WFUSM/NSVQ-TransVQ.git","s2_tldr":"Two new methods are proposed: Non-Stationary Vector Quantization (NSVQ), which propagates encoder drift to non-selected codes through a kernel-based rule, and Transformer-based Vector Quantization (TransVQ), which employs a lightweight mapping to adaptively transform the entire codebook while preserving convergence to the k-means solution.","s2_paper_id":"125c7d3f294e0086726f35b059781bd38ba42552","topics":"[\"Efficiency\", \"Image Generation\"]"},{"id":93,"run_id":1,"domain":"aiml","arxiv_id":"2602.20710","entry_id":"","title":"Counterfactual Simulation Training for Chain-of-Thought Faithfulness","authors":"[\"Peter Hase\", \"Christopher Potts\"]","abstract":"Inspecting Chain-of-Thought reasoning is among the most common means of understanding why an LLM produced its output. But well-known problems with CoT faithfulness severely limit what insights can be gained from this practice. In this paper, we introduce a training method called Counterfactual Simulation Training (CST), which aims to improve CoT faithfulness by rewarding CoTs that enable a simulator to accurately predict a model's outputs over counterfactual inputs. We apply CST in two settings:","published":"2026-02-24T09:15:30+00:00","categories":"[\"cs.AI\", \"cs.CL\"]","pdf_url":"https://arxiv.org/pdf/2602.20710v1","arxiv_url":"http://arxiv.org/abs/2602.20710v1","comment":"","source":"arxiv","github_repo":"https://github.com/peterbhase/counterfactual-simulation-training","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":7.0,"score_axis_2":7.0,"score_axis_3":7.0,"composite":7.0,"summary":"Counterfactual Simulation Training (CST) improves CoT faithfulness by rewarding chains that enable accurate prediction over counterfactual inputs. Achieves 35-point accuracy gain on cue-based counterfactuals with models up to 235B parameters, with code available on GitHub.","reasoning":"Novel training method addressing important faithfulness problem with strong results. GitHub repo available significantly boosts code score.","code_url":"https://github.com/peterbhase/counterfactual-simulation-training","s2_tldr":"Experiments with models up to 235B parameters show that CST can substantially improve monitor accuracy on cue-based counterfactuals as well as simulatability over generic counterfactuals, and suggest that CST can improve CoT faithfulness in general, with promising applications for CoT monitoring.","s2_paper_id":"1af1bc0ab00f7fe4ff30014684b150aeb06c7632","topics":"[\"Reasoning\", \"Language Models\", \"RL\"]"},{"id":96,"run_id":1,"domain":"aiml","arxiv_id":"2602.20517","entry_id":"","title":"Inner Speech as Behavior Guides: Steerable Imitation of Diverse Behaviors for Human-AI coordination","authors":"[\"Rakshit Trivedi\", \"Kartik Sharma\", \"David C Parkes\"]","abstract":"Effective human-AI coordination requires artificial agents capable of exhibiting and responding to human-like behaviors while adapting to changing contexts. Imitation learning has emerged as one of the prominent approaches to build such agents by training them to mimic human-demonstrated behaviors. However, current methods struggle to capture the inherent diversity and non-Markovian nature of human behavior and lack the ability to steer behavior at inference time. Drawing inspiration from the th","published":"2026-02-24T03:37:42+00:00","categories":"[\"cs.AI\", \"cs.CL\", \"cs.LG\"]","pdf_url":"https://arxiv.org/pdf/2602.20517v1","arxiv_url":"http://arxiv.org/abs/2602.20517v1","comment":"Spotlight paper at NeurIPS 2025","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":7.0,"score_axis_2":7.0,"score_axis_3":7.0,"composite":7.0,"summary":"MIMIC uses vision-language models as linguistic scaffolding to generate inner speech representations that guide imitation learning for human-AI coordination. A conditional VAE generates behavior-specific inner speech from observations, then a diffusion policy selects actions conditioned on observations and speech. Enables fine-grained behavioral steering at inference time without additional demonstrations.","reasoning":"Novel use of language as internal behavioral representation with strong theoretical grounding. Code and pre-trained agents openly released at https://mimic-research.github.io enhances reproducibility and adoption.","code_url":"https://mimic-research.github.io","s2_tldr":"MIMIC (Modeling Inner Motivations for Imitation and Control), a framework that uses language as an internal representation of behavioral intent that enables fine-grained steering of behavior at inference time by conditioning the agent on behavior-specific speech.","s2_paper_id":"ac391373d8b52ac61c0c93b9013e057017bad095","topics":"[\"Speech / Audio\"]"},{"id":98,"run_id":1,"domain":"aiml","arxiv_id":"2602.20122","entry_id":"","title":"NanoKnow: How to Know What Your Language Model Knows","authors":"[\"Lingwei Gu\", \"Nour Jedidi\", \"Jimmy Lin\"]","abstract":"How do large language models (LLMs) know what they know? Answering this question has been difficult because pre-training data is often a \"black box\" -- unknown or inaccessible. The recent release of nanochat -- a family of small LLMs with fully open pre-training data -- addresses this as it provides a transparent view into where a model's parametric knowledge comes from. Towards the goal of understanding how knowledge is encoded by LLMs, we release NanoKnow, a benchmark dataset that partitions q","published":"2026-02-23T18:37:49+00:00","categories":"[\"cs.CL\", \"cs.AI\", \"cs.IR\", \"cs.LG\"]","pdf_url":"https://arxiv.org/pdf/2602.20122v1","arxiv_url":"http://arxiv.org/abs/2602.20122v1","comment":"","source":"arxiv","github_repo":"https://github.com/castorini/NanoKnow","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":7.0,"score_axis_2":7.0,"score_axis_3":7.0,"composite":7.0,"summary":"NanoKnow is a benchmark dataset partitioning Natural Questions and SQuAD based on whether answers appear in nanochat's open pre-training corpus. Enables disentangling parametric vs. external knowledge sources. Experiments with eight checkpoints show closed-book accuracy depends on answer frequency, external evidence is complementary but doesn't fully compensate, and non-relevant contexts harm performance.","reasoning":"Novel benchmark leveraging fully open pre-training data to understand knowledge sources. Code and data openly released at GitHub. Actionable insights for RAG and knowledge-grounded generation.","code_url":"https://github.com/castorini/NanoKnow","s2_tldr":"NanoKnow, a benchmark dataset that partitions questions from Natural Questions and SQuAD into splits based on whether their answers are present in nanochat's pre-training corpus, is released, demonstrating that parametric and external knowledge are complementary, and non-relevant information is harmful.","s2_paper_id":"32d089d724ff7d2ae3ce40a9f6135e9f967e310b","topics":"[\"Language Models\", \"Benchmark\"]"},{"id":99,"run_id":1,"domain":"aiml","arxiv_id":"2602.19127","entry_id":"","title":"AgenticRAGTracer: A Hop-Aware Benchmark for Diagnosing Multi-Step Retrieval Reasoning in Agentic RAG","authors":"[\"Qijie You\", \"Wenkai Yu\", \"Wentao Zhang\"]","abstract":"With the rapid advancement of agent-based methods in recent years, Agentic RAG has undoubtedly become an important research direction. Multi-hop reasoning, which requires models to engage in deliberate thinking and multi-step interaction, serves as a critical testbed for assessing such capabilities. However, existing benchmarks typically provide only final questions and answers, while lacking the intermediate hop-level questions that gradually connect atomic questions to the final multi-hop quer","published":"2026-02-22T10:55:21+00:00","categories":"[\"cs.CL\"]","pdf_url":"https://arxiv.org/pdf/2602.19127v1","arxiv_url":"http://arxiv.org/abs/2602.19127v1","comment":"","source":"arxiv","github_repo":"https://github.com/YqjMartin/AgenticRAGTracer","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":7.0,"score_axis_2":6.0,"score_axis_3":8.0,"composite":7.0,"summary":"AgenticRAGTracer introduces the first hop-aware benchmark for multi-step retrieval reasoning in Agentic RAG, with 1,305 automatically generated data points supporting step-by-step validation. Experiments show even GPT-5 achieves only 22.6% accuracy on hardest portions, with failures primarily from distorted reasoning chains.","reasoning":"Novel benchmark with code on GitHub addressing important gap in Agentic RAG evaluation. High practical value for researchers building multi-hop retrieval systems, with both source indicator suggesting good visibility.","code_url":"https://github.com/YqjMartin/AgenticRAGTracer","s2_tldr":"AgenticRAGTracer is introduced, the first Agentic RAG benchmark that is primarily constructed automatically by large language models and designed to support step-by-step validation, and is primarily constructed automatically by large language models and designed to support step-by-step validation.","s2_paper_id":"7ed984c9a29d00ef01dc8311601c7df7b0055949","topics":"[\"Agents\", \"Retrieval / RAG\", \"Reasoning\"]"},{"id":100,"run_id":1,"domain":"aiml","arxiv_id":"2602.18998","entry_id":"","title":"Benchmark Test-Time Scaling of General LLM Agents","authors":"[\"Xiaochuan Li\", \"Ryan Ming\", \"Pranav Setlur\", \"Abhijay Paladugu\", \"Andy Tang\", \"Hao Kang\", \"Shuai Shao\", \"Rong Jin\", \"Chenyan Xiong\"]","abstract":"LLM agents are increasingly expected to function as general-purpose systems capable of resolving open-ended user requests. While existing benchmarks focus on domain-aware environments for developing specialized agents, evaluating general-purpose agents requires more realistic settings that challenge them to operate across multiple skills and tools within a unified environment. We introduce General AgentBench, a benchmark that provides such a unified framework for evaluating general LLM agents ac","published":"2026-02-22T01:08:02+00:00","categories":"[\"cs.AI\", \"cs.CL\"]","pdf_url":"https://arxiv.org/pdf/2602.18998v1","arxiv_url":"http://arxiv.org/abs/2602.18998v1","comment":"","source":"both","github_repo":"https://github.com/cxcscmu/General-AgentBench","github_stars":null,"hf_upvotes":3,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":7.0,"score_axis_2":6.0,"score_axis_3":8.0,"composite":7.0,"summary":"Presents General AgentBench, a unified benchmark evaluating LLM agents across search, coding, reasoning, and tool-use domains. Evaluates 10 leading agents under sequential and parallel scaling, revealing performance degradation in general settings and fundamental limitations in both scaling approaches. Code publicly available.","reasoning":"Valuable benchmark with code for agent evaluation. Practical for assessing agent capabilities but incremental contribution to methodology. High practical value for practitioners building agent systems.","code_url":"https://github.com/cxcscmu/General-AgentBench","s2_tldr":"This work introduces General AgentBench, a benchmark that provides such a unified framework for evaluating general LLM agents across search, coding, reasoning, and tool-use domains and finds that neither scaling methodology yields effective performance improvements in practice.","s2_paper_id":"4fc75f878df0f4815c6192d31f28a5fd560dcc60","topics":"[\"Language Models\", \"Benchmark\"]"},{"id":101,"run_id":1,"domain":"aiml","arxiv_id":"2602.18920","entry_id":"","title":"DeepInnovator: Triggering the Innovative Capabilities of LLMs","authors":"[\"Tianyu Fan\", \"Fengji Zhang\", \"Yuxiang Zheng\", \"Bei Chen\", \"Xinyao Niu\", \"Chengen Huang\", \"Junyang Lin\", \"Chao Huang\"]","abstract":"The application of Large Language Models (LLMs) in accelerating scientific discovery has garnered increasing attention, with a key focus on constructing research agents endowed with innovative capability, i.e., the ability to autonomously generate novel and significant research ideas. Existing approaches predominantly rely on sophisticated prompt engineering and lack a systematic training paradigm. To address this, we propose DeepInnovator, a training framework designed to trigger the innovative","published":"2026-02-21T18:07:18+00:00","categories":"[\"cs.CL\", \"cs.AI\"]","pdf_url":"https://arxiv.org/pdf/2602.18920v1","arxiv_url":"http://arxiv.org/abs/2602.18920v1","comment":"","source":"arxiv","github_repo":"https://github.com/HKUDS/DeepInnovator","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":7.0,"score_axis_2":7.0,"score_axis_3":7.0,"composite":7.0,"summary":"Introduces DeepInnovator, a training framework to trigger innovative capabilities in LLMs for research idea generation. Constructs automated data extraction pipeline from scientific literature and uses 'Next Idea Prediction' training paradigm. DeepInnovator-14B achieves 80.53-93.81% win rates over baselines with performance comparable to leading LLMs.","reasoning":"Novel training approach for scientific discovery with code/data open-sourced. Interesting paradigm for research automation but narrow application domain.","code_url":"https://github.com/HKUDS/DeepInnovator","s2_tldr":"This work introduces a ``Next Idea Prediction''training paradigm, which models the generation of research ideas as an iterative process of continuously predicting, evaluating, and refining plausible and novel next idea.","s2_paper_id":"244f3942376ac7c7df74363b0723c695181271f5","topics":"[\"Language Models\", \"Efficiency\"]"},{"id":103,"run_id":1,"domain":"aiml","arxiv_id":"2602.18420","entry_id":"","title":"SPQ: An Ensemble Technique for Large Language Model Compression","authors":"[\"Jiamin Yao\", \"Eren Gultepe\"]","abstract":"This study presents an ensemble technique, SPQ (SVD-Pruning-Quantization), for large language model (LLM) compression that combines variance-retained singular value decomposition (SVD), activation-based pruning, and post-training linear quantization. Each component targets a different source of inefficiency: i) pruning removes redundant neurons in MLP layers, ii) SVD reduces attention projections into compact low-rank factors, iii) and 8-bit quantization uniformly compresses all linear layers. A","published":"2026-02-20T18:44:16+00:00","categories":"[\"cs.CL\"]","pdf_url":"https://arxiv.org/pdf/2602.18420v1","arxiv_url":"http://arxiv.org/abs/2602.18420v1","comment":"Accepted to LREC 2026 Main Conference","source":"arxiv","github_repo":"https://github.com/JiaminYao/SPQ_LLM_Compression","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":7.0,"score_axis_2":6.0,"score_axis_3":8.0,"composite":7.0,"summary":"SPQ combines SVD, pruning, and quantization for LLM compression, achieving 75% memory reduction on LLaMA-2-7B while improving perplexity and maintaining accuracy. Demonstrates 1.9x inference speedup over GPTQ with competitive memory usage (6.86GB vs 7.16GB).","reasoning":"Practical compression method with code on GitHub. Strong efficiency gains with empirical validation. However, ensemble approach rather than novel architecture.","code_url":"https://github.com/JiaminYao/SPQ_LLM_Compression","s2_tldr":"An ensemble technique for large language model (LLM) compression that combines variance-retained singular value decomposition (SVD), activation-based pruning, and post-training linear quantization, SPQ offers competitive perplexity and accuracy while using less memory.","s2_paper_id":"955f7cd98960622b825ce3778db3326c0f47d3f6","topics":"[\"Language Models\", \"Efficiency\"]"},{"id":107,"run_id":1,"domain":"aiml","arxiv_id":"2602.17003","entry_id":"","title":"Persona2Web: Benchmarking Personalized Web Agents for Contextual Reasoning with User History","authors":"[\"Serin Kim\", \"Sangam Lee\", \"Dongha Lee\"]","abstract":"Large language models have advanced web agents, yet current agents lack personalization capabilities. Since users rarely specify every detail of their intent, practical web agents must be able to interpret ambiguous queries by inferring user preferences and contexts. To address this challenge, we present Persona2Web, the first benchmark for evaluating personalized web agents on the real open web, built upon the clarify-to-personalize principle, which requires agents to resolve ambiguity based on","published":"2026-02-19T01:54:26+00:00","categories":"[\"cs.CL\", \"cs.AI\"]","pdf_url":"https://arxiv.org/pdf/2602.17003v1","arxiv_url":"http://arxiv.org/abs/2602.17003v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":7.0,"score_axis_2":7.0,"score_axis_3":7.0,"composite":7.0,"summary":"Persona2Web introduces first benchmark for personalized web agents on open web, requiring agents to resolve ambiguous queries by inferring user preferences from history. Includes reasoning-aware evaluation framework and reveals key challenges across various agent architectures with publicly available code/data.","reasoning":"Moderate code_and_weights (code/data mentioned as public but under anonymous submission). Strong novelty in personalized web agent evaluation paradigm. Strong practical applicability for developing context-aware web agents.","code_url":"https://anonymous.4open.science/r/Persona2Web-73E8","s2_tldr":"Persona2Web is the first benchmark for evaluating personalized web agents on the real open web, built upon the clarify-to-personalize principle, which requires agents to resolve ambiguity based on user history rather than relying on explicit instructions.","s2_paper_id":"dd70e363f61f05c1269eb2b28a040edd9dea1751","topics":"[\"Benchmark\", \"Reasoning\", \"Language Models\"]"},{"id":1,"run_id":1,"domain":"aiml","arxiv_id":"2602.20159","entry_id":"","title":"A Very Big Video Reasoning Suite","authors":"[\"Maijunxian Wang\", \"Ruisi Wang\", \"Juyi Lin\", \"Ran Ji\", \"Thadd\\u00e4us Wiedemer\", \"Qingying Gao\", \"Dezhi Luo\", \"Yaoyao Qian\", \"Lianyu Huang\", \"Zelong Hong\"]","abstract":"Rapid progress in video models has largely focused on visual quality, leaving their reasoning capabilities underexplored. Video reasoning grounds intelligence in spatiotemporally consistent visual environments that go beyond what text can naturally capture, enabling intuitive reasoning over spatiotemporal structure such as continuity, interaction, and causality. However, systematically studying video reasoning and its scaling behavior is hindered by the lack of large-scale training data. To addr","published":"2026-02-23T18:59:41+00:00","categories":"[\"cs.CV\", \"cs.AI\", \"cs.LG\", \"cs.MM\", \"cs.RO\"]","pdf_url":"https://arxiv.org/pdf/2602.20159v2","arxiv_url":"http://arxiv.org/abs/2602.20159v2","comment":"Homepage: https://video-reason.com/","source":"both","github_repo":"","github_stars":null,"hf_upvotes":380,"hf_models":"[{\"id\": \"Video-Reason/VBVR-Wan2.2\", \"likes\": 38}]","hf_datasets":"[{\"id\": \"Video-Reason/VBVR-Dataset\", \"likes\": 11}, {\"id\": \"Video-Reason/VBVR-Bench-Data\", \"likes\": 3}, {\"id\": \"abs794/VBVR-Bench-Data\", \"likes\": 0}]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":9.0,"score_axis_3":9.0,"composite":6.9,"summary":"VBVR introduces the largest video reasoning dataset with 1M+ clips across 200 tasks (3 orders of magnitude larger than existing datasets) and VBVR-Bench with rule-based verifiable evaluation. The work demonstrates early emergent generalization to unseen reasoning tasks through unprecedented scaling studies, with data, benchmark, and model (Wan2.2) publicly released.","reasoning":"Major dataset release (380 HF upvotes!) with model weights. Paradigm-shifting scale for video reasoning research with strong practical impact on video understanding benchmarking.","code_url":null,"s2_tldr":"The Very Big Video Reasoning (VBVR) Dataset is introduced, an unprecedentedly large-scale resource spanning 200 curated reasoning tasks following a principled taxonomy and over one million video clips, approximately three orders of magnitude larger than existing datasets.","s2_paper_id":"d3e893d3d9a722aea0373ed05c282a81f310a52a","topics":"[\"Reasoning\"]"},{"id":108,"run_id":1,"domain":"aiml","arxiv_id":"2602.21175","entry_id":"","title":"Seeing Through Words: Controlling Visual Retrieval Quality with Language Models","authors":"[\"Jianglin Lu\", \"Simon Jenni\", \"Kushal Kafle\", \"Jing Shi\", \"Handong Zhao\", \"Yun Fu\"]","abstract":"Text-to-image retrieval is a fundamental task in vision-language learning, yet in real-world scenarios it is often challenged by short and underspecified user queries. Such queries are typically only one or two words long, rendering them semantically ambiguous, prone to collisions across diverse visual interpretations, and lacking explicit control over the quality of retrieved images. To address these issues, we propose a new paradigm of quality-controllable retrieval, which enriches short queri","published":"2026-02-24T18:20:57+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.21175v1","arxiv_url":"http://arxiv.org/abs/2602.21175v1","comment":"","source":"arxiv","github_repo":"https://github.com/Jianglin954/QCQC","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":7.0,"score_axis_2":6.0,"score_axis_3":7.0,"composite":6.65,"summary":"Proposes quality-controllable text-to-image retrieval by using language models to enrich short queries with quality-aware context. Works with any pretrained VLM without modification, providing interpretable query enrichment and explicit quality control over retrieved images.","reasoning":"Practical approach with code available. Novelty is moderate (applying LLMs to query expansion), but quality-controllability is useful. Works without VLM retraining.","code_url":"https://github.com/Jianglin954/QCQC","s2_tldr":"This work proposes a new paradigm of quality-controllable retrieval, which enriches short queries with contextual details while incorporating explicit notions of image quality, bridging the gap between the expressive capacity of modern VLMs and the underspecified nature of short user queries.","s2_paper_id":"68fdfe9e20a313070e0c0ffaed9a6ca3eb00b5c3","topics":"[\"Retrieval / RAG\", \"Language Models\", \"Image Generation\"]"},{"id":110,"run_id":1,"domain":"aiml","arxiv_id":"2602.21078","entry_id":"","title":"ProxyFL: A Proxy-Guided Framework for Federated Semi-Supervised Learning","authors":"[\"Duowen Chen\", \"Yan Wang\"]","abstract":"Federated Semi-Supervised Learning (FSSL) aims to collaboratively train a global model across clients by leveraging partially-annotated local data in a privacy-preserving manner. In FSSL, data heterogeneity is a challenging issue, which exists both across clients and within clients. External heterogeneity refers to the data distribution discrepancy across different clients, while internal heterogeneity represents the mismatch between labeled and unlabeled data within clients. Most FSSL methods t","published":"2026-02-24T16:41:16+00:00","categories":"[\"cs.LG\", \"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.21078v1","arxiv_url":"http://arxiv.org/abs/2602.21078v1","comment":"CVPR 2026. code: https://github.com/DuowenC/FSSLlib","source":"arxiv","github_repo":"https://github.com/DuowenC/FSSLlib","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":7.0,"score_axis_2":6.0,"score_axis_3":7.0,"composite":6.65,"summary":"ProxyFL addresses external and internal heterogeneity in Federated Semi-Supervised Learning by using learnable classifier weights as proxies to simulate category distributions. Optimizes global proxy on server and uses positive-negative proxy pool locally to re-include discarded samples, improving convergence and performance.","reasoning":"Solid federated learning contribution with code available. CVPR 2026 paper with practical applicability to federated scenarios. Novelty is moderate (proxy-based aggregation).","code_url":"https://github.com/DuowenC/FSSLlib","s2_tldr":"A proxy-guided framework called ProxyFL that focuses on simultaneously mitigating external and internal heterogeneity via a unified proxy, and considers the learnable weights of classifier as proxy to simulate the category distribution both locally and globally.","s2_paper_id":"9b89c563861960da26f17afa05887a605876f971","topics":"[]"},{"id":118,"run_id":1,"domain":"aiml","arxiv_id":"2602.20689","entry_id":"","title":"MatchED: Crisp Edge Detection Using End-to-End, Matching-based Supervision","authors":"[\"Bedrettin Cetinkaya\", \"Sinan Kalkan\", \"Emre Akbas\"]","abstract":"Generating crisp, i.e., one-pixel-wide, edge maps remains one of the fundamental challenges in edge detection, affecting both traditional and learning-based methods. To obtain crisp edges, most existing approaches rely on two hand-crafted post-processing algorithms, Non-Maximum Suppression (NMS) and skeleton-based thinning, which are non-differentiable and hinder end-to-end optimization. Moreover, all existing crisp edge detection methods still depend on such post-processing to achieve satisfact","published":"2026-02-24T08:45:49+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.20689v1","arxiv_url":"http://arxiv.org/abs/2602.20689v1","comment":"Accepted to CVPR 2026","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":7.0,"score_axis_2":6.0,"score_axis_3":7.0,"composite":6.65,"summary":"MatchED introduces differentiable, matching-based supervision for crisp edge detection, eliminating need for hand-crafted NMS post-processing. Achieves 2-4x improvement in crispness metrics and 20-35% boost in performance under crispness-emphasized evaluation.","reasoning":"No code/weights yet (but website suggests forthcoming release). Novel end-to-end training paradigm for fundamental vision task. Good practical applicability as plug-and-play module.","code_url":"https://cvpr26-matched.github.io","s2_tldr":"This work proposes \\MethodLPP, a lightweight, only $\\sim$21K additional parameters, and plug-and-play matching-based supervision module that can be appended to any edge detection model for joint end-to-end learning of crisp edges, and substantially improves the performance of existing edge detection models.","s2_paper_id":"942e93967e3e5b9669324acef5e994f19c5cb200","topics":"[\"Optimization\"]"},{"id":120,"run_id":1,"domain":"aiml","arxiv_id":"2602.20650","entry_id":"","title":"Dataset Color Quantization: A Training-Oriented Framework for Dataset-Level Compression","authors":"[\"Chenyue Yu\", \"Lingao Xiao\", \"Jinhong Deng\", \"Ivor W. Tsang\", \"Yang He\"]","abstract":"Large-scale image datasets are fundamental to deep learning, but their high storage demands pose challenges for deployment in resource-constrained environments. While existing approaches reduce dataset size by discarding samples, they often ignore the significant redundancy within each image -- particularly in the color space. To address this, we propose Dataset Color Quantization (DCQ), a unified framework that compresses visual datasets by reducing color-space redundancy while preserving infor","published":"2026-02-24T07:53:58+00:00","categories":"[\"cs.CV\", \"cs.AI\"]","pdf_url":"https://arxiv.org/pdf/2602.20650v1","arxiv_url":"http://arxiv.org/abs/2602.20650v1","comment":"Accepted by ICLR 2026","source":"arxiv","github_repo":"https://github.com/he-y/Dataset-Color-Quantization","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":7.0,"score_axis_2":6.0,"score_axis_3":7.0,"composite":6.65,"summary":"Dataset Color Quantization (DCQ) compresses visual datasets by reducing color-space redundancy while preserving training-critical information. Enforces consistent palettes across similar images and retains semantically important colors guided by model perception.","reasoning":"Code available on GitHub. Novel dataset-level compression approach with practical value for resource-constrained training. Incremental but useful contribution.","code_url":"https://github.com/he-y/Dataset-Color-Quantization","s2_tldr":"Extensive experiments show that DCQ significantly improves training performance under aggressive compression, offering a scalable and robust solution for dataset-level storage reduction.","s2_paper_id":"04314968a897acd85e49b3eb6f8cae8df04857ef","topics":"[\"Efficiency\", \"Benchmark\"]"},{"id":122,"run_id":1,"domain":"aiml","arxiv_id":"2602.20597","entry_id":"","title":"Interaction-aware Representation Modeling with Co-occurrence Consistency for Egocentric Hand-Object Parsing","authors":"[\"Yuejiao Su\", \"Yi Wang\", \"Lei Yao\", \"Yawen Cui\", \"Lap-Pui Chau\"]","abstract":"A fine-grained understanding of egocentric human-environment interactions is crucial for developing next-generation embodied agents. One fundamental challenge in this area involves accurately parsing hands and active objects. While transformer-based architectures have demonstrated considerable potential for such tasks, several key limitations remain unaddressed: 1) existing query initialization mechanisms rely primarily on semantic cues or learnable parameters, demonstrating limited adaptability","published":"2026-02-24T06:39:18+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.20597v1","arxiv_url":"http://arxiv.org/abs/2602.20597v1","comment":"","source":"arxiv","github_repo":"https://github.com/yuggiehk/InterFormer","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":7.0,"score_axis_2":7.0,"score_axis_3":6.0,"composite":6.65,"summary":"InterFormer tackles egocentric hand-object parsing with a transformer architecture featuring dynamic query generation from hand-object contact, dual-context feature selection, and a co-occurrence loss for physical consistency. Achieves SOTA on EgoHOS and mini-HOI4D benchmarks with strong generalization to out-of-distribution data.","reasoning":"Code and models available on GitHub. Novel architecture components (DQG, DFS, CoCo loss) address real limitations in transformer-based parsing. Fairly specialized application domain limits broad applicability.","code_url":"https://github.com/yuggiehk/InterFormer","s2_tldr":"This work proposes an end-to-end Interaction-aware Transformer (InterFormer), which integrates three key components, i.e., a Dynamic Query Generator, a Dual-context Feature Selector (DFS), and the Conditional Co-occurrence (CoCo) loss, and achieves state-of-the-art performance on both the EgoHOS and the challenging out-of-distribution mini-HOI4D datasets.","s2_paper_id":"40ab723e76e9e4f7cabc73e582f9b7cbcb9a7843","topics":"[\"Robotics\", \"Architecture\"]"},{"id":123,"run_id":1,"domain":"aiml","arxiv_id":"2602.20537","entry_id":"","title":"PFGNet: A Fully Convolutional Frequency-Guided Peripheral Gating Network for Efficient Spatiotemporal Predictive Learning","authors":"[\"Xinyong Cai\", \"Changbin Sun\", \"Yong Wang\", \"Hongyu Yang\", \"Yuankai Wu\"]","abstract":"Spatiotemporal predictive learning (STPL) aims to forecast future frames from past observations and is essential across a wide range of applications. Compared with recurrent or hybrid architectures, pure convolutional models offer superior efficiency and full parallelism, yet their fixed receptive fields limit their ability to adaptively capture spatially varying motion patterns. Inspired by biological center-surround organization and frequency-selective signal processing, we propose PFGNet, a f","published":"2026-02-24T04:31:12+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.20537v1","arxiv_url":"http://arxiv.org/abs/2602.20537v1","comment":"Accepted to CVPR 2026","source":"arxiv","github_repo":"https://github.com/fhjdqaq/PFGNet","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":7.0,"score_axis_2":6.0,"score_axis_3":7.0,"composite":6.65,"summary":"PFGNet is a fully convolutional spatiotemporal prediction network using frequency-guided peripheral gating to adaptively modulate receptive fields. Achieves SOTA/near-SOTA on Moving MNIST, TaxiBJ, Human3.6M, and KTH with fewer parameters and FLOPs via 1D kernel decomposition.","reasoning":"Code available on GitHub (accepted CVPR 2026). Novel frequency-guided gating mechanism with efficient decomposed convolutions. Strong practical applicability for spatiotemporal forecasting tasks.","code_url":"https://github.com/fhjdqaq/PFGNet","s2_tldr":null,"s2_paper_id":"1625d9e24b997484c2ab22e3bf6eeddb65bd3f3b","topics":"[\"Efficiency\", \"Reasoning\"]"},{"id":127,"run_id":1,"domain":"aiml","arxiv_id":"2602.20354","entry_id":"","title":"3DSPA: A 3D Semantic Point Autoencoder for Evaluating Video Realism","authors":"[\"Bhavik Chandna\", \"Kelsey R. Allen\"]","abstract":"AI video generation is evolving rapidly. For video generators to be useful for applications ranging from robotics to film-making, they must consistently produce realistic videos. However, evaluating the realism of generated videos remains a largely manual process -- requiring human annotation or bespoke evaluation datasets which have restricted scope. Here we develop an automated evaluation framework for video realism which captures both semantics and coherent 3D structure and which does not req","published":"2026-02-23T21:00:48+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.20354v1","arxiv_url":"http://arxiv.org/abs/2602.20354v1","comment":"","source":"arxiv","github_repo":"https://github.com/TheProParadox/3dspa_code","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":7.0,"score_axis_2":6.0,"score_axis_3":7.0,"composite":6.65,"summary":"3DSPA presents a 3D spatiotemporal point autoencoder for automated video realism evaluation, integrating 3D trajectories, depth cues, and DINO features. The method identifies physical law violations and motion artifacts more reliably than existing metrics while aligning better with human judgments across multiple datasets.","reasoning":"Code and pretrained weights available on GitHub. Novel approach to video quality assessment with practical utility for evaluating generative video models. Strong practical applicability.","code_url":"https://github.com/TheProParadox/3dspa_code","s2_tldr":"The results demonstrate that enriching trajectory-based representations with 3D semantics offers a stronger foundation for benchmarking generative video models, and implicitly captures physical rule violations.","s2_paper_id":"e3ae80f9882b56ca3abdd23b313d207f4d322388","topics":"[\"3D / Vision\", \"Benchmark\", \"Video Generation\"]"},{"id":134,"run_id":1,"domain":"aiml","arxiv_id":"2602.19753","entry_id":"","title":"RAP: Fast Feedforward Rendering-Free Attribute-Guided Primitive Importance Score Prediction for Efficient 3D Gaussian Splatting Processing","authors":"[\"Kaifa Yang\", \"Qi Yang\", \"Yiling Xu\", \"Zhu Li\"]","abstract":"3D Gaussian Splatting (3DGS) has emerged as a leading technology for high-quality 3D scene reconstruction. However, the iterative refinement and densification process leads to the generation of a large number of primitives, each contributing to the reconstruction to a substantially different extent. Estimating primitive importance is thus crucial, both for removing redundancy during reconstruction and for enabling efficient compression and transmission. Existing methods typically rely on renderi","published":"2026-02-23T12:02:03+00:00","categories":"[\"cs.CV\", \"cs.GR\"]","pdf_url":"https://arxiv.org/pdf/2602.19753v1","arxiv_url":"http://arxiv.org/abs/2602.19753v1","comment":"Accepted by CVPR 2026","source":"arxiv","github_repo":"https://github.com/yyyykf/RAP","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":7.0,"score_axis_2":5.0,"score_axis_3":8.0,"composite":6.65,"summary":"RAP introduces fast feedforward rendering-free primitive importance prediction for 3D Gaussian Splatting using intrinsic attributes and MLP. Achieves efficient plug-and-play integration for reconstruction, compression, and transmission with code on GitHub.","reasoning":"Code available on GitHub, practical efficiency improvements for 3DGS. Strong applicability but incremental architecture.","code_url":"https://github.com/yyyykf/RAP","s2_tldr":"RAP is proposed, a fast feedforward rendering-free attribute-guided method for efficient importance score prediction in 3DGS that infers primitive significance directly from intrinsic Gaussian attributes and local neighborhood statistics, avoiding rendering-based or visibility-dependent computations.","s2_paper_id":"e2c718397699e19406921e954cdc0519cda33074","topics":"[\"Efficiency\", \"3D / Vision\"]"},{"id":136,"run_id":1,"domain":"aiml","arxiv_id":"2602.19715","entry_id":"","title":"Pixels Don't Lie (But Your Detector Might): Bootstrapping MLLM-as-a-Judge for Trustworthy Deepfake Detection and Reasoning Supervision","authors":"[\"Kartik Kuckreja\", \"Parul Gupta\", \"Muhammad Haris Khan\", \"Abhinav Dhall\"]","abstract":"Deepfake detection models often generate natural-language explanations, yet their reasoning is frequently ungrounded in visual evidence, limiting reliability. Existing evaluations measure classification accuracy but overlook reasoning fidelity. We propose DeepfakeJudge, a framework for scalable reasoning supervision and evaluation, that integrates an out-of-distribution benchmark containing recent generative and editing forgeries, a human-annotated subset with visual reasoning labels, and a suit","published":"2026-02-23T11:08:46+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.19715v1","arxiv_url":"http://arxiv.org/abs/2602.19715v1","comment":"CVPR-2026, Code is available here: https://github.com/KjAeRsTuIsK/DeepfakeJudge","source":"arxiv","github_repo":"https://github.com/KjAeRsTuIsK/DeepfakeJudge","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":7.0,"score_axis_2":6.0,"score_axis_3":7.0,"composite":6.65,"summary":"DeepfakeJudge introduces first RL-based MLLM for deepfake detection with reasoning supervision via bootstrapped generator-evaluator process. Achieves 96.2% accuracy and 98.9% pairwise agreement on meta-evaluation, with 70% user preference for generated reasonings.","reasoning":"Code and datasets open-sourced on GitHub. Novel RL approach for interpretable deepfake detection with strong practical value.","code_url":"https://github.com/KjAeRsTuIsK/DeepfakeJudge","s2_tldr":"The proposed DeepfakeJudge is a framework for scalable reasoning supervision and evaluation that integrates an out-of-distribution benchmark containing recent generative and editing forgeries, a human-annotated subset with visual reasoning labels, and a suite of evaluation models, that specialize in evaluating reasoning rationales without the need for explicit ground truth reasoning rationales.","s2_paper_id":"0660ac1393de67263167f300aa0932e39fa6c8d5","topics":"[\"Reasoning\", \"Benchmark\"]"},{"id":139,"run_id":1,"domain":"aiml","arxiv_id":"2602.19624","entry_id":"","title":"Accurate Planar Tracking With Robust Re-Detection","authors":"[\"Jonas Serych\", \"Jiri Matas\"]","abstract":"We present SAM-H and WOFTSAM, novel planar trackers that combine robust long-term segmentation tracking provided by SAM 2 with 8 degrees-of-freedom homography pose estimation. SAM-H estimates homographies from segmentation mask contours and is thus highly robust to target appearance changes. WOFTSAM significantly improves the current state-of-the-art planar tracker WOFT by exploiting lost target re-detection provided by SAM-H. The proposed methods are evaluated on POT-210 and PlanarTrack trackin","published":"2026-02-23T09:13:55+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.19624v1","arxiv_url":"http://arxiv.org/abs/2602.19624v1","comment":"","source":"arxiv","github_repo":"https://github.com/serycjon/WOFTSAM","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":7.0,"score_axis_2":5.0,"score_axis_3":8.0,"composite":6.65,"summary":"SAM-H and WOFTSAM combine SAM 2 segmentation with homography estimation for robust planar tracking. WOFTSAM outperforms WOFT by +15.2pp on PlanarTrack, setting new SOTA with improved ground-truth annotations and code on GitHub.","reasoning":"Code and re-annotations available on GitHub. Practical improvement combining existing methods for SOTA tracking.","code_url":"https://github.com/serycjon/WOFTSAM","s2_tldr":"Novel planar trackers that combine robust long-term segmentation tracking provided by SAM 2 with 8 degrees-of-freedom homography pose estimation and improved ground-truth annotations of initial PlanarTrack poses are presented, enabling more accurate benchmarking in the high-precision p@5 metric.","s2_paper_id":"438ceda649a626b0c0afce3bda2d3c42e732c984","topics":"[\"3D / Vision\", \"Benchmark\"]"},{"id":144,"run_id":1,"domain":"aiml","arxiv_id":"2602.19505","entry_id":"","title":"Test-Time Computing for Referring Multimodal Large Language Models","authors":"[\"Mingrui Wu\", \"Hao Chen\", \"Jiayi Ji\", \"Xiaoshuai Sun\", \"Zhiyuan Liu\", \"Liujuan Cao\", \"Ming-Ming Cheng\", \"Rongrong Ji\"]","abstract":"We propose ControlMLLM++, a novel test-time adaptation framework that injects learnable visual prompts into frozen multimodal large language models (MLLMs) to enable fine-grained region-based visual reasoning without any model retraining or fine-tuning. Leveraging the insight that cross-modal attention maps intrinsically encode semantic correspondences between textual tokens and visual regions, ControlMLLM++ optimizes a latent visual token modifier during inference via a task-specific energy fun","published":"2026-02-23T04:42:10+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.19505v1","arxiv_url":"http://arxiv.org/abs/2602.19505v1","comment":"arXiv admin note: substantial text overlap with arXiv:2407.21534","source":"arxiv","github_repo":"https://github.com/mrwu-mac/ControlMLLM","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":7.0,"score_axis_2":6.0,"score_axis_3":7.0,"composite":6.65,"summary":"ControlMLLM++ injects learnable visual prompts into frozen MLLMs for test-time region-based reasoning. It optimizes latent visual tokens via task-specific energy functions with improved optimization strategy and prompt debiasing, supporting diverse prompt types (boxes, masks, scribbles, points).","reasoning":"Code on GitHub. Test-time adaptation for MLLMs is practical but incremental. Visual prompting is emerging; multiple prompt types increase usability.","code_url":"https://github.com/mrwu-mac/ControlMLLM","s2_tldr":null,"s2_paper_id":"b6b417f5374187447f3f960db021a1fc9ca0c91d","topics":"[\"Language Models\", \"Multimodal\", \"Reasoning\"]"},{"id":146,"run_id":1,"domain":"aiml","arxiv_id":"2602.19418","entry_id":"","title":"PA-Attack: Guiding Gray-Box Attacks on LVLM Vision Encoders with Prototypes and Attention","authors":"[\"Hefei Mei\", \"Zirui Wang\", \"Chang Xu\", \"Jianyuan Guo\", \"Minjing Dong\"]","abstract":"Large Vision-Language Models (LVLMs) are foundational to modern multimodal applications, yet their susceptibility to adversarial attacks remains a critical concern. Prior white-box attacks rarely generalize across tasks, and black-box methods depend on expensive transfer, which limits efficiency. The vision encoder, standardized and often shared across LVLMs, provides a stable gray-box pivot with strong cross-model transfer. Building on this premise, we introduce PA-Attack (Prototype-Anchored At","published":"2026-02-23T01:20:43+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.19418v1","arxiv_url":"http://arxiv.org/abs/2602.19418v1","comment":"","source":"arxiv","github_repo":"https://github.com/hefeimei06/PA-Attack","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":7.0,"score_axis_2":7.0,"score_axis_3":6.0,"composite":6.65,"summary":"PA-Attack introduces prototype-anchored attentive attack targeting LVLM vision encoders with gray-box access. Uses prototype guidance and two-stage attention enhancement to achieve 75.1% average score reduction rate with strong cross-model transferability and task generalization.","reasoning":"Novel attack methodology with code on GitHub. Useful for adversarial robustness research but limited direct practitioner applicability beyond security testing.","code_url":"https://github.com/hefeimei06/PA-Attack","s2_tldr":"PA-Attack is introduced, a two-stage attention enhancement mechanism that leverage token-level attention scores to concentrate perturbations on critical visual tokens, and adaptively recalibrate attention weights to track the evolving attention during the adversarial process.","s2_paper_id":"24e13aadf2fbbd97c01b0912efdcfddb51abd6b1","topics":"[\"Language Models\", \"Multimodal\"]"},{"id":155,"run_id":1,"domain":"aiml","arxiv_id":"2602.18941","entry_id":"","title":"Global Commander and Local Operative: A Dual-Agent Framework for Scene Navigation","authors":"[\"Kaiming Jin\", \"Yuefan Wu\", \"Shengqiong Wu\", \"Bobo Li\", \"Shuicheng Yan\", \"Tat-Seng Chua\"]","abstract":"Vision-and-Language Scene navigation is a fundamental capability for embodied human-AI collaboration, requiring agents to follow natural language instructions to execute coherent action sequences in complex environments. Existing approaches either rely on multiple agents, incurring high coordination and resource costs, or adopt a single-agent paradigm, which overloads the agent with both global planning and local perception, often leading to degraded reasoning and instruction drift in long-horiz","published":"2026-02-21T19:19:55+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.18941v1","arxiv_url":"http://arxiv.org/abs/2602.18941v1","comment":"18 pages, 9 figures","source":"arxiv","github_repo":"https://github.com/ChocoWu/DACo","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":7.0,"score_axis_2":6.0,"score_axis_3":7.0,"composite":6.65,"summary":"DACo introduces a dual-agent architecture for vision-and-language navigation: a Global Commander for strategic planning and Local Operative for execution. Achieves 4.9-6.5% improvements over baselines in zero-shot settings, generalizing across both closed (GPT-4o) and open-source (Qwen-VL) backbones.","reasoning":"GitHub code available, novel decoupling approach for navigation. Practical zero-shot performance improvements and backbone flexibility boost applicability. No explicit mention of open weights.","code_url":"https://github.com/ChocoWu/DACo","s2_tldr":"By disentangling global reasoning from local action, DACo alleviates cognitive overload and improves long-horizon stability, and provides a principled and extensible paradigm for robust long-horizon navigation.","s2_paper_id":"f7497c525aa63bd173d6ca878683d70ef47a93a3","topics":"[\"Agents\", \"Robotics\", \"Reasoning\"]"},{"id":158,"run_id":1,"domain":"aiml","arxiv_id":"2602.21201","entry_id":"","title":"Aletheia tackles FirstProof autonomously","authors":"[\"Tony Feng\", \"Junehyuk Jung\", \"Sang-hyun Kim\", \"Carlo Pagano\", \"Sergei Gukov\", \"Chiang-Chiang Tsai\", \"David Woodruff\", \"Adel Javanmard\", \"Aryan Mokhtari\", \"Dawsen Hwang\"]","abstract":"We report the performance of Aletheia (Feng et al., 2026b), a mathematics research agent powered by Gemini 3 Deep Think, on the inaugural FirstProof challenge. Within the allowed timeframe of the challenge, Aletheia autonomously solved 6 problems (2, 5, 7, 8, 9, 10) out of 10 according to majority expert assessments; we note that experts were not unanimous on Problem 8 (only). For full transparency, we explain our interpretation of FirstProof and disclose details about our experiments as well as","published":"2026-02-24T18:56:10+00:00","categories":"[\"cs.AI\", \"cs.CL\", \"cs.LG\"]","pdf_url":"https://arxiv.org/pdf/2602.21201v1","arxiv_url":"http://arxiv.org/abs/2602.21201v1","comment":"34 pages. Project page: https://github.com/google-deepmind/superhuman/tree/main/aletheia","source":"both","github_repo":"https://github.com/google-deepmind/superhuman","github_stars":null,"hf_upvotes":2,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":7.0,"score_axis_2":6.0,"score_axis_3":7.0,"composite":6.65,"summary":"Aletheia, powered by Gemini 3 Deep Think, autonomously solved 6 out of 10 problems in the inaugural FirstProof challenge. Full transparency provided with raw prompts and outputs available on GitHub, demonstrating mathematical research agent capabilities.","reasoning":"Code/prompts available on GitHub with 2 HF upvotes. Demonstrates practical mathematical reasoning capabilities. Strong transparency with disclosed experiments but built on proprietary Gemini model.","code_url":"https://github.com/google-deepmind/superhuman/tree/main/aletheia","s2_tldr":null,"s2_paper_id":"bcde00903e22fe6f41a53953e2dce28b9f4db4a4","topics":"[\"Agents\", \"Reasoning\"]"},{"id":179,"run_id":1,"domain":"aiml","arxiv_id":"2602.18307","entry_id":"","title":"VeriSoftBench: Repository-Scale Formal Verification Benchmarks for Lean","authors":"[\"Yutong Xin\", \"Qiaochu Chen\", \"Greg Durrett\", \"I\\u015fil Dillig\"]","abstract":"Large language models have achieved striking results in interactive theorem proving, particularly in Lean. However, most benchmarks for LLM-based proof automation are drawn from mathematics in the Mathlib ecosystem, whereas proofs in software verification are developed inside definition-rich codebases with substantial project-specific libraries. We introduce VeriSoftBench, a benchmark of 500 Lean 4 proof obligations drawn from open-source formal-methods developments and packaged to preserve real","published":"2026-02-20T16:05:06+00:00","categories":"[\"cs.SE\", \"cs.CL\", \"cs.LG\", \"cs.PL\"]","pdf_url":"https://arxiv.org/pdf/2602.18307v1","arxiv_url":"http://arxiv.org/abs/2602.18307v1","comment":"","source":"arxiv","github_repo":"https://github.com/utopia-group/VeriSoftBench","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[{\"id\": \"maxRyeery/VeriSoftBench\", \"likes\": 0}]","hf_spaces":"[]","score_axis_1":7.0,"score_axis_2":6.0,"score_axis_3":7.0,"composite":6.65,"summary":"VeriSoftBench provides 500 Lean 4 formal verification proof obligations from real software verification repositories with preserved cross-file dependencies. Shows Mathlib-tuned provers transfer poorly to repository-centric settings, with success strongly correlated to dependency closure size.","reasoning":"Code available on GitHub with practical benchmark for software verification. Addresses real gap in formal methods evaluation. No model weights but valuable resource.","code_url":"https://github.com/utopia-group/VeriSoftBench","s2_tldr":"VeriSoftBench is introduced, a benchmark of 500 Lean 4 proof obligations drawn from open-source formal-methods developments and packaged to preserve realistic repository context and cross-file dependencies and results show that success is strongly correlated with transitive repository dependence.","s2_paper_id":"c2f7a5082b5fd1770e0ad93b6b6b777b7f03fa70","topics":"[\"Benchmark\", \"Language Models\", \"Reasoning\"]"},{"id":189,"run_id":1,"domain":"aiml","arxiv_id":"2602.17431","entry_id":"","title":"Fine-Grained Uncertainty Quantification for Long-Form Language Model Outputs: A Comparative Study","authors":"[\"Dylan Bouchard\", \"Mohit Singh Chauhan\", \"Viren Bajaj\", \"David Skarbrevik\"]","abstract":"Uncertainty quantification has emerged as an effective approach to closed-book hallucination detection for LLMs, but existing methods are largely designed for short-form outputs and do not generalize well to long-form generation. We introduce a taxonomy for fine-grained uncertainty quantification in long-form LLM outputs that distinguishes methods by design choices at three stages: response decomposition, unit-level scoring, and response-level aggregation. We formalize several families of consis","published":"2026-02-19T15:02:29+00:00","categories":"[\"cs.CL\", \"cs.AI\", \"cs.LG\"]","pdf_url":"https://arxiv.org/pdf/2602.17431v1","arxiv_url":"http://arxiv.org/abs/2602.17431v1","comment":"UQLM repository: https://github.com/cvs-health/uqlm","source":"arxiv","github_repo":"https://github.com/cvs-health/uqlm","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":7.0,"score_axis_2":5.0,"score_axis_3":8.0,"composite":6.65,"summary":"Introduces taxonomy for fine-grained uncertainty quantification in long-form LLM outputs, comparing consistency-based black-box scorers. Finds claim-response entailment outperforms complex claim-level scorers, claim-level beats sentence-level, and uncertainty-aware decoding significantly improves factuality in long-form generation.","reasoning":"Practical framework with code available, addresses important hallucination detection problem. Moderate novelty (systematization of existing approaches), high practical value.","code_url":"https://github.com/cvs-health/uqlm","s2_tldr":"A taxonomy for fine-grained uncertainty quantification in long-form LLM outputs that distinguishes methods by design choices at three stages: response decomposition, unit-level scoring, and response-level aggregation is introduced.","s2_paper_id":"02f74553f0ad2f1dc62f8715fe3e9a436b496998","topics":"[\"Language Models\"]"},{"id":204,"run_id":1,"domain":"aiml","arxiv_id":"2602.20556","entry_id":"","title":"WildGHand: Learning Anti-Perturbation Gaussian Hand Avatars from Monocular In-the-Wild Videos","authors":"[\"Hanhui Li\", \"Xuan Huang\", \"Wanquan Liu\", \"Yuhao Cheng\", \"Long Chen\", \"Yiqiang Yan\", \"Xiaodan Liang\", \"Chenqiang Gao\"]","abstract":"Despite recent progress in 3D hand reconstruction from monocular videos, most existing methods rely on data captured in well-controlled environments and therefore degrade in real-world settings with severe perturbations, such as hand-object interactions, extreme poses, illumination changes, and motion blur. To tackle these issues, we introduce WildGHand, an optimization-based framework that enables self-adaptive 3D Gaussian splatting on in-the-wild videos and produces high-fidelity hand avatars.","published":"2026-02-24T05:14:05+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.20556v1","arxiv_url":"http://arxiv.org/abs/2602.20556v1","comment":"","source":"arxiv","github_repo":"https://github.com/XuanHuang0/WildGHand","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":7.0,"score_axis_2":6.0,"score_axis_3":6.0,"composite":6.3,"summary":"WildGHand reconstructs high-fidelity 3D hand avatars from in-the-wild monocular videos using 3D Gaussian splatting with dynamic perturbation disentanglement and perturbation-aware optimization. Achieves 15.8% PSNR gain and 23.1% LPIPS reduction over base model on challenging real-world data.","reasoning":"Code and implementation available on GitHub. Solid optimization-based approach with clear improvements. Somewhat specialized application (hand avatar reconstruction) but demonstrates robustness to real-world perturbations.","code_url":"https://github.com/XuanHuang0/WildGHand","s2_tldr":"WildGHand is an optimization-based framework that enables self-adaptive 3D Gaussian splatting on in-the-wild videos and produces high-fidelity hand avatars and achieves state-of-the-art performance and substantially improves over its base model across multiple metrics.","s2_paper_id":"a8852ccf5c685ece536dd5b1666b881b6d1eae5e","topics":"[\"3D / Vision\", \"Optimization\"]"},{"id":206,"run_id":1,"domain":"aiml","arxiv_id":"2602.20409","entry_id":"","title":"CLIPoint3D: Language-Grounded Few-Shot Unsupervised 3D Point Cloud Domain Adaptation","authors":"[\"Mainak Singha\", \"Sarthak Mehrotra\", \"Paolo Casari\", \"Subhasis Chaudhuri\", \"Elisa Ricci\", \"Biplab Banerjee\"]","abstract":"Recent vision-language models (VLMs) such as CLIP demonstrate impressive cross-modal reasoning, extending beyond images to 3D perception. Yet, these models remain fragile under domain shifts, especially when adapting from synthetic to real-world point clouds. Conventional 3D domain adaptation approaches rely on heavy trainable encoders, yielding strong accuracy but at the cost of efficiency. We introduce CLIPoint3D, the first framework for few-shot unsupervised 3D point cloud domain adaptation b","published":"2026-02-23T23:17:12+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.20409v1","arxiv_url":"http://arxiv.org/abs/2602.20409v1","comment":"Accepted in CVPR 2026","source":"arxiv","github_repo":"https://github.com/SarthakM320/CLIPoint3D","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":7.0,"score_axis_2":6.0,"score_axis_3":6.0,"composite":6.3,"summary":"CLIPoint3D introduces few-shot unsupervised 3D point cloud domain adaptation using frozen CLIP with prompt tuning, achieving 3-16% accuracy gains over baselines on PointDA-10 and GraspNetPC-10. The method projects 3D samples to depth maps, applies parameter-efficient fine-tuning, and uses entropy-guided view sampling with optimal transport alignment.","reasoning":"Code available on GitHub; novel application of CLIP to 3D domain adaptation with practical efficiency gains. However, no HuggingFace weights indicated, limiting immediate practitioner adoption.","code_url":"https://github.com/SarthakM320/CLIPoint3D","s2_tldr":"This work introduces CLIPoint3D, the first framework for few-shot unsupervised 3D point cloud domain adaptation built upon CLIP, and applies parameter-efficient fine-tuning to CLIP's encoders and design an entropy-guided view sampling strategy for selecting confident projections.","s2_paper_id":"4471786ed9df981fd2a26c08e41384ba99ac46de","topics":"[\"3D / Vision\", \"Language Models\", \"Multimodal\"]"},{"id":212,"run_id":1,"domain":"aiml","arxiv_id":"2602.19706","entry_id":"","title":"HDR Reconstruction Boosting with Training-Free and Exposure-Consistent Diffusion","authors":"[\"Yo-Tin Lin\", \"Su-Kai Chen\", \"Hou-Ning Hu\", \"Yen-Yu Lin\", \"Yu-Lun Liu\"]","abstract":"Single LDR to HDR reconstruction remains challenging for over-exposed regions where traditional methods often fail due to complete information loss. We present a training-free approach that enhances existing indirect and direct HDR reconstruction methods through diffusion-based inpainting. Our method combines text-guided diffusion models with SDEdit refinement to generate plausible content in over-exposed areas while maintaining consistency across multi-exposure LDR images. Unlike previous appro","published":"2026-02-23T10:57:22+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.19706v1","arxiv_url":"http://arxiv.org/abs/2602.19706v1","comment":"WACV 2026. Project page: https://github.com/EusdenLin/HDR-Reconstruction-Boosting","source":"arxiv","github_repo":"https://github.com/EusdenLin/HDR-Reconstruction-Boosting","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":7.0,"score_axis_2":4.0,"score_axis_3":8.0,"composite":6.3,"summary":"Training-free approach enhancing HDR reconstruction via diffusion-based inpainting for over-exposed regions. Combines text-guided diffusion with SDEdit refinement and iterative compensation for luminance coherence, with code on GitHub.","reasoning":"Code available, training-free makes it highly practical. Incremental application of existing diffusion techniques.","code_url":"https://github.com/EusdenLin/HDR-Reconstruction-Boosting","s2_tldr":"This work presents a training-free approach that enhances existing indirect and direct HDR reconstruction methods through diffusion-based inpainting that seamlessly integrates with existing HDR reconstruction techniques through an iterative compensation mechanism that ensures luminance coherence across multiple exposures.","s2_paper_id":"3a2a4bc37522bc924f34a57abf16b3790f4d4234","topics":"[]"},{"id":222,"run_id":1,"domain":"aiml","arxiv_id":"2602.19206","entry_id":"","title":"GS-CLIP: Zero-shot 3D Anomaly Detection by Geometry-Aware Prompt and Synergistic View Representation Learning","authors":"[\"Zehao Deng\", \"An Liu\", \"Yan Wang\"]","abstract":"Zero-shot 3D Anomaly Detection is an emerging task that aims to detect anomalies in a target dataset without any target training data, which is particularly important in scenarios constrained by sample scarcity and data privacy concerns. While current methods adapt CLIP by projecting 3D point clouds into 2D representations, they face challenges. The projection inherently loses some geometric details, and the reliance on a single 2D modality provides an incomplete visual understanding, limiting t","published":"2026-02-22T14:30:41+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.19206v1","arxiv_url":"http://arxiv.org/abs/2602.19206v1","comment":"","source":"arxiv","github_repo":"https://github.com/zhushengxinyue/GS-CLIP","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":7.0,"score_axis_2":6.0,"score_axis_3":6.0,"composite":6.3,"summary":"GS-CLIP adapts CLIP for zero-shot 3D anomaly detection through geometry-aware prompts and synergistic view learning. GDDM generates prompts with 3D geometric priors, while SRM fuses rendered and depth image features for better geometric anomaly detection.","reasoning":"Novel geometric-aware approach for 3D anomaly detection with code available. Addresses CLIP's limitations for 3D but narrow application.","code_url":"https://github.com/zhushengxinyue/GS-CLIP","s2_tldr":"This work proposes the Geometry-Aware Prompt and Synergistic View Representation Learning (GS-CLIP) framework, which enables the model to identify geometric anomalies through a two-stage learning process, and shows that GS-CLIP achieves superior performance in detection.","s2_paper_id":"8a1a2c4eb5ecff94b2c1ac68a0743ae15ab3a9cc","topics":"[\"3D / Vision\", \"Benchmark\"]"},{"id":236,"run_id":1,"domain":"aiml","arxiv_id":"2602.19895","entry_id":"","title":"DSDR: Dual-Scale Diversity Regularization for Exploration in LLM Reasoning","authors":"[\"Zhongwei Wan\", \"Yun Shen\", \"Zhihao Dou\", \"Donghao Zhou\", \"Yu Zhang\", \"Xin Wang\", \"Hui Shen\", \"Jing Xiong\", \"Chaofan Tao\", \"Zixuan Zhong\"]","abstract":"Reinforcement learning with verifiers (RLVR) is a central paradigm for improving large language model (LLM) reasoning, yet existing methods often suffer from limited exploration. Policies tend to collapse onto a few reasoning patterns and prematurely stop deep exploration, while conventional entropy regularization introduces only local stochasticity and fails to induce meaningful path-level diversity, leading to weak and unstable learning signals in group-based policy optimization. We propose DS","published":"2026-02-23T14:37:01+00:00","categories":"[\"cs.LG\", \"cs.CL\"]","pdf_url":"https://arxiv.org/pdf/2602.19895v1","arxiv_url":"http://arxiv.org/abs/2602.19895v1","comment":"","source":"both","github_repo":"https://github.com/SUSTechBruce/DSDR","github_stars":null,"hf_upvotes":11,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":7.0,"score_axis_2":6.0,"score_axis_3":6.0,"composite":6.3,"summary":"Introduces DSDR (Dual-Scale Diversity Regularization), a novel RL framework for LLM reasoning that decomposes diversity into global and coupling components. Promotes diversity among correct trajectories while preventing entropy collapse. Provides theoretical support and shows consistent improvements in accuracy and pass@k on reasoning benchmarks. Code available on GitHub.","reasoning":"Novel approach to exploration in RLVR with code available. No weights but practical framework. Strong theoretical foundation with experimental validation.","code_url":"https://github.com/SUSTechBruce/DSDR","s2_tldr":"DSDR, a Dual-Scale Diversity Regularization reinforcement learning framework that decomposes diversity in LLM reasoning into global and coupling components, is proposed and experiments demonstrate consistent improvements in accuracy and pass@k, highlighting the importance of dual-scale diversity for deep exploration in RLVR.","s2_paper_id":"69439d553e8356da7d5a3b4f83741a2b4a7396ed","topics":"[\"Language Models\", \"Reasoning\", \"Training\"]"},{"id":239,"run_id":1,"domain":"aiml","arxiv_id":"2602.19320","entry_id":"","title":"Anatomy of Agentic Memory: Taxonomy and Empirical Analysis of Evaluation and System Limitations","authors":"[\"Dongming Jiang\", \"Yi Li\", \"Songtao Wei\", \"Jinxin Yang\", \"Ayushi Kishore\", \"Alysa Zhao\", \"Dingyi Kang\", \"Xu Hu\", \"Feng Chen\", \"Qiannan Li\"]","abstract":"Agentic memory systems enable large language model (LLM) agents to maintain state across long interactions, supporting long-horizon reasoning and personalization beyond fixed context windows. Despite rapid architectural development, the empirical foundations of these systems remain fragile: existing benchmarks are often underscaled, evaluation metrics are misaligned with semantic utility, performance varies significantly across backbone models, and system-level costs are frequently overlooked. T","published":"2026-02-22T19:50:01+00:00","categories":"[\"cs.CL\", \"cs.AI\"]","pdf_url":"https://arxiv.org/pdf/2602.19320v1","arxiv_url":"http://arxiv.org/abs/2602.19320v1","comment":"","source":"both","github_repo":"https://github.com/FredJiang0324/Anatomy-of-Agentic-Memory","github_stars":null,"hf_upvotes":5,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":7.0,"score_axis_2":5.0,"score_axis_3":7.0,"composite":6.3,"summary":"This survey analyzes agentic memory systems for LLMs, presenting a taxonomy of memory structures and identifying key limitations including benchmark saturation, metric misalignment, backbone-dependent performance, and system overhead. Provides structured analysis connecting memory architectures to empirical failure modes.","reasoning":"Valuable survey with code available on GitHub. Practical insights for building agentic systems, though primarily analysis rather than novel method. HF presence (5 upvotes) indicates community interest.","code_url":"https://github.com/FredJiang0324/Anatomy-of-Agentic-Memory","s2_tldr":"By connecting the memory structure to empirical limitations, this survey clarifies why current agentic memory systems often underperform their theoretical promise and outlines directions for more reliable evaluation and scalable system design.","s2_paper_id":"b11a1d01b7dd7624ba5f6d923b1f5ffe6b592776","topics":"[\"Agents\", \"Benchmark\", \"Language Models\"]"},{"id":241,"run_id":1,"domain":"aiml","arxiv_id":"2602.18734","entry_id":"","title":"Rethinking Retrieval-Augmented Generation as a Cooperative Decision-Making Problem","authors":"[\"Lichang Song\", \"Ting Long\", \"Yi Chang\"]","abstract":"Retrieval-Augmented Generation (RAG) has demonstrated strong effectiveness in knowledge-intensive tasks by grounding language generation in external evidence. Despite its success, many existing RAG systems are built based on a ranking-centric, asymmetric dependency paradigm, where the generation quality of the generator is highly dependent on reranking results of the reranker. To overcome this limitation, we reformulate RAG as a cooperative multi-agent decision-making problem and propose Coopera","published":"2026-02-21T06:32:36+00:00","categories":"[\"cs.CL\", \"cs.AI\"]","pdf_url":"https://arxiv.org/pdf/2602.18734v1","arxiv_url":"http://arxiv.org/abs/2602.18734v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":7.0,"score_axis_2":6.0,"score_axis_3":6.0,"composite":6.3,"summary":"Reformulates RAG as cooperative multi-agent decision-making problem with CoRAG framework where reranker and generator act as peer decision-makers. Jointly optimizes both components toward shared objective, improving generation stability. Demonstrates generalization on ~10K PopQA samples with model release.","reasoning":"Novel cooperative paradigm for RAG with model release planned, but limited current availability. Interesting architectural shift but needs broader validation.","code_url":"https://anonymous.4open.science/r/CoRAG-D63F","s2_tldr":"This work reformulate RAG as a cooperative multi-agent decision-making problem and proposes Cooperative Retrieval-Augmented Generation (CoRAG), a framework in which the reranker and the generator act as peer decision-makers rather than being connected through an asymmetric dependency pipeline.","s2_paper_id":"d48032843bc08c95c0fa71d82818afb34bba55b0","topics":"[\"Retrieval / RAG\", \"Agents\"]"},{"id":245,"run_id":1,"domain":"aiml","arxiv_id":"2602.18262","entry_id":"","title":"Simplifying Outcomes of Language Model Component Analyses with ELIA","authors":"[\"Aaron Louis Eidt\", \"Nils Feldhus\"]","abstract":"While mechanistic interpretability has developed powerful tools to analyze the internal workings of Large Language Models (LLMs), their complexity has created an accessibility gap, limiting their use to specialists. We address this challenge by designing, building, and evaluating ELIA (Explainable Language Interpretability Analysis), an interactive web application that simplifies the outcomes of various language model component analyses for a broader audience. The system integrates three key tec","published":"2026-02-20T14:45:27+00:00","categories":"[\"cs.CL\", \"cs.AI\", \"cs.LG\"]","pdf_url":"https://arxiv.org/pdf/2602.18262v1","arxiv_url":"http://arxiv.org/abs/2602.18262v1","comment":"EACL 2026 System Demonstrations. GitHub: https://github.com/aaron0eidt/ELIA","source":"arxiv","github_repo":"https://github.com/aaron0eidt/ELIA","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":7.0,"score_axis_2":5.0,"score_axis_3":7.0,"composite":6.3,"summary":"ELIA is an interactive web application for mechanistic interpretability that integrates attribution analysis, function vectors, and circuit tracing with AI-generated natural language explanations. User study shows interactive interfaces preferred and AI explanations bridge knowledge gap across experience levels.","reasoning":"Code on GitHub, practical tool for interpretability. Application-focused rather than novel method. Strong user validation but incremental contribution.","code_url":"https://github.com/aaron0eidt/ELIA","s2_tldr":"An interactive web application that simplifies the outcomes of various language model component analyses for a broader audience, and concludes that an AI system can indeed simplify complex model analyses, but its true power is unlocked when paired with thoughtful, user-centered design that prioritizes interactivity, specificity, and narrative guidance.","s2_paper_id":"3e549fb27f489711f28909c65e8e3ba8a17db226","topics":"[\"Language Models\", \"Benchmark\"]"},{"id":252,"run_id":1,"domain":"aiml","arxiv_id":"2602.17588","entry_id":"","title":"Modeling Distinct Human Interaction in Web Agents","authors":"[\"Faria Huq\", \"Zora Zhiruo Wang\", \"Zhanqiu Guo\", \"Venu Arvind Arangarajan\", \"Tianyue Ou\", \"Frank Xu\", \"Shuyan Zhou\", \"Graham Neubig\", \"Jeffrey P. Bigham\"]","abstract":"Despite rapid progress in autonomous web agents, human involvement remains essential for shaping preferences and correcting agent behavior as tasks unfold. However, current agentic systems lack a principled understanding of when and why humans intervene, often proceeding autonomously past critical decision points or requesting unnecessary confirmation. In this work, we introduce the task of modeling human intervention to support collaborative web task execution. We collect CowCorpus, a dataset o","published":"2026-02-19T18:11:28+00:00","categories":"[\"cs.CL\", \"cs.HC\"]","pdf_url":"https://arxiv.org/pdf/2602.17588v1","arxiv_url":"http://arxiv.org/abs/2602.17588v1","comment":"Preprint","source":"both","github_repo":"https://github.com/oaishi/PlowPilot","github_stars":null,"hf_upvotes":3,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":7.0,"score_axis_2":5.0,"score_axis_3":7.0,"composite":6.3,"summary":"Introduces CowCorpus, a dataset of 400 user web navigation trajectories with 4,200+ interleaved actions, and trains LMs to predict user interventions based on interaction styles. Deployment in live agents shows 61.4-63.4% improvement in intervention prediction and 26.5% increase in user-rated usefulness.","reasoning":"Practical contribution with code and dataset, addresses real human-agent interaction challenges. Moderate novelty (applied ML to UX problem).","code_url":"https://github.com/oaishi/PlowPilot","s2_tldr":"This work introduces the task of modeling human intervention to support collaborative web task execution and trains language models to anticipate when users are likely to intervene based on their interaction styles, showing structured modeling of human intervention leads to more adaptive, collaborative agents.","s2_paper_id":"64f868cc66f0e92fdc51417fbe7ad0e7f7a3d42c","topics":"[\"Agents\", \"Training\", \"Benchmark\"]"},{"id":253,"run_id":1,"domain":"aiml","arxiv_id":"2602.17526","entry_id":"","title":"The Anxiety of Influence: Bloom Filters in Transformer Attention Heads","authors":"[\"Peter Balogh\"]","abstract":"Some transformer attention heads appear to function as membership testers, dedicating themselves to answering the question \"has this token appeared before in the context?\" We identify these heads across four language models (GPT-2 small, medium, and large; Pythia-160M) and show that they form a spectrum of membership-testing strategies. Two heads (L0H1 and L0H5 in GPT-2 small) function as high-precision membership filters with false positive rates of 0-4\\% even at 180 unique context tokens -- we","published":"2026-02-19T16:37:16+00:00","categories":"[\"cs.LG\", \"cs.AI\", \"cs.CL\"]","pdf_url":"https://arxiv.org/pdf/2602.17526v1","arxiv_url":"http://arxiv.org/abs/2602.17526v1","comment":"13 pages, 8 figures, code at https://github.com/pbalogh/anxiety-of-influence v2: L3H0 reclassified as prefix-attention head following confound control. Capacity analysis updated. Duplicate-token head overlap experiment added v3: All experiments were independently validated on CPU to rule out hardware-specific computation artifacts. Results are consistent across backends","source":"arxiv","github_repo":"https://github.com/pbalogh/anxiety-of-influence","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":7.0,"score_axis_2":7.0,"score_axis_3":5.0,"composite":6.3,"summary":"Identifies transformer attention heads that function as membership testers with Bloom filter-like properties across GPT-2 and Pythia models. Two heads achieve 0-4% false positive rates at 180 tokens, while another follows theoretical Bloom filter capacity curves with R\u00b2=1.0, revealing a multi-resolution membership-testing system in early layers.","reasoning":"Novel mechanistic interpretability finding with code available. Limited immediate practical applicability but important for understanding transformer internals.","code_url":"https://github.com/pbalogh/anxiety-of-influence","s2_tldr":"Three genuine membership-testing heads form a multi-resolution system concentrated in early layers (0-1), taxonomically distinct from induction and previous-token heads, with false positive rates that decay monotonically with embedding distance -- consistent with distance-sensitive Bloom filters.","s2_paper_id":"33781935be06f06efbf58ac5e4f9ad2f4460ba02","topics":"[\"Architecture\", \"Language Models\"]"},{"id":5,"run_id":1,"domain":"aiml","arxiv_id":"2602.21204","entry_id":"","title":"Test-Time Training with KV Binding Is Secretly Linear Attention","authors":"[\"Junchen Liu\", \"Sven Elflein\", \"Or Litany\", \"Zan Gojcic\", \"Ruilong Li\"]","abstract":"Test-time training (TTT) with KV binding as sequence modeling layer is commonly interpreted as a form of online meta-learning that memorizes a key-value mapping at test time. However, our analysis reveals multiple phenomena that contradict this memorization-based interpretation. Motivated by these findings, we revisit the formulation of TTT and show that a broad class of TTT architectures can be expressed as a form of learned linear attention operator. Beyond explaining previously puzzling model","published":"2026-02-24T18:59:30+00:00","categories":"[\"cs.LG\", \"cs.AI\", \"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.21204v1","arxiv_url":"http://arxiv.org/abs/2602.21204v1","comment":"Webpage: https://research.nvidia.com/labs/sil/projects/tttla/","source":"both","github_repo":"","github_stars":null,"hf_upvotes":19,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":9.0,"score_axis_3":7.0,"composite":6.2,"summary":"Reframes Test-Time Training (TTT) with KV binding as a form of learned linear attention rather than online meta-learning. This new perspective enables architectural simplifications, fully parallel formulations, and systematic reduction of TTT variants to standard linear attention, improving both efficiency and understanding of the mechanism.","reasoning":"Paradigm-shifting theoretical contribution that reinterprets an existing architecture class. Moderate code/weights score due to no open weights or implementation provided despite high HF interest.","code_url":null,"s2_tldr":"Overall, the results reframe TTT not as test-time memorization, but as learned linear attention with enhanced representational capacity, which enables principled architectural simplifications, admits fully parallel formulations that preserve performance while improving efficiency, and provides a systematic reduction of diverse TTT variants to a standard linear attention form.","s2_paper_id":"c7f6958b3f9477a1f76d526af59d5b374e8bd9f2","topics":"[]"},{"id":9,"run_id":1,"domain":"aiml","arxiv_id":"2602.20496","entry_id":"","title":"Pip-Stereo: Progressive Iterations Pruner for Iterative Optimization based Stereo Matching","authors":"[\"Jintu Zheng\", \"Qizhe Liu\", \"HuangXin Xu\", \"Zhuojie Chen\"]","abstract":"While iterative stereo matching achieves high accuracy, its dependence on Recurrent Neural Networks (RNN) hinders edge deployment, a challenge underexplored in existing researches. We analyze iterative refinement and reveal that disparity updates are spatially sparse and temporally redundant. First, we introduce a progressive iteration pruning strategy that suppresses redundant update steps, effectively collapsing the recursive computation into a near-single-pass inference. Second, we propose a ","published":"2026-02-24T02:51:37+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.20496v1","arxiv_url":"http://arxiv.org/abs/2602.20496v1","comment":"Accepted to CVPR 2026 (3D vision track)","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":7.0,"score_axis_3":9.0,"composite":6.2,"summary":"PipStereo enables real-time stereo matching on edge devices via progressive iteration pruning, monocular prior transfer, and FlashGRU (hardware-aware RNN operator). Achieves 75ms on Jetson Orin NX at 640\u00d7320 with 7.28x speedup and 76.6% memory reduction through structured sparsity.","reasoning":"Code available (CVPR 2026 accepted). Novel iteration pruning strategy with exceptional edge deployment results. Very high practical applicability for embedded vision systems but unclear on open weights.","code_url":null,"s2_tldr":"This work introduces a progressive iteration pruning strategy that suppresses redundant update steps, effectively collapsing the recursive computation into a near-single-pass inference and proposes a collaborative monocular prior transfer framework that implicitly embeds depth priors without requiring a dedicated monocular encoder.","s2_paper_id":"dc4ccd2fd84640ae71b66edb7bbef35c06ecaa7b","topics":"[\"Optimization\", \"Efficiency\"]"},{"id":14,"run_id":1,"domain":"aiml","arxiv_id":"2602.21193","entry_id":"","title":"On Data Engineering for Scaling LLM Terminal Capabilities","authors":"[\"Renjie Pi\", \"Grace Lam\", \"Mohammad Shoeybi\", \"Pooya Jannaty\", \"Bryan Catanzaro\", \"Wei Ping\"]","abstract":"Despite rapid recent progress in the terminal capabilities of large language models, the training data strategies behind state-of-the-art terminal agents remain largely undisclosed. We address this gap through a systematic study of data engineering practices for terminal agents, making two key contributions: (1) Terminal-Task-Gen, a lightweight synthetic task generation pipeline that supports seed-based and skill-based task construction, and (2) a comprehensive analysis of data and training stra","published":"2026-02-24T18:51:04+00:00","categories":"[\"cs.CL\"]","pdf_url":"https://arxiv.org/pdf/2602.21193v1","arxiv_url":"http://arxiv.org/abs/2602.21193v1","comment":"","source":"both","github_repo":"","github_stars":null,"hf_upvotes":60,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":7.0,"score_axis_3":9.0,"composite":6.2,"summary":"Nemotron-Terminal presents Terminal-Task-Gen for synthetic terminal task generation and trains models (8B, 14B, 32B) achieving 13-27.4% performance on Terminal-Bench 2.0. Open-sources model checkpoints and synthetic datasets at HuggingFace, with comprehensive data engineering analysis including curriculum learning and scaling behavior.","reasoning":"High score: 60 HF upvotes, both source, open-source models and datasets. Practical synthetic data generation pipeline with strong performance gains. Detailed training strategies disclosed.","code_url":null,"s2_tldr":"A systematic study of data engineering practices for terminal agents, making two key contributions: (1) Terminal-Task-Gen, a lightweight synthetic task generation pipeline that supports seed-based and skill-based task construction, and (2) a comprehensive analysis of data and training strategies, including filtering, curriculum learning, long context training, and scaling behavior.","s2_paper_id":"24aa45a9d95530701963400e2aa27ca1c642587f","topics":"[\"Language Models\"]"},{"id":264,"run_id":1,"domain":"aiml","arxiv_id":"2602.20792","entry_id":"","title":"SIMSPINE: A Biomechanics-Aware Simulation Framework for 3D Spine Motion Annotation and Benchmarking","authors":"[\"Muhammad Saif Ullah Khan\", \"Didier Stricker\"]","abstract":"Modeling spinal motion is fundamental to understanding human biomechanics, yet remains underexplored in computer vision due to the spine's complex multi-joint kinematics and the lack of large-scale 3D annotations. We present a biomechanics-aware keypoint simulation framework that augments existing human pose datasets with anatomically consistent 3D spinal keypoints derived from musculoskeletal modeling. Using this framework, we create the first open dataset, named SIMSPINE, which provides sparse","published":"2026-02-24T11:31:20+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.20792v1","arxiv_url":"http://arxiv.org/abs/2602.20792v1","comment":"Accepted at CVPR 2026","source":"both","github_repo":"https://github.com/dfki-av/simspine","github_stars":null,"hf_upvotes":2,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":7.0,"score_axis_2":6.0,"score_axis_3":5.0,"composite":5.95,"summary":"SIMSPINE introduces a biomechanics-aware simulation framework for generating anatomically consistent 3D spinal keypoint annotations at scale (2.14M frames). The paper provides pretrained baselines for 2D/3D spine pose estimation and establishes benchmarks for biomechanically valid motion analysis.","reasoning":"Code on GitHub, modest HF presence. Domain-specific (biomechanics/medical) reduces general applicability, but provides valuable open dataset. Incremental over general pose estimation.","code_url":"https://github.com/dfki-av/simspine","s2_tldr":"A biomechanics-aware keypoint simulation framework that augments existing human pose datasets with anatomically consistent 3D spinal keypoints derived from musculoskeletal modeling, and creates the first open dataset, named SIMSPINE, which provides sparse vertebra-level 3D spinal annotations for natural full-body motions in indoor multi-camera capture without external restraints.","s2_paper_id":"663aadc6882d91d03c120a415968fa3edae166a5","topics":"[\"3D / Vision\", \"Benchmark\"]"},{"id":279,"run_id":1,"domain":"aiml","arxiv_id":"2602.19697","entry_id":"","title":"BayesFusion-SDF: Probabilistic Signed Distance Fusion with View Planning on CPU","authors":"[\"Soumya Mazumdar\", \"Vineet Kumar Rakesh\", \"Tapas Samanta\"]","abstract":"Key part of robotics, augmented reality, and digital inspection is dense 3D reconstruction from depth observations. Traditional volumetric fusion techniques, including truncated signed distance functions (TSDF), enable efficient and deterministic geometry reconstruction; however, they depend on heuristic weighting and fail to transparently convey uncertainty in a systematic way. Recent neural implicit methods, on the other hand, get very high fidelity but usually need a lot of GPU power for opti","published":"2026-02-23T10:44:15+00:00","categories":"[\"cs.CV\", \"cs.GR\", \"cs.RO\"]","pdf_url":"https://arxiv.org/pdf/2602.19697v1","arxiv_url":"http://arxiv.org/abs/2602.19697v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":7.0,"score_axis_2":5.0,"score_axis_3":6.0,"composite":5.95,"summary":"BayesFusion-SDF presents CPU-centric probabilistic signed distance fusion using sparse Gaussian fields and heteroscedastic Bayesian formulation. Enables uncertainty-aware surface extraction and next-best-view planning with sparse linear algebra on CPU.","reasoning":"Novel probabilistic formulation for CPU-based 3D reconstruction. GitHub linked but efficiency focus limits GPU-dependent applications.","code_url":"https://mazumdarsoumya.github.io/BayesFusionSDF","s2_tldr":"BayesFusion-SDF is presented, a CPU-centric probabilistic signed distance fusion framework that conceptualizes geometry as a sparse Gaussian random field with a defined posterior distribution over voxel distances that gives useful estimates of uncertainty for active sensing.","s2_paper_id":"bc15726dd3d2d3a3d1cf4687178f99ce11fb428a","topics":"[\"Agents\", \"Efficiency\", \"3D / Vision\"]"},{"id":281,"run_id":1,"domain":"aiml","arxiv_id":"2602.19596","entry_id":"","title":"Learning Mutual View Information Graph for Adaptive Adversarial Collaborative Perception","authors":"[\"Yihang Tao\", \"Senkang Hu\", \"Haonan An\", \"Zhengru Fang\", \"Hangcheng Cao\", \"Yuguang Fang\"]","abstract":"Collaborative perception (CP) enables data sharing among connected and autonomous vehicles (CAVs) to enhance driving safety. However, CP systems are vulnerable to adversarial attacks where malicious agents forge false objects via feature-level perturbations. Current defensive systems use threshold-based consensus verification by comparing collaborative and ego detection results. Yet, these defenses remain vulnerable to more sophisticated attack strategies that could exploit two critical weakness","published":"2026-02-23T08:38:27+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.19596v1","arxiv_url":"http://arxiv.org/abs/2602.19596v1","comment":"Accepted by CVPR'26","source":"arxiv","github_repo":"https://github.com/yihangtao/MVIG.git","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":7.0,"score_axis_2":6.0,"score_axis_3":5.0,"composite":5.95,"summary":"MVIG attack learns adaptive adversarial strategies for collaborative perception systems by capturing vulnerability knowledge via mutual view information graphs and temporal graph learning. It reduces defense success rates by up to 62% at 29.9 FPS, exposing critical security gaps in connected autonomous vehicle perception.","reasoning":"Code on GitHub; CVPR 2026. Adversarial attacks are specialized and incremental. Practical for security researchers but niche application (autonomous vehicles).","code_url":"https://github.com/yihangtao/MVIG.git","s2_tldr":"MVIG attack is proposed, a novel adaptive adversarial CP framework learning to capture vulnerability knowledge disclosed by different defensive CP systems from a unified mutual view information graph (MVIG) representation that combines MVIG representation with temporal graph learning to generate evolving fabrication risk maps.","s2_paper_id":"09a2d2a7086b7ad310f2520e2fc080020f3f2fba","topics":"[]"},{"id":295,"run_id":1,"domain":"aiml","arxiv_id":"2602.19086","entry_id":"","title":"Restoration-Guided Kuzushiji Character Recognition Framework under Seal Interference","authors":"[\"Rui-Yang Ju\", \"Kohei Yamashita\", \"Hirotaka Kameko\", \"Shinsuke Mori\"]","abstract":"Kuzushiji was one of the most popular writing styles in pre-modern Japan and was widely used in both personal letters and official documents. However, due to its highly cursive forms and extensive glyph variations, most modern Japanese readers cannot directly interpret Kuzushiji characters. Therefore, recent research has focused on developing automated Kuzushiji character recognition methods, which have achieved satisfactory performance on relatively clean Kuzushiji document images. However, exi","published":"2026-02-22T07:58:29+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.19086v1","arxiv_url":"http://arxiv.org/abs/2602.19086v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":7.0,"score_axis_2":5.0,"score_axis_3":6.0,"composite":5.95,"summary":"RG-KCR is a three-stage framework for Kuzushiji character recognition under seal interference in historical Japanese documents. It uses YOLOv12 for detection, restoration to remove seal interference, and ViT-based classification (Metom), improving Top-1 accuracy from 93.45% to 95.33% with the restoration stage.","reasoning":"Code available, but very narrow application (historical Japanese documents). Good engineering but limited novelty and practical applicability outside specific domain.","code_url":"https://ruiyangju.github.io/RG-KCR","s2_tldr":"This work proposes a three-stage restoration-guided Kuzushiji character recognition (RG-KCR) framework specifically designed to mitigate seal interference, and constructs datasets for evaluating Kuzushiji character detection (Stage 1) and classification (Stage 3).","s2_paper_id":"0408abddfcf3cd8f19576ae2cc3bbfa2e6e5cb9c","topics":"[]"},{"id":310,"run_id":1,"domain":"aiml","arxiv_id":"2602.18915","entry_id":"","title":"AAVGen: Precision Engineering of Adeno-associated Viral Capsids for Renal Selective Targeting","authors":"[\"Mohammadreza Ghaffarzadeh-Esfahani\", \"Yousof Gheisari\"]","abstract":"Adeno-associated viruses (AAVs) are promising vectors for gene therapy, but their native serotypes face limitations in tissue tropism, immune evasion, and production efficiency. Engineering capsids to overcome these hurdles is challenging due to the vast sequence space and the difficulty of simultaneously optimizing multiple functional properties. The complexity also adds when it comes to the kidney, which presents unique anatomical barriers and cellular targets that require precise and efficien","published":"2026-02-21T17:46:34+00:00","categories":"[\"q-bio.QM\", \"cs.AI\", \"cs.CL\", \"cs.LG\"]","pdf_url":"https://arxiv.org/pdf/2602.18915v1","arxiv_url":"http://arxiv.org/abs/2602.18915v1","comment":"22 pages, 6 figures, and 5 supplementary files. Corresponding author: ygheisari@med.mui.ac.ir, Kaggle notebook is available at https://www.kaggle.com/code/mohammadgh009/aavgen","source":"both","github_repo":"https://github.com/mohammad-gh009/AAVGen","github_stars":null,"hf_upvotes":1,"hf_models":"[{\"id\": \"Moreza009/AAV-Kidney-Tropism\", \"likes\": 0}, {\"id\": \"Moreza009/AAV-Thermostability\", \"likes\": 0}, {\"id\": \"Moreza009/AAV-Fitness\", \"likes\": 0}, {\"id\": \"Moreza009/AAVGen\", \"likes\": 0}]","hf_datasets":"[{\"id\": \"Moreza009/AAV_datasets\", \"likes\": 0}]","hf_spaces":"[]","score_axis_1":7.0,"score_axis_2":6.0,"score_axis_3":5.0,"composite":5.95,"summary":"Presents AAVGen, a generative AI framework for de novo design of AAV capsids with enhanced kidney targeting, fitness, and thermostability. Uses PLM with supervised fine-tuning and novel Group Sequence Policy Optimization (GSPO) with multi-objective rewards. Includes three ESM-2 regression models on HuggingFace and code on GitHub.","reasoning":"Specialized biomedical application (gene therapy vectors) with models and code available. Novel RL approach but very narrow domain reduces general practitioner applicability.","code_url":"https://github.com/mohammad-gh009/AAVGen","s2_tldr":"A generative artificial intelligence framework for de novo design of AAV capsids with enhanced multi-trait profiles that establishes a foundation for data-driven viral vector engineering, accelerating the development of next-generation AAV vectors with tailored functional characteristics.","s2_paper_id":"777cb39bfc37cd1967600a554f10f43038503e9f","topics":"[\"Optimization\"]"},{"id":322,"run_id":1,"domain":"aiml","arxiv_id":"2602.17288","entry_id":"","title":"ArXiv-to-Model: A Practical Study of Scientific LM Training","authors":"[\"Anuj Gupta\"]","abstract":"While frontier large language models demonstrate strong reasoning and mathematical capabilities, the practical process of training domain-specialized scientific language models from raw sources remains under-documented. In this work, we present a detailed case study of training a 1.36B-parameter scientific language model directly from raw arXiv LaTeX sources spanning mathematics, computer science, and theoretical physics. We describe an end-to-end pipeline covering metadata filtering, archive va","published":"2026-02-19T11:47:30+00:00","categories":"[\"cs.AI\", \"cs.CL\"]","pdf_url":"https://arxiv.org/pdf/2602.17288v1","arxiv_url":"http://arxiv.org/abs/2602.17288v1","comment":"15 pages, 6 figures, 1 table","source":"both","github_repo":"https://github.com/kitefishai/KiteFish-A1-1.5B-Math","github_stars":null,"hf_upvotes":7,"hf_models":"[{\"id\": \"KiteFishAI/KiteFish-A1-1.5B-Math\", \"likes\": 1}]","hf_datasets":"[{\"id\": \"KiteFishAI/arxiv-tex-corpus-full\", \"likes\": 1}, {\"id\": \"KiteFishAI/arxiv-tex-corpus-medium\", \"likes\": 1}]","hf_spaces":"[]","score_axis_1":7.0,"score_axis_2":3.0,"score_axis_3":8.0,"composite":5.95,"summary":"Detailed engineering case study of training a 1.36B scientific LM from raw arXiv LaTeX on 2xA100 GPUs. Provides transparent, reproducible pipeline covering data preprocessing, tokenization, and training stability under constrained compute, with model weights and code publicly available.","reasoning":"High code_and_weights with model on HuggingFace and code on GitHub. Lower novelty as engineering-focused case study without architectural innovation. Strong practical applicability for researchers with moderate compute budgets.","code_url":"https://github.com/kitefishai/KiteFish-A1-1.5B-Math","s2_tldr":"This work presents a detailed case study of training a 1.36B-parameter scientific language model directly from raw arXiv LaTeX sources spanning mathematics, computer science, and theoretical physics and highlights how preprocessing decisions significantly affect usable token volume, how tokenization impacts symbolic stability, and how storage and I/O constraints can rival compute as limiting factors.","s2_paper_id":"4818609c84b7b21a185a915b3580929543bd7107","topics":"[\"Language Models\", \"Reasoning\"]"},{"id":18,"run_id":1,"domain":"aiml","arxiv_id":"2602.21010","entry_id":"","title":"Le-DETR: Revisiting Real-Time Detection Transformer with Efficient Encoder Design","authors":"[\"Jiannan Huang\", \"Aditya Kane\", \"Fengzhe Zhou\", \"Yunchao Wei\", \"Humphrey Shi\"]","abstract":"Real-time object detection is crucial for real-world applications as it requires high accuracy with low latency. While Detection Transformers (DETR) have demonstrated significant performance improvements, current real-time DETR models are challenging to reproduce from scratch due to excessive pre-training overheads on the backbone, constraining research advancements by hindering the exploration of novel backbone architectures. In this paper, we want to show that by using general good design, it ","published":"2026-02-24T15:29:55+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.21010v1","arxiv_url":"http://arxiv.org/abs/2602.21010v1","comment":"CVPR Findings","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":7.0,"score_axis_3":8.0,"composite":5.85,"summary":"Le-DETR achieves new SOTA in real-time object detection (52.9-55.1 mAP on COCO) with 80% less pre-training data than prior methods. Key innovations include EfficientNAT backbone architecture and redesigned hybrid encoder with local attention, enabling competitive performance with YOLOv12 and DEIM-D-FINE while reducing training costs. Promises open-sourced code and weights.","reasoning":"Strong practical value with SOTA real-time detection and explicitly promised open code/weights. Novel architecture design (EfficientNAT, hybrid encoder) but incremental DETR improvements.","code_url":null,"s2_tldr":"It is demonstrated that with well-designed, real-time DETR models can achieve strong performance without the need for complex and computationally expensive pretraining.","s2_paper_id":"a2d729c6a56e9d31932acdb9da92a28cc61f1055","topics":"[\"Architecture\", \"Efficiency\", \"3D / Vision\"]"},{"id":22,"run_id":1,"domain":"aiml","arxiv_id":"2602.20497","entry_id":"","title":"LESA: Learnable Stage-Aware Predictors for Diffusion Model Acceleration","authors":"[\"Peiliang Cai\", \"Jiacheng Liu\", \"Haowen Xu\", \"Xinyu Wang\", \"Chang Zou\", \"Linfeng Zhang\"]","abstract":"Diffusion models have achieved remarkable success in image and video generation tasks. However, the high computational demands of Diffusion Transformers (DiTs) pose a significant challenge to their practical deployment. While feature caching is a promising acceleration strategy, existing methods based on simple reusing or training-free forecasting struggle to adapt to the complex, stage-dependent dynamics of the diffusion process, often resulting in quality degradation and failing to maintain co","published":"2026-02-24T02:53:28+00:00","categories":"[\"cs.CV\", \"cs.AI\"]","pdf_url":"https://arxiv.org/pdf/2602.20497v1","arxiv_url":"http://arxiv.org/abs/2602.20497v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":7.0,"score_axis_3":8.0,"composite":5.85,"summary":"LESA accelerates Diffusion Transformers via learnable KAN-based stage-aware predictors for feature caching. Achieves 5-6.25x speedup on FLUX.1-dev, Qwen-Image, and HunyuanVideo with minimal quality loss through multi-stage, multi-expert architecture trained on temporal feature mappings.","reasoning":"Code included in supplementary (will be on GitHub). Novel learnable predictor approach with strong practical speedups. High applicability for practitioners deploying large diffusion models.","code_url":null,"s2_tldr":"A LEarnable Stage-Aware (LESA) predictor framework based on two-stage training that assigns specialized predictors to different noise-level stages, enabling more precise and robust feature forecasting and state-of-the-art performance on both text-to-image and text-to-video synthesis.","s2_paper_id":"673fafaebc3800ea4e92898d297125c3ffb9c58b","topics":"[\"Efficiency\", \"Video Generation\", \"Architecture\"]"},{"id":24,"run_id":1,"domain":"aiml","arxiv_id":"2602.20412","entry_id":"","title":"SimLBR: Learning to Detect Fake Images by Learning to Detect Real Images","authors":"[\"Aayush Dhakal\", \"Subash Khanal\", \"Srikumar Sastry\", \"Jacob Arndt\", \"Philipe Ambrozio Dias\", \"Dalton Lunga\", \"Nathan Jacobs\"]","abstract":"The rapid advancement of generative models has made the detection of AI-generated images a critical challenge for both research and society. Recent works have shown that most state-of-the-art fake image detection methods overfit to their training data and catastrophically fail when evaluated on curated hard test sets with strong distribution shifts. In this work, we argue that it is more principled to learn a tight decision boundary around the real image distribution and treat the fake category ","published":"2026-02-23T23:22:41+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.20412v1","arxiv_url":"http://arxiv.org/abs/2602.20412v1","comment":"Accepted to CVPR 2026","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":7.0,"score_axis_3":8.0,"composite":5.85,"summary":"SimLBR treats fake detection as learning a tight real-image boundary using Latent Blending Regularization, achieving +24.85% accuracy and +69.62% recall on Chameleon benchmark. Orders of magnitude faster training than existing methods with strong cross-generator generalization.","reasoning":"Code/models to be released on HF+GitHub (CVPR 2026 accepted). Novel framing of fake detection problem with strong practical results and efficiency. High applicability for content authentication but pending open release.","code_url":null,"s2_tldr":"This work argues that it is more principled to learn a tight decision boundary around the real image distribution and treat the fake category as a sink class, and proposes SimLBR, a simple and efficient framework for fake image detection using Latent Blending Regularization (LBR).","s2_paper_id":"efaab2ba6123855db263d5ff6cf7aa0c7cdf3aa1","topics":"[\"Benchmark\"]"},{"id":25,"run_id":1,"domain":"aiml","arxiv_id":"2602.20360","entry_id":"","title":"Momentum Guidance: Plug-and-Play Guidance for Flow Models","authors":"[\"Runlong Liao\", \"Jian Yu\", \"Baiyu Su\", \"Chi Zhang\", \"Lizhang Chen\", \"Qiang Liu\"]","abstract":"Flow-based generative models have become a strong framework for high-quality generative modeling, yet pretrained models are rarely used in their vanilla conditional form: conditional samples without guidance often appear diffuse and lack fine-grained detail due to the smoothing effects of neural networks. Existing guidance techniques such as classifier-free guidance (CFG) improve fidelity but double the inference cost and typically reduce sample diversity. We introduce Momentum Guidance (MG), a ","published":"2026-02-23T21:06:35+00:00","categories":"[\"cs.LG\", \"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.20360v1","arxiv_url":"http://arxiv.org/abs/2602.20360v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":7.0,"score_axis_3":8.0,"composite":5.85,"summary":"Momentum Guidance (MG) introduces a plug-and-play guidance technique for flow models that extrapolates velocity using exponential moving averages of past trajectories. MG achieves 36.68% FID improvement without classifier-free guidance while maintaining single-evaluation-per-step cost, demonstrating effectiveness on ImageNet-256, Stable Diffusion 3, and FLUX.1-dev.","reasoning":"No code released yet. Novel trajectory-based guidance with strong practical applicability across major models. Significant efficiency gains make it highly relevant for practitioners.","code_url":null,"s2_tldr":"Momentum Guidance is introduced, a new dimension of guidance that leverages the ODE trajectory itself and matches the effect of standard guidance without extra computation and can further improve quality when combined with CFG.","s2_paper_id":"71b63a16fc08d79d0eb91f5c0ac4a1b948d51a5c","topics":"[]"},{"id":26,"run_id":1,"domain":"aiml","arxiv_id":"2602.20119","entry_id":"","title":"NovaPlan: Zero-Shot Long-Horizon Manipulation via Closed-Loop Video Language Planning","authors":"[\"Jiahui Fu\", \"Junyu Nan\", \"Lingfeng Sun\", \"Hongyu Li\", \"Jianing Qian\", \"Jennifer L. Barry\", \"Kris Kitani\", \"George Konidaris\"]","abstract":"Solving long-horizon tasks requires robots to integrate high-level semantic reasoning with low-level physical interaction. While vision-language models (VLMs) and video generation models can decompose tasks and imagine outcomes, they often lack the physical grounding necessary for real-world execution. We introduce NovaPlan, a hierarchical framework that unifies closed-loop VLM and video planning with geometrically grounded robot execution for zero-shot long-horizon manipulation. At the high lev","published":"2026-02-23T18:35:18+00:00","categories":"[\"cs.RO\", \"cs.AI\", \"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.20119v1","arxiv_url":"http://arxiv.org/abs/2602.20119v1","comment":"25 pages, 15 figures. Project webpage: https://nova-plan.github.io/","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":7.0,"score_axis_3":8.0,"composite":5.85,"summary":"NovaPlan presents a hierarchical framework combining closed-loop VLM planning with geometrically grounded execution for zero-shot long-horizon manipulation. The system extracts keypoints and hand poses from generated videos as kinematic priors, with switching mechanism maintaining stable execution under occlusion, demonstrating complex assembly without demonstrations.","reasoning":"No code yet but strong practical demonstration. Novel closed-loop integration of VLM and video planning with geometric grounding. High applicability for real-world robotics.","code_url":null,"s2_tldr":"NovaPlan is introduced, a hierarchical framework that unifies closed-loop VLM and video planning with geometrically grounded robot execution for zero-shot long-horizon manipulation and can perform complex assembly tasks and exhibit dexterous error recovery behaviors without any prior demonstrations or training.","s2_paper_id":"08b59b6b33a3332060064245ec0eda1eecd16c32","topics":"[\"Agents\", \"Robotics\", \"Video Generation\"]"},{"id":28,"run_id":1,"domain":"aiml","arxiv_id":"2602.19946","entry_id":"","title":"When Pretty Isn't Useful: Investigating Why Modern Text-to-Image Models Fail as Reliable Training Data Generators","authors":"[\"Krzysztof Adamkiewicz\", \"Brian Moser\", \"Stanislav Frolov\", \"Tobias Christian Nauen\", \"Federico Raue\", \"Andreas Dengel\"]","abstract":"Recent text-to-image (T2I) diffusion models produce visually stunning images and demonstrate excellent prompt following. But do they perform well as synthetic vision data generators? In this work, we revisit the promise of synthetic data as a scalable substitute for real training sets and uncover a surprising performance regression. We generate large-scale synthetic datasets using state-of-the-art T2I models released between 2022 and 2025, train standard classifiers solely on this synthetic data","published":"2026-02-23T15:15:53+00:00","categories":"[\"cs.CV\", \"cs.AI\"]","pdf_url":"https://arxiv.org/pdf/2602.19946v2","arxiv_url":"http://arxiv.org/abs/2602.19946v2","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":7.0,"score_axis_3":8.0,"composite":5.85,"summary":"This work reveals a surprising performance regression in using modern T2I models as synthetic data generators. Despite improved visual fidelity, classifiers trained solely on synthetic data from newer T2I models show declining accuracy on real test data due to collapse toward aesthetic-centric distributions.","reasoning":"No code indicated; important negative result challenging assumptions about T2I progress; high practical value for practitioners considering synthetic data.","code_url":null,"s2_tldr":"The promise of synthetic data as a scalable substitute for real training sets is revisited and a surprising performance regression is uncovered, highlighting an urgent need to rethink the capabilities of modern T2I models as reliable training data generators.","s2_paper_id":"82c90424b6477dc445901b83340bba76c461db78","topics":"[\"Image Generation\", \"Benchmark\"]"},{"id":29,"run_id":1,"domain":"aiml","arxiv_id":"2602.19872","entry_id":"","title":"GOAL: Geometrically Optimal Alignment for Continual Generalized Category Discovery","authors":"[\"Jizhou Han\", \"Chenhao Ding\", \"SongLin Dong\", \"Yuhang He\", \"Shaokun Wang\", \"Qiang Wang\", \"Yihong Gong\"]","abstract":"Continual Generalized Category Discovery (C-GCD) requires identifying novel classes from unlabeled data while retaining knowledge of known classes over time. Existing methods typically update classifier weights dynamically, resulting in forgetting and inconsistent feature alignment. We propose GOAL, a unified framework that introduces a fixed Equiangular Tight Frame (ETF) classifier to impose a consistent geometric structure throughout learning. GOAL conducts supervised alignment for labeled sam","published":"2026-02-23T14:15:56+00:00","categories":"[\"cs.CV\", \"cs.AI\"]","pdf_url":"https://arxiv.org/pdf/2602.19872v1","arxiv_url":"http://arxiv.org/abs/2602.19872v1","comment":"Accept by AAAI 2026","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":7.0,"score_axis_3":8.0,"composite":5.85,"summary":"GOAL addresses Continual Generalized Category Discovery by introducing a fixed Equiangular Tight Frame classifier for consistent geometric structure. The framework conducts supervised and confidence-guided alignment, reducing forgetting by 16.1% and boosting novel class discovery by 3.2% over the prior method.","reasoning":"No code/weights indicated; novel ETF-based approach for continual discovery; strong practical applicability for long-horizon learning scenarios.","code_url":null,"s2_tldr":"GOAL, a unified framework that introduces a fixed Equiangular Tight Frame (ETF) classifier to impose a consistent geometric structure throughout learning, is proposed, enabling stable integration of new classes without disrupting old ones.","s2_paper_id":"1a59401d249b2cba712a494f0c582d37fbe58ac3","topics":"[\"Training\"]"},{"id":32,"run_id":1,"domain":"aiml","arxiv_id":"2602.19506","entry_id":"","title":"Relational Feature Caching for Accelerating Diffusion Transformers","authors":"[\"Byunggwan Son\", \"Jeimin Jeon\", \"Jeongwoo Choi\", \"Bumsub Ham\"]","abstract":"Feature caching approaches accelerate diffusion transformers (DiTs) by storing the output features of computationally expensive modules at certain timesteps, and exploiting them for subsequent steps to reduce redundant computations. Recent forecasting-based caching approaches employ temporal extrapolation techniques to approximate the output features with cached ones. Although effective, relying exclusively on temporal extrapolation still suffers from significant prediction errors, leading to pe","published":"2026-02-23T04:45:38+00:00","categories":"[\"cs.CV\", \"cs.LG\"]","pdf_url":"https://arxiv.org/pdf/2602.19506v1","arxiv_url":"http://arxiv.org/abs/2602.19506v1","comment":"Accepted to ICLR 2026","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":7.0,"score_axis_3":8.0,"composite":5.85,"summary":"Relational feature caching (RFC) accelerates diffusion transformers by leveraging input-output relationships to predict cached features via relational feature estimation and cache scheduling. It significantly outperforms prior caching methods across various DiT models.","reasoning":"ICLR 2026; project page suggests code likely available but not explicit. Novel caching approach for DiT acceleration is practical for inference efficiency.","code_url":null,"s2_tldr":"Rel relational feature caching (RFC) is proposed, a novel framework that leverages the input-output relationship to enhance the accuracy of the feature prediction and introduces relational feature estimation (RFE) to estimate the magnitude of changes in the output features from the inputs, enabling more accurate feature predictions.","s2_paper_id":"3453756cb5145fd5b00a0924a661be587dcdb8cd","topics":"[\"Efficiency\", \"Architecture\"]"},{"id":35,"run_id":1,"domain":"aiml","arxiv_id":"2602.20200","entry_id":"","title":"Global Prior Meets Local Consistency: Dual-Memory Augmented Vision-Language-Action Model for Efficient Robotic Manipulation","authors":"[\"Zaijing Li\", \"Bing Hu\", \"Rui Shao\", \"Gongwei Chen\", \"Dongmei Jiang\", \"Pengwei Xie\", \"Jianye Hao\", \"Liqiang Nie\"]","abstract":"Hierarchical Vision-Language-Action (VLA) models have rapidly become a dominant paradigm for robotic manipulation. It typically comprising a Vision-Language backbone for perception and understanding, together with a generative policy for action generation. However, its performance is increasingly bottlenecked by the action generation proceess. (i) Low inference efficiency. A pronounced distributional gap between isotropic noise priors and target action distributions, which increases denoising st","published":"2026-02-22T15:39:34+00:00","categories":"[\"cs.RO\", \"cs.AI\", \"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.20200v1","arxiv_url":"http://arxiv.org/abs/2602.20200v1","comment":"17 pages, 8 figures","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":7.0,"score_axis_3":8.0,"composite":5.85,"summary":"OptimusVLA introduces dual-memory augmentation for Vision-Language-Action models: Global Prior Memory replaces Gaussian noise with task-level priors, and Local Consistency Memory enforces temporal coherence. Achieves 98.6% success on LIBERO and 2.9x inference speedup over pi_0.","reasoning":"Novel dual-memory architecture with strong practical results and efficiency gains. No code/weights available yet but addresses key VLA limitations.","code_url":null,"s2_tldr":"OptimusVLA is introduced, a dual-memory VLA framework with Global Prior Memory (GPM) and Local Consistency Memory (LCM) that replaces Gaussian noise with task-level priors retrieved from semantically similar trajectories, thereby shortening the generative path and reducing the umber of function evaluations (NFE).","s2_paper_id":"770dd1c450a3412d3e9de8d0e1a5ca9b3e001421","topics":"[\"Multimodal\", \"Robotics\", \"Efficiency\"]"},{"id":36,"run_id":1,"domain":"aiml","arxiv_id":"2602.19163","entry_id":"","title":"JavisDiT++: Unified Modeling and Optimization for Joint Audio-Video Generation","authors":"[\"Kai Liu\", \"Yanhao Zheng\", \"Kai Wang\", \"Shengqiong Wu\", \"Rongjunchen Zhang\", \"Jiebo Luo\", \"Dimitrios Hatzinakos\", \"Ziwei Liu\", \"Hao Fei\", \"Tat-Seng Chua\"]","abstract":"AIGC has rapidly expanded from text-to-image generation toward high-quality multimodal synthesis across video and audio. Within this context, joint audio-video generation (JAVG) has emerged as a fundamental task that produces synchronized and semantically aligned sound and vision from textual descriptions. However, compared with advanced commercial models such as Veo3, existing open-source methods still suffer from limitations in generation quality, temporal synchrony, and alignment with human p","published":"2026-02-22T12:44:28+00:00","categories":"[\"cs.CV\", \"cs.MM\", \"cs.SD\"]","pdf_url":"https://arxiv.org/pdf/2602.19163v1","arxiv_url":"http://arxiv.org/abs/2602.19163v1","comment":"Accepted by ICLR 2026. Homepage: https://JavisVerse.github.io/JavisDiT2-page","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":7.0,"score_axis_3":8.0,"composite":5.85,"summary":"JavisDiT++ achieves unified joint audio-video generation through MS-MoE for cross-modal interaction, TA-RoPE for frame-level synchronization, and AV-DPO for human preference alignment. Achieves SOTA with only ~1M training samples, with code/model/dataset release planned.","reasoning":"Strong practical results, novel architecture components, and promising release. ICLR 2026 accepted. Code/weights coming soon.","code_url":null,"s2_tldr":"This paper introduces a modality-specific mixture-of-experts (MS-MoE) design that enables cross-modal interaction efficacy while enhancing single-modal generation quality, and proposes a temporal-aligned RoPE strategy to achieve explicit, frame-level synchronization between audio and video tokens.","s2_paper_id":"1fe826c3e75ceff8beaf78e0ef63f3748bfd0245","topics":"[\"Video Generation\", \"Optimization\", \"Image Generation\"]"},{"id":38,"run_id":1,"domain":"aiml","arxiv_id":"2602.19083","entry_id":"","title":"ChordEdit: One-Step Low-Energy Transport for Image Editing","authors":"[\"Liangsi Lu\", \"Xuhang Chen\", \"Minzhe Guo\", \"Shichu Li\", \"Jingchao Wang\", \"Yang Shi\"]","abstract":"The advent of one-step text-to-image (T2I) models offers unprecedented synthesis speed. However, their application to text-guided image editing remains severely hampered, as forcing existing training-free editors into a single inference step fails. This failure manifests as severe object distortion and a critical loss of consistency in non-edited regions, resulting from the high-energy, erratic trajectories produced by naive vector arithmetic on the models' structured fields. To address this pro","published":"2026-02-22T07:40:50+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.19083v1","arxiv_url":"http://arxiv.org/abs/2602.19083v1","comment":"Accepted by CVPR 2026","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":7.0,"score_axis_3":8.0,"composite":5.85,"summary":"ChordEdit enables one-step text-guided image editing by reformulating it as a transport problem using dynamic optimal transport theory. It derives a low-energy control strategy that produces smoothed, variance-reduced editing fields, achieving real-time editing with high fidelity on one-step T2I models.","reasoning":"Novel theoretical approach using optimal transport for editing, CVPR 2026 accepted. Highly practical for real-time applications, but no code/weights available.","code_url":null,"s2_tldr":"ChordEdit is introduced, a model agnostic, training-free, and inversion-free method that facilitates high-fidelity one-step editing on T2I models and recast editing as a transport problem between the source and target distributions defined by the source and target text prompts.","s2_paper_id":"072156ff8d995993685db7fa9c702e0142c200af","topics":"[\"Image Generation\"]"},{"id":39,"run_id":1,"domain":"aiml","arxiv_id":"2602.19063","entry_id":"","title":"Direction-aware 3D Large Multimodal Models","authors":"[\"Quan Liu\", \"Weihao Xuan\", \"Junjue Wang\", \"Naoto Yokoya\", \"Ling Shao\", \"Shijian Lu\"]","abstract":"3D large multimodal models (3D LMMs) rely heavily on ego poses for enabling directional question-answering and spatial reasoning. However, most existing point cloud benchmarks contain rich directional queries but lack the corresponding ego poses, making them inherently ill-posed in 3D large multimodal modelling. In this work, we redefine a new and rigorous paradigm that enables direction-aware 3D LMMs by identifying and supplementing ego poses into point cloud benchmarks and transforming the cor","published":"2026-02-22T06:31:28+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.19063v1","arxiv_url":"http://arxiv.org/abs/2602.19063v1","comment":"In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":7.0,"score_axis_3":8.0,"composite":5.85,"summary":"This work enables direction-aware 3D LMMs by introducing PoseRecover (automatic ego pose recovery pipeline) and PoseAlign (point cloud transformation to ego poses). It improves ScanRefer mIoU by 30% and Scan2Cap accuracy by 11.7% across multiple 3D LMM backbones with only instruction tuning required.","reasoning":"Novel paradigm for 3D LMMs with substantial improvements, CVPR 2026. Very practical and training-efficient, but no code/weights available.","code_url":null,"s2_tldr":"This work redefine a new and rigorous paradigm that enables direction-aware 3D LMMs by identifying and supplementing ego poses into point cloud benchmarks and transforming the corresponding point cloud data according to the identified ego poses.","s2_paper_id":"dad90ef9f1b533f9e23ddb7930a0a99bb89013b1","topics":"[\"Multimodal\", \"3D / Vision\", \"Reasoning\"]"},{"id":42,"run_id":1,"domain":"aiml","arxiv_id":"2602.18993","entry_id":"","title":"SeaCache: Spectral-Evolution-Aware Cache for Accelerating Diffusion Models","authors":"[\"Jiwoo Chung\", \"Sangeek Hyun\", \"MinKyu Lee\", \"Byeongju Han\", \"Geonho Cha\", \"Dongyoon Wee\", \"Youngjun Hong\", \"Jae-Pil Heo\"]","abstract":"Diffusion models are a strong backbone for visual generation, but their inherently sequential denoising process leads to slow inference. Previous methods accelerate sampling by caching and reusing intermediate outputs based on feature distances between adjacent timesteps. However, existing caching strategies typically rely on raw feature differences that entangle content and noise. This design overlooks spectral evolution, where low-frequency structure appears early and high-frequency detail is ","published":"2026-02-22T00:48:03+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.18993v1","arxiv_url":"http://arxiv.org/abs/2602.18993v1","comment":"Accepted to CVPR 2026 Main. Project page:https://jiwoogit.github.io/SeaCache","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":7.0,"score_axis_3":8.0,"composite":5.85,"summary":"SeaCache accelerates diffusion models through spectral-evolution-aware caching that bases reuse on spectrally-filtered representations rather than raw features. It derives SEA filters preserving content while suppressing noise, achieving SOTA latency-quality trade-offs through dynamic, content-adaptive schedules respecting spectral priors.","reasoning":"Novel spectral approach to diffusion acceleration, CVPR 2026. Training-free and practical for inference optimization, but no code available despite project page.","code_url":null,"s2_tldr":"Spectral-Evolution-Aware Cache (SeaCache), a training-free cache schedule that bases reuse decisions on a spectrally aligned representation that preserves content-relevant components while suppressing noise, is introduced.","s2_paper_id":"53d65f8729eb4775883b7d42f7c19fa97d061d31","topics":"[\"Efficiency\"]"},{"id":43,"run_id":1,"domain":"aiml","arxiv_id":"2602.18904","entry_id":"","title":"PCA-VAE: Differentiable Subspace Quantization without Codebook Collapse","authors":"[\"Hao Lu\", \"Onur C. Koyun\", \"Yongxin Guo\", \"Zhengjie Zhu\", \"Abbas Alili\", \"Metin Nafi Gurcan\"]","abstract":"Vector-quantized autoencoders deliver high-fidelity latents but suffer inherent flaws: the quantizer is non-differentiable, requires straight-through hacks, and is prone to collapse. We address these issues at the root by replacing VQ with a simple, principled, and fully differentiable alternative: an online PCA bottleneck trained via Oja's rule. The resulting model, PCA-VAE, learns an orthogonal, variance-ordered latent basis without codebooks, commitment losses, or lookup noise. Despite its si","published":"2026-02-21T16:57:58+00:00","categories":"[\"cs.LG\", \"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.18904v1","arxiv_url":"http://arxiv.org/abs/2602.18904v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":8.0,"score_axis_3":7.0,"composite":5.85,"summary":"PCA-VAE replaces vector quantization with fully differentiable PCA bottleneck trained via Oja's rule, eliminating codebook collapse and straight-through hacks. Exceeds VQ-GAN and SimVQ reconstruction quality while using 10-100x fewer latent bits and producing interpretable dimensions without disentanglement objectives.","reasoning":"Paradigm-shifting alternative to VQ with strong theoretical grounding and practical results. No code/weights shared explicitly. High novelty in replacing fundamental VQ approach with PCA.","code_url":null,"s2_tldr":"The results suggest that PCA is a viable replacement for VQ: mathematically grounded, stable, bit-efficient, and semantically structured, offering a new direction for generative models beyond vector quantization.","s2_paper_id":"3f16dcc4c4dafc6c02deb47aa1cd610b140df097","topics":"[\"Efficiency\"]"},{"id":44,"run_id":1,"domain":"aiml","arxiv_id":"2602.20743","entry_id":"","title":"Adaptive Text Anonymization: Learning Privacy-Utility Trade-offs via Prompt Optimization","authors":"[\"Gabriel Loiseau\", \"Damien Sileo\", \"Damien Riquet\", \"Maxime Meyer\", \"Marc Tommasi\"]","abstract":"Anonymizing textual documents is a highly context-sensitive problem: the appropriate balance between privacy protection and utility preservation varies with the data domain, privacy objectives, and downstream application. However, existing anonymization methods rely on static, manually designed strategies that lack the flexibility to adjust to diverse requirements and often fail to generalize across domains. We introduce adaptive text anonymization, a new task formulation in which anonymization ","published":"2026-02-24T10:12:40+00:00","categories":"[\"cs.CL\"]","pdf_url":"https://arxiv.org/pdf/2602.20743v1","arxiv_url":"http://arxiv.org/abs/2602.20743v1","comment":"","source":"both","github_repo":"","github_stars":null,"hf_upvotes":1,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":7.0,"score_axis_3":8.0,"composite":5.85,"summary":"Introduces adaptive text anonymization via task-specific prompt optimization, automatically constructing anonymization instructions for LMs based on privacy-utility requirements. Framework achieves better privacy-utility trade-offs than baselines across five diverse datasets while remaining computationally efficient.","reasoning":"Novel adaptive framework addressing real privacy concerns with strong empirical results. HF presence (1 upvote) and practical applicability boost score.","code_url":null,"s2_tldr":"This work proposes a framework for task-specific prompt optimization that automatically constructs anonymization instructions for language models, enabling adaptation to different privacy goals, domains, and downstream usage patterns, and shows that the framework consistently achieves a better privacy-utility trade-off than existing baselines.","s2_paper_id":"bf0eb1400563499b293f0f2e432a848bab6e9384","topics":"[\"Optimization\"]"},{"id":45,"run_id":1,"domain":"aiml","arxiv_id":"2602.20727","entry_id":"","title":"ID-LoRA: Efficient Low-Rank Adaptation Inspired by Matrix Interpolative Decomposition","authors":"[\"Xindian Ma\", \"Rundong Kong\", \"Peng Zhang\", \"Ruoxiang Huang\", \"Yongyu Jiang\"]","abstract":"LoRA has become a universal Parameter-Efficient Fine-Tuning (PEFT) technique that equips Large Language Models (LLMs) to adapt quickly to new tasks. However, when these models are scaled up, even the latest LoRA variants still introduce considerable overhead in trainable parameters. Conversely, aggressively lowering the rank to curb this overhead markedly degrades performance in complex multi-task settings. We propose ID-LoRA, a novel PEFT framework that breaks the trade-off. Its core innovation","published":"2026-02-24T09:45:10+00:00","categories":"[\"cs.CL\"]","pdf_url":"https://arxiv.org/pdf/2602.20727v1","arxiv_url":"http://arxiv.org/abs/2602.20727v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":7.0,"score_axis_3":8.0,"composite":5.85,"summary":"ID-LoRA extracts clustered parameter groups from pretrained weights to form multiple low-rank components sharing a single trainable matrix. Outperforms LoRA and recent variants while using 46-54% fewer trainable parameters across diverse benchmarks.","reasoning":"Novel PEFT approach with strong empirical results and significant parameter efficiency gains. High practical value for fine-tuning, but no code/weights released.","code_url":null,"s2_tldr":"In multi-task scenarios, ID-LoRA surpasses LoRA and its recent variants on both Code and MMLU tasks, yet requires only 54% of the trainable parameters demanded by the conventional LoRA.","s2_paper_id":"276674062954ef348da000449a4083629d970e8b","topics":"[\"Efficiency\", \"Language Models\"]"},{"id":52,"run_id":1,"domain":"aiml","arxiv_id":"2602.18037","entry_id":"","title":"Gradient Regularization Prevents Reward Hacking in Reinforcement Learning from Human Feedback and Verifiable Rewards","authors":"[\"Johannes Ackermann\", \"Michael Noukhovitch\", \"Takashi Ishida\", \"Masashi Sugiyama\"]","abstract":"Reinforcement Learning from Human Feedback (RLHF) or Verifiable Rewards (RLVR) are two key steps in the post-training of modern Language Models (LMs). A common problem is reward hacking, where the policy may exploit inaccuracies of the reward and learn an unintended behavior. Most previous works address this by limiting the policy update with a Kullback-Leibler (KL) penalty towards a reference model. We propose a different framing: Train the LM in a way that biases policy updates towards regions","published":"2026-02-20T07:32:22+00:00","categories":"[\"cs.LG\", \"cs.AI\", \"cs.CL\"]","pdf_url":"https://arxiv.org/pdf/2602.18037v1","arxiv_url":"http://arxiv.org/abs/2602.18037v1","comment":"25 pages, 15 figures","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":7.0,"score_axis_3":8.0,"composite":5.85,"summary":"Proposes gradient regularization to prevent reward hacking in RLHF/RLVR by biasing policy updates toward flatter regions where reward models are more accurate. Shows theoretical connection between reward accuracy and optimum flatness, with empirical results demonstrating better performance than KL penalties across diverse RL experiments including GPT-judged win-rates and math tasks.","reasoning":"Novel theoretical insight with strong practical results addressing critical RLHF problem. Well-validated across tasks. Missing code limits reproducibility.","code_url":null,"s2_tldr":"GR achieves a higher GPT-judged win-rate in RLHF, avoids overly focusing on the format in rule-based math rewards, and prevents hacking the judge in LLM-as-a-Judge math tasks.","s2_paper_id":"aa0ad1d0ca61d790671250ed89873bc3cccc37ad","topics":"[\"Training\", \"RL\", \"Optimization\"]"},{"id":55,"run_id":1,"domain":"aiml","arxiv_id":"2602.17546","entry_id":"","title":"Learning to Stay Safe: Adaptive Regularization Against Safety Degradation during Fine-Tuning","authors":"[\"Jyotin Goel\", \"Souvik Maji\", \"Pratik Mazumder\"]","abstract":"Instruction-following language models are trained to be helpful and safe, yet their safety behavior can deteriorate under benign fine-tuning and worsen under adversarial updates. Existing defenses often offer limited protection or force a trade-off between safety and utility. We introduce a training framework that adapts regularization in response to safety risk, enabling models to remain aligned throughout fine-tuning. To estimate safety risk at training time, we explore two distinct approaches","published":"2026-02-19T16:59:54+00:00","categories":"[\"cs.CL\", \"cs.LG\"]","pdf_url":"https://arxiv.org/pdf/2602.17546v1","arxiv_url":"http://arxiv.org/abs/2602.17546v1","comment":"Work in progress (30 pages)","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":7.0,"score_axis_3":8.0,"composite":5.85,"summary":"Introduces adaptive regularization framework that dynamically adjusts training constraints based on safety risk during fine-tuning, using either judge-based Safety Critic or activation-based risk prediction. Consistently reduces attack success rates while preserving downstream performance with no inference-time cost.","reasoning":"Novel safety approach with practical benefits (no inference cost), but no code/weights mentioned. Addresses critical safety-utility tradeoff in LLM deployment.","code_url":null,"s2_tldr":"This work introduces a training framework that adapts regularization in response to safety risk, enabling models to remain aligned throughout fine-tuning, and demonstrates a principled mechanism for maintaining safety without sacrificing utility.","s2_paper_id":"1e6872bbfbb77b3d1e59216d83864b565114e281","topics":"[\"Language Models\"]"},{"id":336,"run_id":1,"domain":"aiml","arxiv_id":"2602.19467","entry_id":"","title":"Can Large Language Models Replace Human Coders? Introducing ContentBench","authors":"[\"Michael Haman\"]","abstract":"Can low-cost large language models (LLMs) take over the interpretive coding work that still anchors much of empirical content analysis? This paper introduces ContentBench, a public benchmark suite that helps answer this replacement question by tracking how much agreement low-cost LLMs achieve and what they cost on the same interpretive coding tasks. The suite uses versioned tracks that invite researchers to contribute new benchmark datasets. I report results from the first track, ContentBench-Re","published":"2026-02-23T03:26:17+00:00","categories":"[\"cs.CY\", \"cs.AI\", \"cs.CL\"]","pdf_url":"https://arxiv.org/pdf/2602.19467v1","arxiv_url":"http://arxiv.org/abs/2602.19467v1","comment":"Project website: https://contentbench.github.io","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":7.0,"score_axis_2":4.0,"score_axis_3":6.0,"composite":5.6,"summary":"ContentBench introduces a benchmark for evaluating low-cost LLMs on interpretive coding tasks, specifically content analysis of social media posts. Results show top models achieve 97-99% agreement with reference labels for as little as a few dollars per 50,000 posts, though small open-weight models still struggle with nuanced content like sarcasm.","reasoning":"Useful benchmark for practitioners evaluating LLM annotation capabilities, but primarily an evaluation framework rather than a novel method. Limited code/weights availability, moderate novelty as benchmarking work.","code_url":"https://contentbench.github.io","s2_tldr":"ContentBench is a public benchmark suite that helps answer this replacement question by tracking how much agreement low-cost LLMs achieve and what they cost on the same interpretive coding tasks by tracking how much agreement high-cost models achieve and what they cost on the same interpretive coding tasks.","s2_paper_id":"ecad6527092992e8ecb6a4142b2394e96c4beaf2","topics":"[\"Language Models\", \"Benchmark\"]"},{"id":58,"run_id":1,"domain":"aiml","arxiv_id":"2602.21101","entry_id":"","title":"Event-Aided Sharp Radiance Field Reconstruction for Fast-Flying Drones","authors":"[\"Rong Zou\", \"Marco Cannici\", \"Davide Scaramuzza\"]","abstract":"Fast-flying aerial robots promise rapid inspection under limited battery constraints, with direct applications in infrastructure inspection, terrain exploration, and search and rescue. However, high speeds lead to severe motion blur in images and induce significant drift and noise in pose estimates, making dense 3D reconstruction with Neural Radiance Fields (NeRFs) particularly challenging due to their high sensitivity to such degradations. In this work, we present a unified framework that lever","published":"2026-02-24T17:02:56+00:00","categories":"[\"cs.CV\", \"cs.RO\"]","pdf_url":"https://arxiv.org/pdf/2602.21101v1","arxiv_url":"http://arxiv.org/abs/2602.21101v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":7.0,"score_axis_3":7.0,"composite":5.5,"summary":"Leverages event cameras alongside motion-blurred frames to reconstruct high-fidelity Neural Radiance Fields from fast-flying drones. Embeds event-image fusion into NeRF optimization and jointly refines visual-inertial odometry, recovering sharp radiance fields with 50% performance gain over SOTA despite severe motion blur.","reasoning":"Novel fusion of event cameras with NeRF for extreme motion scenarios. Strong practical applicability for drone inspection but no code/weights provided.","code_url":null,"s2_tldr":"This work presents a unified framework that leverages asynchronous event streams alongside motion-blurred frames to reconstruct high-fidelity radiance fields from agile drone flights and preserves fine scene details, delivering a performance gain of over 50% on real-world data compared to state-of-the-art methods.","s2_paper_id":"783816f4efaa0d71a1257302122044ddb6a5843d","topics":"[\"3D / Vision\", \"Robotics\"]"},{"id":59,"run_id":1,"domain":"aiml","arxiv_id":"2602.21054","entry_id":"","title":"VAUQ: Vision-Aware Uncertainty Quantification for LVLM Self-Evaluation","authors":"[\"Seongheon Park\", \"Changdae Oh\", \"Hyeong Kyu Choi\", \"Xuefeng Du\", \"Sharon Li\"]","abstract":"Large Vision-Language Models (LVLMs) frequently hallucinate, limiting their safe deployment in real-world applications. Existing LLM self-evaluation methods rely on a model's ability to estimate the correctness of its own outputs, which can improve deployment reliability; however, they depend heavily on language priors and are therefore ill-suited for evaluating vision-conditioned predictions. We propose VAUQ, a vision-aware uncertainty quantification framework for LVLM self-evaluation that expl","published":"2026-02-24T16:11:14+00:00","categories":"[\"cs.CV\", \"cs.AI\", \"cs.CL\"]","pdf_url":"https://arxiv.org/pdf/2602.21054v1","arxiv_url":"http://arxiv.org/abs/2602.21054v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":7.0,"score_axis_3":7.0,"composite":5.5,"summary":"VAUQ is a vision-aware uncertainty quantification framework for LVLM self-evaluation. Introduces Image-Information Score that measures output dependence on visual evidence and unsupervised core-region masking. Training-free scoring function combining predictive entropy with masked IS reliably reflects answer correctness across multiple datasets.","reasoning":"Novel approach to uncertainty quantification for VLMs with strong practical value for safe deployment. Training-free is advantageous but no code/weights provided.","code_url":null,"s2_tldr":"VAUQ is proposed, a vision-aware uncertainty quantification framework for LVLM self-evaluation that explicitly measures how strongly a model's output depends on visual evidence and introduces the Image-Information Score, which captures the reduction in predictive uncertainty attributable to visual input.","s2_paper_id":"b72bb31b74fd016f47eec75faee74f8f6e820b49","topics":"[\"Benchmark\", \"Language Models\", \"Multimodal\"]"},{"id":60,"run_id":1,"domain":"aiml","arxiv_id":"2602.20985","entry_id":"","title":"EW-DETR: Evolving World Object Detection via Incremental Low-Rank DEtection TRansformer","authors":"[\"Munish Monga\", \"Vishal Chudasama\", \"Pankaj Wasnik\", \"C. V. Jawahar\"]","abstract":"Real-world object detection must operate in evolving environments where new classes emerge, domains shift, and unseen objects must be identified as \"unknown\": all without accessing prior data. We introduce Evolving World Object Detection (EWOD), a paradigm coupling incremental learning, domain adaptation, and unknown detection under exemplar-free constraints. To tackle EWOD, we propose EW-DETR framework that augments DETR-based detectors with three synergistic modules: Incremental LoRA Adapters ","published":"2026-02-24T15:06:04+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.20985v1","arxiv_url":"http://arxiv.org/abs/2602.20985v1","comment":"Accepted at CVPR 2026","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":7.0,"score_axis_3":7.0,"composite":5.5,"summary":"EW-DETR tackles Evolving World Object Detection combining incremental learning, domain adaptation, and unknown detection without storing prior data. Uses Incremental LoRA adapters with DETR-based detectors and introduces FOGS metric for holistic evaluation. Achieves 57.24% improvement in FOGS score on benchmarks.","reasoning":"Novel paradigm addressing practical evolving-world scenarios with strong performance. No code/weights commitment found despite acceptance at CVPR 2026.","code_url":null,"s2_tldr":"This work proposes EW-DETR framework that augments DETR-based detectors with three synergistic modules: Incremental LoRA Adapters for exemplar-free incremental learning under evolving domains; a Query-Norm Objectness Adapter that decouples objectness-aware features from DETR decoder queries; and Entropy-Aware Unknown Mixing for calibrated unknown detection.","s2_paper_id":"f10fb90a87a38a3b3455b98dca8e59628a5ff3e1","topics":"[\"3D / Vision\", \"Architecture\"]"},{"id":64,"run_id":1,"domain":"aiml","arxiv_id":"2602.20630","entry_id":"","title":"From Pairs to Sequences: Track-Aware Policy Gradients for Keypoint Detection","authors":"[\"Yepeng Liu\", \"Hao Li\", \"Liwen Yang\", \"Fangzhen Li\", \"Xudi Ge\", \"Yuliang Gu\", \"kuang Gao\", \"Bing Wang\", \"Guang Chen\", \"Hangjun Ye\"]","abstract":"Keypoint-based matching is a fundamental component of modern 3D vision systems, such as Structure-from-Motion (SfM) and SLAM. Most existing learning-based methods are trained on image pairs, a paradigm that fails to explicitly optimize for the long-term trackability of keypoints across sequences under challenging viewpoint and illumination changes. In this paper, we reframe keypoint detection as a sequential decision-making problem. We introduce TraqPoint, a novel, end-to-end Reinforcement Learn","published":"2026-02-24T07:24:25+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.20630v1","arxiv_url":"http://arxiv.org/abs/2602.20630v1","comment":"Accepted by CVPR 2026","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":7.0,"score_axis_3":7.0,"composite":5.5,"summary":"TraqPoint reframes keypoint detection as sequential RL problem, optimizing track-quality across image sequences via track-aware policy gradients. Outperforms SOTA on sparse matching, relative pose estimation, and 3D reconstruction benchmarks.","reasoning":"No code/weights. Novel RL formulation for fundamental vision problem. High practical value for SLAM/SfM pipelines but lack of released implementation limits immediate impact.","code_url":null,"s2_tldr":"TraqPoint is introduced, a novel, end-to-end Reinforcement Learning (RL) framework designed to optimize the long-term trackability of keypoints across sequences under challenging viewpoint and illumination changes, guided by a policy gradient method.","s2_paper_id":"3f1413731eed6b12962d384c92a895ec7931138c","topics":"[\"Optimization\", \"RL\", \"3D / Vision\"]"},{"id":65,"run_id":1,"domain":"aiml","arxiv_id":"2602.20569","entry_id":"","title":"AIForge-Doc: A Benchmark for Detecting AI-Forged Tampering in Financial and Form Documents","authors":"[\"Jiaqi Wu\", \"Yuchen Zhou\", \"Muduo Xu\", \"Zisheng Liang\", \"Simiao Ren\", \"Jiayu Xue\", \"Meige Yang\", \"Siying Chen\", \"Jingheng Huan\"]","abstract":"We present AIForge-Doc, the first dedicated benchmark targeting exclusively diffusion-model-based inpainting in financial and form documents with pixel-level annotation. Existing document forgery datasets rely on traditional digital editing tools (e.g., Adobe Photoshop, GIMP), creating a critical gap: state-of-the-art detectors are blind to the rapidly growing threat of AI-forged document fraud. AIForge-Doc addresses this gap by systematically forging numeric fields in real-world receipt and for","published":"2026-02-24T05:37:35+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.20569v1","arxiv_url":"http://arxiv.org/abs/2602.20569v1","comment":"17 pages, 10 figures","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":7.0,"score_axis_3":7.0,"composite":5.5,"summary":"AIForge-Doc introduces the first benchmark for detecting AI diffusion-based inpainting in financial documents with pixel-level annotations. Shows existing detectors (TruFor, DocTamper, GPT-4o) catastrophically fail on AI-forged documents (near-chance performance), revealing a critical security gap.","reasoning":"Dataset likely available but unclear on detection model code/weights. Highly novel problem formulation with immediate practical relevance for document fraud detection. Important benchmark contribution.","code_url":null,"s2_tldr":"The first dedicated benchmark targeting exclusively diffusion-model-based inpainting in financial and form documents with pixel-level annotation is presented, confirming that AI-forged values are indistinguishable to automated detectors and VLMs.","s2_paper_id":"af3077fbbf36dc19b5f270a0c1bda8042106efdc","topics":"[\"Benchmark\"]"},{"id":66,"run_id":1,"domain":"aiml","arxiv_id":"2602.20566","entry_id":"","title":"BFA++: Hierarchical Best-Feature-Aware Token Prune for Multi-View Vision Language Action Model","authors":"[\"Haosheng Li\", \"Weixin Mao\", \"Zihan Lan\", \"Hongwei Xiong\", \"Hongan Wang\", \"Chenyang Si\", \"Ziwei Liu\", \"Xiaoming Deng\", \"Hua Chen\"]","abstract":"Vision-Language-Action (VLA) models have achieved significant breakthroughs by leveraging Large Vision Language Models (VLMs) to jointly interpret instructions and visual inputs. However, the substantial increase in visual tokens, particularly from multi-view inputs, poses serious challenges to real-time robotic manipulation. Existing acceleration techniques for VLMs, such as token pruning, often result in degraded performance when directly applied to VLA models, as they overlook the relationshi","published":"2026-02-24T05:31:52+00:00","categories":"[\"cs.RO\", \"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.20566v1","arxiv_url":"http://arxiv.org/abs/2602.20566v1","comment":"9 pages, 10 figures","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":8.0,"composite":5.5,"summary":"BFA++ introduces hierarchical token pruning for multi-view VLA models with intra-view and inter-view importance predictors. Achieves 1.8x speedup on \u03c00 and 1.5x on RDT while improving success rates by ~10% on robotic manipulation tasks.","reasoning":"No code/weights provided. Novel hierarchical pruning strategy for VLA models with strong practical results on real robots. High applicability for robotics practitioners but lacks open implementation.","code_url":null,"s2_tldr":"BFA++ is proposed, a dynamic token pruning framework designed specifically for VLA models that highlights that context-sensitive and task-aware token pruning serves as a more effective strategy than full visual processing, enabling faster inference and improved manipulation accuracy in real-world robotic systems.","s2_paper_id":"57f096f836819351f9f36aa51276c658c4d72fd8","topics":"[\"Multimodal\", \"Language Models\", \"Efficiency\"]"},{"id":67,"run_id":1,"domain":"aiml","arxiv_id":"2602.20551","entry_id":"","title":"CAD-Prompted SAM3: Geometry-Conditioned Instance Segmentation for Industrial Objects","authors":"[\"Zhenran Tang\", \"Rohan Nagabhirava\", \"Changliu Liu\"]","abstract":"Verbal-prompted segmentation is inherently limited by the expressiveness of natural language and struggles with uncommon, instance-specific, or difficult-to-describe objects: scenarios frequently encountered in manufacturing and 3D printing environments. While image exemplars provide an alternative, they primarily encode appearance cues such as color and texture, which are often unrelated to a part's geometric identity. In industrial settings, a single component may be produced in different mate","published":"2026-02-24T05:10:22+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.20551v1","arxiv_url":"http://arxiv.org/abs/2602.20551v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":7.0,"score_axis_3":7.0,"composite":5.5,"summary":"Proposes CAD-prompted segmentation using SAM3 with multi-view CAD renderings as geometry-based prompts for industrial objects. Enables instance segmentation independent of surface appearance (material, color, finish) using synthetic training data from mesh renderings.","reasoning":"No code/weights provided. Novel prompt modality (CAD geometry vs. appearance/language) with strong practical relevance for manufacturing. Lack of open implementation limits accessibility.","code_url":null,"s2_tldr":"This work proposes a CAD-prompted segmentation framework built on SAM3 that uses canonical multi-view renderings of a CAD model as prompt input, and enables single-stage, CAD-prompted mask prediction, extending promptable segmentation to objects that cannot be robustly described by language or appearance alone.","s2_paper_id":"7d3a10c5f62fdf7d5cc1d3771ad1a4ba232d3b1c","topics":"[\"3D / Vision\"]"},{"id":69,"run_id":1,"domain":"aiml","arxiv_id":"2602.20157","entry_id":"","title":"Flow3r: Factored Flow Prediction for Scalable Visual Geometry Learning","authors":"[\"Zhongxiao Cong\", \"Qitao Zhao\", \"Minsik Jeon\", \"Shubham Tulsiani\"]","abstract":"Current feed-forward 3D/4D reconstruction systems rely on dense geometry and pose supervision -- expensive to obtain at scale and particularly scarce for dynamic real-world scenes. We present Flow3r, a framework that augments visual geometry learning with dense 2D correspondences (`flow') as supervision, enabling scalable training from unlabeled monocular videos. Our key insight is that the flow prediction module should be factored: predicting flow between two images using geometry latents from ","published":"2026-02-23T18:59:30+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.20157v1","arxiv_url":"http://arxiv.org/abs/2602.20157v1","comment":"CVPR 2026. Project website: https://flow3r-project.github.io/","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":7.0,"score_axis_3":7.0,"composite":5.5,"summary":"Flow3r enables scalable 3D/4D reconstruction from unlabeled monocular videos by factoring flow prediction into geometry and pose latents. Trained on ~800K unlabeled videos, it achieves SOTA across eight benchmarks for static and dynamic scenes, with largest gains on in-the-wild dynamic videos.","reasoning":"No code released yet (CVPR 2026). Novel factored flow formulation enabling training on unlabeled data addresses key data scarcity issue. High practical potential when released.","code_url":null,"s2_tldr":"Flow3r, a framework that augments visual geometry learning with dense 2D correspondences (`flow') as supervision, enabling scalable training from unlabeled monocular videos, achieves state-of-the-art results across eight benchmarks spanning static and dynamic scenes.","s2_paper_id":"05b7f6e0214788a8bc6a4e734ca87797d5e603da","topics":"[\"3D / Vision\"]"},{"id":71,"run_id":1,"domain":"aiml","arxiv_id":"2602.20053","entry_id":"","title":"Decoupling Defense Strategies for Robust Image Watermarking","authors":"[\"Jiahui Chen\", \"Zehang Deng\", \"Zeyu Zhang\", \"Chaoyang Li\", \"Lianchen Jia\", \"Lifeng Sun\"]","abstract":"Deep learning-based image watermarking, while robust against conventional distortions, remains vulnerable to advanced adversarial and regeneration attacks. Conventional countermeasures, which jointly optimize the encoder and decoder via a noise layer, face 2 inevitable challenges: (1) decrease of clean accuracy due to decoder adversarial training and (2) limited robustness due to simultaneous training of all three advanced attacks. To overcome these issues, we propose AdvMark, a novel two-stage ","published":"2026-02-23T17:02:55+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.20053v1","arxiv_url":"http://arxiv.org/abs/2602.20053v1","comment":"CVPR 2026","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":7.0,"score_axis_3":7.0,"composite":5.5,"summary":"AdvMark proposes a two-stage fine-tuning framework for robust image watermarking that decouples defense strategies. Stage 1 uses tailored adversarial training on the encoder, while stage 2 employs constrained image optimization, achieving up to 46% accuracy improvement against adversarial attacks.","reasoning":"No code indicated; novel decoupling approach for watermarking robustness; strong practical applicability for content protection.","code_url":null,"s2_tldr":"AdvMark, a novel two-stage fine-tuning framework that decouples the defense strategies, outperforms with the highest image quality and comprehensive robustness, i.e. up to 29\\%, 33\\% and 46\\% accuracy improvement for distortion, regeneration and adversarial attacks, respectively.","s2_paper_id":"de29eede15f195957c0daefd37e2339dcc0993b7","topics":"[\"Optimization\"]"},{"id":72,"run_id":1,"domain":"aiml","arxiv_id":"2602.19916","entry_id":"","title":"Augmented Radiance Field: A General Framework for Enhanced Gaussian Splatting","authors":"[\"Yixin Yang\", \"Bojian Wu\", \"Yang Zhou\", \"Hui Huang\"]","abstract":"Due to the real-time rendering performance, 3D Gaussian Splatting (3DGS) has emerged as the leading method for radiance field reconstruction. However, its reliance on spherical harmonics for color encoding inherently limits its ability to separate diffuse and specular components, making it challenging to accurately represent complex reflections. To address this, we propose a novel enhanced Gaussian kernel that explicitly models specular effects through view-dependent opacity. Meanwhile, we intro","published":"2026-02-23T14:55:31+00:00","categories":"[\"cs.CV\", \"cs.GR\"]","pdf_url":"https://arxiv.org/pdf/2602.19916v1","arxiv_url":"http://arxiv.org/abs/2602.19916v1","comment":"Accepted to ICLR 2026. Project page: \\url{https://xiaoxinyyx.github.io/augs}","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":7.0,"score_axis_3":7.0,"composite":5.5,"summary":"This paper proposes augmented radiance fields for enhanced Gaussian Splatting by introducing view-dependent opacity to model specular effects and an error-driven compensation strategy. The method surpasses state-of-the-art NeRF methods in rendering performance with greater parameter efficiency.","reasoning":"Project page available but no explicit code/weights; novel approach to enhanced Gaussian kernels; strong practical applicability for real-time rendering.","code_url":null,"s2_tldr":"A novel enhanced Gaussian kernel that explicitly models specular effects through view-dependent opacity is proposed that not only surpasses state-of-the-art NeRF methods in rendering performance but also achieves greater parameter efficiency.","s2_paper_id":"c8a8c4dc05fd97243c1fad6ab6314d83c62c56ae","topics":"[\"3D / Vision\"]"},{"id":73,"run_id":1,"domain":"aiml","arxiv_id":"2602.19900","entry_id":"","title":"ExpPortrait: Expressive Portrait Generation via Personalized Representation","authors":"[\"Junyi Wang\", \"Yudong Guo\", \"Boyang Guo\", \"Shengming Yang\", \"Juyong Zhang\"]","abstract":"While diffusion models have shown great potential in portrait generation, generating expressive, coherent, and controllable cinematic portrait videos remains a significant challenge. Existing intermediate signals for portrait generation, such as 2D landmarks and parametric models, have limited disentanglement capabilities and cannot express personalized details due to their sparse or low-rank representation. Therefore, existing methods based on these models struggle to accurately preserve subjec","published":"2026-02-23T14:41:35+00:00","categories":"[\"cs.CV\", \"cs.GR\"]","pdf_url":"https://arxiv.org/pdf/2602.19900v1","arxiv_url":"http://arxiv.org/abs/2602.19900v1","comment":"Accepted to CVPR 2026","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":7.0,"score_axis_3":7.0,"composite":5.5,"summary":"ExpPortrait introduces a high-fidelity personalized head representation that disentangles expression and identity for expressive portrait video generation. The method uses a DiT-based generator conditioned on sophisticated head models, outperforming previous methods in identity preservation and expression accuracy.","reasoning":"No code/weights indicated; novel personalized head representation; strong applicability for portrait video synthesis but limited ecosystem presence.","code_url":null,"s2_tldr":"A high-fidelity personalized head representation is proposed that more effectively disentangles expression and identity and outperforms previous models in terms of identity preservation, expression accuracy, and temporal stability, particularly in capturing fine-grained details of complex motion.","s2_paper_id":"6dffa046c5be8e523bda7efb443845a06c5f98f3","topics":"[]"},{"id":74,"run_id":1,"domain":"aiml","arxiv_id":"2602.19542","entry_id":"","title":"Vinedresser3D: Agentic Text-guided 3D Editing","authors":"[\"Yankuan Chi\", \"Xiang Li\", \"Zixuan Huang\", \"James M. Rehg\"]","abstract":"Text-guided 3D editing aims to modify existing 3D assets using natural-language instructions. Current methods struggle to jointly understand complex prompts, automatically localize edits in 3D, and preserve unedited content. We introduce Vinedresser3D, an agentic framework for high-quality text-guided 3D editing that operates directly in the latent space of a native 3D generative model. Given a 3D asset and an editing prompt, Vinedresser3D uses a multimodal large language model to infer rich des","published":"2026-02-23T06:30:36+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.19542v1","arxiv_url":"http://arxiv.org/abs/2602.19542v1","comment":"CVPR 2026, Project website:https://vinedresser3d.github.io/","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":7.0,"score_axis_3":7.0,"composite":5.5,"summary":"Vinedresser3D is an agentic framework for text-guided 3D editing that uses a multimodal LLM to infer edit regions and guidance, then applies inversion-based rectified-flow inpainting in 3D latent space. It outperforms baselines on diverse 3D edits with precise, coherent, mask-free results.","reasoning":"CVPR 2026; project website suggests code/demo likely available. Agentic 3D editing is novel and practical for 3D content creation. No explicit weights mentioned.","code_url":null,"s2_tldr":"Vinedresser3D, an agentic framework for high-quality text-guided 3D editing that operates directly in the latent space of a native 3D generative model, outperforms prior baselines in both automatic metrics and human preference studies, while enabling precise, coherent, and mask-free 3D editing.","s2_paper_id":"2115f8185f8b126a30bb0c7e453ee39b2d611727","topics":"[\"Agents\", \"3D / Vision\", \"Language Models\"]"},{"id":78,"run_id":1,"domain":"aiml","arxiv_id":"2602.19372","entry_id":"","title":"Seeing Farther and Smarter: Value-Guided Multi-Path Reflection for VLM Policy Optimization","authors":"[\"Yanting Yang\", \"Shenyuan Gao\", \"Qingwen Bu\", \"Li Chen\", \"Dimitris N. Metaxas\"]","abstract":"Solving complex, long-horizon robotic manipulation tasks requires a deep understanding of physical interactions, reasoning about their long-term consequences, and precise high-level planning. Vision-Language Models (VLMs) offer a general perceive-reason-act framework for this goal. However, previous approaches using reflective planning to guide VLMs in correcting actions encounter significant limitations. These methods rely on inefficient and often inaccurate implicit learning of state-values fr","published":"2026-02-22T22:53:16+00:00","categories":"[\"cs.RO\", \"cs.CV\", \"cs.LG\"]","pdf_url":"https://arxiv.org/pdf/2602.19372v1","arxiv_url":"http://arxiv.org/abs/2602.19372v1","comment":"ICRA 2026","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":7.0,"score_axis_3":7.0,"composite":5.5,"summary":"Proposes value-guided multi-path reflection for VLM-based robotic manipulation by decoupling state evaluation from action generation. Uses beam search over multiple futures with explicit advantage modeling, achieving 24.6% success rate improvement while reducing inference time by 56.5%.","reasoning":"Accepted at ICRA 2026 with strong practical results for robotics, but no code/weights available. Novel approach to VLM planning with efficiency gains.","code_url":null,"s2_tldr":"This work proposes a novel test-time computation framework that decouples state evaluation from action generation, and introduces a lightweight, confidence-based trigger that allows for early exit when direct predictions are reliable, invoking reflection only when necessary.","s2_paper_id":"fd83dd2409c674c954187727652955f8655aa0d2","topics":"[\"Multimodal\", \"Optimization\", \"Language Models\"]"},{"id":79,"run_id":1,"domain":"aiml","arxiv_id":"2602.19358","entry_id":"","title":"Referring Layer Decomposition","authors":"[\"Fangyi Chen\", \"Yaojie Shen\", \"Lu Xu\", \"Ye Yuan\", \"Shu Zhang\", \"Yulei Niu\", \"Longyin Wen\"]","abstract":"Precise, object-aware control over visual content is essential for advanced image editing and compositional generation. Yet, most existing approaches operate on entire images holistically, limiting the ability to isolate and manipulate individual scene elements. In contrast, layered representations, where scenes are explicitly separated into objects, environmental context, and visual effects, provide a more intuitive and structured framework for interpreting and editing visual content. To bridge","published":"2026-02-22T22:05:17+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.19358v1","arxiv_url":"http://arxiv.org/abs/2602.19358v1","comment":"ICLR 2026","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":7.0,"score_axis_3":7.0,"composite":5.5,"summary":"Introduces Referring Layer Decomposition (RLD) task for generating complete RGBA layers from RGB images via flexible prompts. Presents RefLade dataset (1.11M image-layer-prompt triplets) and RefLayer baseline model with strong zero-shot generalization for compositional editing.","reasoning":"Accepted at ICLR 2026 with large-scale dataset and novel task formulation. High practical value for image editing but no released code/weights yet.","code_url":null,"s2_tldr":"RefLayer is presented, a simple baseline designed for prompt-conditioned layer decomposition, achieving high visual fidelity and semantic alignment, and establishes RLD as a well-defined and benchmarkable research task.","s2_paper_id":"e2670724b1b87439aa764f7394f131813575b7c9","topics":"[\"Image Generation\", \"Robotics\"]"},{"id":80,"run_id":1,"domain":"aiml","arxiv_id":"2602.19350","entry_id":"","title":"PoseCraft: Tokenized 3D Body Landmark and Camera Conditioning for Photorealistic Human Image Synthesis","authors":"[\"Zhilin Guo\", \"Jing Yang\", \"Kyle Fogarty\", \"Jingyi Wan\", \"Boqiao Zhang\", \"Tianhao Wu\", \"Weihao Xia\", \"Chenliang Zhou\", \"Sakar Khattar\", \"Fangcheng Zhong\"]","abstract":"Digitizing humans and synthesizing photorealistic avatars with explicit 3D pose and camera controls are central to VR, telepresence, and entertainment. Existing skinning-based workflows require laborious manual rigging or template-based fittings, while neural volumetric methods rely on canonical templates and re-optimization for each unseen pose. We present PoseCraft, a diffusion framework built around tokenized 3D interface: instead of relying only on rasterized geometry as 2D control images, w","published":"2026-02-22T21:50:24+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.19350v1","arxiv_url":"http://arxiv.org/abs/2602.19350v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":7.0,"score_axis_3":7.0,"composite":5.5,"summary":"PoseCraft uses tokenized 3D landmarks and camera extrinsics as conditioning tokens for photorealistic human image synthesis via diffusion. Preserves 3D semantics better than 2D projection methods and introduces GenHumanRF data generation workflow from volumetric reconstructions.","reasoning":"Novel tokenized 3D conditioning approach for human synthesis with strong results, but no released code/weights. High practical value for avatar creation.","code_url":null,"s2_tldr":"PoseCraft is presented, a diffusion framework built around tokenized 3D interface: instead of relying only on rasterized geometry as 2D control images, it encode sparse 3D landmarks and camera extrinsics as discrete conditioning tokens and inject them into diffusion via cross-attention.","s2_paper_id":"bd2eb3b1446f1ba0decf72891ee9ab80fe307a63","topics":"[\"3D / Vision\", \"Image Generation\", \"Optimization\"]"},{"id":81,"run_id":1,"domain":"aiml","arxiv_id":"2602.19349","entry_id":"","title":"UP-Fuse: Uncertainty-guided LiDAR-Camera Fusion for 3D Panoptic Segmentation","authors":"[\"Rohit Mohan\", \"Florian Drews\", \"Yakov Miron\", \"Daniele Cattaneo\", \"Abhinav Valada\"]","abstract":"LiDAR-camera fusion enhances 3D panoptic segmentation by leveraging camera images to complement sparse LiDAR scans, but it also introduces a critical failure mode. Under adverse conditions, degradation or failure of the camera sensor can significantly compromise the reliability of the perception system. To address this problem, we introduce UP-Fuse, a novel uncertainty-aware fusion framework in the 2D range-view that remains robust under camera sensor degradation, calibration drift, and sensor f","published":"2026-02-22T21:34:29+00:00","categories":"[\"cs.CV\", \"cs.AI\"]","pdf_url":"https://arxiv.org/pdf/2602.19349v1","arxiv_url":"http://arxiv.org/abs/2602.19349v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":7.0,"score_axis_3":7.0,"composite":5.5,"summary":"UP-Fuse introduces uncertainty-guided LiDAR-camera fusion for 3D panoptic segmentation that remains robust under camera degradation/failure. Uses uncertainty maps learned from representational divergence to modulate cross-modal interaction dynamically in range-view space.","reasoning":"Novel uncertainty-aware fusion with safety-critical applications, but no code/weights. Strong robustness results on multiple benchmarks including new Panoptic Waymo.","code_url":null,"s2_tldr":"UP-Fuse is introduced, a novel uncertainty-aware fusion framework in the 2D range-view that remains robust under camera sensor degradation, calibration drift, and sensor failure, making it well suited for robotic perception in safety-critical settings.","s2_paper_id":"61f715ec6713b19467cb3c915e0fefe259e3b39f","topics":"[\"3D / Vision\"]"},{"id":82,"run_id":1,"domain":"aiml","arxiv_id":"2602.19348","entry_id":"","title":"MultiDiffSense: Diffusion-Based Multi-Modal Visuo-Tactile Image Generation Conditioned on Object Shape and Contact Pose","authors":"[\"Sirine Bhouri\", \"Lan Wei\", \"Jian-Qing Zheng\", \"Dandan Zhang\"]","abstract":"Acquiring aligned visuo-tactile datasets is slow and costly, requiring specialised hardware and large-scale data collection. Synthetic generation is promising, but prior methods are typically single-modality, limiting cross-modal learning. We present MultiDiffSense, a unified diffusion model that synthesises images for multiple vision-based tactile sensors (ViTac, TacTip, ViTacTip) within a single architecture. Our approach uses dual conditioning on CAD-derived, pose-aligned depth maps and struc","published":"2026-02-22T21:31:24+00:00","categories":"[\"cs.CV\", \"cs.AI\"]","pdf_url":"https://arxiv.org/pdf/2602.19348v1","arxiv_url":"http://arxiv.org/abs/2602.19348v1","comment":"Accepted by 2026 ICRA","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":7.0,"score_axis_3":7.0,"composite":5.5,"summary":"MultiDiffSense generates vision-based tactile sensor images (ViTac, TacTip, ViTacTip) using unified diffusion model conditioned on CAD depth maps and 4-DoF contact pose. Outperforms cGAN baseline by +36-135% SSIM and enables 50% synthetic data mixing for downstream pose estimation.","reasoning":"Accepted at ICRA 2026, novel multi-modal tactile synthesis approach. Addresses critical data bottleneck in robotics but no released code/weights.","code_url":null,"s2_tldr":"MultiDiffSense alleviates the data-collection bottleneck in tactile sensing and enables scalable, controllable multi-modal dataset generation for robotic applications.","s2_paper_id":"f85bec2ce86a680e91f3bed8e82e99337f54c5a5","topics":"[\"Image Generation\", \"Benchmark\"]"},{"id":85,"run_id":1,"domain":"aiml","arxiv_id":"2602.19064","entry_id":"","title":"L3DR: 3D-aware LiDAR Diffusion and Rectification","authors":"[\"Quan Liu\", \"Xiaoqin Zhang\", \"Ling Shao\", \"Shijian Lu\"]","abstract":"Range-view (RV) based LiDAR diffusion has recently made huge strides towards 2D photo-realism. However, it neglects 3D geometry realism and often generates various RV artifacts such as depth bleeding and wavy surfaces. We design L3DR, a 3D-aware LiDAR Diffusion and Rectification framework that can regress and cancel RV artifacts in 3D space and restore local geometry accurately. Our theoretical and empirical analysis reveals that 3D models are inherently superior to 2D models in generating sharp","published":"2026-02-22T06:31:58+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.19064v1","arxiv_url":"http://arxiv.org/abs/2602.19064v1","comment":"In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":7.0,"score_axis_3":7.0,"composite":5.5,"summary":"L3DR addresses range-view LiDAR diffusion artifacts through 3D-aware rectification. It uses a 3D residual regression network predicting point-level offsets with Welsch Loss to focus on local geometry, achieving SOTA generation quality and geometry realism across KITTI, nuScenes, and Waymo benchmarks.","reasoning":"Novel 3D-aware approach for LiDAR generation with strong theoretical analysis, CVPR 2026. Practical for autonomous driving, but no code/weights available.","code_url":null,"s2_tldr":"L3DR is designed, a 3D-aware LiDAR Diffusion and Rectification framework that can regress and cancel RV artifacts in 3D space and restore local geometry accurately and achieves superb geometry realism by predicting point-level offsets in 3D space.","s2_paper_id":"ffac0049de3c8b549335de958063ad5a0883b13e","topics":"[\"3D / Vision\"]"},{"id":86,"run_id":1,"domain":"aiml","arxiv_id":"2602.19033","entry_id":"","title":"A Markovian View of Iterative-Feedback Loops in Image Generative Models: Neural Resonance and Model Collapse","authors":"[\"Vibhas Kumar Vats\", \"David J. Crandall\", \"Samuel Goree\"]","abstract":"AI training datasets will inevitably contain AI-generated examples, leading to ``feedback'' in which the output of one model impacts the training of another. It is known that such iterative feedback can lead to model collapse, yet the mechanisms underlying this degeneration remain poorly understood. Here we show that a broad class of feedback processes converges to a low-dimensional invariant structure in latent space, a phenomenon we call neural resonance. By modeling iterative feedback as a Ma","published":"2026-02-22T04:05:04+00:00","categories":"[\"cs.LG\", \"cs.AI\", \"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.19033v1","arxiv_url":"http://arxiv.org/abs/2602.19033v1","comment":"A preprint -- Under review","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":8.0,"score_axis_3":6.0,"composite":5.5,"summary":"Models iterative feedback in generative models as Markov Chains, introducing neural resonance\u2014convergence to low-dimensional invariant structures. Proposes eight-pattern taxonomy of collapse behaviors and provides unified explanation for model collapse through ergodicity and directional contraction analysis.","reasoning":"Novel theoretical framework for understanding model collapse with strong analysis. Important for AI safety research but less immediately practical for practitioners.","code_url":null,"s2_tldr":"Neural resonance provides a unified explanation for long-term degenerate behavior in generative models and provides practical diagnostics for identifying, characterizing, and eventually mitigating collapse.","s2_paper_id":"7c5595db126174d483c6bc6b16fddbe25c1cac1f","topics":"[\"Image Generation\", \"Benchmark\"]"},{"id":87,"run_id":1,"domain":"aiml","arxiv_id":"2602.19019","entry_id":"","title":"TokenTrace: Multi-Concept Attribution through Watermarked Token Recovery","authors":"[\"Li Zhang\", \"Shruti Agarwal\", \"John Collomosse\", \"Pengtao Xie\", \"Vishal Asnani\"]","abstract":"Generative AI models pose a significant challenge to intellectual property (IP), as they can replicate unique artistic styles and concepts without attribution. While watermarking offers a potential solution, existing methods often fail in complex scenarios where multiple concepts (e.g., an object and an artistic style) are composed within a single image. These methods struggle to disentangle and attribute each concept individually. In this work, we introduce TokenTrace, a novel proactive waterma","published":"2026-02-22T03:18:45+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.19019v1","arxiv_url":"http://arxiv.org/abs/2602.19019v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":7.0,"score_axis_3":7.0,"composite":5.5,"summary":"TokenTrace embeds secret signatures into text prompt embeddings and latent noise for multi-concept attribution in generative AI. Uses query-based retrieval to disentangle and verify multiple concepts (objects/styles) from single images, achieving SOTA on single and multi-concept attribution while maintaining visual quality.","reasoning":"Novel approach to IP protection problem with practical query-based mechanism. Important for attribution but specialized use case, no code available.","code_url":null,"s2_tldr":"TokenTrace is introduced, a novel proactive watermarking framework for robust, multi-concept attribution that achieves state-of-the-art performance on both single-concept and multi-concept attribution tasks, significantly outperforming existing baselines while maintaining high visual quality and robustness to common transformations.","s2_paper_id":"043400c91a2eda40483ab6cc399d3f8d469f277d","topics":"[]"},{"id":90,"run_id":1,"domain":"aiml","arxiv_id":"2602.21009","entry_id":"","title":"HiSAC: Hierarchical Sparse Activation Compression for Ultra-long Sequence Modeling in Recommenders","authors":"[\"Kun Yuan\", \"Junyu Bi\", \"Daixuan Cheng\", \"Changfa Wu\", \"Shuwen Xiao\", \"Binbin Cao\", \"Jian Wu\", \"Yuning Jiang\"]","abstract":"Modern recommender systems leverage ultra-long user behavior sequences to capture dynamic preferences, but end-to-end modeling is infeasible in production due to latency and memory constraints. While summarizing history via interest centers offers a practical alternative, existing methods struggle to (1) identify user-specific centers at appropriate granularity and (2) accurately assign behaviors, leading to quantization errors and loss of long-tail preferences. To alleviate these issues, we pro","published":"2026-02-24T15:28:58+00:00","categories":"[\"cs.IR\", \"cs.CL\"]","pdf_url":"https://arxiv.org/pdf/2602.21009v1","arxiv_url":"http://arxiv.org/abs/2602.21009v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":8.0,"composite":5.5,"summary":"HiSAC introduces hierarchical sparse activation compression for ultra-long sequence modeling in recommenders through multi-level semantic IDs and global codebook. Deployed on Taobao achieving 1.65% CTR uplift with significant cost reduction through personalized interest-agent activation and Soft-Routing Attention.","reasoning":"Strong production deployment results on Taobao with measurable business impact. No code/weights shared. Novel hierarchical approach for sequence compression with real-world validation.","code_url":null,"s2_tldr":"Hierarchical Sparse Activation Compression (HiSAC) is proposed, an efficient framework for personalized sequence modeling that achieves significant compression and cost reduction, with online A/B tests showing a consistent 1.65% CTR uplift -- demonstrating its scalability and real-world effectiveness.","s2_paper_id":"1878382ff7e2ced6e9ca9079154682efdfa7e087","topics":"[\"Efficiency\", \"Training\"]"},{"id":91,"run_id":1,"domain":"aiml","arxiv_id":"2602.20995","entry_id":"","title":"Generative Pseudo-Labeling for Pre-Ranking with LLMs","authors":"[\"Junyu Bi\", \"Xinting Niu\", \"Daixuan Cheng\", \"Kun Yuan\", \"Tao Wang\", \"Binbin Cao\", \"Jian Wu\", \"Yuning Jiang\"]","abstract":"Pre-ranking is a critical stage in industrial recommendation systems, tasked with efficiently scoring thousands of recalled items for downstream ranking. A key challenge is the train-serving discrepancy: pre-ranking models are trained only on exposed interactions, yet must score all recalled candidates -- including unexposed items -- during online serving. This mismatch not only induces severe sample selection bias but also degrades generalization, especially for long-tail content. Existing debi","published":"2026-02-24T15:14:49+00:00","categories":"[\"cs.IR\", \"cs.CL\"]","pdf_url":"https://arxiv.org/pdf/2602.20995v1","arxiv_url":"http://arxiv.org/abs/2602.20995v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":8.0,"composite":5.5,"summary":"GPL leverages LLMs to generate unbiased pseudo-labels for unexposed items in recommendation pre-ranking, addressing train-serving distribution mismatch. Deployed in production, it achieved 3.07% CTR improvement and better long-tail discovery by aligning training with serving distributions without adding online latency.","reasoning":"Novel application of LLMs to industrial recommender systems with strong real-world results, but no code/weights available and incremental to existing RecSys paradigms.","code_url":null,"s2_tldr":"Generative Pseudo-Labeling (GPL) is proposed, a framework that leverages large language models (LLMs) to generate unbiased, content-aware pseudo-labels for unexposed items, explicitly aligning the training distribution with the online serving space.","s2_paper_id":"6040988eeb71b5a6bcf81fb3265c487b2ba22595","topics":"[\"Efficiency\"]"},{"id":92,"run_id":1,"domain":"aiml","arxiv_id":"2602.20945","entry_id":"","title":"The Art of Efficient Reasoning: Data, Reward, and Optimization","authors":"[\"Taiqiang Wu\", \"Zenan Zu\", \"Bo Zhou\", \"Ngai Wong\"]","abstract":"Large Language Models (LLMs) consistently benefit from scaled Chain-of-Thought (CoT) reasoning, but also suffer from heavy computational overhead. To address this issue, efficient reasoning aims to incentivize short yet accurate thinking trajectories, typically through reward shaping with Reinforcement Learning (RL). In this paper, we systematically investigate the mechanics of efficient reasoning for LLMs. For comprehensive evaluation, we advocate for more fine-grained metrics, including length","published":"2026-02-24T14:28:16+00:00","categories":"[\"cs.CL\", \"cs.AI\"]","pdf_url":"https://arxiv.org/pdf/2602.20945v1","arxiv_url":"http://arxiv.org/abs/2602.20945v1","comment":"Tech Report, Insights on Efficient Reasoning via Reward Shaping","source":"both","github_repo":"","github_stars":null,"hf_upvotes":4,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":7.0,"score_axis_3":7.0,"composite":5.5,"summary":"Comprehensive study of efficient reasoning in LLMs via reward shaping with RL, revealing a two-stage training paradigm and demonstrating 'small models, big coverage' effect. Provides practical guidelines from 0.2M GPU hours of experiments across Qwen3 series (0.6B-30B parameters).","reasoning":"Strong empirical study with actionable insights on efficiency, validated across model scales. HF presence (4 upvotes) signals community interest, but no open weights yet.","code_url":null,"s2_tldr":"A key finding is to train on relatively easier prompts, ensuring the density of positive reward signals and thus avoiding the length collapse, and the learned length bias can be generalized across domains.","s2_paper_id":"51b8f70ac8069d6a731d9c61f54f19dc6c95f2d0","topics":"[\"Efficiency\", \"Reasoning\", \"RL\"]"},{"id":94,"run_id":1,"domain":"aiml","arxiv_id":"2602.20532","entry_id":"","title":"Actor-Curator: Co-adaptive Curriculum Learning via Policy-Improvement Bandits for RL Post-Training","authors":"[\"Zhengyao Gu\", \"Jonathan Light\", \"Raul Astudillo\", \"Ziyu Ye\", \"Langzhou He\", \"Henry Peng Zou\", \"Wei Cheng\", \"Santiago Paternain\", \"Philip S. Yu\", \"Yisong Yue\"]","abstract":"Post-training large foundation models with reinforcement learning typically relies on massive and heterogeneous datasets, making effective curriculum learning both critical and challenging. In this work, we propose ACTOR-CURATOR, a scalable and fully automated curriculum learning framework for reinforcement learning post-training of large language models (LLMs). ACTOR-CURATOR learns a neural curator that dynamically selects training problems from large problem banks by directly optimizing for ex","published":"2026-02-24T04:19:48+00:00","categories":"[\"cs.LG\", \"cs.AI\", \"cs.CL\"]","pdf_url":"https://arxiv.org/pdf/2602.20532v1","arxiv_url":"http://arxiv.org/abs/2602.20532v1","comment":"37 pages, 8 figures, 1 table. Preprint under review. Equal contribution by first two authors","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":7.0,"score_axis_3":7.0,"composite":5.5,"summary":"Actor-Curator presents an automated curriculum learning framework for LLM post-training using reinforcement learning. A neural curator dynamically selects training problems from large banks by optimizing expected policy improvement, formulated as a non-stationary stochastic bandit problem. Achieves 28.6% gains on AIME2024 and up to 80% training speedup over baselines.","reasoning":"Strong novelty in framing curriculum learning as bandits with theoretical guarantees. Impressive empirical gains on challenging reasoning tasks. No code release limits immediate adoption.","code_url":null,"s2_tldr":"This work proposes ACTOR-CURATOR, a scalable and fully automated curriculum learning framework for reinforcement learning post-training of large language models (LLMs), which learns a neural curator that dynamically selects training problems from large problem banks by directly optimizing for expected policy performance improvement.","s2_paper_id":"f09d311304f1b0462b3e3871e74c35b1723280ed","topics":"[\"RL\", \"Language Models\", \"Training\"]"},{"id":95,"run_id":1,"domain":"aiml","arxiv_id":"2602.20528","entry_id":"","title":"Stop-Think-AutoRegress: Language Modeling with Latent Diffusion Planning","authors":"[\"Justin Lovelace\", \"Christian Belardi\", \"Sofian Zalouk\", \"Adhitya Polavaram\", \"Srivatsa Kundurthy\", \"Kilian Q. Weinberger\"]","abstract":"The Stop-Think-AutoRegress Language Diffusion Model (STAR-LDM) integrates latent diffusion planning with autoregressive generation. Unlike conventional autoregressive language models limited to token-by-token decisions, STAR-LDM incorporates a \"thinking\" phase that pauses generation to refine a semantic plan through diffusion before continuing. This enables global planning in continuous space prior to committing to discrete tokens. Evaluations show STAR-LDM significantly outperforms similar-size","published":"2026-02-24T04:09:31+00:00","categories":"[\"cs.CL\", \"cs.LG\"]","pdf_url":"https://arxiv.org/pdf/2602.20528v1","arxiv_url":"http://arxiv.org/abs/2602.20528v1","comment":"COLM 2025","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":8.0,"score_axis_3":6.0,"composite":5.5,"summary":"STAR-LDM integrates latent diffusion planning with autoregressive generation, adding a \"thinking\" phase that pauses token generation to refine semantic plans in continuous space before continuing. Outperforms similar-sized models with >70% win rates on narrative coherence and enables lightweight classifier-based control without retraining.","reasoning":"Novel architecture combining diffusion planning with autoregression is a paradigm shift. Strong controllability results. No code/weights available significantly limits practical adoption.","code_url":null,"s2_tldr":"The Stop-Think-AutoRegress Language Diffusion Model (STAR-LDM) integrates latent diffusion planning with autoregressive generation with straightforward control through lightweight classifiers, enabling fine-grained steering of attributes without model retraining while maintaining better fluency-control trade-offs than specialized approaches.","s2_paper_id":"b5101c1b8dc7cadf34a2fa8f106fc61bf5a70fb0","topics":"[\"Image Generation\", \"Language Models\", \"Agents\"]"},{"id":97,"run_id":1,"domain":"aiml","arxiv_id":"2602.20133","entry_id":"","title":"AdaEvolve: Adaptive LLM Driven Zeroth-Order Optimization","authors":"[\"Mert Cemri\", \"Shubham Agrawal\", \"Akshat Gupta\", \"Shu Liu\", \"Audrey Cheng\", \"Qiuyang Mang\", \"Ashwin Naren\", \"Lutfi Eren Erdogan\", \"Koushik Sen\", \"Matei Zaharia\"]","abstract":"The paradigm of automated program generation is shifting from one-shot generation to inference-time search, where Large Language Models (LLMs) function as semantic mutation operators within evolutionary loops. While effective, these systems are currently governed by static schedules that fail to account for the non-stationary dynamics of the search process. This rigidity results in substantial computational waste, as resources are indiscriminately allocated to stagnating populations while promis","published":"2026-02-23T18:45:31+00:00","categories":"[\"cs.NE\", \"cs.AI\", \"cs.CL\"]","pdf_url":"https://arxiv.org/pdf/2602.20133v1","arxiv_url":"http://arxiv.org/abs/2602.20133v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":7.0,"score_axis_3":7.0,"composite":5.5,"summary":"AdaEvolve reformulates LLM-driven evolution as hierarchical adaptive optimization with three levels: local adaptation (exploration intensity), global adaptation (bandit-based resource routing), and meta-guidance (novel tactics generation). Outperforms baselines across 185 optimization problems including combinatorial, systems, and algorithm design.","reasoning":"Novel hierarchical framework addressing inefficiency in LLM-driven evolutionary search. Strong empirical results across diverse problem types. No code release but conceptually actionable.","code_url":null,"s2_tldr":"This work introduces AdaEvolve, a framework that reformulates LLM-driven evolution as a hierarchical adaptive optimization problem and demonstrates that AdaEvolve consistently outperforms the open-sourced baselines across 185 different open-ended optimization problems including combinatorial, systems optimization and algorithm design problems.","s2_paper_id":"103ca617a9dc23aeede1ff3fb2de60d939050a37","topics":"[\"Language Models\", \"Optimization\"]"},{"id":102,"run_id":1,"domain":"aiml","arxiv_id":"2602.18671","entry_id":"","title":"Spilled Energy in Large Language Models","authors":"[\"Adrian Robert Minut\", \"Hazem Dewidar\", \"Iacopo Masi\"]","abstract":"We reinterpret the final Large Language Model (LLM) softmax classifier as an Energy-Based Model (EBM), decomposing the sequence-to-sequence probability chain into multiple interacting EBMs at inference. This principled approach allows us to track \"energy spills\" during decoding, which we empirically show correlate with factual errors, biases, and failures. Similar to Orgad et al. (2025), our method localizes the exact answer token and subsequently tests for hallucinations. Crucially, however, we","published":"2026-02-21T00:38:47+00:00","categories":"[\"cs.AI\", \"cs.CL\"]","pdf_url":"https://arxiv.org/pdf/2602.18671v1","arxiv_url":"http://arxiv.org/abs/2602.18671v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":7.0,"score_axis_3":7.0,"composite":5.5,"summary":"Reinterprets LLM softmax as Energy-Based Model, introducing training-free hallucination detection via 'spilled energy' and 'marginalized energy' metrics from output logits. Achieves competitive detection across nine benchmarks on LLaMA, Mistral, Gemma without trained probes or activation ablations.","reasoning":"Novel energy-based perspective enabling training-free hallucination detection with strong empirical results, but no code available. Highly practical approach but needs reproducibility artifacts.","code_url":null,"s2_tldr":"This work reinterprets the final Large Language Model (LLM) softmax classifier as an Energy-Based Model (EBM), decomposing the sequence-to-sequence probability chain into multiple interacting EBMs at inference, and introduces two completely training-free metrics derived directly from output logits: spilled energy and marginalized energy.","s2_paper_id":"a6751db3c139beb12d6e471de00d2e3b5399053a","topics":"[\"Language Models\"]"},{"id":104,"run_id":1,"domain":"aiml","arxiv_id":"2602.18333","entry_id":"","title":"On the \"Induction Bias\" in Sequence Models","authors":"[\"M. Reza Ebrahimi\", \"Micha\\u00ebl Defferrard\", \"Sunny Panchal\", \"Roland Memisevic\"]","abstract":"Despite the remarkable practical success of transformer-based language models, recent work has raised concerns about their ability to perform state tracking. In particular, a growing body of literature has shown this limitation primarily through failures in out-of-distribution (OOD) generalization, such as length extrapolation. In this work, we shift attention to the in-distribution implications of these limitations. We conduct a large-scale experimental study of the data efficiency of transform","published":"2026-02-20T16:39:07+00:00","categories":"[\"cs.LG\", \"cs.CL\"]","pdf_url":"https://arxiv.org/pdf/2602.18333v1","arxiv_url":"http://arxiv.org/abs/2602.18333v1","comment":"","source":"both","github_repo":"","github_stars":null,"hf_upvotes":3,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":7.0,"score_axis_3":7.0,"composite":5.5,"summary":"Demonstrates transformers require exponentially more training data than RNNs for state tracking as state-space size and sequence length increase, with negligible weight sharing across lengths. Shows transformers learn length-specific solutions in isolation while RNNs exhibit effective amortized learning.","reasoning":"Important empirical findings on transformer limitations with 3 HF upvotes. No code/weights but significant implications for architecture design. Strong practical insights.","code_url":null,"s2_tldr":"A large-scale experimental study of the data efficiency of transformers and recurrent neural networks (RNNs) across multiple supervision regimes finds that the amount of training data required by transformers grows much more rapidly with state-space size and sequence length than for RNNs.","s2_paper_id":"9a229ebb0cff9ef8db97048911cfd377ccc81bdb","topics":"[\"Language Models\", \"Architecture\"]"},{"id":105,"run_id":1,"domain":"aiml","arxiv_id":"2602.17053","entry_id":"","title":"RFEval: Benchmarking Reasoning Faithfulness under Counterfactual Reasoning Intervention in Large Reasoning Models","authors":"[\"Yunseok Han\", \"Yejoon Lee\", \"Jaeyoung Do\"]","abstract":"Large Reasoning Models (LRMs) exhibit strong performance, yet often produce rationales that sound plausible but fail to reflect their true decision process, undermining reliability and trust. We introduce a formal framework for reasoning faithfulness, defined by two testable conditions: stance consistency (a coherent stance linking reasoning to answer) and causal influence (the stated reasoning causally drives the answer under output-level interventions), explicitly decoupled from accuracy. To o","published":"2026-02-19T03:49:37+00:00","categories":"[\"cs.AI\", \"cs.CL\"]","pdf_url":"https://arxiv.org/pdf/2602.17053v3","arxiv_url":"http://arxiv.org/abs/2602.17053v3","comment":"Accepted in ICLR 2026 Poster: https://iclr.cc/virtual/2026/poster/10011763","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[{\"id\": \"snu-aidas/RFEval\", \"likes\": 2}]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":7.0,"score_axis_3":7.0,"composite":5.5,"summary":"RFEval introduces formal framework for reasoning faithfulness with 7,186 instances testing stance consistency and causal influence via counterfactual interventions. Finds 49.7% unfaithfulness in LRMs, concentrated in math/code domains, with RL objectives reducing faithfulness even when maintaining accuracy.","reasoning":"High code_and_weights with code and dataset availability. Strong novelty in formal faithfulness framework with causal interventions. Strong practical applicability for auditing reasoning model reliability.","code_url":null,"s2_tldr":"This work establishes a rigorous methodology for auditing LRM reliability and shows that trustworthy AI requires optimizing not only for correct outcomes but also for the structural integrity of the reasoning process, as well as establishing a formal framework for reasoning faithfulness.","s2_paper_id":"c080f0b18fa4853469993c1b5c798da89169ce2e","topics":"[\"Reasoning\", \"Benchmark\"]"},{"id":106,"run_id":1,"domain":"aiml","arxiv_id":"2602.17004","entry_id":"","title":"Arcee Trinity Large Technical Report","authors":"[\"Varun Singh\", \"Lucas Krauss\", \"Sami Jaghouar\", \"Matej Sirovatka\", \"Charles Goddard\", \"Fares Obied\", \"Jack Min Ong\", \"Jannik Straube\", \"Fern\", \"Aria Harley\"]","abstract":"We present the technical report for Arcee Trinity Large, a sparse Mixture-of-Experts model with 400B total parameters and 13B activated per token. Additionally, we report on Trinity Nano and Trinity Mini, with Trinity Nano having 6B total parameters with 1B activated per token, Trinity Mini having 26B total parameters with 3B activated per token. The models' modern architecture includes interleaved local and global attention, gated attention, depth-scaled sandwich norm, and sigmoid routing for M","published":"2026-02-19T01:58:50+00:00","categories":"[\"cs.LG\", \"cs.CL\"]","pdf_url":"https://arxiv.org/pdf/2602.17004v1","arxiv_url":"http://arxiv.org/abs/2602.17004v1","comment":"","source":"both","github_repo":"","github_stars":null,"hf_upvotes":16,"hf_models":"[{\"id\": \"arcee-ai/Trinity-Large-Preview\", \"likes\": 149}, {\"id\": \"arcee-ai/Trinity-Large-Preview-W4A16\", \"likes\": 5}]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":8.0,"composite":5.5,"summary":"Technical report for Arcee Trinity family: sparse MoE models (Trinity Large: 400B params/13B activated; Trinity Mini: 26B/3B; Trinity Nano: 6B/1B) featuring modern architecture with interleaved attention, SMEBU routing, and Muon optimizer. Zero loss spikes during training, all checkpoints on HuggingFace.","reasoning":"High code_and_weights with models on HuggingFace, 16 upvotes. Moderate novelty in architecture choices and SMEBU routing. Strong practical applicability with open weights across multiple scales.","code_url":null,"s2_tldr":"The technical report for Arcee Trinity Large, a sparse Mixture-of-Experts model with 400B total parameters and 13B activated per token, and a new MoE load balancing strategy titled Soft-clamped Momentum Expert Bias Updates (SMEBU) are presented.","s2_paper_id":"648ff261c40e20f0f858ac55ea6c2eeb98c53834","topics":"[\"Architecture\"]"},{"id":354,"run_id":1,"domain":"aiml","arxiv_id":"2602.20316","entry_id":"","title":"Inspectorch: Efficient rare event exploration in solar observations","authors":"[\"C. J. D\\u00edaz Baso\", \"I. J. Soler Poquet\", \"C. Kuckein\", \"M. van Noort\", \"N. Poirier\"]","abstract":"The Sun is observed in unprecedented detail, enabling studies of its activity on very small spatiotemporal scales. However, the large volume of data collected by our telescopes cannot be fully analyzed with conventional methods. Popular machine learning methods identify general trends from observations, but tend to overlook unusual events due to their low frequency of occurrence. We study the applicability of unsupervised probabilistic methods to efficiently identify rare events in multidimensio","published":"2026-02-23T20:03:08+00:00","categories":"[\"astro-ph.SR\", \"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.20316v1","arxiv_url":"http://arxiv.org/abs/2602.20316v1","comment":"Comments: 12+1 pages, 11+2 figures, submitted to A&A","source":"arxiv","github_repo":"https://github.com/cdiazbas/inspectorch","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":7.0,"score_axis_2":5.0,"score_axis_3":4.0,"composite":5.25,"summary":"Inspectorch applies flow-based probabilistic models to identify rare events in solar observations by assigning probability scores to each sample. The framework demonstrates effectiveness across multiple solar instruments and is released as an open-source Python package for astronomical data analysis.","reasoning":"Code available on GitHub. Novel application of flow-based models to solar physics but domain-specific (astronomy/remote sensing). Limited broader applicability beyond specialized use case.","code_url":"https://github.com/cdiazbas/inspectorch","s2_tldr":"Inspectorch, an open-source framework that utilizes flow-based models: flexible density estimators capable of learning the multidimensional distribution of solar observations, demonstrates that density estimation using flow-based models offers a powerful approach to identifying rare events in large solar datasets.","s2_paper_id":"05709834d2950d5be9702f6942979975959293ce","topics":"[\"Efficiency\"]"},{"id":359,"run_id":1,"domain":"aiml","arxiv_id":"2602.19562","entry_id":"","title":"A Multimodal Framework for Aligning Human Linguistic Descriptions with Visual Perceptual Data","authors":"[\"Joseph Bingham\"]","abstract":"Establishing stable mappings between natural language expressions and visual percepts is a foundational problem for both cognitive science and artificial intelligence. Humans routinely ground linguistic reference in noisy, ambiguous perceptual contexts, yet the mechanisms supporting such cross-modal alignment remain poorly understood. In this work, we introduce a computational framework designed to model core aspects of human referential interpretation by integrating linguistic utterances with p","published":"2026-02-23T07:20:11+00:00","categories":"[\"cs.AI\", \"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.19562v1","arxiv_url":"http://arxiv.org/abs/2602.19562v1","comment":"19 Pages, 6 figures, preprint","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":7.0,"score_axis_2":5.0,"score_axis_3":4.0,"composite":5.25,"summary":"A computational framework aligns linguistic descriptions with visual percepts using SIFT and UQI for perceptual similarity on crowd-sourced imagery. Evaluated on Stanford Repeated Reference Game corpus, it requires 65% fewer utterances than humans for stable mappings.","reasoning":"Code on anonymous repo (preprint). Cognitive modeling is interesting but incremental and niche (linguistic grounding). Limited immediate applicability.","code_url":"https://anonymous.4open.science/r/metasequoia-9D13/README.md","s2_tldr":"A computational framework designed to model core aspects of human referential interpretation by integrating linguistic utterances with perceptual representations derived from large-scale, crowd-sourced imagery achieves robust referential grounding and offers insights into models of grounded communication, perceptual inference, and cross-modal concept formation.","s2_paper_id":"c120c71128720f0ba1af8e058a25214e8cf19070","topics":"[\"Multimodal\", \"Training\"]"},{"id":364,"run_id":1,"domain":"aiml","arxiv_id":"2602.19022","entry_id":"","title":"An interpretable framework using foundation models for fish sex identification","authors":"[\"Zheng Miao\", \"Tien-Chieh Hung\"]","abstract":"Accurate sex identification in fish is vital for optimizing breeding and management strategies in aquaculture, particularly for species at the risk of extinction. However, most existing methods are invasive or stressful and may cause additional mortality, posing severe risks to threatened or endangered fish populations. To address these challenges, we propose FishProtoNet, a robust, non-invasive computer vision-based framework for sex identification of delta smelt (Hypomesus transpacificus), an ","published":"2026-02-22T03:21:26+00:00","categories":"[\"cs.CV\", \"cs.AI\"]","pdf_url":"https://arxiv.org/pdf/2602.19022v1","arxiv_url":"http://arxiv.org/abs/2602.19022v1","comment":"","source":"arxiv","github_repo":"https://github.com/zhengmiao1/Fish_sex_identification","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":7.0,"score_axis_2":5.0,"score_axis_3":4.0,"composite":5.25,"summary":"FishProtoNet is a non-invasive framework for delta smelt sex identification using foundation models for ROI extraction and interpretable prototype networks. Achieves 74-81% accuracy in early/post-spawning stages but struggles with subadult fish due to less pronounced morphological differences.","reasoning":"Code available on GitHub, but very narrow ecological application (endangered fish species). Limited general applicability despite interesting use of foundation models.","code_url":"https://github.com/zhengmiao1/Fish_sex_identification","s2_tldr":"The proposed FishProtoNet framework is a robust, non-invasive computer vision-based framework for sex identification of delta smelt, an endangered fish species native to California, across its full life cycle and provides interpretability through learned prototype representations while improving robustness by leveraging foundation models to reduce the influence of background noise.","s2_paper_id":"864aa1ceede3ceb7a1abf003d708fbf961c3f6d3","topics":"[\"Language Models\", \"Optimization\"]"},{"id":377,"run_id":1,"domain":"aiml","arxiv_id":"2602.19883","entry_id":"","title":"Denotational Semantics for ODRL: Knowledge-Based Constraint Conflict Detection","authors":"[\"Daham Mustafa\", \"Diego Collarana\", \"Yixin Peng\", \"Rafiqul Haque\", \"Christoph Lange-Bever\", \"Christoph Quix\", \"Stephan Decker\"]","abstract":"ODRL's six set-based operators -- isA, isPartOf, hasPart, isAnyOf, isAllOf, isNoneOf -- depend on external domain knowledge that the W3C specification leaves unspecified. Without it, every cross-dataspace policy comparison defaults to Unknown. We present a denotational semantics that maps each ODRL constraint to the set of knowledge-base concepts satisfying it. Conflict detection reduces to denotation intersection under a three-valued verdict -- Conflict, Compatible, or Unknown -- that is sound ","published":"2026-02-23T14:28:13+00:00","categories":"[\"cs.CL\", \"cs.LO\"]","pdf_url":"https://arxiv.org/pdf/2602.19883v1","arxiv_url":"http://arxiv.org/abs/2602.19883v1","comment":"17 pages, 6 tables. Working draft. Supplementary material (154 TPTP/SMT-LIB benchmarks, Isabelle/HOL theory file) will be made available at https://github.com/Daham-Mustaf/odrl-benchmark upon publication","source":"arxiv","github_repo":"https://github.com/Daham-Mustaf/odrl-benchmark","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":7.0,"score_axis_2":5.0,"score_axis_3":4.0,"composite":5.25,"summary":"Presents denotational semantics for ODRL policy language to enable knowledge-based constraint conflict detection across dataspaces. Maps ODRL constraints to KB concept sets and provides sound three-valued verdict system (Conflict/Compatible/Unknown). Validated on 154 benchmarks across six KB families using Vampire and Z3 solvers.","reasoning":"Specialized formal methods application for policy language. Benchmark code on GitHub but no ML weights. Limited to ODRL/dataspace domain.","code_url":"https://github.com/Daham-Mustaf/odrl-benchmark","s2_tldr":"This work presents a denotational semantics that maps each ODRL constraint to the set of knowledge-base concepts satisfying it, and proves two guarantees: conflicts are preserved across different KB standards, and unmapped concepts degrade gracefully to Unknown -- never to false conflicts.","s2_paper_id":"1e4e856bdf40ed1d09e9762fb8afc5784fb494a2","topics":"[\"Retrieval / RAG\"]"},{"id":384,"run_id":1,"domain":"aiml","arxiv_id":"2602.18964","entry_id":"","title":"Yor-Sarc: A gold-standard dataset for sarcasm detection in a low-resource African language","authors":"[\"Toheeb Aduramomi Jimoh\", \"Tabea De Wille\", \"Nikola S. Nikolov\"]","abstract":"Sarcasm detection poses a fundamental challenge in computational semantics, requiring models to resolve disparities between literal and intended meaning. The challenge is amplified in low-resource languages where annotated datasets are scarce or nonexistent. We present \\textbf{Yor-Sarc}, the first gold-standard dataset for sarcasm detection in Yor\u00f9b\u00e1, a tonal Niger-Congo language spoken by over $50$ million people. The dataset comprises 436 instances annotated by three native speakers from diver","published":"2026-02-21T22:10:18+00:00","categories":"[\"cs.CL\"]","pdf_url":"https://arxiv.org/pdf/2602.18964v1","arxiv_url":"http://arxiv.org/abs/2602.18964v1","comment":"","source":"arxiv","github_repo":"https://github.com/toheebadura/yor-sarc","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[{\"id\": \"toheebadura/yor-sarc\", \"likes\": 0}]","hf_spaces":"[]","score_axis_1":7.0,"score_axis_2":5.0,"score_axis_3":4.0,"composite":5.25,"summary":"Introduces Yor-Sarc, the first gold-standard sarcasm detection dataset for Yor\u00f9b\u00e1 with 436 annotated instances. Achieves substantial inter-annotator agreement (Fleiss' \u03ba=0.7660) using culture-informed annotation protocol. Dataset available on GitHub to support low-resource African language NLP.","reasoning":"Important dataset for underrepresented language but limited scale (436 instances). More relevant for low-resource NLP research than broad practitioner adoption.","code_url":"https://github.com/toheebadura/yor-sarc","s2_tldr":"Yor-Sarc is the first gold-standard dataset for sarcasm detection in Yor, a tonal Niger-Congo language spoken by over $50$ million people and expected to facilitate research on semantic interpretation and culturally informed NLP for low-resource African languages.","s2_paper_id":"a921792289e07e458ef615efe405703f5a2b9e2c","topics":"[\"Benchmark\"]"},{"id":109,"run_id":1,"domain":"aiml","arxiv_id":"2602.21100","entry_id":"","title":"Skullptor: High Fidelity 3D Head Reconstruction in Seconds with Multi-View Normal Prediction","authors":"[\"No\\u00e9 Artru\", \"Rukhshanda Hussain\", \"Emeline Got\", \"Alexandre Messier\", \"David B. Lindell\", \"Abdallah Dib\"]","abstract":"Reconstructing high-fidelity 3D head geometry from images is critical for a wide range of applications, yet existing methods face fundamental limitations. Traditional photogrammetry achieves exceptional detail but requires extensive camera arrays (25-200+ views), substantial computation, and manual cleanup in challenging areas like facial hair. Recent alternatives present a fundamental trade-off: foundation models enable efficient single-image reconstruction but lack fine geometric detail, while","published":"2026-02-24T17:02:11+00:00","categories":"[\"cs.CV\", \"cs.GR\"]","pdf_url":"https://arxiv.org/pdf/2602.21100v1","arxiv_url":"http://arxiv.org/abs/2602.21100v1","comment":"14 pages, 8 figures, to be published in proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":7.0,"composite":5.15,"summary":"Skullptor combines multi-view surface normal prediction with cross-view attention and inverse rendering optimization for high-fidelity 3D head reconstruction. Achieves quality on par with dense-view photogrammetry while reducing camera requirements and computational cost through a hybrid foundation model and optimization approach.","reasoning":"Practical hybrid approach combining foundation models with optimization. CVPR 2026 paper but no code/weights released yet. Solid technical contribution for 3D reconstruction.","code_url":null,"s2_tldr":"This work introduces a multi-view surface normal prediction model that extends monocular foundation models with cross-view attention to produce geometrically consistent normals in a feed-forward pass and leverages these predictions as strong geometric priors within an inverse rendering optimization framework to recover high-frequency surface details.","s2_paper_id":"f41f3995300631b5aa68d751941bd042488ab361","topics":"[\"3D / Vision\", \"Language Models\", \"Efficiency\"]"},{"id":111,"run_id":1,"domain":"aiml","arxiv_id":"2602.21035","entry_id":"","title":"Not Just What's There: Enabling CLIP to Comprehend Negated Visual Descriptions Without Fine-tuning","authors":"[\"Junhao Xiao\", \"Zhiyu Wu\", \"Hao Lin\", \"Yi Chen\", \"Yahui Liu\", \"Xiaoran Zhao\", \"Zixu Wang\", \"Zejiang He\"]","abstract":"Vision-Language Models (VLMs) like CLIP struggle to understand negation, often embedding affirmatives and negatives similarly (e.g., matching \"no dog\" with dog images). Existing methods refine negation understanding via fine-tuning CLIP's text encoder, risking overfitting. In this work, we propose CLIPGlasses, a plug-and-play framework that enhances CLIP's ability to comprehend negated visual descriptions. CLIPGlasses adopts a dual-stage design: a Lens module disentangles negated semantics from ","published":"2026-02-24T15:55:39+00:00","categories":"[\"cs.CV\", \"cs.MM\"]","pdf_url":"https://arxiv.org/pdf/2602.21035v1","arxiv_url":"http://arxiv.org/abs/2602.21035v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":7.0,"composite":5.15,"summary":"CLIPGlasses is a plug-and-play framework enhancing CLIP's understanding of negated visual descriptions without fine-tuning. Lens module disentangles negated semantics; Frame module predicts context-aware repulsion strength integrated into modified similarity computation. Shows strong cross-domain generalization and robustness under low-resource conditions.","reasoning":"Practical plug-and-play approach for important CLIP limitation. Novelty is moderate (architectural additions), no code/weights provided. Good cross-domain results.","code_url":null,"s2_tldr":"Experiments show that CLIP equipped with CLIPGlasses achieves competitive in-domain performance and outperforms state-of-the-art methods in cross-domain generalization, indicating stronger robustness across domains.","s2_paper_id":"0cb02ab95a3a407028c0d36df02775e80799ad36","topics":"[\"Language Models\", \"Multimodal\", \"Retrieval / RAG\"]"},{"id":112,"run_id":1,"domain":"aiml","arxiv_id":"2602.20981","entry_id":"","title":"Echoes Over Time: Unlocking Length Generalization in Video-to-Audio Generation Models","authors":"[\"Christian Simon\", \"MAsato Ishii\", \"Wei-Yao Wang\", \"Koichi Saito\", \"Akio Hayakawa\", \"Dongseok Shim\", \"Zhi Zhong\", \"Shuyang Cui\", \"Shusuke Takahashi\", \"Takashi Shibuya\"]","abstract":"Scaling multimodal alignment between video and audio is challenging, particularly due to limited data and the mismatch between text descriptions and frame-level video information. In this work, we tackle the scaling challenge in multimodal-to-audio generation, examining whether models trained on short instances can generalize to longer ones during testing. To tackle this challenge, we present multimodal hierarchical networks so-called MMHNet, an enhanced extension of state-of-the-art video-to-au","published":"2026-02-24T15:01:39+00:00","categories":"[\"cs.CV\", \"cs.AI\"]","pdf_url":"https://arxiv.org/pdf/2602.20981v1","arxiv_url":"http://arxiv.org/abs/2602.20981v1","comment":"Accepted to CVPR 2026","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":7.0,"composite":5.15,"summary":"MMHNet enables video-to-audio generation models to scale to 5+ minute audio from short training instances. Uses hierarchical networks with non-causal Mamba for long-form generation, demonstrating that training on short videos generalizes to long test sequences without additional long-duration training data.","reasoning":"Solves practical length generalization problem in video-to-audio. No code/weights commitment. Moderate novelty in architecture modifications.","code_url":null,"s2_tldr":"This work tackles the scaling challenge in multimodal-to-audio generation, examining whether models trained on short instances can generalize to longer ones during testing, and presents multimodal hierarchical networks so-called MMHNet, an enhanced extension of state-of-the-art video-to-audio models.","s2_paper_id":"10d689f88b6a004715405b9b117562cf59a7757a","topics":"[\"Speech / Audio\", \"Multimodal\", \"Training\"]"},{"id":113,"run_id":1,"domain":"aiml","arxiv_id":"2602.20980","entry_id":"","title":"CrystaL: Spontaneous Emergence of Visual Latents in MLLMs","authors":"[\"Yang Zhang\", \"Danyang Li\", \"Yuxuan Li\", \"Xin Zhang\", \"Tianyu Xie\", \"Mingming Cheng\", \"Xiang Li\"]","abstract":"Multimodal Large Language Models (MLLMs) have achieved remarkable performance by integrating powerful language backbones with large-scale visual encoders. Among these, latent Chain-of-Thought (CoT) methods enable implicit reasoning in continuous hidden states, facilitating seamless vision-language integration and faster inference. However, existing heuristically predefined supervision signals in latent CoT provide limited guidance for preserving critical visual information in intermediate latent","published":"2026-02-24T15:01:30+00:00","categories":"[\"cs.CV\", \"cs.AI\"]","pdf_url":"https://arxiv.org/pdf/2602.20980v1","arxiv_url":"http://arxiv.org/abs/2602.20980v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":7.0,"composite":5.15,"summary":"CrystaL improves latent Chain-of-Thought in MLLMs by crystallizing latent representations into task-relevant visual semantics through dual-path processing of intact/corrupted images with attention alignment. Single-stage framework without auxiliary annotations achieves gains on perception-intensive benchmarks while maintaining reasoning capabilities.","reasoning":"Solid technical contribution to latent CoT but incremental improvement. No code/weights mentioned. Practical for MLLM reasoning.","code_url":null,"s2_tldr":"CrystaL (Crystallized Latent Reasoning), a single-stage framework with two paths to process intact and corrupted images, without relying on auxiliary annotations or external modules, crystallizes latent representations into task-relevant visual semantics, without relying on auxiliary annotations or external modules.","s2_paper_id":"ccd5076838cceebd7099b9e8ec6fb62b7ed3dc48","topics":"[\"Language Models\", \"Multimodal\", \"Reasoning\"]"},{"id":114,"run_id":1,"domain":"aiml","arxiv_id":"2602.20933","entry_id":"","title":"Dropping Anchor and Spherical Harmonics for Sparse-view Gaussian Splatting","authors":"[\"Shuangkang Fang\", \"I-Chao Shen\", \"Xuanyang Zhang\", \"Zesheng Wang\", \"Yufeng Wang\", \"Wenrui Ding\", \"Gang Yu\", \"Takeo Igarashi\"]","abstract":"Recent 3D Gaussian Splatting (3DGS) Dropout methods address overfitting under sparse-view conditions by randomly nullifying Gaussian opacities. However, we identify a neighbor compensation effect in these approaches: dropped Gaussians are often compensated by their neighbors, weakening the intended regularization. Moreover, these methods overlook the contribution of high-degree spherical harmonic coefficients (SH) to overfitting. To address these issues, we propose DropAnSH-GS, a novel anchor-ba","published":"2026-02-24T14:11:56+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.20933v1","arxiv_url":"http://arxiv.org/abs/2602.20933v1","comment":"Accepted by CVPR 2026","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":7.0,"composite":5.15,"summary":"DropAnSH-GS addresses sparse-view 3D Gaussian Splatting overfitting through anchor-based dropout of Gaussian neighborhoods and high-degree spherical harmonics. Disrupts neighbor compensation effect in prior methods while enabling post-training model compression via SH truncation. Integrates easily into existing 3DGS variants with negligible overhead.","reasoning":"Solid technical improvement to 3DGS with practical benefits. No code/weights commitment despite project website mention. Incremental dropout strategy enhancement.","code_url":null,"s2_tldr":"This work proposes DropAnSH-GS, a novel anchor-based Dropout strategy that substantially outperforms existing Dropout methods with negligible computational overhead, and can be readily integrated into various 3DGS variants to enhance their performances.","s2_paper_id":"3e9b7cc776d3b70d62016c944b1f1a7b165b4066","topics":"[\"3D / Vision\", \"Efficiency\"]"},{"id":115,"run_id":1,"domain":"aiml","arxiv_id":"2602.20925","entry_id":"","title":"LST-SLAM: A Stereo Thermal SLAM System for Kilometer-Scale Dynamic Environments","authors":"[\"Zeyu Jiang\", \"Kuan Xu\", \"Changhao Chen\"]","abstract":"Thermal cameras offer strong potential for robot perception under challenging illumination and weather conditions. However, thermal Simultaneous Localization and Mapping (SLAM) remains difficult due to unreliable feature extraction, unstable motion tracking, and inconsistent global pose and map construction, particularly in dynamic large-scale outdoor environments. To address these challenges, we propose LST-SLAM, a novel large-scale stereo thermal SLAM system that achieves robust performance in","published":"2026-02-24T14:04:54+00:00","categories":"[\"cs.RO\", \"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.20925v1","arxiv_url":"http://arxiv.org/abs/2602.20925v1","comment":"ICRA 2026","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":7.0,"composite":5.15,"summary":"LST-SLAM presents large-scale stereo thermal SLAM system for kilometer-scale dynamic outdoor environments. Combines self-supervised thermal feature learning, stereo dual-level tracking, and semantic-geometric hybrid constraints. Introduces online incremental bag-of-words for loop closure, outperforming AirSLAM and DROID-SLAM on thermal datasets.","reasoning":"Practical thermal SLAM for challenging conditions but niche robotics application. No code/weights mentioned. Solid engineering contribution.","code_url":null,"s2_tldr":"LST-SLAM is proposed, a novel large-scale stereo thermal SLAM system that achieves robust performance in complex, dynamic scenes and introduces a semantic-geometric hybrid constraint that suppresses potentially dynamic features lacking strong inter-frame geometric consistency.","s2_paper_id":"68a316de957bddb4fad7e750adfd0c6488c14177","topics":"[\"Robotics\"]"},{"id":116,"run_id":1,"domain":"aiml","arxiv_id":"2602.20860","entry_id":"","title":"DA-Cal: Towards Cross-Domain Calibration in Semantic Segmentation","authors":"[\"Wangkai Li\", \"Rui Sun\", \"Zhaoyang Li\", \"Yujia Chen\", \"Tianzhu Zhang\"]","abstract":"While existing unsupervised domain adaptation (UDA) methods greatly enhance target domain performance in semantic segmentation, they often neglect network calibration quality, resulting in misalignment between prediction confidence and actual accuracy -- a significant risk in safety-critical applications. Our key insight emerges from observing that performance degrades substantially when soft pseudo-labels replace hard pseudo-labels in cross-domain scenarios due to poor calibration, despite the ","published":"2026-02-24T13:03:41+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.20860v1","arxiv_url":"http://arxiv.org/abs/2602.20860v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":7.0,"composite":5.15,"summary":"DA-Cal addresses calibration quality in unsupervised domain adaptation for semantic segmentation by transforming target domain calibration into soft pseudo-label optimization. Introduces Meta Temperature Network for pixel-level calibration parameters with bi-level optimization and domain-mixing strategies. Improves calibration without inference overhead across multiple UDA benchmarks.","reasoning":"Important calibration problem in UDA but incremental solution. No code/weights commitment. Practical for safety-critical segmentation applications.","code_url":null,"s2_tldr":"DA-Cal is proposed, a dedicated cross-domain calibration framework that transforms target domain calibration into soft pseudo-label optimization and seamlessly integrates with existing self-training frameworks across multiple UDA segmentation benchmarks, significantly improving target domain calibration while delivering performance gains without inference overhead.","s2_paper_id":"84a22522a5a04b34113dc9373d66aebdba2ca607","topics":"[\"3D / Vision\", \"Training\"]"},{"id":117,"run_id":1,"domain":"aiml","arxiv_id":"2602.20839","entry_id":"","title":"Training-Free Multi-Concept Image Editing","authors":"[\"Niki Foteinopoulou\", \"Ignas Budvytis\", \"Stephan Liwicki\"]","abstract":"Editing images with diffusion models without training remains challenging. While recent optimisation-based methods achieve strong zero-shot edits from text, they struggle to preserve identity or capture details that language alone cannot express. Many visual concepts such as facial structure, material texture, or object geometry are impossible to express purely through text prompts alone. To address this gap, we introduce a training-free framework for concept-based image editing, which unifies O","published":"2026-02-24T12:27:51+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.20839v1","arxiv_url":"http://arxiv.org/abs/2602.20839v1","comment":"17 pages, 13 figures","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":7.0,"composite":5.15,"summary":"Training-free multi-concept image editing framework unifying Optimised DDS with LoRA-driven concept composition. Enables combining multiple visual concepts with semantic text guidance through pretrained adapters. Refines DDS through ordered timesteps, regularization, and negative-prompt guidance, improving over InstructPix2Pix and ComposLoRA baselines.","reasoning":"Useful training-free editing approach but no code commitment. Moderate novelty in combining existing techniques. Practical for users with pretrained LoRAs.","code_url":null,"s2_tldr":"This work introduces a training-free framework for concept-based image editing, which unifies Optimised DDS with LoRA-driven concept composition, where the training data of the LoRA represent the concept.","s2_paper_id":"555049cd20970930a7d0b6b8f66ad33e2037d101","topics":"[\"Image Generation\", \"Optimization\"]"},{"id":119,"run_id":1,"domain":"aiml","arxiv_id":"2602.20672","entry_id":"","title":"BBQ-to-Image: Numeric Bounding Box and Qolor Control in Large-Scale Text-to-Image Models","authors":"[\"Eliran Kachlon\", \"Alexander Visheratin\", \"Nimrod Sarid\", \"Tal Hacham\", \"Eyal Gutflaish\", \"Saar Huberman\", \"Hezi Zisman\", \"David Ruppin\", \"Ron Mokady\"]","abstract":"Text-to-image models have rapidly advanced in realism and controllability, with recent approaches leveraging long, detailed captions to support fine-grained generation. However, a fundamental parametric gap remains: existing models rely on descriptive language, whereas professional workflows require precise numeric control over object location, size, and color. In this work, we introduce BBQ, a large-scale text-to-image model that directly conditions on numeric bounding boxes and RGB triplets wi","published":"2026-02-24T08:22:42+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.20672v1","arxiv_url":"http://arxiv.org/abs/2602.20672v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":7.0,"composite":5.15,"summary":"BBQ introduces numeric control (bounding boxes, RGB triplets) into text-to-image models via structured-text conditioning, enabling precise spatial and color specification without architectural changes. Provides intuitive UI controls like object dragging and color pickers.","reasoning":"No code/weights available. Solid practical contribution for controllable generation. Incremental architectural novelty but useful paradigm shift toward parametric control.","code_url":null,"s2_tldr":null,"s2_paper_id":"cef99e346093a6845bb073bc74805d3234865e61","topics":"[\"Image Generation\"]"},{"id":121,"run_id":1,"domain":"aiml","arxiv_id":"2602.20627","entry_id":"","title":"Object-Scene-Camera Decomposition and Recomposition for Data-Efficient Monocular 3D Object Detection","authors":"[\"Zhaonian Kuang\", \"Rui Ding\", \"Meng Yang\", \"Xinhu Zheng\", \"Gang Hua\"]","abstract":"Monocular 3D object detection (M3OD) is intrinsically ill-posed, hence training a high-performance deep learning based M3OD model requires a humongous amount of labeled data with complicated visual variation from diverse scenes, variety of objects and camera poses.However, we observe that, due to strong human bias, the three independent entities, i.e., object, scene, and camera pose, are always tightly entangled when an image is captured to construct training data. More specifically, specific 3D","published":"2026-02-24T07:22:58+00:00","categories":"[\"cs.CV\", \"cs.RO\"]","pdf_url":"https://arxiv.org/pdf/2602.20627v1","arxiv_url":"http://arxiv.org/abs/2602.20627v1","comment":"IJCV","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":7.0,"composite":5.15,"summary":"Proposes online object-scene-camera decomposition and recomposition for data-efficient monocular 3D object detection. Decomposes training images into textured 3D objects and backgrounds, then continuously recomposes with varied object placements and camera poses across epochs.","reasoning":"No code/weights. Clever data augmentation approach with practical value. Incremental contribution to 3D detection training efficiency.","code_url":null,"s2_tldr":"An online object-scene-camera decomposition and recomposition data manipulation scheme to more efficiently exploit the training data and serve as a plug-and-play component to boost M3OD models, working flexibly with both fully and sparsely supervised settings.","s2_paper_id":"26817fd3b2b3a5f8d37752370a48c6c8e20bfaca","topics":"[\"3D / Vision\", \"Efficiency\"]"},{"id":124,"run_id":1,"domain":"aiml","arxiv_id":"2602.20501","entry_id":"","title":"Probing and Bridging Geometry-Interaction Cues for Affordance Reasoning in Vision Foundation Models","authors":"[\"Qing Zhang\", \"Xuesong Li\", \"Jing Zhang\"]","abstract":"What does it mean for a visual system to truly understand affordance? We argue that this understanding hinges on two complementary capacities: geometric perception, which identifies the structural parts of objects that enable interaction, and interaction perception, which models how an agent's actions engage with those parts. To test this hypothesis, we conduct a systematic probing of Visual Foundation Models (VFMs). We find that models like DINO inherently encode part-level geometric structures","published":"2026-02-24T02:59:15+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.20501v1","arxiv_url":"http://arxiv.org/abs/2602.20501v1","comment":"11 pages, 12 figures, Accepted to CVPR 2026","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":7.0,"score_axis_3":6.0,"composite":5.15,"summary":"Probes affordance understanding in VFMs by separating geometric (DINO part structures) and interaction (Flux attention maps) perception. Demonstrates training-free, zero-shot affordance estimation via simple fusion of these components achieves weakly-supervised performance.","reasoning":"No code/weights provided (CVPR 2026 accepted). Novel mechanistic analysis of affordance in foundation models with strong conceptual contribution. Practical zero-shot method but lacks open implementation.","code_url":null,"s2_tldr":"This final fusion experiment confirms that geometric and interaction perception are the fundamental building blocks of affordance understanding in VFMs, providing a mechanistic account of how perception grounds action.","s2_paper_id":"de9b94c979beb75dd049a8a09ab2ee678f6433a0","topics":"[\"Language Models\", \"Reasoning\", \"Agents\"]"},{"id":125,"run_id":1,"domain":"aiml","arxiv_id":"2602.20500","entry_id":"","title":"Strategy-Supervised Autonomous Laparoscopic Camera Control via Event-Driven Graph Mining","authors":"[\"Keyu Zhou\", \"Peisen Xu\", \"Yahao Wu\", \"Jiming Chen\", \"Gaofeng Li\", \"Shunlei Li\"]","abstract":"Autonomous laparoscopic camera control must maintain a stable and safe surgical view under rapid tool-tissue interactions while remaining interpretable to surgeons. We present a strategy-grounded framework that couples high-level vision-language inference with low-level closed-loop control. Offline, raw surgical videos are parsed into camera-relevant temporal events (e.g., interaction, working-distance deviation, and view-quality degradation) and structured as attributed event graphs. Mining the","published":"2026-02-24T02:56:39+00:00","categories":"[\"cs.RO\", \"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.20500v1","arxiv_url":"http://arxiv.org/abs/2602.20500v1","comment":"Submitted to IEEE Transactions on Robotics (T-RO). 19 pages, 9 figures","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":7.0,"composite":5.15,"summary":"Strategy-grounded laparoscopic camera control combining VLM inference with IBVS-RCM control. Mines temporal event graphs from surgical videos to extract reusable strategy primitives; VLM predicts strategies online for autonomous camera positioning. Outperforms junior surgeons by 35% in field-of-view centering.","reasoning":"No code/weights provided (submitted T-RO). Novel structured approach to surgical camera control with strong practical results. High applicability for surgical robotics but lacks open implementation.","code_url":null,"s2_tldr":"The proposed system outperforms junior surgeons in standardized camera-handling evaluations, reducing field-of-view centering error and image shaking by 35.26% and image shaking by 62.33%, while preserving smooth motion and stable working-distance regulation.","s2_paper_id":"b9d13ed448cd587acd418ed519c2c9ed5786c3e6","topics":"[\"Multimodal\"]"},{"id":126,"run_id":1,"domain":"aiml","arxiv_id":"2602.20476","entry_id":"","title":"SceMoS: Scene-Aware 3D Human Motion Synthesis by Planning with Geometry-Grounded Tokens","authors":"[\"Anindita Ghosh\", \"Vladislav Golyanik\", \"Taku Komura\", \"Philipp Slusallek\", \"Christian Theobalt\", \"Rishabh Dabral\"]","abstract":"Synthesizing text-driven 3D human motion within realistic scenes requires learning both semantic intent (\"walk to the couch\") and physical feasibility (e.g., avoiding collisions). Current methods use generative frameworks that simultaneously learn high-level planning and low-level contact reasoning, and rely on computationally expensive 3D scene data such as point clouds or voxel occupancy grids. We propose SceMoS, a scene-aware motion synthesis framework that shows that structured 2D scene repr","published":"2026-02-24T02:09:12+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.20476v1","arxiv_url":"http://arxiv.org/abs/2602.20476v1","comment":"13 pages, 6 figures, 4 tables","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":7.0,"score_axis_3":6.0,"composite":5.15,"summary":"SceMoS synthesizes text-driven 3D human motion in scenes by disentangling global planning (BEV with DINOv2 features) from local execution (heightmap-based VQ-VAE tokens). 2D scene representations replace expensive 3D volumetric data while maintaining physics grounding, achieving SOTA on TRUMANS with 50%+ fewer parameters.","reasoning":"No code/weights provided. Novel 2D factorization approach for scene-aware motion synthesis with efficiency gains. Strong conceptual contribution but lacks open implementation.","code_url":null,"s2_tldr":"SceMoS achieves state-of-the-art motion realism and contact accuracy on the TRUMANS benchmark, reducing the number of trainable parameters for scene encoding by over 50%, showing that 2D scene cues can effectively ground 3D human-scene interaction.","s2_paper_id":"bbee1636ad9a6e780f1e7a7ddbe1b55edac2b2e0","topics":"[\"Agents\", \"3D / Vision\", \"Reasoning\"]"},{"id":128,"run_id":1,"domain":"aiml","arxiv_id":"2602.20231","entry_id":"","title":"UniLACT: Depth-Aware RGB Latent Action Learning for Vision-Language-Action Models","authors":"[\"Manish Kumar Govind\", \"Dominick Reilly\", \"Pu Wang\", \"Srijan Das\"]","abstract":"Latent action representations learned from unlabeled videos have recently emerged as a promising paradigm for pretraining vision-language-action (VLA) models without explicit robot action supervision. However, latent actions derived solely from RGB observations primarily encode appearance-driven dynamics and lack explicit 3D geometric structure, which is essential for precise and contact-rich manipulation. To address this limitation, we introduce UniLACT, a transformer-based VLA model that incor","published":"2026-02-23T18:41:41+00:00","categories":"[\"cs.RO\", \"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.20231v1","arxiv_url":"http://arxiv.org/abs/2602.20231v1","comment":"https://manishgovind.github.io/unilact-vla/","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":7.0,"composite":5.15,"summary":"UniLACT introduces depth-aware latent action learning for vision-language-action models through UniLARN framework that learns shared embeddings for RGB and depth. The transformer-based VLA model enables stronger spatial priors for manipulation tasks, consistently outperforming RGB-only baselines in both simulation and real-world settings.","reasoning":"No code released. Novel depth integration into VLA models with practical robotics applicability. Strong experimental validation but awaiting public release.","code_url":null,"s2_tldr":"This work introduces UniLACT, a transformer-based VLA model that incorporates geometric structure through depth-aware latent pretraining, enabling downstream policies to inherit stronger spatial priors, and proposes UniLARN, a unified latent action learning framework that learns a shared embedding space for RGB and depth while explicitly modeling their cross-modal interactions.","s2_paper_id":"5835f5ce0ccfe9610566a0d308674388c7c7f976","topics":"[\"Multimodal\", \"3D / Vision\", \"Robotics\"]"},{"id":129,"run_id":1,"domain":"aiml","arxiv_id":"2602.20079","entry_id":"","title":"SemanticNVS: Improving Semantic Scene Understanding in Generative Novel View Synthesis","authors":"[\"Xinya Chen\", \"Christopher Wewer\", \"Jiahao Xie\", \"Xinting Hu\", \"Jan Eric Lenssen\"]","abstract":"We present SemanticNVS, a camera-conditioned multi-view diffusion model for novel view synthesis (NVS), which improves generation quality and consistency by integrating pre-trained semantic feature extractors. Existing NVS methods perform well for views near the input view, however, they tend to generate semantically implausible and distorted images under long-range camera motion, revealing severe degradation. We speculate that this degradation is due to current models failing to fully understan","published":"2026-02-23T17:45:21+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.20079v1","arxiv_url":"http://arxiv.org/abs/2602.20079v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":7.0,"composite":5.15,"summary":"SemanticNVS improves novel view synthesis quality under long-range camera motion by integrating pre-trained semantic feature extractors into camera-conditioned multi-view diffusion models. The approach achieves 4.69%-15.26% FID improvement across multiple datasets by enhancing semantic scene understanding.","reasoning":"No code/weights indicated; moderate novelty in combining semantic features with NVS; good practical applicability for view synthesis tasks.","code_url":null,"s2_tldr":"SemanticNVS is presented, a camera-conditioned multi-view diffusion model for novel view synthesis (NVS), which improves generation quality and consistency by integrating pre-trained semantic feature extractors by integrating pre-trained semantic feature extractors.","s2_paper_id":"be12516f33264db8e56bd5f9eca7838222f8e332","topics":"[]"},{"id":130,"run_id":1,"domain":"aiml","arxiv_id":"2602.19944","entry_id":"","title":"Discover, Segment, and Select: A Progressive Mechanism for Zero-shot Camouflaged Object Segmentation","authors":"[\"Yilong Yang\", \"Jianxin Tian\", \"Shengchuan Zhang\", \"Liujuan Cao\"]","abstract":"Current zero-shot Camouflaged Object Segmentation methods typically employ a two-stage pipeline (discover-then-segment): using MLLMs to obtain visual prompts, followed by SAM segmentation. However, relying solely on MLLMs for camouflaged object discovery often leads to inaccurate localization, false positives, and missed detections. To address these issues, we propose the \\textbf{D}iscover-\\textbf{S}egment-\\textbf{S}elect (\\textbf{DSS}) mechanism, a progressive framework designed to refine segme","published":"2026-02-23T15:15:37+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.19944v1","arxiv_url":"http://arxiv.org/abs/2602.19944v1","comment":"Accepted by CVPR 2026 (main conference)","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":7.0,"composite":5.15,"summary":"DSS proposes a Discover-Segment-Select mechanism for zero-shot camouflaged object segmentation that progressively refines segmentation. The framework combines feature-coherent object discovery, SAM segmentation, and semantic-driven mask selection, achieving state-of-the-art performance without training.","reasoning":"No code/weights available; moderate novelty in progressive refinement approach; good practical applicability for zero-shot scenarios.","code_url":null,"s2_tldr":"The proposed DSS mechanism contains a Feature-coherent Object Discovery module that leverages visual features to generate diverse object proposals, a segmentation module that refines these proposals through SAM segmentation, and a Semantic-driven Mask Selection module that employs MLLMs to evaluate and select the optimal segmentation mask from multiple candidates.","s2_paper_id":"84a44921755e2b30431b4ad3a72178d65e9f72fb","topics":"[\"3D / Vision\"]"},{"id":131,"run_id":1,"domain":"aiml","arxiv_id":"2602.19937","entry_id":"","title":"Learning Positive-Incentive Point Sampling in Neural Implicit Fields for Object Pose Estimation","authors":"[\"Yifei Shi\", \"Boyan Wan\", \"Xin Xu\", \"Kai Xu\"]","abstract":"Learning neural implicit fields of 3D shapes is a rapidly emerging field that enables shape representation at arbitrary resolutions. Due to the flexibility, neural implicit fields have succeeded in many research areas, including shape reconstruction, novel view image synthesis, and more recently, object pose estimation. Neural implicit fields enable learning dense correspondences between the camera space and the object's canonical space-including unobserved regions in camera space-significantly ","published":"2026-02-23T15:10:00+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.19937v1","arxiv_url":"http://arxiv.org/abs/2602.19937v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":7.0,"composite":5.15,"summary":"This work combines SO(3)-equivariant convolutional implicit networks with positive-incentive point sampling for object pose estimation. The method predicts canonical coordinates with SO(3)-equivariance and dynamically determines sampling locations, outperforming state-of-the-art in challenging scenarios.","reasoning":"No code/weights indicated; moderate novelty in combining equivariance with adaptive sampling; good applicability for pose estimation tasks.","code_url":null,"s2_tldr":"A method combining an SO(3)-equivariant convolutional implicit network and a positive-incentive point sampling strategy, demonstrating superior performance compared to most existing baselines and demonstrating significant improvements in challenging scenarios, such as objects captured with unseen pose, high occlusion, novel geometry, and severe noise.","s2_paper_id":"363595550e0aa34e747198782867b5df57c08f08","topics":"[\"Image Generation\", \"3D / Vision\"]"},{"id":132,"run_id":1,"domain":"aiml","arxiv_id":"2602.19910","entry_id":"","title":"Multi-Modal Representation Learning via Semi-Supervised Rate Reduction for Generalized Category Discovery","authors":"[\"Wei He\", \"Xianghan Meng\", \"Zhiyuan Huang\", \"Xianbiao Qi\", \"Rong Xiao\", \"Chun-Guang Li\"]","abstract":"Generalized Category Discovery (GCD) aims to identify both known and unknown categories, with only partial labels given for the known categories, posing a challenging open-set recognition problem. State-of-the-art approaches for GCD task are usually built on multi-modality representation learning, which is heavily dependent upon inter-modality alignment. However, few of them cast a proper intra-modality alignment to generate a desired underlying structure of representation distributions. In this","published":"2026-02-23T14:51:09+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.19910v1","arxiv_url":"http://arxiv.org/abs/2602.19910v1","comment":"15 pages, accepted by CVPR 2026","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":7.0,"composite":5.15,"summary":"SSR\u00b2-GCD proposes a multi-modal representation learning framework for Generalized Category Discovery via Semi-Supervised Rate Reduction. The approach emphasizes intra-modality alignment and integrates prompt candidates through VLM inter-modal alignment, achieving superior performance on generic and fine-grained benchmarks.","reasoning":"No code/weights indicated; moderate novelty in rate reduction for GCD; good practical applicability for open-set recognition tasks.","code_url":null,"s2_tldr":"A novel and effective multi-modal representation learning framework for GCD via Semi-Supervised Rate Reduction, called SSR$^2$-GCD, to learn cross-modality representations with desired structural properties based on emphasizing to properly align intra-modality relationships is proposed.","s2_paper_id":"36d2741e293aad7ab2a1284bbcf8c81633e3a4ab","topics":"[\"Training\"]"},{"id":133,"run_id":1,"domain":"aiml","arxiv_id":"2602.19756","entry_id":"","title":"Multimodal Dataset Distillation Made Simple by Prototype-Guided Data Synthesis","authors":"[\"Junhyeok Choi\", \"Sangwoo Mo\", \"Minwoo Chae\"]","abstract":"Recent advances in multimodal learning have achieved remarkable success across diverse vision-language tasks. However, such progress heavily relies on large-scale image-text datasets, making training costly and inefficient. Prior efforts in dataset filtering and pruning attempt to mitigate this issue, but still require relatively large subsets to maintain performance and fail under very small subsets. Dataset distillation offers a promising alternative, yet existing multimodal dataset distillati","published":"2026-02-23T12:08:28+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.19756v1","arxiv_url":"http://arxiv.org/abs/2602.19756v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":7.0,"composite":5.15,"summary":"Proposes learning-free multimodal dataset distillation using CLIP embeddings, prototypes, and unCLIP decoder for image synthesis. Eliminates full-dataset training while achieving SOTA cross-architecture generalization on multimodal tasks.","reasoning":"Novel training-free paradigm for dataset distillation with strong practical efficiency. No code/weights available.","code_url":null,"s2_tldr":"This work proposes a learning-free dataset distillation framework that eliminates the need for large-scale training and optimization while enhancing generalization across architectures, achieving state-of-the-art cross-architecture generalization.","s2_paper_id":"c8ee00ad0dae3503e772b74c6d9e28dddc2f4348","topics":"[\"Multimodal\", \"Efficiency\", \"Benchmark\"]"},{"id":135,"run_id":1,"domain":"aiml","arxiv_id":"2602.19719","entry_id":"","title":"Generative 6D Pose Estimation via Conditional Flow Matching","authors":"[\"Amir Hamza\", \"Davide Boscaini\", \"Weihang Li\", \"Benjamin Busam\", \"Fabio Poiesi\"]","abstract":"Existing methods for instance-level 6D pose estimation typically rely on neural networks that either directly regress the pose in $\\mathrm{SE}(3)$ or estimate it indirectly via local feature matching. The former struggle with object symmetries, while the latter fail in the absence of distinctive local features. To overcome these limitations, we propose a novel formulation of 6D pose estimation as a conditional flow matching problem in $\\mathbb{R}^3$. We introduce Flose, a generative method that ","published":"2026-02-23T11:15:12+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.19719v1","arxiv_url":"http://arxiv.org/abs/2602.19719v1","comment":"Project Website : https://tev-fbk.github.io/Flose/","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":7.0,"composite":5.15,"summary":"Flose formulates 6D pose estimation as conditional flow matching in R\u00b3, combining appearance-based semantic features with geometric guidance to handle object symmetries. Achieves +4.5pp improvement on BOP benchmark with RANSAC-based outlier handling.","reasoning":"Novel generative formulation for pose estimation with SOTA results. Project website but no explicit code/weights link.","code_url":null,"s2_tldr":"This work introduces Flose, a generative method that infers object poses via a denoising process conditioned on local features and integrates appearance-based semantic features to mitigate ambiguities caused by object symmetries.","s2_paper_id":"4cc6a8b3c421defbb4f9df376fb2ef74d7c82bde","topics":"[]"},{"id":137,"run_id":1,"domain":"aiml","arxiv_id":"2602.19710","entry_id":"","title":"Universal Pose Pretraining for Generalizable Vision-Language-Action Policies","authors":"[\"Haitao Lin\", \"Hanyang Yu\", \"Jingshun Huang\", \"He Zhang\", \"Yonggen Ling\", \"Ping Tan\", \"Xiangyang Xue\", \"Yanwei Fu\"]","abstract":"Existing Vision-Language-Action (VLA) models often suffer from feature collapse and low training efficiency because they entangle high-level perception with sparse, embodiment-specific action supervision. Since these models typically rely on VLM backbones optimized for Visual Question Answering (VQA), they excel at semantic identification but often overlook subtle 3D state variations that dictate distinct action patterns. To resolve these misalignments, we propose Pose-VLA, a decoupled paradig","published":"2026-02-23T11:00:08+00:00","categories":"[\"cs.CV\", \"cs.LG\", \"cs.RO\"]","pdf_url":"https://arxiv.org/pdf/2602.19710v1","arxiv_url":"http://arxiv.org/abs/2602.19710v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":7.0,"composite":5.15,"summary":"Pose-VLA decouples VLA training into universal 3D spatial pretraining (camera-centric) and embodiment-specific post-training via discrete pose tokens. Achieves 79.5% on RoboTwin 2.0, 96.0% on LIBERO with efficient real-world generalization from 100 demos.","reasoning":"Novel decoupled pretraining paradigm for robotics with strong efficiency gains. No code/weights available.","code_url":null,"s2_tldr":"Pose-VLA is proposed, a decoupled paradigm that separates VLA training into a pre-training phase for extracting universal 3D spatial priors in a unified camera-centric space, and a post-training phase for efficient embodiment alignment within robot-specific action space.","s2_paper_id":"9b68406f1b01eb18349a94776f5e8087e963c852","topics":"[\"Multimodal\", \"3D / Vision\", \"Training\"]"},{"id":138,"run_id":1,"domain":"aiml","arxiv_id":"2602.19631","entry_id":"","title":"Localized Concept Erasure in Text-to-Image Diffusion Models via High-Level Representation Misdirection","authors":"[\"Uichan Lee\", \"Jeonghyeon Kim\", \"Sangheum Hwang\"]","abstract":"Recent advances in text-to-image (T2I) diffusion models have seen rapid and widespread adoption. However, their powerful generative capabilities raise concerns about potential misuse for synthesizing harmful, private, or copyrighted content. To mitigate such risks, concept erasure techniques have emerged as a promising solution. Prior works have primarily focused on fine-tuning the denoising component (e.g., the U-Net backbone). However, recent causal tracing studies suggest that visual attribut","published":"2026-02-23T09:18:27+00:00","categories":"[\"cs.CV\", \"cs.AI\"]","pdf_url":"https://arxiv.org/pdf/2602.19631v1","arxiv_url":"http://arxiv.org/abs/2602.19631v1","comment":"Accepted at ICLR 2026. The first two authors contributed equally","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":7.0,"composite":5.15,"summary":"HiRM erases concepts in T2I diffusion by misdirecting text encoder's high-level representations while updating only early layers containing visual attribute causal states. Achieves precise removal with minimal impact on non-target concepts, transferable to Flux.","reasoning":"Novel localized concept erasure approach with strong practical implications. No code/weights despite safety focus.","code_url":null,"s2_tldr":"High-Level Representation Misdirection (HiRM) is proposed, which misdirects high-level semantic representations of target concepts in the text encoder toward designated vectors such as random directions or semantically defined directions, while updating only early layers that contain causal states of visual attributes.","s2_paper_id":"f81ba3ba30234aac1c3d4f70adaa439c3e6eb21c","topics":"[\"Image Generation\", \"Language Models\"]"},{"id":140,"run_id":1,"domain":"aiml","arxiv_id":"2602.19570","entry_id":"","title":"VALD: Multi-Stage Vision Attack Detection for Efficient LVLM Defense","authors":"[\"Nadav Kadvil\", \"Ayellet Tal\"]","abstract":"Large Vision-Language Models (LVLMs) can be vulnerable to adversarial images that subtly bias their outputs toward plausible yet incorrect responses. We introduce a general, efficient, and training-free defense that combines image transformations with agentic data consolidation to recover correct model behavior. A key component of our approach is a two-stage detection mechanism that quickly filters out the majority of clean inputs. We first assess image consistency under content-preserving trans","published":"2026-02-23T07:39:43+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.19570v1","arxiv_url":"http://arxiv.org/abs/2602.19570v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":7.0,"composite":5.15,"summary":"VALD defends LVLMs against adversarial images via training-free multi-stage detection combining image transformations, text-embedding consistency checks, and LLM-based consolidation. Most clean images skip costly processing, achieving SOTA accuracy with minimal overhead.","reasoning":"No code/weights. Training-free defense is practical and efficient, but incremental improvement in adversarial robustness. Could be useful for practitioners.","code_url":null,"s2_tldr":"This work introduces a general, efficient, and training-free defense that combines image transformations with agentic data consolidation to recover correct model behavior and achieves state-of-the-art accuracy while maintaining notable efficiency.","s2_paper_id":"63b902b1f34d46024c1f6ae075523b02474cbe63","topics":"[\"Efficiency\", \"Language Models\", \"Multimodal\"]"},{"id":141,"run_id":1,"domain":"aiml","arxiv_id":"2602.19549","entry_id":"","title":"Sculpting the Vector Space: Towards Efficient Multi-Vector Visual Document Retrieval via Prune-then-Merge Framework","authors":"[\"Yibo Yan\", \"Mingdong Ou\", \"Yi Cao\", \"Xin Zou\", \"Jiahao Huo\", \"Shuliang Liu\", \"James Kwok\", \"Xuming Hu\"]","abstract":"Visual Document Retrieval (VDR), which aims to retrieve relevant pages within vast corpora of visually-rich documents, is of significance in current multimodal retrieval applications. The state-of-the-art multi-vector paradigm excels in performance but suffers from prohibitive overhead, a problem that current efficiency methods like pruning and merging address imperfectly, creating a difficult trade-off between compression rate and feature fidelity. To overcome this dilemma, we introduce Prune-t","published":"2026-02-23T06:45:19+00:00","categories":"[\"cs.CL\", \"cs.CV\", \"cs.IR\"]","pdf_url":"https://arxiv.org/pdf/2602.19549v1","arxiv_url":"http://arxiv.org/abs/2602.19549v1","comment":"Under review","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":7.0,"composite":5.15,"summary":"Prune-then-Merge is a two-stage framework for multi-vector visual document retrieval that adaptively prunes low-information patches then hierarchically merges refined embeddings. It extends near-lossless compression range and improves performance at high compression ratios on 29 VDR datasets.","reasoning":"No code/weights. Incremental compression method for document retrieval. Practical for retrieval applications but not a paradigm shift.","code_url":null,"s2_tldr":"This work introduces Prune-then-Merge, a novel two-stage framework that synergizes these complementary approaches to summarizing semantic content without the noise-induced feature dilution seen in single-stage methods.","s2_paper_id":"52124a9a0f0b3d57674892b74c4de8dc0a646734","topics":"[\"Efficiency\", \"Retrieval / RAG\", \"Multimodal\"]"},{"id":142,"run_id":1,"domain":"aiml","arxiv_id":"2602.19536","entry_id":"","title":"Fore-Mamba3D: Mamba-based Foreground-Enhanced Encoding for 3D Object Detection","authors":"[\"Zhiwei Ning\", \"Xuanang Gao\", \"Jiaxi Cao\", \"Runze Yang\", \"Huiying Xu\", \"Xinzhong Zhu\", \"Jie Yang\", \"Wei Liu\"]","abstract":"Linear modeling methods like Mamba have been merged as the effective backbone for the 3D object detection task. However, previous Mamba-based methods utilize the bidirectional encoding for the whole non-empty voxel sequence, which contains abundant useless background information in the scenes. Though directly encoding foreground voxels appears to be a plausible solution, it tends to degrade detection performance. We attribute this to the response attenuation and restricted context representation","published":"2026-02-23T06:03:07+00:00","categories":"[\"cs.CV\", \"cs.AI\"]","pdf_url":"https://arxiv.org/pdf/2602.19536v1","arxiv_url":"http://arxiv.org/abs/2602.19536v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":7.0,"composite":5.15,"summary":"Fore-Mamba3D proposes a Mamba-based backbone for 3D object detection that enhances foreground encoding via regional-to-global sliding windows (RGSW) and semantic-assisted state spatial fusion (SASFMamba). It addresses response attenuation and restricted context in foreground-only sequences.","reasoning":"No code/weights. Mamba for 3D detection is emerging; foreground enhancement is incremental. Practical for 3D detection but not a major shift.","code_url":null,"s2_tldr":"This work proposes a novel backbone, termed Fore-Mamba3D, to focus on the foreground enhancement by modifying Mamba-based encoder, and emphasizes foreground-only encoding and alleviates the distance-based and causal dependencies in the linear autoregression model.","s2_paper_id":"e9093edb5ae4f82d8416d2a2dc03a684f90a4811","topics":"[\"3D / Vision\", \"Architecture\"]"},{"id":143,"run_id":1,"domain":"aiml","arxiv_id":"2602.19530","entry_id":"","title":"ORION: ORthonormal Text Encoding for Universal VLM AdaptatION","authors":"[\"Omprakash Chakraborty\", \"Jose Dolz\", \"Ismail Ben Ayed\"]","abstract":"Vision language models (VLMs) have demonstrated remarkable generalization across diverse tasks, yet their performance remains constrained by the quality and geometry of the textual prototypes used to represent classes. Standard zero shot classifiers, derived from frozen text encoders and handcrafted prompts, may yield correlated or weakly separated embeddings that limit task specific discriminability. We introduce ORION, a text encoder fine tuning framework that improves pretrained VLMs using on","published":"2026-02-23T05:47:28+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.19530v1","arxiv_url":"http://arxiv.org/abs/2602.19530v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":7.0,"composite":5.15,"summary":"ORION fine-tunes VLM text encoders using low-rank adaptation with a loss promoting pairwise orthogonality between class prototypes and penalizing deviations. It improves zero-shot, few-shot, and test-time adaptation across 11 benchmarks and three VLM backbones as a plug-and-play module.","reasoning":"No code/weights. Orthogonality constraint for text embeddings is incremental but practical. Consistent improvements over SOTA methods are valuable.","code_url":null,"s2_tldr":"This work introduces ORION, a text encoder fine tuning framework that improves pretrained VLMs using only class names, and provides a probabilistic interpretation of the orthogonality penalty, connecting it to the general maximum likelihood estimation (MLE) principle via Huygens theorem.","s2_paper_id":"93eec3ff6dc19cf77c945a6f244b0d46a10e5bbb","topics":"[\"Multimodal\", \"Language Models\", \"Retrieval / RAG\"]"},{"id":145,"run_id":1,"domain":"aiml","arxiv_id":"2602.19461","entry_id":"","title":"Laplacian Multi-scale Flow Matching for Generative Modeling","authors":"[\"Zelin Zhao\", \"Petr Molodyk\", \"Haotian Xue\", \"Yongxin Chen\"]","abstract":"In this paper, we present Laplacian multiscale flow matching (LapFlow), a novel framework that enhances flow matching by leveraging multi-scale representations for image generative modeling. Our approach decomposes images into Laplacian pyramid residuals and processes different scales in parallel through a mixture-of-transformers (MoT) architecture with causal attention mechanisms. Unlike previous cascaded approaches that require explicit renoising between scales, our model generates multi-scale","published":"2026-02-23T03:09:56+00:00","categories":"[\"cs.CV\", \"cs.LG\"]","pdf_url":"https://arxiv.org/pdf/2602.19461v1","arxiv_url":"http://arxiv.org/abs/2602.19461v1","comment":"Accepted to appear in ICLR 2026","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":7.0,"score_axis_3":6.0,"composite":5.15,"summary":"Introduces Laplacian multiscale flow matching (LapFlow) for image generation using parallel multi-scale representations via mixture-of-transformers. Achieves superior sample quality with fewer GFLOPs and faster inference than baselines, scaling effectively to 1024\u00d71024 resolution while eliminating cascaded renoising steps.","reasoning":"Accepted at ICLR 2026 with novel architecture for flow matching, but no released code/weights yet. Efficiency improvements are valuable for practitioners.","code_url":null,"s2_tldr":"This paper presents Laplacian multiscale flow matching (LapFlow), a novel framework that enhances flow matching by leveraging multi-scale representations for image generative modeling and achieves superior sample quality with fewer GFLOPs and faster inference compared to single-scale and multi-scale flow matching baselines.","s2_paper_id":"ed8ec3846a1755403ed7af5ec46650d745995912","topics":"[\"Image Generation\", \"Architecture\"]"},{"id":147,"run_id":1,"domain":"aiml","arxiv_id":"2602.19308","entry_id":"","title":"WildOS: Open-Vocabulary Object Search in the Wild","authors":"[\"Hardik Shah\", \"Erica Tevere\", \"Deegan Atha\", \"Marcel Kaufmann\", \"Shehryar Khattak\", \"Manthan Patel\", \"Marco Hutter\", \"Jonas Frey\", \"Patrick Spieler\"]","abstract":"Autonomous navigation in complex, unstructured outdoor environments requires robots to operate over long ranges without prior maps and limited depth sensing. In such settings, relying solely on geometric frontiers for exploration is often insufficient. In such settings, the ability to reason semantically about where to go and what is safe to traverse is crucial for robust, efficient exploration. This work presents WildOS, a unified system for long-range, open-vocabulary object search that combin","published":"2026-02-22T19:14:00+00:00","categories":"[\"cs.RO\", \"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.19308v1","arxiv_url":"http://arxiv.org/abs/2602.19308v1","comment":"28 pages, 16 figures, 2 tables","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":7.0,"composite":5.15,"summary":"WildOS enables long-range, open-vocabulary object search for autonomous robots in unstructured outdoor environments. The system combines safe geometric exploration with semantic visual reasoning using foundation models, building a sparse navigation graph scored by visual similarity and traversability predictions.","reasoning":"Novel application of foundation models to robotics with practical field deployment. No code/weights available, limiting reproducibility and immediate use.","code_url":null,"s2_tldr":"This work presents WildOS, a unified system for long-range, open-vocabulary object search that combines safe geometric exploration with semantic visual reasoning and introduces a particle-filter-based method for coarse localization of the open-vocabulary target query, enabling effective planning toward distant goals.","s2_paper_id":"03331c5b6592af945c030ef9acb2aedea3b719ff","topics":"[\"Efficiency\", \"Robotics\"]"},{"id":148,"run_id":1,"domain":"aiml","arxiv_id":"2602.19188","entry_id":"","title":"PositionOCR: Augmenting Positional Awareness in Multi-Modal Models via Hybrid Specialist Integration","authors":"[\"Chen Duan\", \"Zhentao Guo\", \"Pei Fu\", \"Zining Wang\", \"Kai Zhou\", \"Pengfei Yan\"]","abstract":"In recent years, Multi-modal Large Language Models (MLLMs) have achieved strong performance in OCR-centric Visual Question Answering (VQA) tasks, illustrating their capability to process heterogeneous data and exhibit adaptability across varied contexts. However, these MLLMs rely on a Large Language Model (LLM) as the decoder, which is primarily designed for linguistic processing, and thus inherently lacks the positional reasoning required for precise visual tasks, such as text spotting and text","published":"2026-02-22T13:36:48+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.19188v1","arxiv_url":"http://arxiv.org/abs/2602.19188v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":7.0,"composite":5.15,"summary":"PositionOCR integrates text spotting specialists with MLLMs through a 131M-parameter hybrid architecture. Combines positional accuracy of specialists with contextual reasoning of LLMs for improved text grounding and spotting in OCR-centric VQA tasks.","reasoning":"Novel hybrid approach addressing real limitation of MLLMs in spatial reasoning. Parameter-efficient but no code/weights.","code_url":null,"s2_tldr":"This work introduces PositionOCR, a parameter-efficient hybrid architecture that seamlessly integrates a text spotting model's positional strengths with an LLM's contextual reasoning, and demonstrates outstanding multi-modal processing capabilities, particularly excelling in tasks such as text grounding and text spotting.","s2_paper_id":"3be5be059967b30a389804100c34615c8ed51b2f","topics":"[\"Language Models\", \"Multimodal\", \"Reasoning\"]"},{"id":149,"run_id":1,"domain":"aiml","arxiv_id":"2602.19140","entry_id":"","title":"CaReFlow: Cyclic Adaptive Rectified Flow for Multimodal Fusion","authors":"[\"Sijie Mai\", \"Shiqin Han\"]","abstract":"Modality gap significantly restricts the effectiveness of multimodal fusion. Previous methods often use techniques such as diffusion models and adversarial learning to reduce the modality gap, but they typically focus on one-to-one alignment without exposing the data points of the source modality to the global distribution information of the target modality. To this end, leveraging the characteristic of rectified flow that can map one distribution to another via a straight trajectory, we extend ","published":"2026-02-22T12:12:05+00:00","categories":"[\"cs.CV\", \"cs.LG\"]","pdf_url":"https://arxiv.org/pdf/2602.19140v1","arxiv_url":"http://arxiv.org/abs/2602.19140v1","comment":"Accepted by CVPR 2026","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":7.0,"score_axis_3":6.0,"composite":5.15,"summary":"CaReFlow extends rectified flow for multimodal fusion through one-to-many distribution mapping with adaptive relaxed alignment and cyclic rectified flow for information preservation. Addresses modality gap through global distribution exposure and achieves competitive results on affective computing.","reasoning":"Novel extension of rectified flow to multimodal learning. CVPR 2026 accepted but no code/weights available.","code_url":null,"s2_tldr":"This work extends rectified flow to achieve more accurate distribution mapping and address the ambiguous flow directions in one-to-many mapping, and designs `adaptive relaxed alignment', enforcing stricter alignment for modality pairs belonging to the same sample, while applying relaxed mapping for pairs not belonging to the same sample or category.","s2_paper_id":"1713330c8b4b474ab845c4a0c5fa76d59b70125c","topics":"[\"Multimodal\", \"Training\"]"},{"id":150,"run_id":1,"domain":"aiml","arxiv_id":"2602.19117","entry_id":"","title":"Keep it SymPL: Symbolic Projective Layout for Allocentric Spatial Reasoning in Vision-Language Models","authors":"[\"Jaeyun Jang\", \"Seunghui Shin\", \"Taeho Park\", \"Hyoseok Hwang\"]","abstract":"Perspective-aware spatial reasoning involves understanding spatial relationships from specific viewpoints-either egocentric (observer-centered) or allocentric (object-centered). While vision-language models (VLMs) perform well in egocentric settings, their performance deteriorates when reasoning from allocentric viewpoints, where spatial relations must be inferred from the perspective of objects within the scene. In this study, we address this underexplored challenge by introducing Symbolic Proj","published":"2026-02-22T10:18:54+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.19117v2","arxiv_url":"http://arxiv.org/abs/2602.19117v2","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":7.0,"score_axis_3":6.0,"composite":5.15,"summary":"SymPL reformulates allocentric spatial reasoning (object-centered viewpoints) as symbolic-layout representations that VLMs handle well. Using projection, abstraction, bipartition, and localization, it substantially improves both allocentric and egocentric spatial reasoning performance, showing robust results under visual illusions and multi-view scenarios.","reasoning":"Novel approach to a challenging problem in VLMs (allocentric reasoning), but no code/weights available. Practical method but limited to specific spatial reasoning tasks.","code_url":null,"s2_tldr":"Symbolic Projective Layout (SymPL) is introduced, a framework that reformulates allocentric reasoning into symbolic-layout forms that VLMs inherently handle well and substantially improves performance in both allocentric and egocentric tasks.","s2_paper_id":"47c053616ebb89bf6ff0fe6cde8267ff2e261d10","topics":"[\"Language Models\", \"Multimodal\", \"Reasoning\"]"},{"id":151,"run_id":1,"domain":"aiml","arxiv_id":"2602.19091","entry_id":"","title":"CREM: Compression-Driven Representation Enhancement for Multimodal Retrieval and Comprehension","authors":"[\"Lihao Liu\", \"Yan Wang\", \"Biao Yang\", \"Da Li\", \"Jiangxia Cao\", \"Yuxiao Luo\", \"Xiang Chen\", \"Xiangyu Wu\", \"Wei Yuan\", \"Fan Yang\"]","abstract":"Multimodal Large Language Models (MLLMs) have shown remarkable success in comprehension tasks such as visual description and visual question answering. However, their direct application to embedding-based tasks like retrieval remains challenging due to the discrepancy between output formats and optimization objectives. Previous approaches often employ contrastive fine-tuning to adapt MLLMs for retrieval, but at the cost of losing their generative capabilities. We argue that both generative and e","published":"2026-02-22T08:09:51+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.19091v1","arxiv_url":"http://arxiv.org/abs/2602.19091v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":7.0,"composite":5.15,"summary":"CREM unifies multimodal retrieval and generation in MLLMs through compression-driven training. It uses learnable chorus tokens to aggregate multimodal semantics and a compression-driven strategy integrating contrastive and generative objectives, achieving SOTA retrieval on MMEB while maintaining strong generative performance.","reasoning":"Solid unification approach bridging retrieval and generation, but no code/weights. Practical for multi-task scenarios but incremental improvement over existing MLLM methods.","code_url":null,"s2_tldr":"This work proposes CREM (Compression-driven Representation Enhanced Model), with a unified framework that enhances multimodal representations for retrieval while preserving generative ability and highlights that generative supervision can further improve the representational quality of MLLMs under the proposed compression-driven paradigm.","s2_paper_id":"960335d02b55cff9bd349c46de8e717f9acd361e","topics":"[\"Multimodal\", \"Retrieval / RAG\", \"Efficiency\"]"},{"id":152,"run_id":1,"domain":"aiml","arxiv_id":"2602.19024","entry_id":"","title":"Towards Calibrating Prompt Tuning of Vision-Language Models","authors":"[\"Ashshak Sharifdeen\", \"Fahad Shamshad\", \"Muhammad Akhtar Munir\", \"Abhishek Basu\", \"Mohamed Insaf Ismithdeen\", \"Jeyapriyan Jeyamohan\", \"Chathurika Sewwandi Silva\", \"Karthik Nandakumar\", \"Muhammad Haris Khan\"]","abstract":"Prompt tuning of large-scale vision-language models such as CLIP enables efficient task adaptation without updating model weights. However, it often leads to poor confidence calibration and unreliable predictive uncertainty. We address this problem by proposing a calibration framework that enhances predictive reliability while preserving the geometry of the pretrained CLIP embedding space, which is required for robust generalization. Our approach extends the standard cross-entropy loss with two ","published":"2026-02-22T03:26:23+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.19024v1","arxiv_url":"http://arxiv.org/abs/2602.19024v1","comment":"Accepted to CVPR 2026","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":7.0,"composite":5.15,"summary":"Proposes calibration framework for CLIP prompt tuning using mean-variance margin penalty and text moment-matching loss. Tested across 7 prompt-tuning methods and 11 datasets, significantly reducing Expected Calibration Error while preserving CLIP embedding space geometry for robust generalization.","reasoning":"Practical calibration approach for CLIP, CVPR 2026. Useful for reliable predictions but incremental improvement, no code/weights available.","code_url":null,"s2_tldr":"This work proposes a calibration framework that enhances predictive reliability while preserving the geometry of the pretrained CLIP embedding space, which is required for robust generalization and reduces the Expected Calibration Error (ECE) compared to competitive calibration techniques on both base and novel classes.","s2_paper_id":"bfd284603d35968d2b32f3a00759af3951a0a12a","topics":"[\"Language Models\", \"Multimodal\", \"Efficiency\"]"},{"id":153,"run_id":1,"domain":"aiml","arxiv_id":"2602.19001","entry_id":"","title":"A Benchmark and Knowledge-Grounded Framework for Advanced Multimodal Personalization Study","authors":"[\"Xia Hu\", \"Honglei Zhuang\", \"Brian Potetz\", \"Alireza Fathi\", \"Bo Hu\", \"Babak Samari\", \"Howard Zhou\"]","abstract":"The powerful reasoning of modern Vision Language Models open a new frontier for advanced personalization study. However, progress in this area is critically hampered by the lack of suitable benchmarks. To address this gap, we introduce Life-Bench, a comprehensive, synthetically generated multimodal benchmark built on simulated user digital footprints. Life-Bench features over questions evaluating a wide spectrum of capabilities, from persona understanding to complex reasoning over historical dat","published":"2026-02-22T01:44:16+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.19001v1","arxiv_url":"http://arxiv.org/abs/2602.19001v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":7.0,"composite":5.15,"summary":"Introduces Life-Bench, a comprehensive multimodal benchmark for advanced personalization study built on simulated user footprints, and LifeGraph framework organizing personal context into knowledge graphs. Experiments reveal existing methods fail on complex personalized tasks, with large performance gaps in relational/temporal/aggregative reasoning.","reasoning":"Important benchmark contribution for personalization research. LifeGraph shows promise but complex tasks remain open challenges, no code available.","code_url":null,"s2_tldr":"This work introduces Life-Bench, a comprehensive, synthetically generated multimodal benchmark built on simulated user digital footprints, and proposes LifeGraph, an end-to-end framework that organizes personal context into a knowledge graph to facilitate structured retrieval and reasoning.","s2_paper_id":"9f9abe950a2808b3c34e835d0ff1b233b68141b5","topics":"[\"Multimodal\", \"Benchmark\", \"Language Models\"]"},{"id":154,"run_id":1,"domain":"aiml","arxiv_id":"2602.18990","entry_id":"","title":"IDSelect: A RL-Based Cost-Aware Selection Agent for Video-based Multi-Modal Person Recognition","authors":"[\"Yuyang Ji\", \"Yixuan Shen\", \"Kien Nguyen\", \"Lifeng Zhou\", \"Feng Liu\"]","abstract":"Video-based person recognition achieves robust identification by integrating face, body, and gait. However, current systems waste computational resources by processing all modalities with fixed heavyweight ensembles regardless of input complexity. To address these limitations, we propose IDSelect, a reinforcement learning-based cost-aware selector that chooses one pre-trained model per modality per-sequence to optimize the accuracy-efficiency trade-off. Our key insight is that an input-condition","published":"2026-02-22T00:32:59+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.18990v1","arxiv_url":"http://arxiv.org/abs/2602.18990v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":7.0,"composite":5.15,"summary":"IDSelect uses RL to dynamically select one pre-trained model per modality (face/body/gait) per video sequence for person recognition, optimizing accuracy-efficiency trade-off. Achieves 95.9% Rank-1 on CCVID with 92.4% less computation than baselines, and 41.3% reduction on MEVID while maintaining performance.","reasoning":"Practical RL-based selection for computational efficiency. Solid engineering but incremental approach applying RL to model selection, no code available.","code_url":null,"s2_tldr":"IDSelect is a reinforcement learning-based cost-aware selector that chooses one pre-trained model per modality per-sequence to optimize the accuracy-efficiency trade-off, and shows that an input-conditioned selector can discover complementary model choices that surpass fixed ensembles while using substantially fewer resources.","s2_paper_id":"ceacc0443343980d03d5417723fcfd3241bb0190","topics":"[\"RL\", \"Agents\", \"Training\"]"},{"id":156,"run_id":1,"domain":"aiml","arxiv_id":"2602.18887","entry_id":"","title":"SafeDrive: Fine-Grained Safety Reasoning for End-to-End Driving in a Sparse World","authors":"[\"Jungho Kim\", \"Jiyong Oh\", \"Seunghoon Yu\", \"Hongjae Shin\", \"Donghyuk Kwak\", \"Jun Won Choi\"]","abstract":"The end-to-end (E2E) paradigm, which maps sensor inputs directly to driving decisions, has recently attracted significant attention due to its unified modeling capability and scalability. However, ensuring safety in this unified framework remains one of the most critical challenges. In this work, we propose SafeDrive, an E2E planning framework designed to perform explicit and interpretable safety reasoning through a trajectory-conditioned Sparse World Model. SafeDrive comprises two complementary","published":"2026-02-21T16:17:28+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.18887v1","arxiv_url":"http://arxiv.org/abs/2602.18887v1","comment":"Accepted to CVPR 2026, 19 pages, 9 figures","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":7.0,"composite":5.15,"summary":"SafeDrive is an end-to-end autonomous driving framework that performs explicit safety reasoning through trajectory-conditioned Sparse World Model. Achieves 91.6 PDMS on NAVSIM with only 0.5% collision rate and 66.8% driving score on Bench2Drive through agent-specific collision risk evaluation.","reasoning":"Novel approach to E2E driving safety with strong benchmark results. CVPR 2026 accepted. No code/weights shared. High practical value for autonomous driving applications.","code_url":null,"s2_tldr":"This work proposes SafeDrive, an E2E planning framework designed to perform explicit and interpretable safety reasoning through a trajectory-conditioned Sparse World Model, which achieves state-of-the-art performance on both open-loop and closed-loop benchmarks.","s2_paper_id":"bed1322acc45b08b2a7ab5ca0fa3b73368875671","topics":"[\"Reasoning\", \"Agents\", \"World Models\"]"},{"id":157,"run_id":1,"domain":"aiml","arxiv_id":"2602.18886","entry_id":"","title":"PhysConvex: Physics-Informed 3D Dynamic Convex Radiance Fields for Reconstruction and Simulation","authors":"[\"Dan Wang\", \"Xinrui Cui\", \"Serge Belongie\", \"Ravi Ramamoorthi\"]","abstract":"Reconstructing and simulating dynamic 3D scenes with both visual realism and physical consistency remains a fundamental challenge. Existing neural representations, such as NeRFs and 3DGS, excel in appearance reconstruction but struggle to capture complex material deformation and dynamics. We propose PhysConvex, a Physics-informed 3D Dynamic Convex Radiance Field that unifies visual rendering and physical simulation. PhysConvex represents deformable radiance fields using physically grounded conve","published":"2026-02-21T16:16:33+00:00","categories":"[\"cs.CV\", \"cs.GR\"]","pdf_url":"https://arxiv.org/pdf/2602.18886v1","arxiv_url":"http://arxiv.org/abs/2602.18886v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":7.0,"score_axis_3":6.0,"composite":5.15,"summary":"PhysConvex unifies visual rendering and physical simulation using physics-informed dynamic convex radiance fields governed by continuum mechanics. Introduces boundary-driven representation with reduced-order simulation using neural skinning eigenmodes, achieving high-fidelity reconstruction of geometry, appearance, and physical properties from videos.","reasoning":"Novel physics-based approach to dynamic scene reconstruction. No code/weights shared. Strong theoretical contribution but uncertain practical deployment complexity for practitioners.","code_url":null,"s2_tldr":"PhysConvex is proposed, a Physics-informed 3D Dynamic Convex Radiance Field that unifies visual rendering and physical simulation, and achieves high-fidelity reconstruction of geometry, appearance, and physical properties from videos, outperforming existing methods.","s2_paper_id":"fd88efe37a43efa7d81fdcc2d99e331b77ccafe1","topics":"[\"3D / Vision\"]"},{"id":159,"run_id":1,"domain":"aiml","arxiv_id":"2602.21143","entry_id":"","title":"A Benchmark for Deep Information Synthesis","authors":"[\"Debjit Paul\", \"Daniel Murphy\", \"Milan Gritta\", \"Ronald Cardenas\", \"Victor Prokhorov\", \"Lena Sophia Bolliger\", \"Aysim Toker\", \"Roy Miles\", \"Andreea-Maria Oncescu\", \"Jasivan Alex Sivakumar\"]","abstract":"Large language model (LLM)-based agents are increasingly used to solve complex tasks involving tool use, such as web browsing, code execution, and data analysis. However, current evaluation benchmarks do not adequately assess their ability to solve real-world tasks that require synthesizing information from multiple sources and inferring insights beyond simple fact retrieval. To address this, we introduce DEEPSYNTH, a novel benchmark designed to evaluate agents on realistic, time-consuming probl","published":"2026-02-24T17:43:32+00:00","categories":"[\"cs.AI\", \"cs.CL\", \"cs.IR\", \"cs.LG\"]","pdf_url":"https://arxiv.org/pdf/2602.21143v1","arxiv_url":"http://arxiv.org/abs/2602.21143v1","comment":"Accepted at ICLR 2026","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":7.0,"composite":5.15,"summary":"DEEPSYNTH introduces benchmark with 120 tasks across 7 domains evaluating LLM agents on information synthesis requiring multi-source reasoning. State-of-the-art models achieve maximum 17.5 F1 on LLM-judge metric. ICLR 2026 accepted benchmark reveals current agents struggle with hallucinations and large information spaces.","reasoning":"ICLR accepted benchmark. No code/weights but important evaluation resource. Identifies critical gaps in current agent capabilities with rigorous multi-stage annotation pipeline.","code_url":null,"s2_tldr":"DEEPSYNTH is a novel benchmark designed to evaluate agents on realistic, time-consuming problems that combine information gathering, synthesis, and structured reasoning to produce insights, and reveals that current agents struggle with hallucinations and reasoning over large information spaces.","s2_paper_id":"4e2b7a8edceeada9462218fbf45e1370de41a32a","topics":"[\"Benchmark\", \"Language Models\", \"Agents\"]"},{"id":160,"run_id":1,"domain":"aiml","arxiv_id":"2602.21103","entry_id":"","title":"Prompt-Level Distillation: A Non-Parametric Alternative to Model Fine-Tuning for Efficient Reasoning","authors":"[\"Sanket Badhe\", \"Deep Shah\"]","abstract":"Advanced reasoning typically requires Chain-of-Thought prompting, which is accurate but incurs prohibitive latency and substantial test-time inference costs. The standard alternative, fine-tuning smaller models, often sacrifices interpretability while introducing significant resource and operational overhead. To address these limitations, we introduce Prompt-Level Distillation (PLD). We extract explicit reasoning patterns from a Teacher model and organize them into a structured list of expressiv","published":"2026-02-24T17:03:21+00:00","categories":"[\"cs.CL\", \"cs.IR\"]","pdf_url":"https://arxiv.org/pdf/2602.21103v1","arxiv_url":"http://arxiv.org/abs/2602.21103v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":5.0,"score_axis_3":8.0,"composite":5.15,"summary":"Prompt-Level Distillation (PLD) extracts reasoning patterns from Teacher models into structured instructions for Student System Prompts, achieving non-parametric knowledge transfer. Improved Gemma-3 4B Macro F1 from 57% to 90% on StereoSet and 67% to 83% on Contract-NLI with negligible latency overhead and full interpretability.","reasoning":"Novel non-parametric approach avoiding fine-tuning overhead. No code/weights. High practical value for edge devices and regulated industries through transparent reasoning.","code_url":null,"s2_tldr":null,"s2_paper_id":"16931d834a9ae893e747b7757fadfa436bacab17","topics":"[\"Language Models\", \"Efficiency\", \"Reasoning\"]"},{"id":161,"run_id":1,"domain":"aiml","arxiv_id":"2602.20816","entry_id":"","title":"Don't Ignore the Tail: Decoupling top-K Probabilities for Efficient Language Model Distillation","authors":"[\"Sayantan Dasgupta\", \"Trevor Cohn\", \"Timothy Baldwin\"]","abstract":"The core learning signal used in language model distillation is the standard Kullback-Leibler (KL) divergence between the student and teacher distributions. Traditional KL divergence tends to be dominated by the next tokens with the highest probabilities, i.e., the teacher's modes, thereby diminishing the influence of less probable yet potentially informative components of the output distribution. We propose a new tail-aware divergence that decouples the contribution of the teacher model's top-K","published":"2026-02-24T11:54:06+00:00","categories":"[\"cs.CL\", \"cs.LG\"]","pdf_url":"https://arxiv.org/pdf/2602.20816v1","arxiv_url":"http://arxiv.org/abs/2602.20816v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":7.0,"composite":5.15,"summary":"Proposes tail-aware KL divergence for language model distillation that decouples top-K probabilities from tail, increasing contribution of less probable but informative predictions. Achieves competitive performance in pre-training and supervised distillation with modest computational budget.","reasoning":"Novel divergence formulation for distillation with practical efficiency gains, but no code/weights released. Useful for practitioners doing model compression.","code_url":null,"s2_tldr":"A new tail-aware divergence is proposed that decouples the contribution of the teacher model's top-K predicted probabilities from that of lower-probability predictions, while maintaining the same computational profile as the KL Divergence.","s2_paper_id":"56083e25fd201e76d66a23f09130af424978ecd7","topics":"[\"Language Models\", \"Efficiency\"]"},{"id":162,"run_id":1,"domain":"aiml","arxiv_id":"2602.20759","entry_id":"","title":"Overton Pluralistic Reinforcement Learning for Large Language Models","authors":"[\"Yu Fu\", \"Seongho Son\", \"Ilija Bogunovic\"]","abstract":"Existing alignment paradigms remain limited in capturing the pluralistic nature of human values. Overton Pluralism addresses this gap by generating responses with diverse perspectives from a single query. This paper introduces OP-GRPO (Overton Pluralistic Group Relative Policy Optimization), a reinforcement learning framework for implicit Overton Pluralism that enables a single large language model to produce pluralistic responses without explicit prompting or modular orchestration. Our workflow","published":"2026-02-24T10:39:27+00:00","categories":"[\"cs.CL\"]","pdf_url":"https://arxiv.org/pdf/2602.20759v1","arxiv_url":"http://arxiv.org/abs/2602.20759v1","comment":"28 pages, 8 figures","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":7.0,"score_axis_3":6.0,"composite":5.15,"summary":"OP-GRPO enables a single LLM to generate pluralistic responses capturing diverse perspectives without explicit prompting. Using dual-reward system with similarity estimator, achieves 37.4% relative gain over 20B baseline using only 3B parameters.","reasoning":"Novel RL framework for value pluralism with strong efficiency results, but no open weights. Interesting alignment approach though limited to specific use case.","code_url":null,"s2_tldr":"OP-GRPO (Overton Pluralistic Group Relative Policy Optimization), a reinforcement learning framework for implicit Overton Pluralism that enables a single large language model to produce pluralistic responses without explicit prompting or modular orchestration, is introduced.","s2_paper_id":"4c1e2e2b9671c7f92698fd63f3f049f921047cc7","topics":"[\"Language Models\", \"Training\", \"RL\"]"},{"id":163,"run_id":1,"domain":"aiml","arxiv_id":"2602.20670","entry_id":"","title":"CAMEL: Confidence-Gated Reflection for Reward Modeling","authors":"[\"Zirui Zhu\", \"Hailun Xu\", \"Yang Luo\", \"Yong Liu\", \"Kanchan Sarkar\", \"Kun Xu\", \"Yang You\"]","abstract":"Reward models play a fundamental role in aligning large language models with human preferences. Existing methods predominantly follow two paradigms: scalar discriminative preference models, which are efficient but lack interpretability, and generative judging models, which offer richer reasoning at the cost of higher computational overhead. We observe that the log-probability margin between verdict tokens strongly correlates with prediction correctness, providing a reliable proxy for instance di","published":"2026-02-24T08:20:08+00:00","categories":"[\"cs.CL\", \"cs.AI\"]","pdf_url":"https://arxiv.org/pdf/2602.20670v1","arxiv_url":"http://arxiv.org/abs/2602.20670v1","comment":"Preprint. 13 pages","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":7.0,"composite":5.15,"summary":"CAMEL uses confidence-gated reflection for reward modeling, performing lightweight single-token decisions and selectively invoking reflection for low-confidence instances. Achieves state-of-the-art 82.9% average accuracy on reward-model benchmarks, outperforming 70B models using only 14B parameters.","reasoning":"Novel confidence-based approach with strong efficiency-accuracy trade-off and SOTA results, but no code/weights released.","code_url":null,"s2_tldr":"CAMEL, a confidence-gated reflection framework that performs a lightweight single-token preference decision first and selectively invokes reflection only for low-confidence instances, is proposed, establishing a strictly better accuracy-efficiency Pareto frontier.","s2_paper_id":"1ab46b041e6dfdeb31ec8c6ad866a8ca7c007af5","topics":"[\"Training\", \"RL\", \"Language Models\"]"},{"id":164,"run_id":1,"domain":"aiml","arxiv_id":"2602.20574","entry_id":"","title":"GATES: Self-Distillation under Privileged Context with Consensus Gating","authors":"[\"Alex Stein\", \"Furong Huang\", \"Tom Goldstein\"]","abstract":"We study self-distillation in settings where supervision is unreliable: there are no ground truth labels, verifiable rewards, or external graders to evaluate answers. We focus on document-grounded question answering with asymmetric context, where a single model serves as both tutor (with access to a relevant source document during training) and student (answering from the question alone at test time). Rather than assuming tutor correctness, we derive supervision online from tutor consensus by sa","published":"2026-02-24T05:56:20+00:00","categories":"[\"cs.LG\", \"cs.CL\"]","pdf_url":"https://arxiv.org/pdf/2602.20574v1","arxiv_url":"http://arxiv.org/abs/2602.20574v1","comment":"10 Pages of main text with an additional 7 pages of supplementary material","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":7.0,"score_axis_3":6.0,"composite":5.15,"summary":"GATES introduces consensus-gated self-distillation for document-grounded QA where a model learns from its own reasoning traces when multiple samples agree, without ground truth labels. The method improves document-free accuracy from 46% to 62% in-domain and 20.2% to 35.4% on math benchmarks by distilling full reasoning trajectories rather than just answers.","reasoning":"Novel approach to self-distillation using consensus as supervision signal. No code/weights available hurts reproducibility. Strong empirical gains suggest practical value.","code_url":null,"s2_tldr":"This work focuses on document-grounded question answering with asymmetric context, where a single model serves as both tutor and student, and derives supervision online from tutor consensus by sampling multiple document-grounded reasoning traces and using agreement to gate learning.","s2_paper_id":"84746060eaa629bd427d317f1e778fe78d8badc1","topics":"[\"Efficiency\", \"Benchmark\", \"RL\"]"},{"id":165,"run_id":1,"domain":"aiml","arxiv_id":"2602.20332","entry_id":"","title":"No One Size Fits All: QueryBandits for Hallucination Mitigation","authors":"[\"Nicole Cho\", \"William Watson\", \"Alec Koppel\", \"Sumitra Ganesh\", \"Manuela Veloso\"]","abstract":"Advanced reasoning capabilities in Large Language Models (LLMs) have led to more frequent hallucinations; yet most mitigation work focuses on open-source models for post-hoc detection and parameter editing. The dearth of studies focusing on hallucinations in closed-source models is especially concerning, as they constitute the vast majority of models in institutional deployments. We introduce QueryBandits, a model-agnostic contextual bandit framework that adaptively learns online to select the o","published":"2026-02-23T20:28:48+00:00","categories":"[\"cs.CL\", \"cs.AI\", \"cs.LG\"]","pdf_url":"https://arxiv.org/pdf/2602.20332v1","arxiv_url":"http://arxiv.org/abs/2602.20332v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":7.0,"composite":5.15,"summary":"QueryBandits is a model-agnostic contextual bandit framework that adaptively learns optimal query-rewrite strategies to mitigate LLM hallucinations in closed-source models. Thompson Sampling achieves 87.5% win rate over no-rewrite baseline and outperforms static policies by 42.6-60.3%. Demonstrates that rigid query-rewriting can worsen hallucinations.","reasoning":"Novel application of bandits to query rewriting for hallucination mitigation. Works with closed-source models (practical). Strong empirical results across 16 QA scenarios. No code released.","code_url":null,"s2_tldr":"QueryBandits is introduced, a model-agnostic contextual bandit framework that adaptively learns online to select the optimal query-rewrite strategy by leveraging an empirically validated and calibrated reward function, substantiating the finding that there is no single rewrite policy optimal for all queries.","s2_paper_id":"4812337292e96cbeb5aefb0592f6bd2560f53ad8","topics":"[\"RL\", \"Language Models\", \"Reasoning\"]"},{"id":166,"run_id":1,"domain":"aiml","arxiv_id":"2602.20135","entry_id":"","title":"KNIGHT: Knowledge Graph-Driven Multiple-Choice Question Generation with Adaptive Hardness Calibration","authors":"[\"Mohammad Amanlou\", \"Erfan Shafiee Moghaddam\", \"Yasaman Amou Jafari\", \"Mahdi Noori\", \"Farhan Farsi\", \"Behnam Bahrak\"]","abstract":"With the rise of large language models (LLMs), they have become instrumental in applications such as Retrieval-Augmented Generation (RAG). Yet evaluating these systems remains bottlenecked by the time and cost of building specialized assessment datasets. We introduce KNIGHT, an LLM-based, knowledge-graph-driven framework for generating multiple-choice question (MCQ) datasets from external sources. KNIGHT constructs a topic-specific knowledge graph, a structured and parsimonious summary of entiti","published":"2026-02-23T18:46:27+00:00","categories":"[\"cs.CL\", \"cs.AI\", \"cs.IR\"]","pdf_url":"https://arxiv.org/pdf/2602.20135v1","arxiv_url":"http://arxiv.org/abs/2602.20135v1","comment":"Accepted at the Third Conference on Parsimony and Learning (CPAL 2026). 36 pages, 12 figures. (Equal contribution: Yasaman Amou Jafari and Mahdi Noori.)","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":7.0,"composite":5.15,"summary":"KNIGHT is an LLM-based framework for generating MCQ datasets from external sources using knowledge graphs. Constructs reusable topic-specific KGs that enable instructor-controlled difficulty and multi-hop questions without re-feeding full source text. Produces six datasets in History, Biology, and Mathematics with quality evaluation across five criteria.","reasoning":"Novel use of KGs for efficient, reusable MCQ generation with difficulty control. Practical for education/evaluation tasks. Token-efficient approach. No code release limits adoption.","code_url":null,"s2_tldr":"KNIGHT, an LLM-based, knowledge-graph-driven framework for generating multiple-choice question (MCQ) datasets from external sources, achieves high quality across these criteria, and yields model rankings aligned with MMLU-style benchmarks, while supporting topic-specific and difficulty-controlled evaluation.","s2_paper_id":"34d59f9a0b3d26d63836d018f50556f0fdcd74ff","topics":"[\"Retrieval / RAG\", \"Language Models\", \"Benchmark\"]"},{"id":167,"run_id":1,"domain":"aiml","arxiv_id":"2602.19455","entry_id":"","title":"SenTSR-Bench: Thinking with Injected Knowledge for Time-Series Reasoning","authors":"[\"Zelin He\", \"Boran Han\", \"Xiyuan Zhang\", \"Shuai Zhang\", \"Haotian Lin\", \"Qi Zhu\", \"Haoyang Fang\", \"Danielle C. Maddix\", \"Abdul Fatir Ansari\", \"Akash Chandrayan\"]","abstract":"Time-series diagnostic reasoning is essential for many applications, yet existing solutions face a persistent gap: general reasoning large language models (GRLMs) possess strong reasoning skills but lack the domain-specific knowledge to understand complex time-series patterns. Conversely, fine-tuned time-series LLMs (TSLMs) understand these patterns but lack the capacity to generalize reasoning for more complicated questions. To bridge this gap, we propose a hybrid knowledge-injection framework ","published":"2026-02-23T02:55:32+00:00","categories":"[\"cs.LG\", \"cs.AI\", \"cs.CL\", \"stat.ML\"]","pdf_url":"https://arxiv.org/pdf/2602.19455v1","arxiv_url":"http://arxiv.org/abs/2602.19455v1","comment":"Accepted by the 29th International Conference on Artificial Intelligence and Statistics (AISTATS 2026)","source":"both","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":7.0,"composite":5.15,"summary":"SenTSR-Bench proposes a hybrid framework injecting time-series-specific knowledge from specialized models into general reasoning LLMs, using RL with verifiable rewards to elicit knowledge-rich traces. The method achieves 9.1%-26.1% improvements over specialized models and 7.9%-22.4% over general LLMs on diagnostic reasoning tasks.","reasoning":"Novel approach bridging domain-specific and general LLMs through knowledge injection. Practical for time-series applications, though code availability uncertain and 'both' source suggests moderate visibility.","code_url":null,"s2_tldr":"This work proposes a hybrid knowledge-injection framework that injects TSLM-generated insights directly into GRLM's reasoning trace, thereby achieving strong time-series reasoning with in-domain knowledge.","s2_paper_id":"5737bb82342f65fa9518d87e74641b22a37533e5","topics":"[\"Reasoning\", \"Language Models\"]"},{"id":168,"run_id":1,"domain":"aiml","arxiv_id":"2602.19111","entry_id":"","title":"Astra: Activation-Space Tail-Eigenvector Low-Rank Adaptation of Large Language Models","authors":"[\"Kainan Liu\", \"Yong Zhang\", \"Ning Cheng\", \"Yun Zhu\", \"Yanmeng Wang\", \"Shaojun Wang\", \"Jing Xiao\"]","abstract":"Parameter-Efficient Fine-Tuning (PEFT) methods, especially LoRA, are widely used for adapting pre-trained models to downstream tasks due to their computational and storage efficiency. However, in the context of LoRA and its variants, the potential of activation subspaces corresponding to tail eigenvectors remains substantially under-exploited, which may lead to suboptimal fine-tuning performance. In this work, we propose Astra (Activation-Space Tail-Eigenvector Low-Rank Adaptation), a novel PEFT","published":"2026-02-22T09:54:40+00:00","categories":"[\"cs.CL\"]","pdf_url":"https://arxiv.org/pdf/2602.19111v1","arxiv_url":"http://arxiv.org/abs/2602.19111v1","comment":"22 pages, 10 figures","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":7.0,"composite":5.15,"summary":"Astra proposes a PEFT method leveraging tail eigenvectors of model output activations from calibration data to construct task-adaptive low-rank adapters. Achieves faster convergence and improved performance with reduced parameters, outperforming LoRA baselines across 16 NLU/NLG benchmarks and sometimes surpassing full fine-tuning.","reasoning":"Novel approach to parameter-efficient fine-tuning with strong empirical results. High practical value for practitioners, though no code/weights mentioned despite promising performance.","code_url":null,"s2_tldr":"Astra (Activation-Space Tail-Eigenvector Low-Rank Adaptation), a novel PEFT method that leverages the tail eigenvectors of the model output activations-estimated from a small task-specific calibration set-to construct task-adaptive low-rank adapters achieves faster convergence and improved downstream performance with a significantly reduced parameter budget.","s2_paper_id":"87cdf9591e9fe510261a4c580ab0aeab21368cbf","topics":"[\"Language Models\", \"Efficiency\"]"},{"id":169,"run_id":1,"domain":"aiml","arxiv_id":"2602.19079","entry_id":"","title":"TriTopic: Tri-Modal Graph-Based Topic Modeling with Iterative Refinement and Archetypes","authors":"[\"Roman Egger\"]","abstract":"Topic modeling extracts latent themes from large text collections, but leading approaches like BERTopic face critical limitations: stochastic instability, loss of lexical precision (\"Embedding Blur\"), and reliance on a single data perspective. We present TriTopic, a framework that addresses these weaknesses through a tri-modal graph fusing semantic embeddings, TF-IDF, and metadata. Three core innovations drive its performance: hybrid graph construction via Mutual kNN and Shared Nearest Neighbo","published":"2026-02-22T07:29:53+00:00","categories":"[\"cs.CL\"]","pdf_url":"https://arxiv.org/pdf/2602.19079v1","arxiv_url":"http://arxiv.org/abs/2602.19079v1","comment":"11 pages, 7 figures","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":7.0,"composite":5.15,"summary":"TriTopic introduces tri-modal topic modeling fusing semantic embeddings, TF-IDF, and metadata through hybrid graph construction, consensus clustering, and iterative refinement. Achieves highest NMI across all benchmarks (mean 0.575 vs 0.513 for BERTopic) with 100% corpus coverage and is available as open-source PyPI library.","reasoning":"Novel approach addressing BERTopic's limitations with strong empirical results and open-source availability. Good practical value for text analysis practitioners, though not weights-based.","code_url":null,"s2_tldr":"TriTopic, a framework that addresses weaknesses through a tri-modal graph fusing semantic embeddings, TF-IDF, and metadata, achieves the highest NMI on every dataset, guarantees 100% corpus coverage with 0% outliers, and is available as an open-source PyPI library.","s2_paper_id":"bcad657ac67338045eb683a4ccba25f354b628a7","topics":"[\"Retrieval / RAG\"]"},{"id":170,"run_id":1,"domain":"aiml","arxiv_id":"2602.19043","entry_id":"","title":"Uncovering Context Reliance in Unstructured Knowledge Editing","authors":"[\"Zisheng Zhou\", \"Mengqi Zhang\", \"Shiguang Wu\", \"Xiaotian Ye\", \"Chi Zhang\", \"Zhumin Chen\", \"Pengjie Ren\"]","abstract":"Editing Large language models (LLMs) with real-world, unstructured knowledge is essential for correcting and updating their internal parametric knowledge. In this work, we revisit the fundamental next-token prediction (NTP) as a candidate paradigm for unstructured editing. We identify Context Reliance as a critical failure mode of NTP-based approaches, where knowledge acquired from edited text becomes highly dependent on its preceding context, leading to recall failures when that context is abse","published":"2026-02-22T04:44:34+00:00","categories":"[\"cs.CL\"]","pdf_url":"https://arxiv.org/pdf/2602.19043v1","arxiv_url":"http://arxiv.org/abs/2602.19043v1","comment":"21 pages, 14 figures","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":7.0,"composite":5.15,"summary":"Identifies 'Context Reliance' as a critical failure mode in unstructured knowledge editing where acquired knowledge becomes dependent on preceding context. Proposes COIN framework to mitigate this by encouraging local knowledge focus, reducing Context Reliance by 45.2% and achieving 23.6% higher editing success rate.","reasoning":"Novel identification of fundamental failure mode with practical solution. Important for knowledge editing applications, though no code/weights mentioned despite strong empirical improvements.","code_url":null,"s2_tldr":"This work revisits the fundamental next-token prediction (NTP) as a candidate paradigm for unstructured editing and proposes a simple yet effective COntext-INdependent editing framework (COIN), encouraging model to focus on knowledge within local scope rather than memorizing contextual patterns.","s2_paper_id":"7ad3a2868bf3b0f9a0a40f9e0514ea4ab3971f23","topics":"[\"Language Models\"]"},{"id":171,"run_id":1,"domain":"aiml","arxiv_id":"2602.19008","entry_id":"","title":"Capable but Unreliable: Canonical Path Deviation as a Causal Mechanism of Agent Failure in Long-Horizon Tasks","authors":"[\"Wilson Y. Lee\"]","abstract":"Why do language agents fail on tasks they are capable of solving? We argue that many such failures are reliability failures caused by stochastic drift from a task's latent solution structure, not capability failures. Every well-defined tool-use task imposes a canonical solution path (i.e., a convergent set of tool invocations shared across successful runs) and agent success depends critically on whether a trajectory stays within this path's operating envelope. We establish this causally using a ","published":"2026-02-22T02:37:57+00:00","categories":"[\"cs.CL\", \"cs.LG\"]","pdf_url":"https://arxiv.org/pdf/2602.19008v1","arxiv_url":"http://arxiv.org/abs/2602.19008v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":7.0,"score_axis_3":6.0,"composite":5.15,"summary":"Analyzes why LLM agents fail on tasks they can solve by identifying 'canonical path deviation' as a causal mechanism. Using Toolathlon benchmark across 22 models, shows that successful runs adhere more closely to canonical solution paths. Proposes a simple restart-based intervention that improves success rates by 8.8 points.","reasoning":"Interesting causal analysis of agent failures with actionable intervention, but no code/weights available. Novel diagnostic framework but limited by lack of reproducibility artifacts.","code_url":null,"s2_tldr":"The findings imply that agent reliability cannot be improved by capability scaling alone, but offer a highly actionable intervention: a simple monitor that restarts the bottom tercile of runs based on mid-trajectory canonical adherence lifts success rates by \u00a38.8 percentage points among intervened runs.","s2_paper_id":"f95e01ad5375cf0390209bffab07bff4dcd30ab2","topics":"[\"Agents\"]"},{"id":172,"run_id":1,"domain":"aiml","arxiv_id":"2602.20191","entry_id":"","title":"MoBiQuant: Mixture-of-Bits Quantization for Token-Adaptive Elastic LLMs","authors":"[\"Dongwei Wang\", \"Jinhee Kim\", \"Seokho Han\", \"Denis Gudovskiy\", \"Yohei Nakata\", \"Tomoyuki Okuno\", \"KhayTze Peong\", \"Kang Eun Jeon\", \"Jong Hwan Ko\", \"Yiran Chen\"]","abstract":"Changing runtime complexity on cloud and edge devices necessitates elastic large language model (LLM) deployment, where an LLM can be inferred with various quantization precisions based on available computational resources. However, it has been observed that the calibration parameters for quantization are typically linked to specific precisions, which presents challenges during elastic-precision calibration and precision switching at runtime. In this work, we attribute the source of varying cali","published":"2026-02-21T21:11:08+00:00","categories":"[\"cs.LG\", \"cs.AI\", \"cs.CL\"]","pdf_url":"https://arxiv.org/pdf/2602.20191v1","arxiv_url":"http://arxiv.org/abs/2602.20191v1","comment":"17 pages, 12 figures","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":7.0,"score_axis_3":6.0,"composite":5.15,"summary":"Proposes MoBiQuant, a Mixture-of-Bits quantization framework for elastic LLM deployment that adjusts weight precision based on token sensitivity. Uses many-in-one recursive residual quantization and token-aware routing to enable smooth precision switching without repeated calibration. Matches PTQ performance on LLaMA3-8B.","reasoning":"Novel approach to elastic quantization addressing practical deployment needs, but no code/weights provided. Strong technical contribution but limited immediate reproducibility.","code_url":null,"s2_tldr":"This work proposes a novel Mixture-of-Bits quantization framework that adjusts weight precision for elastic LLM inference based on token sensitivity, and proposes the many-in-one recursive residual quantization that can iteratively reconstruct higher-precision weights and the token-aware router to dynamically select the number of residual bit slices.","s2_paper_id":"8b25e954589e020c88e153b26b5f8eaf16a108ef","topics":"[\"Efficiency\", \"Language Models\"]"},{"id":173,"run_id":1,"domain":"aiml","arxiv_id":"2602.18922","entry_id":"","title":"Why Agent Caching Fails and How to Fix It: Structured Intent Canonicalization with Few-Shot Learning","authors":"[\"Abhinaba Basu\"]","abstract":"Personal AI agents incur substantial cost via repeated LLM calls. We show existing caching methods fail: GPTCache achieves 37.9% accuracy on real benchmarks; APC achieves 0-12%. The root cause is optimizing for the wrong property -- cache effectiveness requires key consistency and precision, not classification accuracy. We observe cache-key evaluation reduces to clustering evaluation and apply V-measure decomposition to separate these on n=8,682 points across MASSIVE, BANKING77, CLINC150, and ","published":"2026-02-21T18:25:18+00:00","categories":"[\"cs.CL\", \"cs.AI\", \"cs.LG\"]","pdf_url":"https://arxiv.org/pdf/2602.18922v1","arxiv_url":"http://arxiv.org/abs/2602.18922v1","comment":"28 pages, 15 figures, 8 tables, 5 appendices","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":7.0,"composite":5.15,"summary":"Identifies fundamental failures in agent caching systems (GPTCache 37.9%, APC 0-12% accuracy) and proposes W5H2 structured intent decomposition with SetFit. Achieves 91.1% accuracy on MASSIVE in ~2ms vs 68.8% for 20B-parameter LLM at 3,447ms. Projects 97.5% cost reduction for agent systems.","reasoning":"Highly practical solution to agent cost reduction with strong empirical results, but no code released. Significant practical value but limited by missing artifacts.","code_url":null,"s2_tldr":"This work introduces W5H2, a structured intent decomposition framework and provides risk-controlled selective prediction guarantees via RCPS with nine bound families, and shows existing caching methods fail.","s2_paper_id":"b709f4500272a73108ac5fd920f42e7fa4517e1e","topics":"[\"Agents\", \"Language Models\", \"Benchmark\"]"},{"id":174,"run_id":1,"domain":"aiml","arxiv_id":"2602.18721","entry_id":"","title":"ReHear: Iterative Pseudo-Label Refinement for Semi-Supervised Speech Recognition via Audio Large Language Models","authors":"[\"Zefang Liu\", \"Chenyang Zhu\", \"Sangwoo Cho\", \"Shi-Xiong Zhang\"]","abstract":"Semi-supervised learning in automatic speech recognition (ASR) typically relies on pseudo-labeling, which often suffers from confirmation bias and error accumulation due to noisy supervision. To address this limitation, we propose ReHear, a framework for iterative pseudo-label refinement that integrates an instruction-tuned, audio-aware large language model (LLM) into the self-training loop. Unlike conventional text-based correctors, our approach conditions the LLM on both the ASR hypothesis and","published":"2026-02-21T05:04:22+00:00","categories":"[\"cs.CL\", \"eess.AS\"]","pdf_url":"https://arxiv.org/pdf/2602.18721v1","arxiv_url":"http://arxiv.org/abs/2602.18721v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":7.0,"composite":5.15,"summary":"Introduces ReHear, framework for iterative pseudo-label refinement in semi-supervised ASR using instruction-tuned audio-aware LLMs. Conditions LLM on both ASR hypothesis and source audio to recover phonetically accurate transcripts. Effectively mitigates error propagation and outperforms supervised and pseudo-labeling baselines.","reasoning":"Novel approach to semi-supervised ASR with strong practical benefits, but no code/weights available. Addresses important problem but limited by missing artifacts.","code_url":null,"s2_tldr":"ReHear is a framework for iterative pseudo-label refinement that integrates an instruction-tuned, audio-aware large language model (LLM) into the self-training loop, allowing it to recover phonetically accurate transcripts even from severe recognition errors.","s2_paper_id":"b68fd339dd91994befd9984230200e93c43da487","topics":"[\"Language Models\", \"Speech / Audio\"]"},{"id":175,"run_id":1,"domain":"aiml","arxiv_id":"2602.18700","entry_id":"","title":"Watermarking LLM Agent Trajectories","authors":"[\"Wenlong Meng\", \"Chen Gong\", \"Terry Yue Zhuo\", \"Fan Zhang\", \"Kecen Li\", \"Zheng Liu\", \"Zhou Yang\", \"Chengkun Wei\", \"Wenzhi Chen\"]","abstract":"LLM agents rely heavily on high-quality trajectory data to guide their problem-solving behaviors, yet producing such data requires substantial task design, high-capacity model generation, and manual filtering. Despite the high cost of creating these datasets, existing literature has overlooked copyright protection for LLM agent trajectories. This gap leaves creators vulnerable to data theft and makes it difficult to trace misuse or enforce ownership rights. This paper introduces ActHook, the fir","published":"2026-02-21T03:12:29+00:00","categories":"[\"cs.CR\", \"cs.CL\"]","pdf_url":"https://arxiv.org/pdf/2602.18700v1","arxiv_url":"http://arxiv.org/abs/2602.18700v1","comment":"20 pages, 9 figures","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":7.0,"score_axis_3":6.0,"composite":5.15,"summary":"Presents ActHook, first watermarking method for LLM agent trajectory datasets. Embeds hook actions activated by secret key without altering task outcomes. Achieves 94.3 detection AUC on Qwen-2.5-Coder-7B with negligible performance degradation across math, web search, and software engineering agents.","reasoning":"Novel application of watermarking to agent trajectories addressing important IP protection problem, but no code released. Creative approach but limited immediate applicability.","code_url":null,"s2_tldr":"ActHook is introduced, the first watermarking method tailored for agent trajectory datasets that embeds hook actions that are activated by a secret input key and do not alter the original task outcome, enabling reliable black-box detection.","s2_paper_id":"b01c6b129af19e8b65cc37cd2b84b83fe9953d41","topics":"[\"Language Models\", \"Agents\", \"Benchmark\"]"},{"id":176,"run_id":1,"domain":"aiml","arxiv_id":"2602.18633","entry_id":"","title":"DP-RFT: Learning to Generate Synthetic Text via Differentially Private Reinforcement Fine-Tuning","authors":"[\"Fangyuan Xu\", \"Sihao Chen\", \"Zinan Lin\", \"Taiwei Shi\", \"Sydney Graham\", \"Pei Zhou\", \"Mengting Wan\", \"Alex Stein\", \"Virginia Estellers\", \"Charles Chen\"]","abstract":"Differentially private (DP) synthetic data generation plays a pivotal role in developing large language models (LLMs) on private data, where data owners cannot provide eyes-on access to individual examples. Generating DP synthetic data typically involves a difficult trade-off. On one hand, DP finetuning methods train an LLM as a synthetic data generator with formal privacy guarantees, yet it still requires the raw content of private examples for model training. However, methods that avoid direct","published":"2026-02-20T22:03:56+00:00","categories":"[\"cs.CL\"]","pdf_url":"https://arxiv.org/pdf/2602.18633v1","arxiv_url":"http://arxiv.org/abs/2602.18633v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":7.0,"score_axis_3":6.0,"composite":5.15,"summary":"Introduces DP-RFT, online RL algorithm for differentially private synthetic text generation using DP-protected nearest-neighbor votes as reward signal. Leverages PPO to train LLM without eyes-on access to individual private examples, closing gap between private evolution and DP finetuning for news, transcripts, medical abstracts.","reasoning":"Novel application of RL to privacy-preserving synthetic data generation addressing important problem, but no code available. Strong technical contribution but limited by missing implementation.","code_url":null,"s2_tldr":"The experiments show that DP-RFT closes the gap between private evolution and DP finetuning methods in terms of the fidelity and downstream utility of the generated synthetic data, while respecting the private data boundary.","s2_paper_id":"ada37a8f925827ed5498f1c4d84dce8f28d3526c","topics":"[\"Language Models\"]"},{"id":177,"run_id":1,"domain":"aiml","arxiv_id":"2602.18613","entry_id":"","title":"Diagnosing LLM Reranker Behavior Under Fixed Evidence Pools","authors":"[\"Baris Arat\", \"Emre Sefer\"]","abstract":"Standard reranking evaluations study how a reranker orders candidates returned by an upstream retriever. This setup couples ranking behavior with retrieval quality, so differences in output cannot be attributed to the ranking policy alone. We introduce a controlled diagnostic that isolates reranking by using Multi-News clusters as fixed evidence pools. We limit each pool to exactly eight documents and pass identical inputs to all rankers. Within this setup, BM25 and MMR serve as interpretable re","published":"2026-02-20T21:07:32+00:00","categories":"[\"cs.LG\", \"cs.CL\", \"cs.IR\"]","pdf_url":"https://arxiv.org/pdf/2602.18613v1","arxiv_url":"http://arxiv.org/abs/2602.18613v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":7.0,"composite":5.15,"summary":"Introduces a diagnostic framework for evaluating LLM rerankers using fixed evidence pools from Multi-News clusters, isolating ranking behavior from retrieval quality. Shows LLMs diverge from both BM25 and MMR baselines with varying redundancy patterns, revealing fundamental limitations in lexical coverage at small budgets.","reasoning":"Novel diagnostic methodology but lacks code/weights. Practical for understanding reranker behavior, though incremental in advancing SOTA. Model-agnostic approach is valuable.","code_url":null,"s2_tldr":"A controlled diagnostic is introduced that isolates reranking by using Multi-News clusters as fixed evidence pools and is model-agnostic and applicable to any ranker, including open source systems and proprietary APIs.","s2_paper_id":"ed0e05630b4cd0564c020e539283a33e7cd9af82","topics":"[\"Language Models\", \"Retrieval / RAG\", \"Benchmark\"]"},{"id":178,"run_id":1,"domain":"aiml","arxiv_id":"2602.18425","entry_id":"","title":"RVR: Retrieve-Verify-Retrieve for Comprehensive Question Answering","authors":"[\"Deniz Qian\", \"Hung-Ting Chen\", \"Eunsol Choi\"]","abstract":"Comprehensively retrieving diverse documents is crucial to address queries that admit a wide range of valid answers. We introduce retrieve-verify-retrieve (RVR), a multi-round retrieval framework designed to maximize answer coverage. Initially, a retriever takes the original query and returns a candidate document set, followed by a verifier that identifies a high-quality subset. For subsequent rounds, the query is augmented with previously verified documents to uncover answers that are not yet c","published":"2026-02-20T18:48:05+00:00","categories":"[\"cs.CL\", \"cs.IR\"]","pdf_url":"https://arxiv.org/pdf/2602.18425v1","arxiv_url":"http://arxiv.org/abs/2602.18425v1","comment":"18 pages, 12 figures, 12 tables","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":7.0,"composite":5.15,"summary":"Presents RVR (Retrieve-Verify-Retrieve), a multi-round retrieval framework for comprehensive answer coverage using iterative query augmentation with verified documents. Achieves 10%+ relative gains in complete recall on QAMPARI and consistent improvements on QUEST/WebQuestionsSP across different retrievers.","reasoning":"Solid iterative retrieval methodology with strong empirical results. No code/weights provided. Practical for multi-answer search scenarios but incremental architecture.","code_url":null,"s2_tldr":"This work presents a promising iterative approach for comprehensive answer recall leveraging a verifier and adapting retrievers to a new inference scenario, achieving at least 10% relative and 3% absolute gain in complete recall percentage on a multi-answer retrieval dataset (QAMPARI).","s2_paper_id":"d577302822c8ca6fa0738fe5593aa45e451c2212","topics":"[\"Retrieval / RAG\"]"},{"id":180,"run_id":1,"domain":"aiml","arxiv_id":"2602.18297","entry_id":"","title":"Analyzing and Improving Chain-of-Thought Monitorability Through Information Theory","authors":"[\"Usman Anwar\", \"Tim Bakker\", \"Dana Kianfar\", \"Cristina Pinneri\", \"Christos Louizos\"]","abstract":"Chain-of-thought (CoT) monitors are LLM-based systems that analyze reasoning traces to detect when outputs may exhibit attributes of interest, such as test-hacking behavior during code generation. In this paper, we use information-theoretic analysis to show that non-zero mutual information between CoT and output is a necessary but not sufficient condition for CoT monitorability. We identify two sources of approximation error that may undermine the performance of CoT monitors in practice: informa","published":"2026-02-20T15:50:30+00:00","categories":"[\"cs.LG\", \"cs.AI\", \"cs.CL\", \"cs.IT\"]","pdf_url":"https://arxiv.org/pdf/2602.18297v1","arxiv_url":"http://arxiv.org/abs/2602.18297v1","comment":"First two authors contributed equally","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":7.0,"score_axis_3":6.0,"composite":5.15,"summary":"Uses information theory to show non-zero mutual information between CoT and output is necessary but insufficient for monitorability. Proposes oracle-based and label-free training approaches to improve CoT monitor accuracy while preventing degeneration across multiple environments.","reasoning":"Strong theoretical framework with practical training methods. No code/weights. Novel information-theoretic analysis of CoT monitoring with validation.","code_url":null,"s2_tldr":"Information-theoretic analysis is used to show that non-zero mutual information between CoT and output is a necessary but not sufficient condition for CoT monitorability, and proposes two complementary approaches that significantly improve monitor accuracy while preventing CoT degeneration even when training against a monitor.","s2_paper_id":"e310eb83a6b8780e75ff04d989002491aeae8785","topics":"[\"Reasoning\", \"Language Models\", \"Code\"]"},{"id":181,"run_id":1,"domain":"aiml","arxiv_id":"2602.18145","entry_id":"","title":"Detecting Contextual Hallucinations in LLMs with Frequency-Aware Attention","authors":"[\"Siya Qi\", \"Yudong Chen\", \"Runcong Zhao\", \"Qinglin Zhu\", \"Zhanghao Hu\", \"Wei Liu\", \"Yulan He\", \"Zheng Yuan\", \"Lin Gui\"]","abstract":"Hallucination detection is critical for ensuring the reliability of large language models (LLMs) in context-based generation. Prior work has explored intrinsic signals available during generation, among which attention offers a direct view of grounding behavior. However, existing approaches typically rely on coarse summaries that fail to capture fine-grained instabilities in attention. Inspired by signal processing, we introduce a frequency-aware perspective on attention by analyzing its variati","published":"2026-02-20T11:18:45+00:00","categories":"[\"cs.CL\"]","pdf_url":"https://arxiv.org/pdf/2602.18145v1","arxiv_url":"http://arxiv.org/abs/2602.18145v1","comment":"25 pages, 10 figures","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":7.0,"composite":5.15,"summary":"Introduces frequency-aware attention analysis for hallucination detection in LLMs by treating attention distributions as discrete signals and extracting high-frequency components. Shows that hallucinated tokens exhibit high-frequency attention energy reflecting unstable grounding, enabling lightweight detection that outperforms existing methods on RAGTruth and HalluRAG benchmarks.","reasoning":"Novel signal processing perspective on attention for hallucination detection, practically useful for RAG systems. No code/weights available limits immediate adoption.","code_url":null,"s2_tldr":"This work develops a lightweight hallucination detector using high-frequency attention features that achieves performance gains over verification-based, internal-representation-based, and attention-based methods across models and tasks.","s2_paper_id":"96bfb3821f0d8088b160a230af89d01209758e6c","topics":"[\"Language Models\"]"},{"id":182,"run_id":1,"domain":"aiml","arxiv_id":"2602.17867","entry_id":"","title":"ADAPT: Hybrid Prompt Optimization for LLM Feature Visualization","authors":"[\"Jo\\u00e3o N. Cardoso\", \"Arlindo L. Oliveira\", \"Bruno Martins\"]","abstract":"Understanding what features are encoded by learned directions in LLM activation space requires identifying inputs that strongly activate them. Feature visualization, which optimizes inputs to maximally activate a target direction, offers an alternative to costly dataset search approaches, but remains underexplored for LLMs due to the discrete nature of text. Furthermore, existing prompt optimization techniques are poorly suited to this domain, which is highly prone to local minima. To overcome t","published":"2026-02-19T22:03:25+00:00","categories":"[\"cs.LG\", \"cs.CL\"]","pdf_url":"https://arxiv.org/pdf/2602.17867v1","arxiv_url":"http://arxiv.org/abs/2602.17867v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":7.0,"composite":5.15,"summary":"Introduces ADAPT, a hybrid feature visualization method for LLMs combining beam search initialization with adaptive gradient-guided mutation. Designed to overcome discrete text optimization challenges and local minima, consistently outperforms prior methods on Sparse Autoencoder latents from Gemma 2 2B.","reasoning":"Novel approach to feature visualization for LLMs addressing real limitations. Practically useful for interpretability. No code/weights shared.","code_url":null,"s2_tldr":"This work introduces ADAPT, a hybrid method combining beam search initialization with adaptive gradient-guided mutation, designed around these failure modes, and establishes that feature visualization for LLMs is tractable, but requires design assumptions tailored to the domain.","s2_paper_id":"3954ec7e8afdf1abac5e71ab75ef57dc7bb3a50b","topics":"[\"Language Models\", \"Optimization\", \"Benchmark\"]"},{"id":183,"run_id":1,"domain":"aiml","arxiv_id":"2602.17837","entry_id":"","title":"TFL: Targeted Bit-Flip Attack on Large Language Model","authors":"[\"Jingkai Guo\", \"Chaitali Chakrabarti\", \"Deliang Fan\"]","abstract":"Large language models (LLMs) are increasingly deployed in safety and security critical applications, raising concerns about their robustness to model parameter fault injection attacks. Recent studies have shown that bit-flip attacks (BFAs), which exploit computer main memory (i.e., DRAM) vulnerabilities to flip a small number of bits in model weights, can severely disrupt LLM behavior. However, existing BFA on LLM largely induce un-targeted failure or general performance degradation, offering li","published":"2026-02-19T20:59:47+00:00","categories":"[\"cs.CR\", \"cs.CL\", \"cs.LG\"]","pdf_url":"https://arxiv.org/pdf/2602.17837v1","arxiv_url":"http://arxiv.org/abs/2602.17837v1","comment":"13 pages, 11 figures. Preprint","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":7.0,"score_axis_3":6.0,"composite":5.15,"summary":"Presents TFL, a novel targeted bit-flip attack framework enabling precise manipulation of LLM outputs for selected prompts with <50 bit flips while maintaining performance on unrelated inputs. Uses keyword-focused attack loss and auxiliary utility score, demonstrating effectiveness across multiple LLMs and benchmarks.","reasoning":"Novel security research with important implications for deployed LLMs. Demonstrates new attack class but no defensive code shared.","code_url":null,"s2_tldr":"The experiments show that TFL achieves successful targeted LLM output manipulations with less than 50 bit flips and significantly reduced effect on unrelated queries compared to prior BFA approaches, demonstrating the effectiveness of TFL and positions it as a new class of stealthy and targeted LLM model attack.","s2_paper_id":"aa355b1b1a1d43c18e21036a2df307edc000b17f","topics":"[\"Language Models\"]"},{"id":184,"run_id":1,"domain":"aiml","arxiv_id":"2602.17655","entry_id":"","title":"What Language is This? Ask Your Tokenizer","authors":"[\"Clara Meister\", \"Ahmetcan Yavuz\", \"Pietro Lesci\", \"Tiago Pimentel\"]","abstract":"Language Identification (LID) is an important component of many multilingual natural language processing pipelines, where it facilitates corpus curation, training data analysis, and cross-lingual evaluation of large language models. Despite near-perfect performance on high-resource languages, existing systems remain brittle in low-resource and closely related language settings. We introduce UniLID, a simple and efficient LID method based on the UnigramLM tokenization algorithm, leveraging its pr","published":"2026-02-19T18:58:39+00:00","categories":"[\"cs.CL\"]","pdf_url":"https://arxiv.org/pdf/2602.17655v1","arxiv_url":"http://arxiv.org/abs/2602.17655v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":7.0,"composite":5.15,"summary":"UniLID: efficient language identification method based on UnigramLM tokenization algorithm treating segmentation as language-specific. Supports incremental language addition without retraining, achieves >70% accuracy with 5 samples per language in low-resource settings, and delivers large gains on dialect identification versus fastText/GlotLID/CLD3.","reasoning":"Novel approach to LID with strong low-resource performance and practical integration path. Solid empirical results but no code/weights shared yet.","code_url":null,"s2_tldr":"UniLID is introduced, a simple and efficient LID method based on the UnigramLM tokenization algorithm, leveraging its probabilistic framing, parameter estimation technique and inference strategy, which supports incremental addition of new languages without retraining existing models and can naturally be integrated into existing language model tokenization pipelines.","s2_paper_id":"b4fcd117eca0035b3b62cb41eff62ff444ccc961","topics":"[\"Language Models\", \"Efficiency\", \"Benchmark\"]"},{"id":185,"run_id":1,"domain":"aiml","arxiv_id":"2602.17598","entry_id":"","title":"The Cascade Equivalence Hypothesis: When Do Speech LLMs Behave Like ASR$\\rightarrow$LLM Pipelines?","authors":"[\"Jayadev Billa\"]","abstract":"Current speech LLMs largely perform implicit ASR: on tasks solvable from a transcript, they are behaviorally and mechanistically equivalent to simple Whisper$\\to$LLM cascades. We show this through matched-backbone testing across four speech LLMs and six tasks, controlling for the LLM backbone for the first time. Ultravox is statistically indistinguishable from its matched cascade ($\u03ba{=}0.93$); logit lens reveals literal text emerging in hidden states; LEACE concept erasure confirms text represen","published":"2026-02-19T18:22:39+00:00","categories":"[\"cs.CL\", \"cs.AI\", \"eess.AS\"]","pdf_url":"https://arxiv.org/pdf/2602.17598v1","arxiv_url":"http://arxiv.org/abs/2602.17598v1","comment":"10 pages, 6 figures, 7 tables","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":7.0,"score_axis_3":6.0,"composite":5.15,"summary":"This paper demonstrates that current speech LLMs largely perform implicit ASR, behaving equivalently to Whisper\u2192LLM cascades on transcript-solvable tasks. Using logit lens and LEACE concept erasure, the authors show text representations are causally necessary, revealing most speech LLMs are expensive cascades that perform worse under noise.","reasoning":"Novel mechanistic analysis revealing fundamental architectural limitations, but no code/weights or new models. Important practical insights for speech LLM deployment.","code_url":null,"s2_tldr":"Matched-backbone testing across four speech LLMs and six tasks shows cascade equivalence is architecture-dependent, not universal, and under noise, they are worse ones, with clean-condition advantages reversing by up to 7.6% at 0 dB.","s2_paper_id":"ccc3aab3163ce60c24c92777f66203bbfb1e1529","topics":"[\"Language Models\", \"Speech / Audio\"]"},{"id":186,"run_id":1,"domain":"aiml","arxiv_id":"2602.17547","entry_id":"","title":"KLong: Training LLM Agent for Extremely Long-horizon Tasks","authors":"[\"Yue Liu\", \"Zhiyuan Hu\", \"Flood Sung\", \"Jiaheng Zhang\", \"Bryan Hooi\"]","abstract":"This paper introduces KLong, an open-source LLM agent trained to solve extremely long-horizon tasks. The principle is to first cold-start the model via trajectory-splitting SFT, then scale it via progressive RL training. Specifically, we first activate basic agentic abilities of a base model with a comprehensive SFT recipe. Then, we introduce Research-Factory, an automated pipeline that generates high-quality training data by collecting research papers and constructing evaluation rubrics. Using ","published":"2026-02-19T17:01:08+00:00","categories":"[\"cs.AI\", \"cs.CL\"]","pdf_url":"https://arxiv.org/pdf/2602.17547v1","arxiv_url":"http://arxiv.org/abs/2602.17547v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":7.0,"composite":5.15,"summary":"KLong trains LLM agents for extremely long-horizon tasks using trajectory-splitting SFT and progressive RL. The 106B model surpasses Kimi K2 Thinking (1T) by 11.28% on PaperBench and generalizes well to coding benchmarks like SWE-bench Verified and MLE-bench.","reasoning":"Strong practical results but no mention of open weights/code. Novel training methodology (trajectory-splitting, progressive RL) with good benchmark performance.","code_url":null,"s2_tldr":"This paper introduces KLong, an open-source LLM agent trained to solve extremely long-horizon tasks, and proposes a new trajectory-splitting SFT, which preserves early context, progressively truncates later context, and maintains overlap between sub-trajectories.","s2_paper_id":"53025935a9d9d3100b7c3d333ccc6efe976e3c0c","topics":"[\"Language Models\", \"Agents\", \"Benchmark\"]"},{"id":187,"run_id":1,"domain":"aiml","arxiv_id":"2602.17483","entry_id":"","title":"What Do LLMs Associate with Your Name? A Human-Centered Black-Box Audit of Personal Data","authors":"[\"Dimitri Staufer\", \"Kirsten Morehouse\"]","abstract":"Large language models (LLMs), and conversational agents based on them, are exposed to personal data (PD) during pre-training and during user interactions. Prior work shows that PD can resurface, yet users lack insight into how strongly models associate specific information to their identity. We audit PD across eight LLMs (3 open-source; 5 API-based, including GPT-4o), introduce LMP2 (Language Model Privacy Probe), a human-centered, privacy-preserving audit tool refined through two formative stud","published":"2026-02-19T15:53:29+00:00","categories":"[\"cs.HC\", \"cs.AI\", \"cs.CL\", \"cs.CY\"]","pdf_url":"https://arxiv.org/pdf/2602.17483v1","arxiv_url":"http://arxiv.org/abs/2602.17483v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":7.0,"composite":5.15,"summary":"LMP2 is a privacy-preserving audit tool that reveals LLMs confidently generate personal data for well-known individuals and 11 features with 60%+ accuracy for everyday users. Study of EU residents shows 72% seek control over model-generated associations, raising questions about privacy rights for LLM-generated data.","reasoning":"Important privacy research with practical implications, but no code/weights. Moderate novelty in audit methodology, high practical relevance for AI governance.","code_url":null,"s2_tldr":"It is demonstrated empirically that models confidently generate multiple PD categories for well-known individuals, and introduced LMP2 (Language Model Privacy Probe), a human-centered, privacy-preserving audit tool refined through two formative studies.","s2_paper_id":"9bb78e15e373593d32f89cfa3241b5a574f8a5ff","topics":"[\"Language Models\"]"},{"id":188,"run_id":1,"domain":"aiml","arxiv_id":"2602.17445","entry_id":"","title":"ABCD: All Biases Come Disguised","authors":"[\"Mateusz Nowak\", \"Xavier Cadet\", \"Peter Chin\"]","abstract":"Multiple-choice question (MCQ) benchmarks have been a standard evaluation practice for measuring LLMs' ability to reason and answer knowledge-based questions. Through a synthetic NonsenseQA benchmark, we observe that different LLMs exhibit varying degrees of label-position-few-shot-prompt bias, where the model either uses the answer position, the label in front of the answer, the distributions of correct answers present in the few-shot prompt, or a combination of all to answer each MCQ question.","published":"2026-02-19T15:12:33+00:00","categories":"[\"cs.CL\", \"cs.LG\"]","pdf_url":"https://arxiv.org/pdf/2602.17445v1","arxiv_url":"http://arxiv.org/abs/2602.17445v1","comment":"29 pages, 20 figures, pre-print, 12 tables","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":7.0,"composite":5.15,"summary":"Proposes bias-reduced evaluation protocol for MCQ benchmarks that replaces answer labels with uniform ones and uses sentence similarity for matching. Reduces mean accuracy variance 3\u00d7 with minimal performance drop, exposing genuine LLM capabilities under reduced evaluation artifacts across multiple benchmarks.","reasoning":"Practical evaluation improvement, no code/weights mentioned. Moderate novelty in addressing evaluation bias, high practical value for fair benchmarking.","code_url":null,"s2_tldr":"A simple bias-reduced evaluation protocol that replaces the labels of each question with uniform, unordered labels and prompts the LLM to use the whole answer presented is proposed, which substantially improves the robustness to answer permutations.","s2_paper_id":"02db335d036be2116fb88116ec56208a8416a848","topics":"[\"Retrieval / RAG\", \"Benchmark\"]"},{"id":190,"run_id":1,"domain":"aiml","arxiv_id":"2602.17413","entry_id":"","title":"DAVE: A Policy-Enforcing LLM Spokesperson for Secure Multi-Document Data Sharing","authors":"[\"Ren\\u00e9 Brinkhege\", \"Prahlad Menon\"]","abstract":"In current inter-organizational data spaces, usage policies are enforced mainly at the asset level: a whole document or dataset is either shared or withheld. When only parts of a document are sensitive, providers who want to avoid leaking protected information typically must manually redact documents before sharing them, which is costly, coarse-grained, and hard to maintain as policies or partners change. We present DAVE, a usage policy-enforcing LLM spokesperson that answers questions over priv","published":"2026-02-19T14:43:48+00:00","categories":"[\"cs.CR\", \"cs.CL\"]","pdf_url":"https://arxiv.org/pdf/2602.17413v1","arxiv_url":"http://arxiv.org/abs/2602.17413v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":7.0,"score_axis_3":6.0,"composite":5.15,"summary":"DAVE proposes LLM spokesperson architecture for policy-enforcing multi-document data sharing, introducing virtual redaction to suppress sensitive information at query time. Describes integration with Eclipse Dataspace Components and ODRL policies, though full enforcement pipeline not yet implemented.","reasoning":"Novel architectural contribution for secure data sharing, but primarily conceptual (no full implementation). Practical relevance for data governance scenarios.","code_url":null,"s2_tldr":"An evaluation methodology to assess security, utility, and performance trade-offs under benign and adversarial querying as a basis for future empirical work on systematically governed LLM access to multi-party data spaces is outlined.","s2_paper_id":"b21b4dcd6dfabe2f2b5306f67fe2a907793314ec","topics":"[\"Language Models\", \"Benchmark\"]"},{"id":191,"run_id":1,"domain":"aiml","arxiv_id":"2602.17283","entry_id":"","title":"Towards Cross-lingual Values Assessment: A Consensus-Pluralism Perspective","authors":"[\"Yukun Chen\", \"Xinyu Zhang\", \"Jialong Tang\", \"Yu Wan\", \"Baosong Yang\", \"Yiming Li\", \"Zhan Qin\", \"Kui Ren\"]","abstract":"While large language models (LLMs) have become pivotal to content safety, current evaluation paradigms primarily focus on detecting explicit harms (e.g., violence or hate speech), neglecting the subtler value dimensions conveyed in digital content. To bridge this gap, we introduce X-Value, a novel Cross-lingual Values Assessment Benchmark designed to evaluate LLMs' ability to assess deep-level values of content from a global perspective. X-Value consists of more than 5,000 QA pairs across 18 lan","published":"2026-02-19T11:41:34+00:00","categories":"[\"cs.CL\", \"cs.AI\"]","pdf_url":"https://arxiv.org/pdf/2602.17283v1","arxiv_url":"http://arxiv.org/abs/2602.17283v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":7.0,"composite":5.15,"summary":"Introduces X-Value, a cross-lingual values assessment benchmark with 5,000+ QA pairs across 18 languages and 7 value domains. Reveals SOTA LLMs struggle with nuanced value assessment (<77% accuracy) with significant cross-lingual disparities (>20% gap), highlighting need for values-aware content moderation.","reasoning":"Good code_and_weights with dataset on HuggingFace. Moderate novelty in applying Schwartz's value theory to LLM evaluation. Strong practical applicability for content safety and cross-cultural AI deployment.","code_url":null,"s2_tldr":"X-Value, a novel Cross-lingual Values Assessment Benchmark designed to evaluate LLMs'ability to assess deep-level values of content from a global perspective, is introduced and a unique two-stage annotation framework is proposed that first identifies whether an issue falls under global consensus or pluralism, and subsequently conducts a multi-party evaluation of the latent values embedded within the content.","s2_paper_id":"22452c440c561bf4b52b768c8df1a7c10f50be19","topics":"[\"Language Models\", \"Speech / Audio\", \"Benchmark\"]"},{"id":192,"run_id":1,"domain":"aiml","arxiv_id":"2602.17063","entry_id":"","title":"Sign Lock-In: Randomly Initialized Weight Signs Persist and Bottleneck Sub-Bit Model Compression","authors":"[\"Akira Sakai\", \"Yuma Ichikawa\"]","abstract":"Sub-bit model compression seeks storage below one bit per weight; as magnitudes are aggressively compressed, the sign bit becomes a fixed-cost bottleneck. Across Transformers, CNNs, and MLPs, learned sign matrices resist low-rank approximation and are spectrally indistinguishable from an i.i.d. Rademacher baseline. Despite this apparent randomness, most weights retain their initialization signs; flips primarily occur via rare near-zero boundary crossings, suggesting that sign-pattern randomness ","published":"2026-02-19T04:10:05+00:00","categories":"[\"cs.LG\", \"cs.AI\", \"cs.CL\", \"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.17063v1","arxiv_url":"http://arxiv.org/abs/2602.17063v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":7.0,"score_axis_3":6.0,"composite":5.15,"summary":"Identifies 'sign lock-in' phenomenon where weight signs resist change during training, bottlenecking sub-bit compression. Proposes sign lock-in theory with stopping-time analysis and introduces gap-based initialization plus outward-drift regularizer reducing flip rate to 10^-3 with minimal perplexity increase.","reasoning":"Low code_and_weights (no code/weights mentioned). Strong novelty in identifying fundamental compression bottleneck with theoretical analysis. Good practical applicability for model compression efforts.","code_url":null,"s2_tldr":"This work formalizes sign lock-in theory, a stopping-time analysis of sign flips under SGD noise, and introduces a gap-based initialization and a lightweight outward-drift regularizer, reducing the effective flip rate to approximately $10^{-3}$ with only about a one-point increase in perplexity.","s2_paper_id":"b57e3b20264749d2f385de1d48818682b46fe884","topics":"[\"Efficiency\", \"Architecture\"]"},{"id":193,"run_id":1,"domain":"aiml","arxiv_id":"2602.17045","entry_id":"","title":"Large Language Models Persuade Without Planning Theory of Mind","authors":"[\"Jared Moore\", \"Rasmus Overmark\", \"Ned Cooper\", \"Beba Cibralic\", \"Nick Haber\", \"Cameron R. Jones\"]","abstract":"A growing body of work attempts to evaluate the theory of mind (ToM) abilities of humans and large language models (LLMs) using static, non-interactive question-and-answer benchmarks. However, theoretical work in the field suggests that first-personal interaction is a crucial part of ToM and that such predictive, spectatorial tasks may fail to evaluate it. We address this gap with a novel ToM task that requires an agent to persuade a target to choose one of three policy proposals by strategicall","published":"2026-02-19T03:31:31+00:00","categories":"[\"cs.CL\"]","pdf_url":"https://arxiv.org/pdf/2602.17045v1","arxiv_url":"http://arxiv.org/abs/2602.17045v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":7.0,"score_axis_3":6.0,"composite":5.15,"summary":"Introduces interactive ToM task requiring strategic information revelation for persuasion. LLMs excel at persuasion without explicit ToM planning but struggle with multi-step reasoning in hidden mental state conditions, while humans perform moderately across conditions.","reasoning":"Low code_and_weights (no code/weights mentioned). Strong novelty in interactive ToM evaluation paradigm. Good practical applicability for understanding persuasive AI capabilities and risks.","code_url":null,"s2_tldr":"It is suggested that effective persuasion can occur without explicit ToM reasoning (e.g., through rhetorical strategies) and that LLMs excel at this form of persuasion.","s2_paper_id":"169e8411077438dbb1f64a24da07000cf9f625a3","topics":"[\"Language Models\", \"Agents\", \"Benchmark\"]"},{"id":194,"run_id":1,"domain":"aiml","arxiv_id":"2602.17022","entry_id":"","title":"ReIn: Conversational Error Recovery with Reasoning Inception","authors":"[\"Takyoung Kim\", \"Jinseok Nam\", \"Chandrayee Basu\", \"Xing Fan\", \"Chengyuan Ma\", \"Heng Ji\", \"Gokhan Tur\", \"Dilek Hakkani-T\\u00fcr\"]","abstract":"Conversational agents powered by large language models (LLMs) with tool integration achieve strong performance on fixed task-oriented dialogue datasets but remain vulnerable to unanticipated, user-induced errors. Rather than focusing on error prevention, this work focuses on error recovery, which necessitates the accurate diagnosis of erroneous dialogue contexts and execution of proper recovery plans. Under realistic constraints precluding model fine-tuning or prompt modification due to signific","published":"2026-02-19T02:37:29+00:00","categories":"[\"cs.CL\", \"cs.AI\"]","pdf_url":"https://arxiv.org/pdf/2602.17022v1","arxiv_url":"http://arxiv.org/abs/2602.17022v1","comment":"ICLR 2026","source":"both","github_repo":"","github_stars":null,"hf_upvotes":1,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":7.0,"composite":5.15,"summary":"ReIn (Reasoning Inception) enables test-time error recovery in conversational agents without fine-tuning or prompt modification by planting recovery reasoning. Consistently outperforms prompt-modification approaches across diverse agent/inception module combinations, improving task success on ambiguous/unsupported user requests.","reasoning":"Moderate code_and_weights (ICLR acceptance suggests eventual release). Moderate novelty in test-time intervention approach. Strong practical applicability for production conversational systems.","code_url":null,"s2_tldr":"Reasoning Inception is proposed, a test-time intervention method that plants an initial reasoning into the agent's decision-making process and generates recovery plans, which are subsequently integrated into the agent's internal reasoning process to guide corrective actions, without modifying its parameters or system prompts.","s2_paper_id":"0f849aeba593ee4465fc5433f91a5a5503be166d","topics":"[\"Reasoning\", \"Language Models\", \"Benchmark\"]"},{"id":399,"run_id":1,"domain":"aiml","arxiv_id":"2602.19863","entry_id":"","title":"Brewing Stronger Features: Dual-Teacher Distillation for Multispectral Earth Observation","authors":"[\"Filip Wolf\", \"Bla\\u017e Rolih\", \"Luka \\u010cehovin Zajc\"]","abstract":"Foundation models are transforming Earth Observation (EO), yet the diversity of EO sensors and modalities makes a single universal model unrealistic. Multiple specialized EO foundation models (EOFMs) will likely coexist, making efficient knowledge transfer across modalities essential. Most existing EO pretraining relies on masked image modeling, which emphasizes local reconstruction but provides limited control over global semantic structure. To address this, we propose a dual-teacher contrastiv","published":"2026-02-23T14:09:01+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.19863v2","arxiv_url":"http://arxiv.org/abs/2602.19863v2","comment":"Accepted to CVPR 2026","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":7.0,"score_axis_2":5.0,"score_axis_3":3.0,"composite":4.9,"summary":"Proposes dual-teacher distillation framework combining multispectral and optical vision foundation models for Earth Observation. Achieves SOTA on optical and multispectral benchmarks with improvements across segmentation (+3.64pp), change detection (+1.2pp), and classification (+1.31pp).","reasoning":"Remote sensing application with no code/weights available. Incremental method applying existing distillation techniques to EO domain.","code_url":"https://wolfilip.github.io/DEO/","s2_tldr":"This work proposes a dual-teacher contrastive distillation framework for multispectral imagery that aligns the student's pretraining objective with the contrastive self-distillation paradigm of modern optical vision foundation models (VFMs).","s2_paper_id":"8568d64d74bd8f5062f4ddc176033e501b59d826","topics":"[\"Efficiency\", \"Language Models\"]"},{"id":195,"run_id":1,"domain":"aiml","arxiv_id":"2602.21188","entry_id":"","title":"Human Video Generation from a Single Image with 3D Pose and View Control","authors":"[\"Tiantian Wang\", \"Chun-Han Yao\", \"Tao Hu\", \"Mallikarjun Byrasandra Ramalinga Reddy\", \"Ming-Hsuan Yang\", \"Varun Jampani\"]","abstract":"Recent diffusion methods have made significant progress in generating videos from single images due to their powerful visual generation capabilities. However, challenges persist in image-to-video synthesis, particularly in human video generation, where inferring view-consistent, motion-dependent clothing wrinkles from a single image remains a formidable problem. In this paper, we present Human Video Generation in 4D (HVG), a latent video diffusion model capable of generating high-quality, multi-","published":"2026-02-24T18:42:20+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.21188v1","arxiv_url":"http://arxiv.org/abs/2602.21188v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":6.0,"composite":4.8,"summary":"HVG is a latent video diffusion model for generating high-quality, multi-view, spatiotemporally coherent human videos from a single image with 3D pose and view control. Features articulated pose modulation and progressive spatio-temporal sampling to maintain consistency in long animations.","reasoning":"Solid technical contribution for human video generation, but no code/weights provided. Application-specific (human video generation) with incremental novelty over existing video diffusion methods.","code_url":null,"s2_tldr":"HVG is presented, a latent video diffusion model capable of generating high-quality, multi-view, spatiotemporally coherent human videos from a single image with 3D pose and view control and outperforms existing methods in generating high-quality 4D human videos from diverse human images and pose inputs.","s2_paper_id":"050f3d8b4a3414e3f925acd0feb990187d75de0c","topics":"[\"Video Generation\", \"3D / Vision\"]"},{"id":196,"run_id":1,"domain":"aiml","arxiv_id":"2602.21141","entry_id":"","title":"SynthRender and IRIS: Open-Source Framework and Dataset for Bidirectional Sim-Real Transfer in Industrial Object Perception","authors":"[\"Jose Moises Araya-Martinez\", \"Thushar Tom\", \"Adri\\u00e1n Sanchis Reig\", \"Pablo Rey Valiente\", \"Jens Lambrecht\", \"J\\u00f6rg Kr\\u00fcger\"]","abstract":"Object perception is fundamental for tasks such as robotic material handling and quality inspection. However, modern supervised deep-learning perception models require large datasets for robust automation under semi-uncontrolled conditions. The cost of acquiring and annotating such data for proprietary parts is a major barrier for widespread deployment. In this context, we release SynthRender, an open source framework for synthetic image generation with Guided Domain Randomization capabilities. ","published":"2026-02-24T17:42:34+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.21141v1","arxiv_url":"http://arxiv.org/abs/2602.21141v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":5.0,"score_axis_3":7.0,"composite":4.8,"summary":"SynthRender is an open-source framework for synthetic image generation with Guided Domain Randomization for industrial object perception. Includes IRIS dataset with 32 categories and 20K labels, achieving 99.1% mAP@50 on robotics datasets through sim-to-real transfer.","reasoning":"Practical framework for industrial applications with strong empirical results. Novelty is incremental (domain randomization techniques), but open-source framework and dataset are valuable.","code_url":null,"s2_tldr":null,"s2_paper_id":"7d9d8f49ecc464bc2dfc86193d18e7e4a0943898","topics":"[\"Benchmark\", \"Image Generation\", \"Robotics\"]"},{"id":197,"run_id":1,"domain":"aiml","arxiv_id":"2602.21105","entry_id":"","title":"BrepGaussian: CAD reconstruction from Multi-View Images with Gaussian Splatting","authors":"[\"Jiaxing Yu\", \"Dongyang Ren\", \"Hangyu Xu\", \"Zhouyuxiao Yang\", \"Yuanqi Li\", \"Jie Guo\", \"Zhengkang Zhou\", \"Yanwen Guo\"]","abstract":"The boundary representation (B-rep) models a 3D solid as its explicit boundaries: trimmed corners, edges, and faces. Recovering B-rep representation from unstructured data is a challenging and valuable task of computer vision and graphics. Recent advances in deep learning have greatly improved the recovery of 3D shape geometry, but still depend on dense and clean point clouds and struggle to generalize to novel shapes. We propose B-rep Gaussian Splatting (BrepGaussian), a novel framework that le","published":"2026-02-24T17:03:45+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.21105v1","arxiv_url":"http://arxiv.org/abs/2602.21105v1","comment":"Accepted to CVPR 2026","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":6.0,"composite":4.8,"summary":"BrepGaussian recovers boundary representation (B-rep) models from multi-view 2D images using Gaussian Splatting with learnable features. Two-stage learning framework first captures geometry and edges, then refines patch features to achieve clean geometry and coherent instance representations for CAD reconstruction.","reasoning":"Novel application of Gaussian Splatting to B-rep recovery. Accepted to CVPR 2026 but no code/weights provided yet. Practical for CAD/manufacturing applications.","code_url":null,"s2_tldr":"This work proposes B-rep Gaussian Splatting (BrepGaussian), a novel framework that learns 3D parametric representations from 2D images that first captures geometry and edges and then refines patch features to achieve clean geometry and coherent instance representations.","s2_paper_id":"bd0e9fee182c70e85c8a881c1158752e41cf4c63","topics":"[\"3D / Vision\"]"},{"id":198,"run_id":1,"domain":"aiml","arxiv_id":"2602.20989","entry_id":"","title":"Cycle-Consistent Tuning for Layered Image Decomposition","authors":"[\"Zheng Gu\", \"Min Lu\", \"Zhida Sun\", \"Dani Lischinski\", \"Daniel Cohen-O\", \"Hui Huang\"]","abstract":"Disentangling visual layers in real-world images is a persistent challenge in vision and graphics, as such layers often involve non-linear and globally coupled interactions, including shading, reflection, and perspective distortion. In this work, we present an in-context image decomposition framework that leverages large diffusion foundation models for layered separation. We focus on the challenging case of logo-object decomposition, where the goal is to disentangle a logo from the surface on wh","published":"2026-02-24T15:10:31+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.20989v1","arxiv_url":"http://arxiv.org/abs/2602.20989v1","comment":"Accepted to CVPR 2026. Project page: https://vcc.tech/research/2026/ImgDecom","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":6.0,"composite":4.8,"summary":"Presents in-context image decomposition framework leveraging diffusion models for layered separation tasks like logo-object decomposition. Uses cycle-consistent LoRA tuning with bidirectional supervision and progressive self-improvement. Demonstrates accurate decomposition across multiple layer types without requiring paired training data.","reasoning":"Moderate novelty in applying cycle-consistency to diffusion-based decomposition. Project page mentioned but no explicit code/weights commitment. Niche application domain.","code_url":null,"s2_tldr":"This work presents an in-context image decomposition framework that leverages large diffusion foundation models for layered separation and introduces a cycle-consistent tuning strategy that jointly trains decomposition and composition models, enforcing reconstruction consistency between decomposed and recomposed images.","s2_paper_id":"fc5e011e5348209d9fe21858d3ec45c705b96531","topics":"[\"Language Models\"]"},{"id":199,"run_id":1,"domain":"aiml","arxiv_id":"2602.20911","entry_id":"","title":"From Isolation to Integration: Building an Adaptive Expert Forest for Pre-Trained Model-based Class-Incremental Learning","authors":"[\"Ruiqi Liu\", \"Boyu Diao\", \"Hangda Liu\", \"Zhulin An\", \"Fei Wang\", \"Yongjun Xu\"]","abstract":"Class-Incremental Learning (CIL) requires models to learn new classes without forgetting old ones. A common method is to freeze a pre-trained model and train a new, lightweight adapter for each task. While this prevents forgetting, it treats the learned knowledge as a simple, unstructured collection and fails to use the relationships between tasks. To this end, we propose the Semantic-guided Adaptive Expert Forest (SAEF), a new method that organizes adapters into a structured hierarchy for bette","published":"2026-02-24T13:48:13+00:00","categories":"[\"cs.LG\", \"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.20911v1","arxiv_url":"http://arxiv.org/abs/2602.20911v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":6.0,"composite":4.8,"summary":"SAEF organizes class-incremental learning adapters into semantic-guided hierarchical forest structure for better knowledge sharing. Groups tasks into conceptual clusters and builds balanced expert trees through adapter merging. Uses confidence-weighted expert activation for inference, achieving SOTA on multiple CIL benchmarks.","reasoning":"Interesting adapter organization approach but no code/weights. Incremental improvement in CIL methodology with moderate practical impact.","code_url":null,"s2_tldr":"The Semantic-guided Adaptive Expert Forest (SAEF) is proposed, a new method that organizes adapters into a structured hierarchy for better knowledge sharing and achieves SOTA performance.","s2_paper_id":"a852cd57f62061262e94e90aabf1531f1af91b69","topics":"[]"},{"id":200,"run_id":1,"domain":"aiml","arxiv_id":"2602.20725","entry_id":"","title":"Bridging Physically Based Rendering and Diffusion Models with Stochastic Differential Equation","authors":"[\"Junwei Shu\", \"Wenjie Liu\", \"Changgu Chen\", \"Hantang Liu\", \"Yang Li\", \"Changbo Wang\"]","abstract":"Diffusion-based image generators excel at producing realistic content from text or image conditions, but they offer only limited explicit control over low-level, physically grounded shading and material properties. In contrast, physically based rendering (PBR) offers fine-grained physical control but lacks prompt-driven flexibility. Although these two paradigms originate from distinct communities, both share a common evolution -- from noisy observations to clean images. In this paper, we propose","published":"2026-02-24T09:44:12+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.20725v1","arxiv_url":"http://arxiv.org/abs/2602.20725v1","comment":"preprint","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":6.0,"composite":4.8,"summary":"Bridges physically based rendering (PBR) and diffusion models through unified stochastic differential equation formulation. Enables physically grounded control over shading and materials in diffusion-generated images via path tracing-inspired noise modeling.","reasoning":"No code/weights. Interesting theoretical unification but incremental practical impact. Niche application area (graphics/rendering control).","code_url":null,"s2_tldr":"This paper proposes a unified stochastic formulation that bridges Monte Carlo rendering and diffusion-based generative modeling and provides a systematic analysis of how the physical characteristics of path tracing can be extended to existing diffusion models from the perspective of noise variance.","s2_paper_id":"e4bc5250dde59247f861b17c9c7e0769f26c3041","topics":"[\"Image Generation\"]"},{"id":201,"run_id":1,"domain":"aiml","arxiv_id":"2602.20666","entry_id":"","title":"BoxSplitGen: A Generative Model for 3D Part Bounding Boxes in Varying Granularity","authors":"[\"Juil Koo\", \"Wei-Tung Lin\", \"Chanho Park\", \"Chanhyeok Park\", \"Minhyuk Sung\"]","abstract":"Human creativity follows a perceptual process, moving from abstract ideas to finer details during creation. While 3D generative models have advanced dramatically, models specifically designed to assist human imagination in 3D creation -- particularly for detailing abstractions from coarse to fine -- have not been explored. We propose a framework that enables intuitive and interactive 3D shape generation by iteratively splitting bounding boxes to refine the set of bounding boxes. The main technic","published":"2026-02-24T08:15:25+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.20666v1","arxiv_url":"http://arxiv.org/abs/2602.20666v1","comment":"Project page: https://boxsplitgen.github.io","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":6.0,"composite":4.8,"summary":"BoxSplitGen generates 3D part bounding boxes with varying granularity by learning iterative box-splitting sequences. Combines box-splitting generative model with box-to-shape diffusion model for intuitive, coarse-to-fine 3D shape creation.","reasoning":"No code/weights. Novel iterative refinement approach for 3D generation. Moderate practical applicability, primarily useful for 3D modeling workflows.","code_url":null,"s2_tldr":"A framework that enables intuitive and interactive 3D shape generation by iteratively splitting bounding boxes to refine the set of bounding boxes and demonstrates that the box-splitting generative model outperforms token prediction models and the inpainting approach with an unconditional diffusion model.","s2_paper_id":"c3895411b3e87ed8c0e88e911fe5cf0f5c7f7f55","topics":"[\"3D / Vision\"]"},{"id":202,"run_id":1,"domain":"aiml","arxiv_id":"2602.20618","entry_id":"","title":"RecoverMark: Robust Watermarking for Localization and Recovery of Manipulated Faces","authors":"[\"Haonan An\", \"Xiaohui Ye\", \"Guang Hua\", \"Yihang Tao\", \"Hangcheng Cao\", \"Xiangyu Yu\", \"Yuguang Fang\"]","abstract":"The proliferation of AI-generated content has facilitated sophisticated face manipulation, severely undermining visual integrity and posing unprecedented challenges to intellectual property. In response, a common proactive defense leverages fragile watermarks to detect, localize, or even recover manipulated regions. However, these methods always assume an adversary unaware of the embedded watermark, overlooking their inherent vulnerability to watermark removal attacks. Furthermore, this fragilit","published":"2026-02-24T07:11:40+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.20618v1","arxiv_url":"http://arxiv.org/abs/2602.20618v1","comment":"Accepted by CVPR 2026","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":6.0,"composite":4.8,"summary":"RecoverMark embeds protected face content into background for robust watermarking achieving simultaneous manipulation localization, recovery, and ownership verification. Uses two-stage training with comprehensive distortion simulation for robustness against removal attacks.","reasoning":"No code/weights. Novel application of content-as-watermark with practical security value. Domain-specific (watermarking/IP protection).","code_url":null,"s2_tldr":"RecoverMark is a watermarking framework that achieves robust manipulation localization, content recovery, and ownership verification simultaneously and exploits a critical real-world constraint: an adversary must preserve the background's semantic consistency to avoid visual detection, even if they apply global, imperceptible watermark removal attacks.","s2_paper_id":"b729b5bc30115fbb3678b021b2841f5dd7885d45","topics":"[\"Robotics\"]"},{"id":203,"run_id":1,"domain":"aiml","arxiv_id":"2602.20608","entry_id":"","title":"VAGNet: Grounding 3D Affordance from Human-Object Interactions in Videos","authors":"[\"Aihua Mao\", \"Kaihang Huang\", \"Yong-Jin Liu\", \"Chee Seng Chan\", \"Ying He\"]","abstract":"3D object affordance grounding aims to identify regions on 3D objects that support human-object interaction (HOI), a capability essential to embodied visual reasoning. However, most existing approaches rely on static visual or textual cues, neglecting that affordances are inherently defined by dynamic actions. As a result, they often struggle to localize the true contact regions involved in real interactions. We take a different perspective. Humans learn how to use objects by observing and imita","published":"2026-02-24T07:00:38+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.20608v1","arxiv_url":"http://arxiv.org/abs/2602.20608v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":6.0,"composite":4.8,"summary":"VAGNet introduces video-guided 3D affordance grounding, leveraging dynamic HOI sequences rather than static cues. Proposes PVAD dataset (first video-3D pairing for affordances) and framework aligning video-derived interaction cues with 3D structure.","reasoning":"No code/weights (promised public release). Novel video-guided approach and new dataset. Domain-specific (embodied AI) but solid contribution.","code_url":null,"s2_tldr":"VAGNet is proposed, a framework that aligns video-derived interaction cues with 3D structure to resolve ambiguities that static cues cannot address, and PVAD is introduced, the first HOI video-3D pairing affordance dataset, providing functional supervision unavailable in prior works.","s2_paper_id":"9ba8f7c7a6e3b88d8d7374a7cdea7e5d546183ed","topics":"[\"3D / Vision\", \"Robotics\", \"Reasoning\"]"},{"id":205,"run_id":1,"domain":"aiml","arxiv_id":"2602.20549","entry_id":"","title":"Sample-efficient evidence estimation of score based priors for model selection","authors":"[\"Frederic Wang\", \"Katherine L. Bouman\"]","abstract":"The choice of prior is central to solving ill-posed imaging inverse problems, making it essential to select one consistent with the measurements $y$ to avoid severe bias. In Bayesian inverse problems, this could be achieved by evaluating the model evidence $p(y \\mid M)$ under different models $M$ that specify the prior and then selecting the one with the highest value. Diffusion models are the state-of-the-art approach to solving inverse problems with a data-driven prior; however, directly compu","published":"2026-02-24T05:06:46+00:00","categories":"[\"cs.LG\", \"cs.CV\", \"stat.ME\"]","pdf_url":"https://arxiv.org/pdf/2602.20549v1","arxiv_url":"http://arxiv.org/abs/2602.20549v1","comment":"ICLR 2026","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":6.0,"composite":4.8,"summary":"Proposes sample-efficient model evidence estimation for diffusion priors by integrating over time-marginals during posterior sampling. Enables Bayesian model selection for inverse problems using only ~20 posterior samples, demonstrated on black hole imaging.","reasoning":"No code/weights provided. Novel approach to intractable evidence computation for diffusion models. Specialized application (inverse problems, model selection) but methodologically interesting.","code_url":null,"s2_tldr":"The proposed estimator matches the model evidence when it can be computed analytically, and it is able to both select the correct diffusion model prior and diagnose prior misfit under different highly ill-conditioned, non-linear inverse problems, including a real-world black hole imaging problem.","s2_paper_id":"77b1801381ade83b6bef45c8230df109a140c4d1","topics":"[\"Efficiency\", \"Benchmark\"]"},{"id":207,"run_id":1,"domain":"aiml","arxiv_id":"2602.20328","entry_id":"","title":"GSNR: Graph Smooth Null-Space Representation for Inverse Problems","authors":"[\"Romario Gualdr\\u00f3n-Hurtado\", \"Roman Jacome\", \"Rafael S. Suarez\", \"Henry Arguello\"]","abstract":"Inverse problems in imaging are ill-posed, leading to infinitely many solutions consistent with the measurements due to the non-trivial null-space of the sensing matrix. Common image priors promote solutions on the general image manifold, such as sparsity, smoothness, or score function. However, as these priors do not constrain the null-space component, they can bias the reconstruction. Thus, we aim to incorporate meaningful null-space information in the reconstruction framework. Inspired by smo","published":"2026-02-23T20:24:00+00:00","categories":"[\"cs.CV\", \"eess.IV\", \"math.OC\"]","pdf_url":"https://arxiv.org/pdf/2602.20328v1","arxiv_url":"http://arxiv.org/abs/2602.20328v1","comment":"23 pages, 24 figures, Accepted to The IEEE/CVF Conference on Computer Vision and Pattern Recognition 2026","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":6.0,"composite":4.8,"summary":"GSNR proposes Graph-Smooth Null-Space Representation to constrain the null-space component in inverse imaging problems using graph Laplacians. The method improves convergence and achieves up to 4.3 dB PSNR gains over baselines across deblurring, compressed sensing, demosaicing, and super-resolution tasks.","reasoning":"No code available. Novel mathematical framework for inverse problems with solid theoretical contributions. Practical for imaging applications but requires implementation.","code_url":null,"s2_tldr":"GSNR is incorporated into well-known inverse problem solvers, e.g., PnP, DIP, and diffusion solvers, in four scenarios: image deblurring, compressed sensing, demosaicing, and image super-resolution, providing consistent improvement of up to 4.3 dB over baseline formulations and up to 1 dB compared with end-to-end learned models in terms of PSNR.","s2_paper_id":"d6f01b1b06f69787c4f1862dce74245990341634","topics":"[]"},{"id":208,"run_id":1,"domain":"aiml","arxiv_id":"2602.20150","entry_id":"","title":"Simulation-Ready Cluttered Scene Estimation via Physics-aware Joint Shape and Pose Optimization","authors":"[\"Wei-Cheng Huang\", \"Jiaheng Han\", \"Xiaohan Ye\", \"Zherong Pan\", \"Kris Hauser\"]","abstract":"Estimating simulation-ready scenes from real-world observations is crucial for downstream planning and policy learning tasks. Regretfully, existing methods struggle in cluttered environments, often exhibiting prohibitive computational cost, poor robustness, and restricted generality when scaling to multiple interacting objects. We propose a unified optimization-based formulation for real-to-sim scene estimation that jointly recovers the shapes and poses of multiple rigid objects under physical c","published":"2026-02-23T18:58:24+00:00","categories":"[\"cs.RO\", \"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.20150v1","arxiv_url":"http://arxiv.org/abs/2602.20150v1","comment":"15 pages, 13 figures, in submission","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":6.0,"composite":4.8,"summary":"This paper presents physics-aware joint shape and pose optimization for simulation-ready cluttered scene estimation using shape-differentiable contact models. The method efficiently scales to scenes with up to 5 objects through structured sparse Hessian exploitation and integrates learning-based initialization with differentiable texture refinement.","reasoning":"No code available. Novel physics-constrained optimization for robotics simulation. Practical for sim2real but limited scale (5 objects) and no implementation released.","code_url":null,"s2_tldr":"This work proposes a unified optimization-based formulation for real-to-sim scene estimation that jointly recovers the shapes and poses of multiple rigid objects under physical constraints and develops an end-to-end real-to-sim scene estimation pipeline that integrates learning-based object initialization, physics-constrained joint shape-pose optimization, and differentiable texture refinement.","s2_paper_id":"afeb12f06a0c692874cc60a6e8fd9392c2499b85","topics":"[\"Optimization\", \"Agents\"]"},{"id":209,"run_id":1,"domain":"aiml","arxiv_id":"2602.20051","entry_id":"","title":"SEAL-pose: Enhancing 3D Human Pose Estimation via a Learned Loss for Structural Consistency","authors":"[\"Yeonsung Kim\", \"Junggeun Do\", \"Seunguk Do\", \"Sangmin Kim\", \"Jaesik Park\", \"Jay-Yoon Lee\"]","abstract":"3D human pose estimation (HPE) is characterized by intricate local and global dependencies among joints. Conventional supervised losses are limited in capturing these correlations because they treat each joint independently. Previous studies have attempted to promote structural consistency through manually designed priors or rule-based constraints; however, these approaches typically require manual specification and are often non-differentiable, limiting their use as end-to-end training objectiv","published":"2026-02-23T17:00:35+00:00","categories":"[\"cs.CV\", \"cs.AI\"]","pdf_url":"https://arxiv.org/pdf/2602.20051v1","arxiv_url":"http://arxiv.org/abs/2602.20051v1","comment":"17 pages","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":6.0,"composite":4.8,"summary":"SEAL-pose introduces a learnable loss-net that trains pose-nets by evaluating structural plausibility through joint-graph-based design. This data-driven approach learns complex structural dependencies without hand-crafted priors, reducing per-joint errors across three 3D HPE benchmarks with eight backbones.","reasoning":"No code/weights available; moderate novelty in learned loss design; good applicability for 3D pose estimation tasks.","code_url":null,"s2_tldr":"Extensive experiments on three 3D HPE benchmarks with eight backbones show that SEAL-pose reduces per-joint errors and improves pose plausibility compared with the corresponding backbones across all settings, and outperforms models with explicit structural constraints, despite not enforcing any such constraints.","s2_paper_id":"972480689eb7ea4a63a7dc3ce07c86e3d784b3b9","topics":"[\"3D / Vision\"]"},{"id":210,"run_id":1,"domain":"aiml","arxiv_id":"2602.19974","entry_id":"","title":"RL-RIG: A Generative Spatial Reasoner via Intrinsic Reflection","authors":"[\"Tianyu Wang\", \"Zhiyuan Ma\", \"Qian Wang\", \"Xinyi Zhang\", \"Xinwei Long\", \"Bowen Zhou\"]","abstract":"Recent advancements in image generation have achieved impressive results in producing high-quality images. However, existing image generation models still generally struggle with a spatial reasoning dilemma, lacking the ability to accurately capture fine-grained spatial relationships from the prompt and correctly generate scenes with structural integrity. To mitigate this dilemma, we propose RL-RIG, a Reinforcement Learning framework for Reflection-based Image Generation. Our architecture compri","published":"2026-02-23T15:39:53+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.19974v1","arxiv_url":"http://arxiv.org/abs/2602.19974v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":6.0,"composite":4.8,"summary":"RL-RIG introduces a Reinforcement Learning framework for image generation using a Generate-Reflect-Edit paradigm with four components: Diffuser, Checker, Actor, and Inverse Diffuser. The approach improves spatial reasoning in image generation, achieving 11% improvement on spatial accuracy metrics.","reasoning":"No code/weights indicated; moderate novelty in RL-based image generation; good applicability for controllable generation but limited ecosystem presence.","code_url":null,"s2_tldr":"RL-RIG, a Reinforcement Learning framework for Reflection-based Image Generation that outperforms existing state-of-the-art open-source models by up to 11% in terms of controllable and precise spatial reasoning in image generation.","s2_paper_id":"4c7759a740f66a13e1b26fc9eb5947a2b38b2202","topics":"[\"RL\", \"Image Generation\", \"Reasoning\"]"},{"id":211,"run_id":1,"domain":"aiml","arxiv_id":"2602.19931","entry_id":"","title":"Expanding the Role of Diffusion Models for Robust Classifier Training","authors":"[\"Pin-Han Huang\", \"Shang-Tse Chen\", \"Hsuan-Tien Lin\"]","abstract":"Incorporating diffusion-generated synthetic data into adversarial training (AT) has been shown to substantially improve the training of robust image classifiers. In this work, we extend the role of diffusion models beyond merely generating synthetic data, examining whether their internal representations, which encode meaningful features of the data, can provide additional benefits for robust classifier training. Through systematic experiments, we show that diffusion models offer representations ","published":"2026-02-23T15:06:52+00:00","categories":"[\"cs.LG\", \"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.19931v1","arxiv_url":"http://arxiv.org/abs/2602.19931v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":6.0,"composite":4.8,"summary":"This work extends diffusion models' role in adversarial training beyond synthetic data generation by incorporating their internal representations as auxiliary learning signals. The approach consistently improves robustness across CIFAR-10, CIFAR-100, and ImageNet, encouraging more disentangled features.","reasoning":"No code/weights available; moderate novelty in leveraging diffusion representations; good applicability for robust classifier training.","code_url":null,"s2_tldr":"It is shown that diffusion models offer representations that are both diverse and partially robust, and that explicitly incorporating diffusion representations as an auxiliary learning signal during AT consistently improves robustness across settings.","s2_paper_id":"74c834991c84e4cdf6abf3b27a3cea461ec33658","topics":"[\"Optimization\"]"},{"id":213,"run_id":1,"domain":"aiml","arxiv_id":"2602.19615","entry_id":"","title":"Seeing Clearly, Reasoning Confidently: Plug-and-Play Remedies for Vision Language Model Blindness","authors":"[\"Xin Hu\", \"Haomiao Ni\", \"Yunbei Zhang\", \"Jihun Hamm\", \"Zechen Li\", \"Zhengming Ding\"]","abstract":"Vision language models (VLMs) have achieved remarkable success in broad visual understanding, yet they remain challenged by object-centric reasoning on rare objects due to the scarcity of such instances in pretraining data. While prior efforts alleviate this issue by retrieving additional data or introducing stronger vision encoders, these methods are still computationally intensive during finetuning VLMs and don't fully exploit the original training data. In this paper, we introduce an efficien","published":"2026-02-23T09:02:40+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.19615v1","arxiv_url":"http://arxiv.org/abs/2602.19615v1","comment":"Accepted by CVPR 2026","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":5.0,"score_axis_3":7.0,"composite":4.8,"summary":"Proposes plug-and-play module improving VLM reasoning on rare objects via multi-modal class embeddings from vision foundation models and synonym-augmented text. Refines visual tokens and enriches prompts without VLM finetuning.","reasoning":"Practical plug-and-play approach without finetuning VLMs. No code/weights despite strong applicability.","code_url":null,"s2_tldr":"This paper proposes to learn multi-modal class embeddings for rare objects by leveraging prior knowledge from vision foundation models and synonym-augmented text descriptions, compensating for limited training examples, and proposes a lightweight attention-based enhancement module that improves fine-grained object details.","s2_paper_id":"4d4b51eccf99681d4f79f12206348c0dbfd8fee2","topics":"[\"Language Models\", \"Multimodal\", \"Reasoning\"]"},{"id":214,"run_id":1,"domain":"aiml","arxiv_id":"2602.19575","entry_id":"","title":"ConceptPrism: Concept Disentanglement in Personalized Diffusion Models via Residual Token Optimization","authors":"[\"Minseo Kim\", \"Minchan Kwon\", \"Dongyeun Lee\", \"Yunho Jeon\", \"Junmo Kim\"]","abstract":"Personalized text-to-image generation suffers from concept entanglement, where irrelevant residual information from reference images is captured, leading to a trade-off between concept fidelity and text alignment. Recent disentanglement approaches attempt to solve this utilizing manual guidance, such as linguistic cues or segmentation masks, which limits their applicability and fails to fully articulate the target concept. In this paper, we propose ConceptPrism, a novel framework that automatica","published":"2026-02-23T07:46:19+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.19575v1","arxiv_url":"http://arxiv.org/abs/2602.19575v1","comment":"Accepted to CVPR 2026","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":6.0,"composite":4.8,"summary":"ConceptPrism disentangles shared visual concepts from image-specific residuals in personalized diffusion models by jointly optimizing target and residual tokens without manual guidance. It achieves improved fidelity-alignment trade-off in personalized text-to-image generation.","reasoning":"No code/weights found. CVPR 2026 acceptance suggests novelty, but lack of public resources limits immediate applicability. Personalized diffusion is a growing area.","code_url":null,"s2_tldr":"ConceptPrism is a novel framework that automatically disentangles the shared visual concept from image-specific residuals by comparing images within a set by jointly optimizes a target token and image-wise residual tokens using two complementary objectives: a reconstruction loss to ensure fidelity and a novel exclusion loss that compels residual tokens to discard the shared concept.","s2_paper_id":"4e7720c7dbe7a9330ab6a5db53d813ce59ac0f6d","topics":"[\"Optimization\", \"Image Generation\", \"3D / Vision\"]"},{"id":215,"run_id":1,"domain":"aiml","arxiv_id":"2602.19571","entry_id":"","title":"HOCA-Bench: Beyond Semantic Perception to Predictive World Modeling via Hegelian Ontological-Causal Anomalies","authors":"[\"Chang Liu\", \"Yunfan Ye\", \"Qingyang Zhou\", \"Xichen Tan\", \"Mengxuan Luo\", \"Zhenyu Qiu\", \"Wei Peng\", \"Zhiping Cai\"]","abstract":"Video-LLMs have improved steadily on semantic perception, but they still fall short on predictive world modeling, which is central to physically grounded intelligence. We introduce HOCA-Bench, a benchmark that frames physical anomalies through a Hegelian lens. HOCA-Bench separates anomalies into two types: ontological anomalies, where an entity violates its own definition or persistence, and causal anomalies, where interactions violate physical relations. Using state-of-the-art generative video ","published":"2026-02-23T07:40:32+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.19571v1","arxiv_url":"http://arxiv.org/abs/2602.19571v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":7.0,"score_axis_3":5.0,"composite":4.8,"summary":"HOCA-Bench benchmarks Video-LLMs on physical world modeling by separating anomalies into ontological (entity violations) and causal (interaction violations) using generative video models as adversarial simulators. Evaluations on 17 models reveal a 20%+ performance drop on causal reasoning tasks, exposing a gap in physical understanding.","reasoning":"No code/weights. Novel Hegelian framing and benchmark for video-LLM physical reasoning is interesting but remains a survey/benchmark rather than a new method.","code_url":null,"s2_tldr":"System-2\"Thinking\"modes improve reasoning, but they do not close the gap, suggesting that current architectures recognize visual patterns more readily than they apply basic physical laws.","s2_paper_id":"4ce3512766cda1a6aaa59afe882f9a15453efcff","topics":"[\"Reasoning\", \"World Models\", \"Benchmark\"]"},{"id":216,"run_id":1,"domain":"aiml","arxiv_id":"2602.19565","entry_id":"","title":"DICArt: Advancing Category-level Articulated Object Pose Estimation in Discrete State-Spaces","authors":"[\"Li Zhang\", \"Mingyu Mei\", \"Ailing Wang\", \"Xianhui Meng\", \"Yan Zhong\", \"Xinyuan Song\", \"Liu Liu\", \"Rujing Wang\", \"Zaixing He\", \"Cewu Lu\"]","abstract":"Articulated object pose estimation is a core task in embodied AI. Existing methods typically regress poses in a continuous space, but often struggle with 1) navigating a large, complex search space and 2) failing to incorporate intrinsic kinematic constraints. In this work, we introduce DICArt (DIsCrete Diffusion for Articulation Pose Estimation), a novel framework that formulates pose estimation as a conditional discrete diffusion process. Instead of operating in a continuous domain, DICArt pro","published":"2026-02-23T07:30:47+00:00","categories":"[\"cs.CV\", \"cs.AI\"]","pdf_url":"https://arxiv.org/pdf/2602.19565v1","arxiv_url":"http://arxiv.org/abs/2602.19565v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":6.0,"composite":4.8,"summary":"DICArt formulates articulated object pose estimation as conditional discrete diffusion with a hierarchical kinematic coupling strategy and flexible flow decider. It improves modeling fidelity on synthetic and real-world datasets for category-level 6D pose estimation.","reasoning":"No code/weights. Discrete diffusion for articulation is novel but specialized. Embodied AI applications are emerging but still niche.","code_url":null,"s2_tldr":"This work introduces DICArt (DIsCrete Diffusion for Articulation Pose Estimation), a novel framework that formulates pose estimation as a conditional discrete diffusion process that integrates discrete generative modeling with structural priors and incorporates a hierarchical kinematic coupling strategy.","s2_paper_id":"0e694e3a990be4a2719dc8b7a725e373e6713dc3","topics":"[\"Architecture\", \"Robotics\"]"},{"id":217,"run_id":1,"domain":"aiml","arxiv_id":"2602.19471","entry_id":"","title":"Forgetting-Resistant and Lesion-Aware Source-Free Domain Adaptive Fundus Image Analysis with Vision-Language Model","authors":"[\"Zheang Huai\", \"Hui Tang\", \"Hualiang Wang\", \"Xiaomeng Li\"]","abstract":"Source-free domain adaptation (SFDA) aims to adapt a model trained in the source domain to perform well in the target domain, with only unlabeled target domain data and the source model. Taking into account that conventional SFDA methods are inevitably error-prone under domain shift, recently greater attention has been directed to SFDA assisted with off-the-shelf foundation models, e.g., vision-language (ViL) models. However, existing works of leveraging ViL models for SFDA confront two issues: ","published":"2026-02-23T03:29:54+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.19471v1","arxiv_url":"http://arxiv.org/abs/2602.19471v1","comment":"10 pages","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":6.0,"composite":4.8,"summary":"FRLA addresses source-free domain adaptation for fundus image diagnosis with vision-language models via forgetting-resistant and lesion-aware modules. Preserves confident predictions and leverages patch-wise fine-grained knowledge from ViL models, outperforming SOTA methods.","reasoning":"Code release promised. Medical imaging application (fundus) limits broader impact. SFDA with ViL is emerging but specialized domain.","code_url":null,"s2_tldr":"A forgetting-resistant and lesion-aware (FRLA) method for SFDA of fundus image diagnosis with ViL model that not only significantly outperforms the vision-language model, but also achieves consistent improvements over the state-of-the-art methods.","s2_paper_id":"6a6a00a384f5a45bff44a91579586c740567577f","topics":"[\"Language Models\", \"Multimodal\"]"},{"id":218,"run_id":1,"domain":"aiml","arxiv_id":"2602.19385","entry_id":"","title":"Adaptive Data Augmentation with Multi-armed Bandit: Sample-Efficient Embedding Calibration for Implicit Pattern Recognition","authors":"[\"Minxue Tang\", \"Yangyang Yu\", \"Aolin Ding\", \"Maziyar Baran Pouyan\", \"Taha Belkhouja Yujia Bao\"]","abstract":"Recognizing implicit visual and textual patterns is essential in many real-world applications of modern AI. However, tackling long-tail pattern recognition tasks remains challenging for current pre-trained foundation models such as LLMs and VLMs. While finetuning pre-trained models can improve accuracy in recognizing implicit patterns, it is usually infeasible due to a lack of training data and high computational overhead. In this paper, we propose ADAMAB, an efficient embedding calibration fram","published":"2026-02-22T23:39:21+00:00","categories":"[\"cs.CV\", \"cs.CL\", \"cs.LG\"]","pdf_url":"https://arxiv.org/pdf/2602.19385v1","arxiv_url":"http://arxiv.org/abs/2602.19385v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":6.0,"composite":4.8,"summary":"ADAMAB proposes adaptive data augmentation with multi-armed bandit for few-shot pattern recognition in VLMs/LLMs. Trains light-weight calibrators on fixed embeddings using modified UCB algorithm, achieving 40% accuracy improvement with <5 samples per class.","reasoning":"Useful sample-efficient approach for few-shot learning, but no code/weights and incremental improvement over existing methods.","code_url":null,"s2_tldr":"ADAMAB trains embedder-agnostic light-weight calibrators on top of fixed embedding models without accessing their parameters without accessing their parameters to maximally reduce the computational costs and mitigate the need for large-scale training data.","s2_paper_id":"d6125537e0e2fd487ea2b9ebb5adea150bf64d53","topics":"[\"Efficiency\", \"Retrieval / RAG\", \"RL\"]"},{"id":219,"run_id":1,"domain":"aiml","arxiv_id":"2602.19323","entry_id":"","title":"DefenseSplat: Enhancing the Robustness of 3D Gaussian Splatting via Frequency-Aware Filtering","authors":"[\"Yiran Qiao\", \"Yiren Lu\", \"Yunlai Zhou\", \"Rui Yang\", \"Linlin Hou\", \"Yu Yin\", \"Jing Ma\"]","abstract":"3D Gaussian Splatting (3DGS) has emerged as a powerful paradigm for real-time and high-fidelity 3D reconstruction from posed images. However, recent studies reveal its vulnerability to adversarial corruptions in input views, where imperceptible yet consistent perturbations can drastically degrade rendering quality, increase training and rendering time, and inflate memory usage, even leading to server denial-of-service. In our work, to mitigate this issue, we begin by analyzing the distinct behav","published":"2026-02-22T20:00:02+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.19323v1","arxiv_url":"http://arxiv.org/abs/2602.19323v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":6.0,"composite":4.8,"summary":"DefenseSplat enhances 3D Gaussian Splatting robustness against adversarial attacks through frequency-aware filtering using wavelet transforms. Filters high-frequency noise while preserving low-frequency content, substantially improving robustness without ground-truth supervision or significant clean-data performance loss.","reasoning":"Addresses important security vulnerability in 3DGS with simple yet effective approach. No code/weights but good practical value for production systems.","code_url":null,"s2_tldr":"This work designs a simple yet effective frequency-aware defense strategy that reconstructs training views by filtering high-frequency noise while preserving low-frequency content, which effectively suppresses adversarial artifacts while maintaining the authenticity of the original scene.","s2_paper_id":"1d796df7f686c1c52f970d85dc490f51bf205c95","topics":"[\"3D / Vision\"]"},{"id":220,"run_id":1,"domain":"aiml","arxiv_id":"2602.19322","entry_id":"","title":"US-JEPA: A Joint Embedding Predictive Architecture for Medical Ultrasound","authors":"[\"Ashwath Radhachandran\", \"Vedrana Ivezi\\u0107\", \"Shreeram Athreya\", \"Ronit Anilkumar\", \"Corey W. Arnold\", \"William Speier\"]","abstract":"Ultrasound (US) imaging poses unique challenges for representation learning due to its inherently noisy acquisition process. The low signal-to-noise ratio and stochastic speckle patterns hinder standard self-supervised learning methods relying on a pixel-level reconstruction objective. Joint-Embedding Predictive Architectures (JEPAs) address this drawback by predicting masked latent representations rather than raw pixels. However, standard approaches depend on hyperparameter-brittle and computat","published":"2026-02-22T19:56:56+00:00","categories":"[\"cs.CV\", \"cs.AI\", \"cs.LG\"]","pdf_url":"https://arxiv.org/pdf/2602.19322v1","arxiv_url":"http://arxiv.org/abs/2602.19322v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":6.0,"composite":4.8,"summary":"US-JEPA applies Joint-Embedding Predictive Architecture with Static-teacher Asymmetric Latent Training (SALT) for ultrasound self-supervised learning. Uses frozen domain-specific teacher for stable latent targets, achieving competitive or superior performance to foundation models on UltraBench classification tasks.","reasoning":"Novel application of JEPA to ultrasound with rigorous benchmark comparison. Domain-specific medical application limits broader impact, no code/weights.","code_url":null,"s2_tldr":"This work proposes US-JEPA, a self-supervised framework that adopts the Static-teacher Asymmetric Latent Training (SALT) objective and demonstrates that masked latent prediction provides a stable and efficient path toward robust ultrasound representations.","s2_paper_id":"305d83fdb6ba24114cc0f8dc6b8a15be3b11ca6a","topics":"[\"Retrieval / RAG\"]"},{"id":221,"run_id":1,"domain":"aiml","arxiv_id":"2602.19274","entry_id":"","title":"DD-CAM: Minimal Sufficient Explanations for Vision Models Using Delta Debugging","authors":"[\"Krishna Khadka\", \"Yu Lei\", \"Raghu N. Kacker\", \"D. Richard Kuhn\"]","abstract":"We introduce a gradient-free framework for identifying minimal, sufficient, and decision-preserving explanations in vision models by isolating the smallest subset of representational units whose joint activation preserves predictions. Unlike existing approaches that aggregate all units, often leading to cluttered saliency maps, our approach, DD-CAM, identifies a 1-minimal subset whose joint activation suffices to preserve the prediction (i.e., removing any unit from the subset alters the predict","published":"2026-02-22T17:12:31+00:00","categories":"[\"cs.CV\", \"cs.SE\"]","pdf_url":"https://arxiv.org/pdf/2602.19274v1","arxiv_url":"http://arxiv.org/abs/2602.19274v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":6.0,"composite":4.8,"summary":"DD-CAM uses delta debugging from software engineering to identify minimal sufficient explanations in vision models. Unlike aggregation-based CAM methods, it isolates the smallest subset of representational units whose activation preserves predictions, producing cleaner saliency maps.","reasoning":"Novel gradient-free approach borrowing from software debugging. Improves interpretability but no code/weights available yet.","code_url":null,"s2_tldr":null,"s2_paper_id":"3bf85e86654b075989124388bfdc87386e14c833","topics":"[\"Optimization\"]"},{"id":223,"run_id":1,"domain":"aiml","arxiv_id":"2602.19180","entry_id":"","title":"VLM-Guided Group Preference Alignment for Diffusion-based Human Mesh Recovery","authors":"[\"Wenhao Shen\", \"Hao Wang\", \"Wanqi Yin\", \"Fayao Liu\", \"Xulei Yang\", \"Chao Liang\", \"Zhongang Cai\", \"Guosheng Lin\"]","abstract":"Human mesh recovery (HMR) from a single RGB image is inherently ambiguous, as multiple 3D poses can correspond to the same 2D observation. Recent diffusion-based methods tackle this by generating various hypotheses, but often sacrifice accuracy. They yield predictions that are either physically implausible or drift from the input image, especially under occlusion or in cluttered, in-the-wild scenes. To address this, we introduce a dual-memory augmented HMR critique agent with self-reflection to ","published":"2026-02-22T13:19:06+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.19180v1","arxiv_url":"http://arxiv.org/abs/2602.19180v1","comment":"Accepted to CVPR 2026","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":6.0,"composite":4.8,"summary":"VLM-guided group preference alignment for diffusion-based human mesh recovery. A dual-memory critique agent generates context-aware quality scores for building preference datasets, used to finetune diffusion models for more physically plausible and image-consistent predictions.","reasoning":"Novel application of preference learning to HMR. CVPR 2026 accepted but no code/weights available.","code_url":null,"s2_tldr":"This work introduces a dual-memory augmented HMR critique agent with self-reflection to produce context-aware quality scores for predicted meshes, and proposes a group preference alignment framework for finetuning diffusion-based HMR models.","s2_paper_id":"86e015fd9fa8e42bbf9ccfecbe537050ca1cc9d8","topics":"[\"Multimodal\", \"Training\", \"Agents\"]"},{"id":224,"run_id":1,"domain":"aiml","arxiv_id":"2602.19112","entry_id":"","title":"Universal 3D Shape Matching via Coarse-to-Fine Language Guidance","authors":"[\"Qinfeng Xiao\", \"Guofeng Mei\", \"Bo Yang\", \"Liying Zhang\", \"Jian Zhang\", \"Kit-lun Yick\"]","abstract":"Establishing dense correspondences between shapes is a crucial task in computer vision and graphics, while prior approaches depend on near-isometric assumptions and homogeneous subject types (i.e., only operate for human shapes). However, building semantic correspondences for cross-category objects remains challenging and has received relatively little attention. To achieve this, we propose UniMatch, a semantic-aware, coarse-to-fine framework for constructing dense semantic correspondences betwe","published":"2026-02-22T10:07:03+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.19112v2","arxiv_url":"http://arxiv.org/abs/2602.19112v2","comment":"Accepted by CVPR 2026","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":6.0,"composite":4.8,"summary":"UniMatch enables universal 3D shape matching across different object categories using coarse-to-fine language guidance. It combines class-agnostic segmentation, MLLMs for part naming, VLMs for semantic matching, and rank-based contrastive learning to establish dense correspondences between non-isometric shapes without predefined part proposals.","reasoning":"Interesting multi-modal approach for cross-category matching, but no code/weights available despite CVPR 2026 acceptance. Practical for graphics applications.","code_url":null,"s2_tldr":"This work proposes UniMatch, a semantic-aware, coarse-to-fine framework for constructing dense semantic correspondences between strongly non-isometric shapes without restricting object categories, and demonstrates UniMatch consistently outperforms competing methods in various challenging scenarios.","s2_paper_id":"003649ca5ce4f9230d6426053666bff9c63c55df","topics":"[\"3D / Vision\"]"},{"id":225,"run_id":1,"domain":"aiml","arxiv_id":"2602.18906","entry_id":"","title":"Marginalized Bundle Adjustment: Multi-View Camera Pose from Monocular Depth Estimates","authors":"[\"Shengjie Zhu\", \"Ahmed Abdelkader\", \"Mark J. Matthews\", \"Xiaoming Liu\", \"Wen-Sheng Chu\"]","abstract":"Structure-from-Motion (SfM) is a fundamental 3D vision task for recovering camera parameters and scene geometry from multi-view images. While recent deep learning advances enable accurate Monocular Depth Estimation (MDE) from single images without depending on camera motion, integrating MDE into SfM remains a challenge. Unlike conventional triangulated sparse point clouds, MDE produces dense depth maps with significantly higher error variance. Inspired by modern RANSAC estimators, we propose Mar","published":"2026-02-21T17:01:32+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.18906v1","arxiv_url":"http://arxiv.org/abs/2602.18906v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":6.0,"composite":4.8,"summary":"Marginalized Bundle Adjustment (MBA) integrates monocular depth estimation into Structure-from-Motion by mitigating MDE error variance through dense depth maps. Achieves state-of-the-art results in SfM and camera relocalization across varying scales from few-frame to thousands of images.","reasoning":"Novel integration of MDE into SfM with RANSAC-inspired approach. No explicit code/weights mentioned. Strong practical results but unclear implementation details limit immediate replicability.","code_url":null,"s2_tldr":"This work shows that MDE depth maps are sufficiently accurate to yield SoTA or competitive results in SfM and camera relocalization tasks, and proposes Marginalized Bundle Adjustment (MBA) to mitigate MDE error variance leveraging its density.","s2_paper_id":"f8f72afd48f6a3bdce45162b63fe2e23935b9fe6","topics":"[\"3D / Vision\"]"},{"id":226,"run_id":1,"domain":"aiml","arxiv_id":"2602.20751","entry_id":"","title":"SibylSense: Adaptive Rubric Learning via Memory Tuning and Adversarial Probing","authors":"[\"Yifei Xu\", \"Guilherme Potje\", \"Shivam Shandilya\", \"Tiancheng Yuan\", \"Leonardo de Oliveira Nunes\", \"Rakshanda Agarwal\", \"Saeid Asgari\", \"Adam Atkinson\", \"Emre K\\u0131c\\u0131man\", \"Songwu Lu\"]","abstract":"Designing aligned and robust rewards for open-ended generation remains a key barrier to RL post-training. Rubrics provide structured, interpretable supervision, but scaling rubric construction is difficult: expert rubrics are costly, prompted rubrics are often superficial or inconsistent, and fixed-pool discriminative rubrics can saturate and drift, enabling reward hacking. We present SibylSense, an inference-time learning approach that adapts a frozen rubric generator through a tunable memory b","published":"2026-02-24T10:28:44+00:00","categories":"[\"cs.CL\", \"cs.AI\", \"cs.LG\"]","pdf_url":"https://arxiv.org/pdf/2602.20751v1","arxiv_url":"http://arxiv.org/abs/2602.20751v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":6.0,"composite":4.8,"summary":"SibylSense adapts frozen rubric generators through tunable memory bank for RL reward design. Alternates memory tuning with adversarial policy updates to prevent reward hacking and maintain discriminative power for open-ended generation tasks.","reasoning":"Novel inference-time learning approach for reward design but no code/weights available. Addresses important RL alignment problem.","code_url":null,"s2_tldr":"SibylSense is presented, an inference-time learning approach that adapts a frozen rubric generator through a tunable memory bank of validated rubric items, and yields more discriminative rubrics and improves downstream RL performance over static and non-adaptive baselines.","s2_paper_id":"4dd8a2405cb9599d72e7a7e9d4637996c091f726","topics":"[\"RL\"]"},{"id":227,"run_id":1,"domain":"aiml","arxiv_id":"2602.20735","entry_id":"","title":"RMIT-ADM+S at the MMU-RAG NeurIPS 2025 Competition","authors":"[\"Kun Ran\", \"Marwah Alaofi\", \"Danula Hettiachchi\", \"Chenglong Ma\", \"Khoi Nguyen Dinh Anh\", \"Khoi Vo Nguyen\", \"Sachin Pathiyan Cherumanal\", \"Lida Rashidi\", \"Falk Scholer\", \"Damiano Spina\"]","abstract":"This paper presents the award-winning RMIT-ADM+S system for the Text-to-Text track of the NeurIPS~2025 MMU-RAG Competition. We introduce Routing-to-RAG (R2RAG), a research-focused retrieval-augmented generation (RAG) architecture composed of lightweight components that dynamically adapt the retrieval strategy based on inferred query complexity and evidence sufficiency. The system uses smaller LLMs, enabling operation on a single consumer-grade GPU while supporting complex research ta","published":"2026-02-24T09:58:25+00:00","categories":"[\"cs.IR\", \"cs.AI\", \"cs.CL\"]","pdf_url":"https://arxiv.org/pdf/2602.20735v1","arxiv_url":"http://arxiv.org/abs/2602.20735v1","comment":"MMU-RAG NeurIPS 2025 winning system","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":5.0,"score_axis_3":7.0,"composite":4.8,"summary":"R2RAG (Routing-to-RAG) won Best Dynamic Evaluation at NeurIPS 2025 MMU-RAG Competition using lightweight components that dynamically adapt retrieval strategy based on query complexity. Runs on single consumer GPU while supporting complex research tasks.","reasoning":"Award-winning system with practical efficiency gains. Competition code likely available, good practical applicability for RAG systems.","code_url":null,"s2_tldr":null,"s2_paper_id":"727925c518aa8f3a35186f4b5819a19bf70a2ff8","topics":"[\"Retrieval / RAG\"]"},{"id":228,"run_id":1,"domain":"aiml","arxiv_id":"2602.20610","entry_id":"","title":"SpecMind: Cognitively Inspired, Interactive Multi-Turn Framework for Postcondition Inference","authors":"[\"Cuong Chi Le\", \"Minh V. T Pham\", \"Tung Vu Duy\", \"Cuong Duc Van\", \"Huy N. Phan\", \"Hoang N. Phan\", \"Tien N. Nguyen\"]","abstract":"Specifications are vital for ensuring program correctness, yet writing them manually remains challenging and time-intensive. Recent large language model (LLM)-based methods have shown successes in generating specifications such as postconditions, but existing single-pass prompting often yields inaccurate results. In this paper, we present SpecMind, a novel framework for postcondition generation that treats LLMs as interactive and exploratory reasoners rather than one-shot generators. SpecMind em","published":"2026-02-24T07:01:17+00:00","categories":"[\"cs.SE\", \"cs.CL\"]","pdf_url":"https://arxiv.org/pdf/2602.20610v1","arxiv_url":"http://arxiv.org/abs/2602.20610v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":6.0,"composite":4.8,"summary":"SpecMind uses feedback-driven multi-turn prompting for postcondition generation in program verification, enabling LLMs to iteratively refine specifications via implicit/explicit feedback. Significantly outperforms state-of-the-art single-pass approaches in accuracy and completeness.","reasoning":"Novel interactive framework for formal methods but no code/weights. Useful for software verification practitioners but niche application domain.","code_url":null,"s2_tldr":null,"s2_paper_id":"c46c8b1a2de32f1da11018119cdd30cc961b74ad","topics":"[\"Language Models\"]"},{"id":229,"run_id":1,"domain":"aiml","arxiv_id":"2602.20580","entry_id":"","title":"Personal Information Parroting in Language Models","authors":"[\"Nishant Subramani\", \"Kshitish Ghate\", \"Mona Diab\"]","abstract":"Modern language models (LM) are trained on large scrapes of the Web, containing millions of personal information (PI) instances, many of which LMs memorize, increasing privacy risks. In this work, we develop the regexes and rules (R&R) detector suite to detect email addresses, phone numbers, and IP addresses, which outperforms the best regex-based PI detectors. On a manually curated set of 483 instances of PI, we measure memorization: finding that 13.6% are parroted verbatim by the Pythia-6.9b m","published":"2026-02-24T06:02:03+00:00","categories":"[\"cs.CL\", \"cs.AI\", \"cs.CR\", \"cs.LG\"]","pdf_url":"https://arxiv.org/pdf/2602.20580v1","arxiv_url":"http://arxiv.org/abs/2602.20580v1","comment":"EACL Findings 2026","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":5.0,"score_axis_3":7.0,"composite":4.8,"summary":"Develops R&R detector suite outperforming regex-based PI detectors, finding 13.6% of personal information instances are verbatim parroted by Pythia-6.9b. Analysis across model sizes (160M-6.9B) shows both size and pretraining positively correlate with PI memorization.","reasoning":"Important privacy analysis with practical implications for dataset curation, but no code/weights released. Useful findings for practitioners.","code_url":null,"s2_tldr":"The regexes and rules (R&R) detector suite to detect email addresses, phone numbers, and IP addresses is developed, which outperforms the best regex-based PI detectors and strongly recommends that pretraining datasets be aggressively filtered and anonymized to minimize PI parroting.","s2_paper_id":"13bef8598cc4bfb63ba580b725a98db2e9380f96","topics":"[\"Language Models\"]"},{"id":230,"run_id":1,"domain":"aiml","arxiv_id":"2602.20459","entry_id":"","title":"PreScience: A Benchmark for Forecasting Scientific Contributions","authors":"[\"Anirudh Ajith\", \"Amanpreet Singh\", \"Jay DeYoung\", \"Nadav Kunievsky\", \"Austin C. Kozlowski\", \"Oyvind Tafjord\", \"James Evans\", \"Daniel S. Weld\", \"Tom Hope\", \"Doug Downey\"]","abstract":"Can AI systems trained on the scientific record up to a fixed point in time forecast the scientific advances that follow? Such a capability could help researchers identify collaborators and impactful research directions, and anticipate which problems and methods will become central next. We introduce PreScience -- a scientific forecasting benchmark that decomposes the research process into four interdependent generative tasks: collaborator prediction, prior work selection, contribution generatio","published":"2026-02-24T01:37:53+00:00","categories":"[\"cs.AI\", \"cs.CL\"]","pdf_url":"https://arxiv.org/pdf/2602.20459v1","arxiv_url":"http://arxiv.org/abs/2602.20459v1","comment":"10 pages (53 with bibliography and appendix), 4 figures (13 with appendix), 4 tables (10 with appendix), 1 algorithm","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":6.0,"composite":4.8,"summary":"PreScience is a benchmark for forecasting scientific contributions with 98K AI papers, decomposing research into four tasks: collaborator prediction, prior work selection, contribution generation, and impact prediction. Introduces LACERScore for measuring contribution similarity. Frontier LLMs achieve only moderate performance (GPT-5 scores 5.6/10), with synthetic corpora being less diverse and novel than human research.","reasoning":"Novel benchmark addressing important scientific forecasting tasks with comprehensive evaluation. Reveals substantial headroom for improvement. No code/weights released limits reproducibility.","code_url":null,"s2_tldr":"PreScience is introduced -- a scientific forecasting benchmark that decomposes the research process into four interdependent generative tasks: collaborator prediction, prior work selection, contribution generation, and impact prediction, including LACERScore, a novel LLM-based measure of contribution similarity that outperforms previous metrics and approximates inter-annotator agreement.","s2_paper_id":"3402a5f78154a76e6b31f6dc179df6c1e82065f4","topics":"[\"Benchmark\"]"},{"id":231,"run_id":1,"domain":"aiml","arxiv_id":"2602.20449","entry_id":"","title":"Protein Language Models Diverge from Natural Language: Comparative Analysis and Improved Inference","authors":"[\"Anna Hart\", \"Chi Han\", \"Jeonghwan Kim\", \"Huimin Zhao\", \"Heng Ji\"]","abstract":"Modern Protein Language Models (PLMs) apply transformer-based model architectures from natural language processing to biological sequences, predicting a variety of protein functions and properties. However, protein language has key differences from natural language, such as a rich functional space despite a vocabulary of only 20 amino acids. These differences motivate research into how transformer-based architectures operate differently in the protein domain and how we can better leverage PLMs t","published":"2026-02-24T01:18:30+00:00","categories":"[\"cs.LG\", \"cs.AI\", \"cs.CL\", \"q-bio.BM\"]","pdf_url":"https://arxiv.org/pdf/2602.20449v1","arxiv_url":"http://arxiv.org/abs/2602.20449v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":6.0,"composite":4.8,"summary":"Compares information distribution across attention heads in protein vs. natural language transformers, revealing key differences. Adapts early-exit techniques to protein property prediction, achieving 0.4-7.01 percentage point gains while improving efficiency by 10%+ by selecting representations from intermediate layers based on task and protein.","reasoning":"Solid comparative analysis with practical efficiency improvements. Novel insights into domain-specific transformer behavior. No code/weights hurts reproducibility.","code_url":null,"s2_tldr":"This work begins by directly comparing how the distribution of information stored across layers of attention heads differs between the protein and natural language domain and opens up an area of research directly comparing how language models change behavior when moved into the protein domain and advances language modeling in biological domains.","s2_paper_id":"8f9395e4f42c28b21abf5aefe8081d371c44b3f6","topics":"[\"Language Models\", \"Reasoning\", \"Architecture\"]"},{"id":232,"run_id":1,"domain":"aiml","arxiv_id":"2602.20379","entry_id":"","title":"Case-Aware LLM-as-a-Judge Evaluation for Enterprise-Scale RAG Systems","authors":"[\"Mukul Chhabra\", \"Luigi Medrano\", \"Arush Verma\"]","abstract":"Enterprise Retrieval-Augmented Generation (RAG) assistants operate in multi-turn, case-based workflows such as technical support and IT operations, where evaluation must reflect operational constraints, structured identifiers (e.g., error codes, versions), and resolution workflows. Existing RAG evaluation frameworks are primarily designed for benchmark-style or single-turn settings and often fail to capture enterprise-specific failure modes such as case misidentification, workflow misalignment, ","published":"2026-02-23T21:37:06+00:00","categories":"[\"cs.CL\", \"cs.AI\"]","pdf_url":"https://arxiv.org/pdf/2602.20379v1","arxiv_url":"http://arxiv.org/abs/2602.20379v1","comment":"12 pages including appendix, 6 figures","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":5.0,"score_axis_3":7.0,"composite":4.8,"summary":"Presents case-aware LLM-as-a-Judge framework for evaluating enterprise multi-turn RAG systems in operational contexts like technical support. Eight operationally grounded metrics evaluate retrieval quality, grounding, answer utility, precision integrity, and workflow alignment. Severity-aware scoring improves diagnostic clarity for production monitoring and regression testing.","reasoning":"Practically useful for enterprise RAG deployment with actionable metrics. Incremental advancement over existing evaluation methods. No open models or novel architectures.","code_url":null,"s2_tldr":"This work presents a case-aware LLM-as-a-Judge evaluation framework for enterprise multi-turn RAG systems, showing that generic proxy metrics provide ambiguous signals, while the proposed framework exposes enterprise-critical tradeoffs that are actionable for system improvement.","s2_paper_id":"c5e1627bbb1d7098fb201a1c1d1bb07a0336d157","topics":"[\"Retrieval / RAG\", \"Benchmark\", \"Language Models\"]"},{"id":233,"run_id":1,"domain":"aiml","arxiv_id":"2602.20300","entry_id":"","title":"What Makes a Good Query? Measuring the Impact of Human-Confusing Linguistic Features on LLM Performance","authors":"[\"William Watson\", \"Nicole Cho\", \"Sumitra Ganesh\", \"Manuela Veloso\"]","abstract":"Large Language Model (LLM) hallucinations are usually treated as defects of the model or its decoding strategy. Drawing on classical linguistics, we argue that a query's form can also shape a listener's (and model's) response. We operationalize this insight by constructing a 22-dimension query feature vector covering clause complexity, lexical rarity, and anaphora, negation, answerability, and intention grounding, all known to affect human comprehension. Using 369,837 real-world queries, we ask:","published":"2026-02-23T19:30:08+00:00","categories":"[\"cs.CL\", \"cs.AI\"]","pdf_url":"https://arxiv.org/pdf/2602.20300v1","arxiv_url":"http://arxiv.org/abs/2602.20300v1","comment":"EACL 2026 Findings","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":6.0,"composite":4.8,"summary":"Constructs 22-dimension query feature vector covering linguistic properties (clause complexity, lexical rarity, anaphora, negation, answerability) to predict LLM hallucination risk. Analysis of 369K queries reveals consistent \"risk landscape\" where features like deep nesting increase hallucination while clear intention grounding reduces it.","reasoning":"Novel linguistic feature engineering approach to understanding hallucinations. Large-scale empirical analysis provides actionable insights. No code/models but useful for guided query rewriting.","code_url":null,"s2_tldr":"It is argued that a query's form can also shape a listener's (and model's) response, and an empirically observable query-feature representation correlated with hallucination risk is established, paving the way for guided query rewriting and future intervention studies.","s2_paper_id":"5b30891711c4049bbb4030a71c480f9f9f3b477f","topics":"[\"Language Models\"]"},{"id":234,"run_id":1,"domain":"aiml","arxiv_id":"2602.20294","entry_id":"","title":"InterviewSim: A Scalable Framework for Interview-Grounded Personality Simulation","authors":"[\"Yu Li\", \"Pranav Narayanan Venkit\", \"Yada Pruksachatkun\", \"Chien-Sheng Wu\"]","abstract":"Simulating real personalities with large language models requires grounding generation in authentic personal data. Existing evaluation approaches rely on demographic surveys, personality questionnaires, or short AI-led interviews as proxies, but lack direct assessment against what individuals actually said. We address this gap with an interview-grounded evaluation framework for personality simulation at a large scale. We extract over 671,000 question-answer pairs from 23,000 verified interview t","published":"2026-02-23T19:21:10+00:00","categories":"[\"cs.CL\", \"cs.AI\", \"cs.CY\"]","pdf_url":"https://arxiv.org/pdf/2602.20294v1","arxiv_url":"http://arxiv.org/abs/2602.20294v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":6.0,"composite":4.8,"summary":"InterviewSim presents interview-grounded evaluation framework for personality simulation using 671K QA pairs from 23K verified interview transcripts across 1K public personalities. Proposes multi-dimensional evaluation with four metrics measuring content similarity, factual consistency, personality alignment, and knowledge retention. Reveals trade-off between retrieval and chronological methods.","reasoning":"Novel large-scale evaluation framework grounded in authentic interview data. Solid empirical analysis revealing method trade-offs. No models released, primarily evaluation-focused.","code_url":null,"s2_tldr":"An interview-grounded evaluation framework for personality simulation at a large scale with a trade-off in how interview data is best utilized is revealed: retrieval-augmented methods excel at capturing personality style and response quality, while chronological-based methods better preserve factual consistency and knowledge retention.","s2_paper_id":"cb5b9879c23cef00a0435d2a8d3c7ad2c06b1000","topics":"[\"Language Models\", \"Benchmark\"]"},{"id":235,"run_id":1,"domain":"aiml","arxiv_id":"2602.20091","entry_id":"","title":"How Retrieved Context Shapes Internal Representations in RAG","authors":"[\"Samuel Yeh\", \"Sharon Li\"]","abstract":"Retrieval-augmented generation (RAG) enhances large language models (LLMs) by conditioning generation on retrieved external documents, but the effect of retrieved context is often non-trivial. In realistic retrieval settings, the retrieved document set often contains a mixture of documents that vary in relevance and usefulness. While prior work has largely examined these phenomena through output behavior, little is known about how retrieved context shapes the internal representations that mediat","published":"2026-02-23T18:02:04+00:00","categories":"[\"cs.CL\"]","pdf_url":"https://arxiv.org/pdf/2602.20091v1","arxiv_url":"http://arxiv.org/abs/2602.20091v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":6.0,"composite":4.8,"summary":"Analyzes how retrieved context shapes internal representations in RAG systems by examining hidden states across four QA datasets and three LLMs. Reveals how context relevancy and layer-wise processing influence representations in single- and multi-document settings, providing explanations for output behaviors and insights for RAG system design.","reasoning":"Solid interpretability analysis providing insights into RAG mechanisms. Useful for understanding model behavior but no novel architectures or methods. No code released.","code_url":null,"s2_tldr":"This work systematically analyzes how different types of retrieved documents affect the hidden states of LLMs, and how these internal representation shifts relate to downstream generation behavior, revealing how context relevancy and layer-wise processing influence internal representations.","s2_paper_id":"289267e934ac7aabc5d463a8787d8ae7a0e02ae4","topics":"[\"Retrieval / RAG\", \"Language Models\"]"},{"id":237,"run_id":1,"domain":"aiml","arxiv_id":"2602.19548","entry_id":"","title":"Beyond a Single Extractor: Re-thinking HTML-to-Text Extraction for LLM Pretraining","authors":"[\"Jeffrey Li\", \"Josh Gardner\", \"Doug Kang\", \"Fangping Shi\", \"Karanjeet Singh\", \"Chun-Liang Li\", \"Herumb Shandilya\", \"David Hall\", \"Oncel Tuzel\", \"Percy Liang\"]","abstract":"One of the first pre-processing steps for constructing web-scale LLM pretraining datasets involves extracting text from HTML. Despite the immense diversity of web content, existing open-source datasets predominantly apply a single fixed extractor to all webpages. In this work, we investigate whether this practice leads to suboptimal coverage and utilization of Internet data. We first show that while different extractors may lead to similar model performance on standard language understanding tas","published":"2026-02-23T06:41:57+00:00","categories":"[\"cs.CL\", \"cs.LG\"]","pdf_url":"https://arxiv.org/pdf/2602.19548v1","arxiv_url":"http://arxiv.org/abs/2602.19548v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":5.0,"score_axis_3":7.0,"composite":4.8,"summary":"Investigates HTML-to-text extraction for LLM pretraining, showing that different extractors lead to different surviving pages despite similar downstream performance. Proposes taking Union over extractors to increase token yield by up to 71%. Shows extractor choice significantly impacts structured content tasks (10pp on WikiTQ, 3pp on HumanEval).","reasoning":"Practical data preprocessing insight but no code/weights. Important for dataset construction but incremental methodological contribution.","code_url":null,"s2_tldr":"This work shows that while different extractors may lead to similar model performance on standard language understanding tasks, the pages surviving a fixed filtering pipeline can differ substantially, and suggests a simple intervention: by taking a Union over different extractors, this can increase the token yield of DCLM-Baseline by up to 71% while maintaining benchmark performance.","s2_paper_id":"251d0e6cb247ddfb0c4bbb5dcafb12a9f678144c","topics":"[\"Language Models\", \"Benchmark\"]"},{"id":238,"run_id":1,"domain":"aiml","arxiv_id":"2602.19509","entry_id":"","title":"Pyramid MoA: A Probabilistic Framework for Cost-Optimized Anytime Inference","authors":"[\"Arindam Khaled\"]","abstract":"Large Language Models (LLMs) face a persistent trade-off between inference cost and reasoning capability. While \"Oracle\" models (e.g., Llama-3-70B) achieve state-of-the-art accuracy, they are prohibitively expensive for high-volume deployment. Smaller models (e.g., 8B parameters) are cost-effective but struggle with complex tasks. In this work, we propose \"Pyramid MoA\", a hierarchical Mixture-of-Agents architecture that uses a lightweight Router to dynamically escalate queries only when necessar","published":"2026-02-23T04:47:47+00:00","categories":"[\"cs.CL\", \"cs.AI\", \"cs.LG\"]","pdf_url":"https://arxiv.org/pdf/2602.19509v1","arxiv_url":"http://arxiv.org/abs/2602.19509v1","comment":"6 pages, 4 figures, 1 table","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":5.0,"score_axis_3":7.0,"composite":4.8,"summary":"Proposes Pyramid MoA, a hierarchical Mixture-of-Agents architecture using lightweight Router to dynamically escalate queries to larger models only when needed. Achieves 93.0% accuracy on GSM8K (matching 98.0% Oracle) while reducing compute costs by 61% with negligible latency overhead (+0.82s).","reasoning":"Practical cost-optimization approach but no code/weights. Useful for deployment but incremental application of ensemble routing. Good cost-accuracy tradeoff demonstrated.","code_url":null,"s2_tldr":"This work proposes \"Pyramid MoA\", a hierarchical Mixture-of-Agents architecture that uses a lightweight Router to dynamically escalate queries only when necessary, and demonstrates that the system introduces negligible latency overhead and allows for a tunable trade-off between performance and budget.","s2_paper_id":"cdab5bca82dbee3e77d493deea5a8cea87307472","topics":"[\"Optimization\", \"Language Models\", \"Reasoning\"]"},{"id":240,"run_id":1,"domain":"aiml","arxiv_id":"2602.18806","entry_id":"","title":"Think$^{2}$: Grounded Metacognitive Reasoning in Large Language Models","authors":"[\"Abraham Paul Elenjical\", \"Vivek Hruday Kavuri\", \"Vasudeva Varma\"]","abstract":"Large Language Models (LLMs) demonstrate strong reasoning performance, yet their ability to reliably monitor, diagnose, and correct their own errors remains limited. We introduce a psychologically grounded metacognitive framework that operationalizes Ann Brown's regulatory cycle (Planning, Monitoring, and Evaluation) as a structured prompting architecture, and study its integration within a lightweight dual-process MetaController for adaptive effort allocation. Across diverse reasoning and diagn","published":"2026-02-21T11:45:12+00:00","categories":"[\"cs.CL\", \"cs.AI\"]","pdf_url":"https://arxiv.org/pdf/2602.18806v1","arxiv_url":"http://arxiv.org/abs/2602.18806v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":6.0,"composite":4.8,"summary":"Introduces metacognitive framework operationalizing Ann Brown's regulatory cycle (Planning, Monitoring, Evaluation) for LLM reasoning with MetaController for adaptive effort allocation. Achieves threefold improvement in self-correction on GSM8K, MBPP, AIME using Llama-3 and Qwen-3 (8B). Human evaluations show 84% preference for trustworthiness.","reasoning":"Psychologically grounded approach to reasoning with strong results, but no code released. Practical prompting technique but limited by missing reproducibility artifacts.","code_url":null,"s2_tldr":"A psychologically grounded metacognitive framework that operationalizes Ann Brown's regulatory cycle as a structured prompting architecture is introduced and its integration within a lightweight dual-process MetaController for adaptive effort allocation is studied.","s2_paper_id":"8edf50874d45bd5bcd1b1eb65771c3a69220dc14","topics":"[\"Language Models\", \"Reasoning\", \"Agents\"]"},{"id":242,"run_id":1,"domain":"aiml","arxiv_id":"2602.18693","entry_id":"","title":"Contradiction to Consensus: Dual Perspective, Multi Source Retrieval Based Claim Verification with Source Level Disagreement using LLM","authors":"[\"Md Badsha Biswas\", \"Ozlem Uzuner\"]","abstract":"The spread of misinformation across digital platforms can pose significant societal risks. Claim verification, a.k.a. fact-checking, systems can help identify potential misinformation. However, their efficacy is limited by the knowledge sources that they rely on. Most automated claim verification systems depend on a single knowledge source and utilize the supporting evidence from that source; they ignore the disagreement of their source with others. This limits their knowledge coverage and trans","published":"2026-02-21T02:21:31+00:00","categories":"[\"cs.CL\"]","pdf_url":"https://arxiv.org/pdf/2602.18693v1","arxiv_url":"http://arxiv.org/abs/2602.18693v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":6.0,"composite":4.8,"summary":"Presents multi-source claim verification system using LLMs with dual-perspective evidence retrieval (original and negated claims) from Wikipedia, PubMed, and Google. Aggregates evidence across sources and quantifies inter-source disagreement. Evaluated on four benchmarks with five LLMs showing improved verification and transparency.","reasoning":"Comprehensive approach to fact-checking addressing source diversity, but no code released. Practical system design but limited by missing implementation.","code_url":null,"s2_tldr":"This work presents a novel system for open-domain claim verification (ODCV) that leverages large language models (LLMs), multi-perspective evidence retrieval, and cross-source disagreement analysis that shows that knowledge aggregation not only improves claim verification but also reveals differences in source-specific reasoning.","s2_paper_id":"e1f6437c15db45ba119b6e97374ec8bcc6a11d6b","topics":"[\"Language Models\", \"Retrieval / RAG\"]"},{"id":243,"run_id":1,"domain":"aiml","arxiv_id":"2602.18417","entry_id":"","title":"Subgroups of $U(d)$ Induce Natural RNN and Transformer Architectures","authors":"[\"Joshua Nunley\"]","abstract":"This paper presents a direct framework for sequence models with hidden states on closed subgroups of U(d). We use a minimal axiomatic setup and derive recurrent and transformer templates from a shared skeleton in which subgroup choice acts as a drop-in replacement for state space, tangent projection, and update map. We then specialize to O(d) and evaluate orthogonal-state RNN and transformer models on Tiny Shakespeare and Penn Treebank under parameter-matched settings. We also report a general l","published":"2026-02-20T18:35:43+00:00","categories":"[\"cs.LG\", \"cs.CL\"]","pdf_url":"https://arxiv.org/pdf/2602.18417v1","arxiv_url":"http://arxiv.org/abs/2602.18417v1","comment":"12 pages, 3 figures, 8 tables","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":7.0,"score_axis_3":5.0,"composite":4.8,"summary":"Proposes deriving RNN and transformer architectures from closed subgroups of U(d) using minimal axiomatic setup. Evaluates orthogonal-state models on Tiny Shakespeare and Penn Treebank, showing subgroup choice as drop-in replacement for state space design.","reasoning":"Novel mathematical framework connecting group theory to sequence architectures. Limited empirical validation and no code/weights. More theoretical than immediately practical.","code_url":null,"s2_tldr":"A minimal axiomatic setup is used and recurrent and transformer templates from a shared skeleton in which subgroup choice acts as a drop-in replacement for state space, tangent projection, and update map and a general linear-mixing extension in tangent space is reported.","s2_paper_id":"32ab012c58f6913f8efc0236422062e9e16bf867","topics":"[\"Architecture\", \"Benchmark\"]"},{"id":244,"run_id":1,"domain":"aiml","arxiv_id":"2602.18301","entry_id":"","title":"On the Semantic and Syntactic Information Encoded in Proto-Tokens for One-Step Text Reconstruction","authors":"[\"Ivan Bondarenko\", \"Egor Palkin\", \"Fedor Tikunov\"]","abstract":"Autoregressive large language models (LLMs) generate text token-by-token, requiring n forward passes to produce a sequence of length n. Recent work, Exploring the Latent Capacity of LLMs for One-Step Text Reconstruction (Mezentsev and Oseledets), shows that frozen LLMs can reconstruct hundreds of tokens from only two learned proto-tokens in a single forward pass, suggesting a path beyond the autoregressive paradigm. In this paper, we study what information these proto-tokens encode and how they ","published":"2026-02-20T15:54:10+00:00","categories":"[\"cs.LG\", \"cs.CL\"]","pdf_url":"https://arxiv.org/pdf/2602.18301v1","arxiv_url":"http://arxiv.org/abs/2602.18301v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":6.0,"composite":4.8,"summary":"Analyzes semantic vs syntactic information in proto-tokens for one-step text reconstruction from frozen LLMs. Shows m-token captures semantics while e-token less so; proposes relational distillation to transfer batch-level semantic relations without sacrificing reconstruction quality.","reasoning":"Novel analysis of proto-token representations with practical insights. No code/weights provided. Contributes to understanding non-autoregressive generation.","code_url":null,"s2_tldr":"Results indicate that the m-token tends to capture semantic information more strongly than the e-token under standard optimization; anchor-based constraints trade off sharply with reconstruction accuracy; and relational distillation can transfer batch-level semantic relations into the proto-token space without sacrificing reconstruction quality, supporting the feasibility of future non-autoregressive seq2seq systems that predict proto-tokens as an intermediate representation.","s2_paper_id":"4970009849ad60cff489d0a767c608ce3941cc2f","topics":"[\"Language Models\"]"},{"id":246,"run_id":1,"domain":"aiml","arxiv_id":"2602.18217","entry_id":"","title":"Information-Theoretic Storage Cost in Sentence Comprehension","authors":"[\"Kohei Kajikawa\", \"Shinnosuke Isono\", \"Ethan Gotlieb Wilcox\"]","abstract":"Real-time sentence comprehension imposes a significant load on working memory, as comprehenders must maintain contextual information to anticipate future input. While measures of such load have played an important role in psycholinguistic theories, they have been formalized, largely, using symbolic grammars, which assign discrete, uniform costs to syntactic predictions. This study proposes a measure of processing storage cost based on an information-theoretic formalization, as the amount of info","published":"2026-02-20T13:55:56+00:00","categories":"[\"cs.CL\"]","pdf_url":"https://arxiv.org/pdf/2602.18217v1","arxiv_url":"http://arxiv.org/abs/2602.18217v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":6.0,"composite":4.8,"summary":"Proposes information-theoretic measure of sentence processing storage cost as information previous words carry about future context under uncertainty. Recovers known processing asymmetries, correlates with grammar-based costs, and predicts reading times in naturalistic datasets using neural LM estimates.","reasoning":"Novel theoretical contribution to psycholinguistics using neural LMs. No code/weights. Incremental validation on known phenomena limits immediate practical impact.","code_url":null,"s2_tldr":"","s2_paper_id":"6659a935090ca0d0b77b41515d871f9cb63cf00e","topics":"[]"},{"id":247,"run_id":1,"domain":"aiml","arxiv_id":"2602.18171","entry_id":"","title":"Click it or Leave it: Detecting and Spoiling Clickbait with Informativeness Measures and Large Language Models","authors":"[\"Wojciech Michaluk\", \"Tymoteusz Urban\", \"Mateusz Kubita\", \"Soveatin Kuntur\", \"Anna Wroblewska\"]","abstract":"Clickbait headlines degrade the quality of online information and undermine user trust. We present a hybrid approach to clickbait detection that combines transformer-based text embeddings with linguistically motivated informativeness features. Using natural language processing techniques, we evaluate classical vectorizers, word embedding baselines, and large language model embeddings paired with tree-based classifiers. Our best-performing model, XGBoost over embeddings augmented with 15 explicit","published":"2026-02-20T12:16:08+00:00","categories":"[\"cs.CL\", \"cs.AI\"]","pdf_url":"https://arxiv.org/pdf/2602.18171v1","arxiv_url":"http://arxiv.org/abs/2602.18171v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":5.0,"score_axis_3":7.0,"composite":4.8,"summary":"Hybrid clickbait detection combining transformer embeddings with 15 linguistically motivated informativeness features. XGBoost achieves 91% F1-score, outperforming TF-IDF, Word2Vec, GloVe, and LLM prompting baselines with interpretable features highlighting attention cues.","reasoning":"Practical system but ensemble approach rather than novel method. Code/models promised but no weights on HF. Incremental improvement over baselines.","code_url":null,"s2_tldr":"This work presents a hybrid approach to clickbait detection that combines transformer-based text embeddings with linguistically motivated informativeness features, and evaluates classical vectorizers, word embedding baselines, and large language model embeddings paired with tree-based classifiers.","s2_paper_id":"f52bca98cc83c80c8c29d14a33a63fee2d880292","topics":"[\"Language Models\", \"Retrieval / RAG\", \"Architecture\"]"},{"id":248,"run_id":1,"domain":"aiml","arxiv_id":"2602.18152","entry_id":"","title":"The Statistical Signature of LLMs","authors":"[\"Ortal Hadad\", \"Edoardo Loru\", \"Jacopo Nudo\", \"Niccol\\u00f2 Di Marco\", \"Matteo Cinelli\", \"Walter Quattrociocchi\"]","abstract":"Large language models generate text through probabilistic sampling from high-dimensional distributions, yet how this process reshapes the structural statistical organization of language remains incompletely characterized. Here we show that lossless compression provides a simple, model-agnostic measure of statistical regularity that differentiates generative regimes directly from surface text. We analyze compression behavior across three progressively more complex information ecosystems: controll","published":"2026-02-20T11:33:37+00:00","categories":"[\"cs.CL\", \"cs.CY\", \"physics.soc-ph\"]","pdf_url":"https://arxiv.org/pdf/2602.18152v1","arxiv_url":"http://arxiv.org/abs/2602.18152v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":6.0,"composite":4.8,"summary":"Shows lossless compression provides model-agnostic measure differentiating LLM-generated from human text across three information ecosystems. LLM text exhibits higher structural regularity and compressibility, though separation attenuates in fragmented interaction environments at small scales.","reasoning":"Novel compression-based analysis revealing structural signatures of LLM generation. No code/weights. Interesting theoretical contribution but limited immediate practical applications.","code_url":null,"s2_tldr":"It is shown that lossless compression provides a simple, model-agnostic measure of statistical regularity that differentiates generative regimes directly from surface text and introduces a simple and robust framework for quantifying how generative systems reshape textual production.","s2_paper_id":"4f118e12b0737b2d071e0fc1865b87e30e0962fa","topics":"[\"Language Models\", \"Efficiency\"]"},{"id":249,"run_id":1,"domain":"aiml","arxiv_id":"2602.18008","entry_id":"","title":"NIMMGen: Learning Neural-Integrated Mechanistic Digital Twins with LLMs","authors":"[\"Zihan Guan\", \"Rituparna Datta\", \"Mengxuan Hu\", \"Shunshun Liu\", \"Aiying Zhang\", \"Prasanna Balachandran\", \"Sheng Li\", \"Anil Vullikanti\"]","abstract":"Mechanistic models encode scientific knowledge about dynamical systems and are widely used in downstream scientific and policy applications. Recent work has explored LLM-based agentic frameworks to automatically construct mechanistic models from data; however, existing problem settings substantially oversimplify real-world conditions, leaving it unclear whether LLM-generated mechanistic models are reliable in practice. To address this gap, we introduce the Neural-Integrated Mechanistic Modeling ","published":"2026-02-20T05:46:54+00:00","categories":"[\"cs.LG\", \"cs.AI\", \"cs.CL\"]","pdf_url":"https://arxiv.org/pdf/2602.18008v1","arxiv_url":"http://arxiv.org/abs/2602.18008v1","comment":"19 pages, 6 figures","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":6.0,"composite":4.8,"summary":"Introduces NIMM evaluation framework for LLM-generated mechanistic models under realistic partial observation settings, and proposes NIMMgen agentic framework for neural-integrated mechanistic modeling. Demonstrates strong performance across three scientific domains with support for counterfactual intervention simulation.","reasoning":"Novel approach to mechanistic modeling with LLMs. Practical for scientific applications but limited immediate applicability. No code/weights available.","code_url":null,"s2_tldr":"The Neural-Integrated Mechanistic Modeling (NIMM) evaluation framework is introduced, which evaluates LLM-generated mechanistic models under realistic settings with partial observations and diversified task objectives, and designs NIMMgen, an agentic framework for neural-integrated mechanistic modeling that enhances code correctness and practical validity through iterative refinement.","s2_paper_id":"9e1cea2ce51ace11a15f9a2c4af2ebf47fca7293","topics":"[\"Language Models\", \"Agents\"]"},{"id":250,"run_id":1,"domain":"aiml","arxiv_id":"2602.17881","entry_id":"","title":"Understanding Unreliability of Steering Vectors in Language Models: Geometric Predictors and the Limits of Linear Approximations","authors":"[\"Joschka Braun\"]","abstract":"Steering vectors are a lightweight method for controlling language model behavior by adding a learned bias to the activations at inference time. Although effective on average, steering effect sizes vary across samples and are unreliable for many target behaviors. In my thesis, I investigate why steering reliability differs across behaviors and how it is impacted by steering vector training data. First, I find that higher cosine similarity between training activation differences predicts more rel","published":"2026-02-19T22:37:05+00:00","categories":"[\"cs.CL\", \"cs.AI\", \"cs.LG\"]","pdf_url":"https://arxiv.org/pdf/2602.17881v1","arxiv_url":"http://arxiv.org/abs/2602.17881v1","comment":"Master's Thesis, University of T\u00fcbingen. 89 pages, 34 figures. Portions of this work were published at the ICLR 2025 Workshop on Foundation Models in the Wild (see arXiv:2505.22637)","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":6.0,"composite":4.8,"summary":"Master's thesis investigating steering vector reliability in LLMs, finding that higher cosine similarity in training activations and better separation along steering direction predict more reliable steering. Shows steering vectors are unreliable when linear approximations fail to capture non-linear latent representations.","reasoning":"Solid diagnostic analysis of steering vector limitations with practical insights. Academic work with limited immediate tooling. Thesis format.","code_url":null,"s2_tldr":"This thesis investigates why steering reliability differs across behaviors and how it is impacted by steering vector training data and suggests that steering vectors are unreliable when the latent target behavior representation is not effectively approximated by the linear steering direction.","s2_paper_id":"f2211da63fd389a854897f86fe8769c9da2a4828","topics":"[\"Language Models\"]"},{"id":251,"run_id":1,"domain":"aiml","arxiv_id":"2602.17784","entry_id":"","title":"QueryPlot: Generating Geological Evidence Layers using Natural Language Queries for Mineral Exploration","authors":"[\"Meng Ye\", \"Xiao Lin\", \"Georgina Lukoczki\", \"Graham W. Lederer\", \"Yi Yao\"]","abstract":"Mineral prospectivity mapping requires synthesizing heterogeneous geological knowledge, including textual deposit models and geospatial datasets, to identify regions likely to host specific mineral deposit types. This process is traditionally manual and knowledge-intensive. We present QueryPlot, a semantic retrieval and mapping framework that integrates large-scale geological text corpora with geologic map data using modern Natural Language Processing techniques. We curate descriptive deposit mo","published":"2026-02-19T19:31:37+00:00","categories":"[\"cs.CL\", \"cs.AI\"]","pdf_url":"https://arxiv.org/pdf/2602.17784v1","arxiv_url":"http://arxiv.org/abs/2602.17784v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":5.0,"score_axis_3":7.0,"composite":4.8,"summary":"QueryPlot: semantic retrieval framework for mineral prospectivity mapping using NLP to integrate geological text with geospatial data. Enables natural language queries over 120 deposit types, achieving high recall of known tungsten skarn deposits and alignment with expert-defined tracts. Web-based system with GIS export.","reasoning":"Novel application of NLP to specialized domain with practical web system. Limited broader ML community interest due to domain specificity.","code_url":null,"s2_tldr":null,"s2_paper_id":"9a96235c78b1583e9967b3752ae9a70de0412da7","topics":"[\"Reasoning\", \"Retrieval / RAG\", \"Benchmark\"]"},{"id":254,"run_id":1,"domain":"aiml","arxiv_id":"2602.17327","entry_id":"","title":"WebFAQ 2.0: A Multilingual QA Dataset with Mined Hard Negatives for Dense Retrieval","authors":"[\"Michael Dinzinger\", \"Laura Caspari\", \"Ali Salman\", \"Irvin Topi\", \"Jelena Mitrovi\\u0107\", \"Michael Granitzer\"]","abstract":"We introduce WebFAQ 2.0, a new version of the WebFAQ dataset, containing 198 million FAQ-based natural question-answer pairs across 108 languages. Compared to the previous version, it significantly expands multilingual coverage and the number of bilingual aligned QA pairs to over 14.3M, making it the largest FAQ-based resource. Unlike the original release, WebFAQ 2.0 uses a novel data collection strategy that directly crawls and extracts relevant web content, resulting in a substantially more di","published":"2026-02-19T12:45:58+00:00","categories":"[\"cs.IR\", \"cs.AI\", \"cs.CL\"]","pdf_url":"https://arxiv.org/pdf/2602.17327v1","arxiv_url":"http://arxiv.org/abs/2602.17327v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":4.0,"score_axis_3":8.0,"composite":4.8,"summary":"WebFAQ 2.0 provides a massive multilingual FAQ dataset (198M QA pairs, 108 languages) with 1.25M hard negatives across 20 languages for training dense retrievers. Includes cross-encoder scores and training scripts for contrastive learning and knowledge distillation, directly supporting practical IR applications.","reasoning":"High code_and_weights due to datasets on HuggingFace and GitHub. Moderate novelty as it's primarily a dataset expansion with standard hard negative mining. Strong practical applicability for multilingual retrieval practitioners.","code_url":null,"s2_tldr":"A new version of the WebFAQ dataset, containing 198 million FAQ-based natural question-answer pairs across 108 languages, is introduced, significantly expands multilingual coverage and the number of bilingual aligned QA pairs to over 14.3M, making it the largest FAQ-based resource.","s2_paper_id":"94f4145553e71f9d9015bd13da104591f6fd6eb8","topics":"[\"Benchmark\", \"Retrieval / RAG\"]"},{"id":255,"run_id":1,"domain":"aiml","arxiv_id":"2602.17287","entry_id":"","title":"Representation Collapse in Machine Translation Through the Lens of Angular Dispersion","authors":"[\"Evgeniia Tokarchuk\", \"Maya K. Nachesa\", \"Sergey Troshin\", \"Vlad Niculae\"]","abstract":"Modern neural translation models based on the Transformer architecture are known for their high performance, particularly when trained on high-resource datasets. A standard next-token prediction training strategy, while widely adopted in practice, may lead to overlooked artifacts such as representation collapse. Previous works have shown that this problem is especially pronounced in the representation of the deeper Transformer layers, where it often fails to efficiently utilize the geometric spa","published":"2026-02-19T11:46:38+00:00","categories":"[\"cs.CL\", \"cs.LG\"]","pdf_url":"https://arxiv.org/pdf/2602.17287v1","arxiv_url":"http://arxiv.org/abs/2602.17287v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":6.0,"composite":4.8,"summary":"Analyzes representation collapse in neural MT through angular dispersion, showing deeper layers fail to utilize geometric space efficiently. Proposes gap-based initialization and angular dispersion regularization that improves translation quality while mitigating collapse, with benefits preserved after quantization.","reasoning":"Low code_and_weights (no code/weights mentioned). Moderate novelty in connecting angular dispersion to representation collapse. Decent practical applicability for MT practitioners concerned with model efficiency.","code_url":null,"s2_tldr":"This work incorporates an existing regularization method based on angular dispersion and demonstrates empirically that it not only mitigates collapse but also improves translation quality, and shows that quantized models exhibit similar collapse behavior and that the benefits of regularization are preserved even after quantization.","s2_paper_id":"517ca8e8358b70f37de09d9e17a930f1e62eea8e","topics":"[\"Efficiency\", \"Architecture\", \"Benchmark\"]"},{"id":256,"run_id":1,"domain":"aiml","arxiv_id":"2602.17262","entry_id":"","title":"Quantifying and Mitigating Socially Desirable Responding in LLMs: A Desirability-Matched Graded Forced-Choice Psychometric Study","authors":"[\"Kensuke Okada\", \"Yui Furukawa\", \"Kyosuke Bunji\"]","abstract":"Human self-report questionnaires are increasingly used in NLP to benchmark and audit large language models (LLMs), from persona consistency to safety and bias assessments. Yet these instruments presume honest responding; in evaluative contexts, LLMs can instead gravitate toward socially preferred answers-a form of socially desirable responding (SDR)-biasing questionnaire-derived scores and downstream conclusions. We propose a psychometric framework to quantify and mitigate SDR in questionnaire-b","published":"2026-02-19T11:07:24+00:00","categories":"[\"cs.CL\", \"stat.ME\"]","pdf_url":"https://arxiv.org/pdf/2602.17262v1","arxiv_url":"http://arxiv.org/abs/2602.17262v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":7.0,"score_axis_3":5.0,"composite":4.8,"summary":"Applies Thematic Apperception Test (TAT) and psychometric theory to quantify socially desirable responding (SDR) in LLMs. Proposes graded forced-choice inventory that substantially attenuates SDR while maintaining persona profile recovery, revealing model-dependent SDR-recovery trade-offs.","reasoning":"Low code_and_weights (no code/weights mentioned). Strong novelty in applying clinical psychology frameworks to LLM evaluation. Moderate practical applicability primarily for bias auditing and personality assessment contexts.","code_url":null,"s2_tldr":"","s2_paper_id":"d42cfed8d6b68a4a7a1606142d171b9c3bbd474f","topics":"[\"Language Models\", \"Benchmark\"]"},{"id":257,"run_id":1,"domain":"aiml","arxiv_id":"2602.17072","entry_id":"","title":"BankMathBench: A Benchmark for Numerical Reasoning in Banking Scenarios","authors":"[\"Yunseung Lee\", \"Subin Kim\", \"Youngjun Kwak\", \"Jaegul Choo\"]","abstract":"Large language models (LLMs)-based chatbots are increasingly being adopted in the financial domain, particularly in digital banking, to handle customer inquiries about products such as deposits, savings, and loans. However, these models still exhibit low accuracy in core banking computations-including total payout estimation, comparison of products with varying interest rates, and interest calculation under early repayment conditions. Such tasks require multi-step numerical reasoning and context","published":"2026-02-19T04:27:47+00:00","categories":"[\"cs.CL\"]","pdf_url":"https://arxiv.org/pdf/2602.17072v1","arxiv_url":"http://arxiv.org/abs/2602.17072v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":5.0,"score_axis_3":7.0,"composite":4.8,"summary":"Introduces BankMathBench, domain-specific benchmark for numerical reasoning in banking scenarios with three difficulty levels. Shows tool-augmented fine-tuning achieves 57-75% accuracy improvements over zero-shot baselines, addressing practical limitations in financial chatbots.","reasoning":"Low code_and_weights (no code/weights mentioned). Moderate novelty in domain-specific benchmark design. Strong practical applicability for financial services deploying LLM chatbots.","code_url":null,"s2_tldr":"BankMathBench is proposed, a domain-specific dataset that reflects realistic banking tasks and exhibited notable improvements in both formula generation and numerical reasoning accuracy when trained on open-source LLMs, demonstrating the dataset's effectiveness in enhancing domain-specific reasoning.","s2_paper_id":"ddf2df282df5abd6db07881aca2f516d041820a2","topics":"[\"Reasoning\", \"Benchmark\", \"Language Models\"]"},{"id":258,"run_id":1,"domain":"aiml","arxiv_id":"2602.17054","entry_id":"","title":"ALPS: A Diagnostic Challenge Set for Arabic Linguistic & Pragmatic Reasoning","authors":"[\"Hussein S. Al-Olimat\", \"Ahmad Alshareef\"]","abstract":"While recent Arabic NLP benchmarks focus on scale, they often rely on synthetic or translated data which may benefit from deeper linguistic verification. We introduce ALPS (Arabic Linguistic & Pragmatic Suite), a native, expert-curated diagnostic challenge set probing Deep Semantics and Pragmatics, capabilities that complement specialized large-scale benchmarks. While broad-coverage benchmarks prioritize scale and multi-task coverage, ALPS targets the depth of linguistic understanding through 53","published":"2026-02-19T03:51:37+00:00","categories":"[\"cs.CL\", \"cs.AI\"]","pdf_url":"https://arxiv.org/pdf/2602.17054v1","arxiv_url":"http://arxiv.org/abs/2602.17054v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":6.0,"composite":4.8,"summary":"ALPS provides 531 expert-curated Arabic linguistic challenges across 15 tasks testing deep semantics and pragmatics. Evaluating 23 models reveals 36.5% error rates on morpho-syntactic dependencies despite high fluency, with substantial gaps between commercial and Arabic-native models.","reasoning":"Low code_and_weights (no code/weights mentioned). Moderate novelty in expert-curated linguistic diagnostics. Good practical applicability for Arabic NLP development and evaluation.","code_url":null,"s2_tldr":"This work introduces ALPS (Arabic Linguistic&Pragmatic Suite), a native, expert-curated diagnostic challenge set probing Deep Semantics and Pragmatics, capabilities that complement specialized large-scale benchmarks.","s2_paper_id":"cadab8d70d050d4aa2ad0b7bd5ff81169d451d70","topics":"[\"Reasoning\", \"Benchmark\"]"},{"id":409,"run_id":1,"domain":"aiml","arxiv_id":"2602.19314","entry_id":"","title":"IPv2: An Improved Image Purification Strategy for Real-World Ultra-Low-Dose Lung CT Denoising","authors":"[\"Guoliang Gong\", \"Man Yu\"]","abstract":"The image purification strategy constructs an intermediate distribution with aligned anatomical structures, which effectively corrects the spatial misalignment between real-world ultra-low-dose CT and normal-dose CT images and significantly enhances the structural preservation ability of denoising models. However, this strategy exhibits two inherent limitations. First, it suppresses noise only in the chest wall and bone regions while leaving the image background untreated. Second, it lacks a ded","published":"2026-02-22T19:28:31+00:00","categories":"[\"cs.CV\", \"cs.AI\"]","pdf_url":"https://arxiv.org/pdf/2602.19314v1","arxiv_url":"http://arxiv.org/abs/2602.19314v1","comment":"","source":"arxiv","github_repo":"https://github.com/MonkeyDadLufy/Image-Purification-Strategy-v2","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":7.0,"score_axis_2":4.0,"score_axis_3":3.0,"composite":4.55,"summary":"IPv2 improves ultra-low-dose CT denoising by addressing limitations in the original image purification strategy. The method adds three modules (Remove Background, Add noise, Remove noise) to better denoise background and lung parenchyma regions while maintaining anatomical structure alignment.","reasoning":"Medical imaging application with incremental improvements to existing method. Code available but no model weights. Limited to specific medical domain.","code_url":"https://github.com/MonkeyDadLufy/Image-Purification-Strategy-v2","s2_tldr":"This work systematically redesigns the original image purification strategy and proposes an improved version termed IPv2, which consistently improves background suppression and lung parenchyma restoration across multiple mainstream denoising models.","s2_paper_id":"008fc1085c7f7a8cb1ef299969e9c25ae42324e9","topics":"[\"Training\"]"},{"id":259,"run_id":1,"domain":"aiml","arxiv_id":"2602.21064","entry_id":"","title":"Motivation is Something You Need","authors":"[\"Mehdi Acheli\", \"Walid Gaaloul\"]","abstract":"This work introduces a novel training paradigm that draws from affective neuroscience. Inspired by the interplay of emotions and cognition in the human brain and more specifically the SEEKING motivational state, we design a dual-model framework where a smaller base model is trained continuously, while a larger motivated model is activated intermittently during predefined \"motivation conditions\". The framework mimics the emotional state of high curiosity and anticipation of reward in which broade","published":"2026-02-24T16:26:52+00:00","categories":"[\"cs.AI\", \"cs.CV\", \"cs.LG\"]","pdf_url":"https://arxiv.org/pdf/2602.21064v1","arxiv_url":"http://arxiv.org/abs/2602.21064v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":5.0,"composite":4.45,"summary":"Proposes dual-model training framework inspired by affective neuroscience where a larger 'motivated' model is activated intermittently during high-curiosity conditions. Uses shared weight updates between small base model and larger extension, achieving competitive performance with lower training costs on image classification.","reasoning":"Interesting neuroscience-inspired training paradigm. Limited practical validation (only image classification). No code/weights provided, unclear broader applicability.","code_url":null,"s2_tldr":"Empirical evaluation on the image classification task demonstrates that, not only does the alternating training scheme efficiently and effectively enhance the base model compared to a traditional scheme, in some cases, the motivational model also surpasses its standalone counterpart despite seeing less data per epoch.","s2_paper_id":"6b8ac42e6ab9092bde4be887734b13b89c28d144","topics":"[\"RL\"]"},{"id":260,"run_id":1,"domain":"aiml","arxiv_id":"2602.20972","entry_id":"","title":"Are Multimodal Large Language Models Good Annotators for Image Tagging?","authors":"[\"Ming-Kun Xie\", \"Jia-Hao Xiao\", \"Zhiqiang Kou\", \"Zhongnian Li\", \"Gang Niu\", \"Masashi Sugiyama\"]","abstract":"Image tagging, a fundamental vision task, traditionally relies on human-annotated datasets to train multi-label classifiers, which incurs significant labor and costs. While Multimodal Large Language Models (MLLMs) offer promising potential to automate annotation, their capability to replace human annotators remains underexplored. This paper aims to analyze the gap between MLLM-generated and human annotations and to propose an effective solution that enables MLLM-based annotation to replace manua","published":"2026-02-24T14:53:16+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.20972v1","arxiv_url":"http://arxiv.org/abs/2602.20972v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":5.0,"score_axis_3":6.0,"composite":4.45,"summary":"Analyzes MLLMs as automated image annotators for tagging tasks, showing they achieve 50-80% of human annotation quality at 1/1000th the cost while reaching 90%+ performance on downstream training. Proposes TagLLM framework with structured group-wise prompting and interactive label disambiguation to narrow the gap.","reasoning":"Useful analysis of MLLM annotation capabilities with practical cost-benefit insights. Incremental method contributions. No code/weights mentioned.","code_url":null,"s2_tldr":"TagLLM is proposed, a novel framework for image tagging that substantially narrows the gap between MLLM-generated and human annotations, especially in downstream training performance, where it closes about 60\\% to 80\\% of the difference.","s2_paper_id":"c29bb177deb4269dfaf412017b06852e2f5cbed6","topics":"[\"Language Models\", \"Multimodal\", \"Benchmark\"]"},{"id":261,"run_id":1,"domain":"aiml","arxiv_id":"2602.20947","entry_id":"","title":"Estimation of Confidence Bounds in Binary Classification using Wilson Score Kernel Density Estimation","authors":"[\"Thorbj\\u00f8rn Mosekj\\u00e6r Iversen\", \"Zebin Duan\", \"Frederik Hagelskj\\u00e6r\"]","abstract":"The performance and ease of use of deep learning-based binary classifiers have improved significantly in recent years. This has opened up the potential for automating critical inspection tasks, which have traditionally only been trusted to be done manually. However, the application of binary classifiers in critical operations depends on the estimation of reliable confidence bounds such that system performance can be ensured up to a given statistical significance. We present Wilson Score Kernel D","published":"2026-02-24T14:31:28+00:00","categories":"[\"cs.LG\", \"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.20947v1","arxiv_url":"http://arxiv.org/abs/2602.20947v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":5.0,"score_axis_3":6.0,"composite":4.45,"summary":"Wilson Score Kernel Density Classification provides novel kernel-based method for estimating confidence bounds in binary classification. Shows comparable performance to Gaussian Process Classification but with lower computational complexity. Evaluated on selective classification tasks across four datasets including vision foundation models.","reasoning":"Theoretical contribution with limited practical novelty. No code/weights. Narrow application to binary classification confidence estimation.","code_url":null,"s2_tldr":"This work presents Wilson Score Kernel Density Classification, which is a novel kernel-based method for estimating confidence bounds in binary classification, and shows similar performance to Gaussian Process Classification, but at a lower computational complexity.","s2_paper_id":"f3cc02e8942f8444c2f1a6f739e7c06f930f4f90","topics":"[]"},{"id":262,"run_id":1,"domain":"aiml","arxiv_id":"2602.20930","entry_id":"","title":"Computing a Characteristic Orientation for Rotation-Independent Image Analysis","authors":"[\"Cristian Valero-Abundio\", \"Emilio Sansano-Sansano\", \"Ra\\u00fal Montoliu\", \"Marina Mart\\u00ednez Garc\\u00eda\"]","abstract":"Handling geometric transformations, particularly rotations, remains a challenge in deep learning for computer vision. Standard neural networks lack inherent rotation invariance and typically rely on data augmentation or architectural modifications to improve robustness. Although effective, these approaches increase computational demands, require specialised implementations, or alter network structures, limiting their applicability. This paper introduces General Intensity Direction (GID), a prepr","published":"2026-02-24T14:08:12+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.20930v1","arxiv_url":"http://arxiv.org/abs/2602.20930v1","comment":"Accepted for publication at the 21st International Conference on Computer Vision Theory and Applications (VISAPP 2026). 8 pages","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":5.0,"score_axis_3":6.0,"composite":4.45,"summary":"General Intensity Direction (GID) introduces preprocessing method for rotation robustness by estimating global image orientation and aligning to canonical reference. Achieves higher accuracy than rotation-invariant architectures on rotated MNIST without modifying network architecture, maintaining compatibility with standard CNNs.","reasoning":"Limited novelty as preprocessing-based rotation handling. No code/weights. Narrow scope evaluated mainly on MNIST.","code_url":null,"s2_tldr":"General Intensity Direction is introduced, a preprocessing method that improves rotation robustness without modifying the network architecture, allowing standard models to process inputs more consistently across different rotations, making it compatible with convolutional networks.","s2_paper_id":"52274503c4a75a9ad7c54930b678b7bca7c2743b","topics":"[]"},{"id":263,"run_id":1,"domain":"aiml","arxiv_id":"2602.20845","entry_id":"","title":"FLIM Networks with Bag of Feature Points","authors":"[\"Jo\\u00e3o Deltregia Martinelli\", \"Marcelo Luis Rodrigues Filho\", \"Felipe Crispim da Rocha Salvagnini\", \"Gilson Junior Soares\", \"Jefersson A. dos Santos\", \"Alexandre X. Falc\\u00e3o\"]","abstract":"Convolutional networks require extensive image annotation, which can be costly and time-consuming. Feature Learning from Image Markers (FLIM) tackles this challenge by estimating encoder filters (i.e., kernel weights) from user-drawn markers on discriminative regions of a few representative images without traditional optimization. Such an encoder combined with an adaptive decoder comprises a FLIM network fully trained without backpropagation. Prior research has demonstrated their effectiveness i","published":"2026-02-24T12:36:22+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.20845v1","arxiv_url":"http://arxiv.org/abs/2602.20845v1","comment":"Accepted at the 28th Iberoamerican Congress on Pattern Recognition (CIARP 2025). To appear in Lecture Notes in Computer Science (LNCS), Springer","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":5.0,"score_axis_3":6.0,"composite":4.45,"summary":"FLIM-BoFP accelerates Feature Learning from Image Markers by performing single clustering at input to create bag of feature points, replacing multi-block patch clustering. Trains networks without backpropagation from user markers, demonstrating efficiency improvements for salient object detection and parasite detection in microscopy.","reasoning":"Efficiency improvement over prior FLIM method but limited novelty. No code/weights. Narrow application scope to marker-based training.","code_url":null,"s2_tldr":"This study revisits FLIM SOD and introduces FLIM-Bag of Feature Points (FLIM-BoFP), a considerably faster filter estimation method compared to FLIM-Cluster and other state-of-the-art baselines for parasite detection in optical microscopy images.","s2_paper_id":"2ebf6f0ebac49540f864e56d2d90efdd7eef029d","topics":"[\"Optimization\"]"},{"id":265,"run_id":1,"domain":"aiml","arxiv_id":"2602.20790","entry_id":"","title":"Real-time Motion Segmentation with Event-based Normal Flow","authors":"[\"Sheng Zhong\", \"Zhongyang Ren\", \"Xiya Zhu\", \"Dehao Yuan\", \"Cornelia Fermuller\", \"Yi Zhou\"]","abstract":"Event-based cameras are bio-inspired sensors with pixels that independently and asynchronously respond to brightness changes at microsecond resolution, offering the potential to handle visual tasks in challenging scenarios. However, due to the sparse information content in individual events, directly processing the raw event data to solve vision tasks is highly inefficient, which severely limits the applicability of state-of-the-art methods in real-time tasks, such as motion segmentation, a fund","published":"2026-02-24T11:29:07+00:00","categories":"[\"cs.CV\", \"cs.RO\"]","pdf_url":"https://arxiv.org/pdf/2602.20790v1","arxiv_url":"http://arxiv.org/abs/2602.20790v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":5.0,"score_axis_3":6.0,"composite":4.45,"summary":"Proposes a real-time motion segmentation framework for event cameras using normal flow as intermediate representation. Achieves 800x speedup over SOTA through efficient graph-cut optimization and normal flow-based motion model initialization.","reasoning":"No code/weights available. Incremental efficiency improvement over existing event-based methods. Limited to specialized hardware (event cameras).","code_url":null,"s2_tldr":"This work proposes a normal flow-based motion segmentation framework for event-based vision that significantly reduces the computational complexity and ensures real-time performance, achieving nearly a 800x speedup in comparison to the open-source state-of-the-art method.","s2_paper_id":"b6a5a92f517f513eb769d48c619a2d5c526e7e0f","topics":"[\"3D / Vision\", \"Efficiency\"]"},{"id":266,"run_id":1,"domain":"aiml","arxiv_id":"2602.20721","entry_id":"","title":"CleanStyle: Plug-and-Play Style Conditioning Purification for Text-to-Image Stylization","authors":"[\"Xiaoman Feng\", \"Mingkun Lei\", \"Yang Wang\", \"Dingwen Fu\", \"Chi Zhang\"]","abstract":"Style transfer in diffusion models enables controllable visual generation by injecting the style of a reference image. However, recent encoder-based methods, while efficient and tuning-free, often suffer from content leakage, where semantic elements from the style image undesirably appear in the output, impairing prompt fidelity and stylistic consistency. In this work, we introduce CleanStyle, a plug-and-play framework that filters out content-related noise from the style embedding without retra","published":"2026-02-24T09:33:05+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.20721v1","arxiv_url":"http://arxiv.org/abs/2602.20721v1","comment":"26 pages","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":5.0,"score_axis_3":6.0,"composite":4.45,"summary":"CleanStyle addresses content leakage in style transfer by filtering tail components of style embeddings via SVD. Introduces style-specific classifier-free guidance for improved prompt fidelity, functioning as a plug-and-play module without retraining.","reasoning":"No code/weights available. Incremental improvement to existing encoder-based style transfer methods. Limited novelty (SVD-based filtering is conventional).","code_url":null,"s2_tldr":"This work introduces CleanStyle, a plug-and-play framework that filters out content-related noise from the style embedding without retraining, and can be seamlessly integrated into existing encoder-based diffusion models without retraining.","s2_paper_id":"9964fda118778076e179c686b02ba09fd2349923","topics":"[\"Image Generation\", \"Efficiency\", \"Retrieval / RAG\"]"},{"id":267,"run_id":1,"domain":"aiml","arxiv_id":"2602.20664","entry_id":"","title":"AnimeAgent: Is the Multi-Agent via Image-to-Video models a Good Disney Storytelling Artist?","authors":"[\"Hailong Yan\", \"Shice Liu\", \"Tao Wang\", \"Xiangtao Zhang\", \"Yijie Zhong\", \"Jinwei Chen\", \"Le Zhang\", \"Bo Li\"]","abstract":"Custom Storyboard Generation (CSG) aims to produce high-quality, multi-character consistent storytelling. Current approaches based on static diffusion models, whether used in a one-shot manner or within multi-agent frameworks, face three key limitations: (1) Static models lack dynamic expressiveness and often resort to \"copy-paste\" pattern. (2) One-shot inference cannot iteratively correct missing attributes or poor prompt adherence. (3) Multi-agents rely on non-robust evaluators, ill-suited for","published":"2026-02-24T08:14:24+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.20664v1","arxiv_url":"http://arxiv.org/abs/2602.20664v1","comment":"Tech Report","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":5.0,"score_axis_3":6.0,"composite":4.45,"summary":"AnimeAgent proposes first Image-to-Video multi-agent framework for custom storyboard generation, addressing consistency and expressiveness limitations of static diffusion models. Leverages I2V motion priors and mixed subjective-objective evaluation for iterative refinement.","reasoning":"No code/weights. Domain-specific (animation/storytelling). Incremental application of I2V models to storyboarding with engineering contributions.","code_url":null,"s2_tldr":"Inspired by Disney's\"Combination of Straight Ahead and Pose to Pose\"workflow, AnimeAgent leverages I2V's implicit motion prior to enhance consistency and expressiveness, while a mixed subjective-objective reviewer enables reliable iterative refinement.","s2_paper_id":"44dd611e315d9d5f7087ca5dd3898e99dee249bc","topics":"[\"Agents\", \"Benchmark\"]"},{"id":268,"run_id":1,"domain":"aiml","arxiv_id":"2602.20583","entry_id":"","title":"PropFly: Learning to Propagate via On-the-Fly Supervision from Pre-trained Video Diffusion Models","authors":"[\"Wonyong Seo\", \"Jaeho Moon\", \"Jaehyup Lee\", \"Soo Ye Kim\", \"Munchurl Kim\"]","abstract":"Propagation-based video editing enables precise user control by propagating a single edited frame into following frames while maintaining the original context such as motion and structures. However, training such models requires large-scale, paired (source and edited) video datasets, which are costly and complex to acquire. Hence, we propose the PropFly, a training pipeline for Propagation-based video editing, relying on on-the-Fly supervision from pre-trained video diffusion models (VDMs) inste","published":"2026-02-24T06:11:08+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.20583v1","arxiv_url":"http://arxiv.org/abs/2602.20583v1","comment":"The first two authors contributed equally to this work (equal contribution)","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":5.0,"composite":4.45,"summary":"PropFly enables propagation-based video editing without paired datasets by synthesizing source/edited pairs on-the-fly from pretrained VDMs using varying CFG scales. Learns propagation via Guidance-Modulated Flow Matching loss for temporally consistent editing.","reasoning":"No code/weights provided. Novel training strategy (on-the-fly supervision from VDMs) but incremental application to video editing. Training approach could be useful for practitioners.","code_url":null,"s2_tldr":"The PropFly is proposed, a training pipeline for Propagation-based video editing, relying on on-the-Fly supervision from pre-trained video diffusion models (VDMs) instead of requiring off-the-shelf or precomputed paired video editing datasets.","s2_paper_id":"7e060d694d6612f8ff6abb5464089feb36a321fd","topics":"[\"Video Generation\", \"Benchmark\"]"},{"id":269,"run_id":1,"domain":"aiml","arxiv_id":"2602.20520","entry_id":"","title":"How Do Inpainting Artifacts Propagate to Language?","authors":"[\"Pratham Yashwante\", \"Davit Abrahamyan\", \"Shresth Grover\", \"Sukruth Rao\"]","abstract":"We study how visual artifacts introduced by diffusion-based inpainting affect language generation in vision-language models. We use a two-stage diagnostic setup in which masked image regions are reconstructed and then provided to captioning models, enabling controlled comparisons between captions generated from original and reconstructed inputs. Across multiple datasets, we analyze the relationship between reconstruction fidelity and downstream caption quality. We observe consistent associations","published":"2026-02-24T03:46:33+00:00","categories":"[\"cs.CV\", \"cs.AI\"]","pdf_url":"https://arxiv.org/pdf/2602.20520v1","arxiv_url":"http://arxiv.org/abs/2602.20520v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":5.0,"score_axis_3":6.0,"composite":4.45,"summary":"Studies how diffusion-based inpainting artifacts propagate to vision-language model captions through controlled two-stage experiments. Shows consistent associations between reconstruction quality metrics and caption quality, with layer-dependent attention pattern changes.","reasoning":"No code/weights provided. Diagnostic study rather than method contribution. Useful insights for multimodal systems but limited direct applicability.","code_url":null,"s2_tldr":"A two-stage diagnostic setup is used in which masked image regions are reconstructed and then provided to captioning models, enabling controlled comparisons between captions generated from original and reconstructed inputs.","s2_paper_id":"c0f75468b12eaf1ed48a3acf906d72149ff051db","topics":"[\"Language Models\", \"Multimodal\", \"Benchmark\"]"},{"id":270,"run_id":1,"domain":"aiml","arxiv_id":"2602.20363","entry_id":"","title":"Aesthetic Camera Viewpoint Suggestion with 3D Aesthetic Field","authors":"[\"Sheyang Tang\", \"Armin Shafiee Sarvestani\", \"Jialu Xu\", \"Xiaoyu Xu\", \"Zhou Wang\"]","abstract":"The aesthetic quality of a scene depends strongly on camera viewpoint. Existing approaches for aesthetic viewpoint suggestion are either single-view adjustments, predicting limited camera adjustments from a single image without understanding scene geometry, or 3D exploration approaches, which rely on dense captures or prebuilt 3D environments coupled with costly reinforcement learning (RL) searches. In this work, we introduce the notion of 3D aesthetic field that enables geometry-grounded aesthe","published":"2026-02-23T21:08:23+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.20363v1","arxiv_url":"http://arxiv.org/abs/2602.20363v1","comment":"14 pages, 10 figures","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":5.0,"composite":4.45,"summary":"This paper introduces 3D aesthetic fields for camera viewpoint suggestion using 3D Gaussian Splatting to distill 2D aesthetic models into 3D space. The approach enables efficient viewpoint search without dense RL exploration, requiring only sparse input views for geometry-grounded aesthetic reasoning.","reasoning":"No code or weights available. Novel 3D aesthetic field concept but limited immediate usability without implementations. Practical for content creation if released.","code_url":null,"s2_tldr":"This work introduces the notion of 3D aesthetic field that enables geometry-grounded aesthetic reasoning in 3D with sparse captures, allowing efficient viewpoint suggestions in contrast to costly RL searches and proposes a two-stage search pipeline that combines coarse viewpoint sampling with gradient-based refinement.","s2_paper_id":"5ad106c3e878a25eb7afd2aa12d2a82fb47d440f","topics":"[\"3D / Vision\", \"Training\", \"RL\"]"},{"id":271,"run_id":1,"domain":"aiml","arxiv_id":"2602.20351","entry_id":"","title":"BiRQA: Bidirectional Robust Quality Assessment for Images","authors":"[\"Aleksandr Gushchin\", \"Dmitriy S. Vatolin\", \"Anastasia Antsiferova\"]","abstract":"Full-Reference image quality assessment (FR IQA) is important for image compression, restoration and generative modeling, yet current neural metrics remain slow and vulnerable to adversarial perturbations. We present BiRQA, a compact FR IQA metric model that processes four fast complementary features within a bidirectional multiscale pyramid. A bottom-up attention module injects fine-scale cues into coarse levels through an uncertainty-aware gate, while a top-down cross-gating block routes seman","published":"2026-02-23T20:52:56+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.20351v1","arxiv_url":"http://arxiv.org/abs/2602.20351v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":5.0,"score_axis_3":6.0,"composite":4.45,"summary":"BiRQA is a compact full-reference image quality metric using bidirectional multiscale pyramids with fast complementary features. It achieves state-of-the-art performance 3x faster than previous methods and demonstrates strong adversarial robustness through Anchored Adversarial Training, lifting SROCC from 0.30-0.57 to 0.60-0.84 under attacks.","reasoning":"No code or weights available. Incremental improvement in IQA metrics with efficiency gains. Practical for compression/restoration but not paradigm-shifting.","code_url":null,"s2_tldr":"BiRQA is the only FR IQA model combining competitive accuracy with real-time throughput and strong adversarial resilience, and to the authors' knowledge is the only FR IQA model combining competitive accuracy with real-time throughput and strong adversarial resilience.","s2_paper_id":"28b2b4aa00309b7e7a1256bb3e74253e066c13ce","topics":"[\"Efficiency\"]"},{"id":272,"run_id":1,"domain":"aiml","arxiv_id":"2602.20330","entry_id":"","title":"Circuit Tracing in Vision-Language Models: Understanding the Internal Mechanisms of Multimodal Thinking","authors":"[\"Jingcheng Yang\", \"Tianhu Xiong\", \"Shengyi Qian\", \"Klara Nahrstedt\", \"Mingyuan Wu\"]","abstract":"Vision-language models (VLMs) are powerful but remain opaque black boxes. We introduce the first framework for transparent circuit tracing in VLMs to systematically analyze multimodal reasoning. By utilizing transcoders, attribution graphs, and attention-based methods, we uncover how VLMs hierarchically integrate visual and semantic concepts. We reveal that distinct visual feature circuits can handle mathematical reasoning and support cross-modal associations. Validated through feature steering ","published":"2026-02-23T20:26:45+00:00","categories":"[\"cs.CV\", \"cs.AI\", \"cs.LG\"]","pdf_url":"https://arxiv.org/pdf/2602.20330v1","arxiv_url":"http://arxiv.org/abs/2602.20330v1","comment":"To appear in the Findings of CVPR 2026","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":5.0,"composite":4.45,"summary":"This work introduces the first circuit tracing framework for vision-language models, using transcoders and attribution graphs to analyze how VLMs integrate visual and semantic concepts. The method reveals distinct circuits for mathematical reasoning and cross-modal associations, validated through feature steering and circuit patching.","reasoning":"No code released. Novel interpretability approach for VLMs but primarily research-oriented. Limited immediate practical applicability without tools.","code_url":null,"s2_tldr":"This work introduces the first framework for transparent circuit tracing in VLMs to systematically analyze multimodal reasoning and reveals that distinct visual feature circuits can handle mathematical reasoning and support cross-modal associations.","s2_paper_id":"a3f616abde4bad6389405fbd2d9eba9ba23e9f65","topics":"[\"Language Models\", \"Multimodal\", \"Reasoning\"]"},{"id":273,"run_id":1,"domain":"aiml","arxiv_id":"2602.20291","entry_id":"","title":"De-rendering, Reasoning, and Repairing Charts with Vision-Language Models","authors":"[\"Valentin Bonas\", \"Martin Sinnona\", \"Viviana Siless\", \"Emmanuel Iarussi\"]","abstract":"Data visualizations are central to scientific communication, journalism, and everyday decision-making, yet they are frequently prone to errors that can distort interpretation or mislead audiences. Rule-based visualization linters can flag violations, but they miss context and do not suggest meaningful design changes. Directly querying general-purpose LLMs about visualization quality is unreliable: lacking training to follow visualization design principles, they often produce inconsistent or inco","published":"2026-02-23T19:16:27+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.20291v1","arxiv_url":"http://arxiv.org/abs/2602.20291v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":5.0,"score_axis_3":6.0,"composite":4.45,"summary":"This work presents a framework combining chart de-rendering, automated analysis, and iterative improvement using VLMs to provide actionable visualization design feedback. Evaluated on 1,000 charts, the system generated 10,452 recommendations across 10 coherent categories like axis formatting and color accessibility.","reasoning":"No code available. Useful application but primarily engineering existing VLM capabilities. Practical for visualization tools but not technically novel.","code_url":null,"s2_tldr":"This work introduces a framework that combines chart de-rendering, automated analysis, and iterative improvement to deliver actionable, interpretable feedback on visualization design, and highlights the promise of LLM-driven recommendation systems for delivering structured, principle-based feedback on visualization design.","s2_paper_id":"9f5da3be05a2b6835856273f20c62ffe3785537d","topics":"[\"Language Models\", \"Multimodal\", \"Reasoning\"]"},{"id":274,"run_id":1,"domain":"aiml","arxiv_id":"2602.20114","entry_id":"","title":"Benchmarking Unlearning for Vision Transformers","authors":"[\"Kairan Zhao\", \"Iurie Luca\", \"Peter Triantafillou\"]","abstract":"Research in machine unlearning (MU) has gained strong momentum: MU is now widely regarded as a critical capability for building safe and fair AI. In parallel, research into transformer architectures for computer vision tasks has been highly successful: Increasingly, Vision Transformers (VTs) emerge as strong alternatives to CNNs. Yet, MU research for vision tasks has largely centered on CNNs, not VTs. While benchmarking MU efforts have addressed LLMs, diffusion models, and CNNs, none exist for V","published":"2026-02-23T18:33:16+00:00","categories":"[\"cs.CV\", \"cs.AI\"]","pdf_url":"https://arxiv.org/pdf/2602.20114v1","arxiv_url":"http://arxiv.org/abs/2602.20114v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":5.0,"score_axis_3":6.0,"composite":4.45,"summary":"First comprehensive benchmarking of machine unlearning algorithms on Vision Transformers across different families (ViT, Swin-T) and capacities. The work characterizes VT memorization relative to CNNs, evaluates algorithms leveraging memorization proxies, and establishes baseline performance for future comparisons.","reasoning":"No code available. Important benchmarking work but primarily evaluation-focused with incremental contributions. Useful reference for unlearning research community.","code_url":null,"s2_tldr":"This work sheds light on how well existing algorithms work in VT settings, establishing a promising reference performance baseline and offering a benchmarking basis, enabling reproducible, fair, and comprehensive comparisons of existing (and future) MU algorithms on VTs.","s2_paper_id":"11eb86ebe34237afee50fb50908b2ed9485dc54c","topics":"[\"Architecture\", \"Benchmark\"]"},{"id":275,"run_id":1,"domain":"aiml","arxiv_id":"2602.20055","entry_id":"","title":"To Move or Not to Move: Constraint-based Planning Enables Zero-Shot Generalization for Interactive Navigation","authors":"[\"Apoorva Vashisth\", \"Manav Kulshrestha\", \"Pranav Bakshi\", \"Damon Conover\", \"Guillaume Sartoretti\", \"Aniket Bera\"]","abstract":"Visual navigation typically assumes the existence of at least one obstacle-free path between start and goal, which must be discovered/planned by the robot. However, in real-world scenarios, such as home environments and warehouses, clutter can block all routes. Targeted at such cases, we introduce the Lifelong Interactive Navigation problem, where a mobile robot with manipulation abilities can move clutter to forge its own path to complete sequential object- placement tasks - each involving plac","published":"2026-02-23T17:10:00+00:00","categories":"[\"cs.RO\", \"cs.AI\", \"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.20055v1","arxiv_url":"http://arxiv.org/abs/2602.20055v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":5.0,"composite":4.45,"summary":"An LLM-driven constraint-based planning framework for interactive navigation that enables mobile robots to move clutter and forge paths in blocked environments. The system uses structured scene graphs for reasoning and active perception, outperforming baselines in ProcTHOR-10k simulator.","reasoning":"No code/weights available; robotics application with moderate novelty; limited to specific interactive navigation scenarios.","code_url":null,"s2_tldr":"An LLM-driven, constraint-based planning framework with active perception that allows the LLM to reason over a structured scene graph of discovered objects and obstacles, deciding which object to move, where to place it, and where to look next to discover task-relevant information is proposed.","s2_paper_id":"9d19dab8737504e055cd65334ce2e9eb2ca61cef","topics":"[\"Robotics\", \"Agents\"]"},{"id":276,"run_id":1,"domain":"aiml","arxiv_id":"2602.19768","entry_id":"","title":"TraceVision: Trajectory-Aware Vision-Language Model for Human-Like Spatial Understanding","authors":"[\"Fan Yang\", \"Shurong Zheng\", \"Hongyin Zhao\", \"Yufei Zhan\", \"Xin Li\", \"Yousong Zhu\", \"Chaoyang Zhao Ming Tang\", \"Jinqiao Wang\"]","abstract":"Recent Large Vision-Language Models (LVLMs) demonstrate remarkable capabilities in image understanding and natural language generation. However, current approaches focus predominantly on global image understanding, struggling to simulate human visual attention trajectories and explain associations between descriptions and specific regions. We propose TraceVision, a unified vision-language model integrating trajectory-aware spatial understanding in an end-to-end framework. TraceVision employs a T","published":"2026-02-23T12:18:26+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.19768v2","arxiv_url":"http://arxiv.org/abs/2602.19768v2","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":5.0,"composite":4.45,"summary":"TraceVision unifies trajectory-aware spatial understanding in VLMs through end-to-end framework with TVP module for visual-trajectory fusion. Introduces RILN dataset and achieves SOTA on trajectory-guided tasks including captioning, prediction, and segmentation.","reasoning":"Novel architecture integrating trajectory information into VLMs. No code/weights despite strong practical potential.","code_url":null,"s2_tldr":"This work proposes TraceVision, a unified vision-language model integrating trajectory-aware spatial understanding in an end-to-end framework and constructs the Reasoning-based Interactive Localized Narratives (RILN) dataset to enhance logical reasoning and interpretability.","s2_paper_id":"ebf4128f6f69a61cee63ecf8c4151a66b00bdf78","topics":"[\"Language Models\", \"Multimodal\"]"},{"id":277,"run_id":1,"domain":"aiml","arxiv_id":"2602.19766","entry_id":"","title":"One2Scene: Geometric Consistent Explorable 3D Scene Generation from a Single Image","authors":"[\"Pengfei Wang\", \"Liyi Chen\", \"Zhiyuan Ma\", \"Yanjun Guo\", \"Guowen Zhang\", \"Lei Zhang\"]","abstract":"Generating explorable 3D scenes from a single image is a highly challenging problem in 3D vision. Existing methods struggle to support free exploration, often producing severe geometric distortions and noisy artifacts when the viewpoint moves far from the original perspective. We introduce \\textbf{One2Scene}, an effective framework that decomposes this ill-posed problem into three tractable sub-tasks to enable immersive explorable scene generation. We first use a panorama generator to produce an","published":"2026-02-23T12:15:54+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.19766v1","arxiv_url":"http://arxiv.org/abs/2602.19766v1","comment":"ICLR 2026","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":5.0,"score_axis_3":6.0,"composite":4.45,"summary":"One2Scene generates explorable 3D scenes from single images via three-stage pipeline: panorama generation, feed-forward Gaussian Splatting scaffold, and novel view synthesis. Achieves geometrically consistent scene exploration with bidirectional feature fusion.","reasoning":"Novel decomposition approach for single-image 3D scene generation. Code promised but not yet available.","code_url":null,"s2_tldr":"One2Scene is introduced, an effective framework that decomposes this ill-posed problem into three tractable sub-tasks to enable immersive explorable scene generation and works stably under large camera motions, supporting immersive scene exploration.","s2_paper_id":"5e131382f22ed8143924cb4916a581542adc795a","topics":"[\"3D / Vision\"]"},{"id":278,"run_id":1,"domain":"aiml","arxiv_id":"2602.19708","entry_id":"","title":"ChimeraLoRA: Multi-Head LoRA-Guided Synthetic Datasets","authors":"[\"Hoyoung Kim\", \"Minwoo Jang\", \"Jabin Koo\", \"Sangdoo Yun\", \"Jungseul Ok\"]","abstract":"Beyond general recognition tasks, specialized domains including privacy-constrained medical applications and fine-grained settings often encounter data scarcity, especially for tail classes. To obtain less biased and more reliable models under such scarcity, practitioners leverage diffusion models to supplement underrepresented regions of real data. Specifically, recent studies fine-tune pretrained diffusion models with LoRA on few-shot real sets to synthesize additional images. While an image-w","published":"2026-02-23T10:59:41+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.19708v1","arxiv_url":"http://arxiv.org/abs/2602.19708v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":5.0,"score_axis_3":6.0,"composite":4.45,"summary":"ChimeraLoRA addresses few-shot data scarcity by separating LoRA into class-shared (priors) and per-image (details) components with semantic boosting. Combines diversity and fine-grained details via Dirichlet-weighted mixture for improved downstream classification.","reasoning":"Practical approach for few-shot synthesis but no code/weights. Incremental improvement over existing LoRA methods.","code_url":null,"s2_tldr":"Across diverse datasets, the synthesized images are both diverse and detail-rich while closely aligning with the few-shot real distribution, yielding robust gains in downstream classification accuracy.","s2_paper_id":"c72ecaafaf27433d7d463132a0b162446cfdc3df","topics":"[\"Benchmark\", \"Language Models\"]"},{"id":280,"run_id":1,"domain":"aiml","arxiv_id":"2602.19679","entry_id":"","title":"TeHOR: Text-Guided 3D Human and Object Reconstruction with Textures","authors":"[\"Hyeongjin Nam\", \"Daniel Sungho Jung\", \"Kyoung Mu Lee\"]","abstract":"Joint reconstruction of 3D human and object from a single image is an active research area, with pivotal applications in robotics and digital content creation. Despite recent advances, existing approaches suffer from two fundamental limitations. First, their reconstructions rely heavily on physical contact information, which inherently cannot capture non-contact human-object interactions, such as gazing at or pointing toward an object. Second, the reconstruction process is primarily driven by lo","published":"2026-02-23T10:22:52+00:00","categories":"[\"cs.CV\", \"cs.AI\"]","pdf_url":"https://arxiv.org/pdf/2602.19679v1","arxiv_url":"http://arxiv.org/abs/2602.19679v1","comment":"Published at CVPR 2026, 20 pages including the supplementary material","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":5.0,"score_axis_3":6.0,"composite":4.45,"summary":"TeHOR performs joint 3D human-object reconstruction from single images using text-guided semantic alignment for non-contact interactions and appearance cues for holistic context. Achieves SOTA on human-object interaction reconstruction.","reasoning":"Novel text-guided approach for 3D reconstruction but no code/weights available. Strong practical application.","code_url":null,"s2_tldr":"This work introduces TeHOR, a framework that leverages text descriptions of human-object interactions to enforce semantic alignment between the 3D reconstruction and its textual cues, enabling reasoning over a wider spectrum of interactions, including non-contact cases.","s2_paper_id":"b4fa235591597c9766ba790b29d6f8a470cd43a8","topics":"[\"3D / Vision\", \"Robotics\"]"},{"id":282,"run_id":1,"domain":"aiml","arxiv_id":"2602.19539","entry_id":"","title":"Can a Teenager Fool an AI? Evaluating Low-Cost Cosmetic Attacks on Age Estimation Systems","authors":"[\"Xingyu Shen\", \"Tommy Duong\", \"Xiaodong An\", \"Zengqi Zhao\", \"Zebang Hu\", \"Haoyu Hu\", \"Ziyou Wang\", \"Finn Guo\", \"Simiao Ren\"]","abstract":"Age estimation systems are increasingly deployed as gatekeepers for age-restricted online content, yet their robustness to cosmetic modifications has not been systematically evaluated. We investigate whether simple, household-accessible cosmetic changes, including beards, grey hair, makeup, and simulated wrinkles, can cause AI age estimators to classify minors as adults. To study this threat at scale without ethical concerns, we simulate these physical attacks on 329 facial images of individuals","published":"2026-02-23T06:13:52+00:00","categories":"[\"cs.CV\", \"cs.CR\", \"cs.LG\"]","pdf_url":"https://arxiv.org/pdf/2602.19539v1","arxiv_url":"http://arxiv.org/abs/2602.19539v1","comment":"13 pages, 6 figures","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":5.0,"score_axis_3":6.0,"composite":4.45,"summary":"Evaluates low-cost cosmetic attacks (beards, grey hair, makeup, wrinkles) on eight age estimation models using VLM-simulated modifications on 329 faces. Full attacks achieve up to 83% attack conversion rate, exposing vulnerabilities in age-verification systems.","reasoning":"No code/weights. Adversarial evaluation is practical for security/safety but incremental. Specialized application (age verification).","code_url":null,"s2_tldr":"Whether simple, household-accessible cosmetic changes, including beards, grey hair, makeup, and simulated wrinkles, can cause AI age estimators to classify minors as adults is investigated and called for adversarial robustness evaluation as a mandatory criterion for model selection.","s2_paper_id":"eecec51d34b055f59257a93fb8197a731b477c78","topics":"[\"Benchmark\"]"},{"id":283,"run_id":1,"domain":"aiml","arxiv_id":"2602.19503","entry_id":"","title":"A Text-Guided Vision Model for Enhanced Recognition of Small Instances","authors":"[\"Hyun-Ki Jung\"]","abstract":"As drone-based object detection technology continues to evolve, the demand is shifting from merely detecting objects to enabling users to accurately identify specific targets. For example, users can input particular targets as prompts to precisely detect desired objects. To address this need, an efficient text-guided object detection model has been developed to enhance the detection of small objects. Specifically, an improved version of the existing YOLO-World model is introduced. The proposed m","published":"2026-02-23T04:40:14+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.19503v1","arxiv_url":"http://arxiv.org/abs/2602.19503v1","comment":"Accepted for publication in Applied Computer Science (2026)","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":5.0,"score_axis_3":6.0,"composite":4.45,"summary":"Enhances YOLO-World for text-guided small object detection in drone imagery by replacing C2f with C3k2 layers for better local features. Improves precision from 40.6% to 41.6% on VisDrone while reducing parameters and FLOPs.","reasoning":"No code/weights. Incremental YOLO modification for drone object detection. Practical for drone applications but limited novelty.","code_url":null,"s2_tldr":"An improved version of the existing YOLO-World model is introduced that replaces the C2f layer in the YOLOv8 backbone with a C3k2 layer, enabling more precise representation of local features, particularly for small objects or those with clearly defined boundaries.","s2_paper_id":"763bcb5d450e3a640c3ffbceb556f69d8f500b25","topics":"[\"Efficiency\", \"3D / Vision\", \"World Models\"]"},{"id":284,"run_id":1,"domain":"aiml","arxiv_id":"2602.19470","entry_id":"","title":"Physics-informed Active Polarimetric 3D Imaging for Specular Surfaces","authors":"[\"Jiazhang Wang\", \"Hyelim Yang\", \"Tianyi Wang\", \"Florian Willomitzer\"]","abstract":"3D imaging of specular surfaces remains challenging in real-world scenarios, such as in-line inspection or hand-held scanning, requiring fast and accurate measurement of complex geometries. Optical metrology techniques such as deflectometry achieve high accuracy but typically rely on multi-shot acquisition, making them unsuitable for dynamic environments. Fourier-based single-shot approaches alleviate this constraint, yet their performance deteriorates when measuring surfaces with high spatial f","published":"2026-02-23T03:28:41+00:00","categories":"[\"cs.CV\", \"physics.optics\"]","pdf_url":"https://arxiv.org/pdf/2602.19470v1","arxiv_url":"http://arxiv.org/abs/2602.19470v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":5.0,"composite":4.45,"summary":"Proposes a physics-informed deep learning framework for single-shot 3D imaging of specular surfaces by combining polarimetric cues with structured illumination. Uses dual-encoder architecture with mutual feature modulation to handle nonlinear coupling, enabling fast inference for complex geometries without multi-shot acquisition.","reasoning":"Novel approach combining polarimetry with deflectometry via deep learning, but no code/weights available and targets narrow industrial metrology applications.","code_url":null,"s2_tldr":"","s2_paper_id":"5b1a936390878a473ff6e7523d975595906c5aca","topics":"[\"3D / Vision\"]"},{"id":285,"run_id":1,"domain":"aiml","arxiv_id":"2602.19430","entry_id":"","title":"TherA: Thermal-Aware Visual-Language Prompting for Controllable RGB-to-Thermal Infrared Translation","authors":"[\"Dong-Guw Lee\", \"Tai Hyoung Rhee\", \"Hyunsoo Jang\", \"Young-Sik Shin\", \"Ukcheol Shin\", \"Ayoung Kim\"]","abstract":"Despite the inherent advantages of thermal infrared(TIR) imaging, large-scale data collection and annotation remain a major bottleneck for TIR-based perception. A practical alternative is to synthesize pseudo TIR data via image translation; however, most RGB-to-TIR approaches heavily rely on RGB-centric priors that overlook thermal physics, yielding implausible heat distributions. In this paper, we introduce TherA, a controllable RGB-to-TIR translation framework that produces diverse and thermal","published":"2026-02-23T01:56:29+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.19430v1","arxiv_url":"http://arxiv.org/abs/2602.19430v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":5.0,"composite":4.45,"summary":"TherA performs controllable RGB-to-TIR translation using thermal-aware VLM embeddings coupled with latent diffusion. Enables scene/object-level control via user prompts (time, weather, object state) and achieves 33% improvement in zero-shot translation over baselines.","reasoning":"Novel thermal-aware conditioning approach but narrow application domain (TIR imaging) and no released code/weights.","code_url":null,"s2_tldr":"TherA is introduced, a controllable RGB-to-TIR translation framework that produces diverse and thermally plausible images at both scene and object level and achieves state-of-the-art translation performance, demonstrating improved zero-shot translation performance up to 33% increase averaged across all metrics.","s2_paper_id":"042f13449dbd47c40c6a98ee7eac0d90f471abce","topics":"[]"},{"id":286,"run_id":1,"domain":"aiml","arxiv_id":"2602.19423","entry_id":"","title":"Prefer-DAS: Learning from Local Preferences and Sparse Prompts for Domain Adaptive Segmentation of Electron Microscopy","authors":"[\"Jiabao Chen\", \"Shan Xiong\", \"Jialin Peng\"]","abstract":"Domain adaptive segmentation (DAS) is a promising paradigm for delineating intracellular structures from various large-scale electron microscopy (EM) without incurring extensive annotated data in each domain. However, the prevalent unsupervised domain adaptation (UDA) strategies often demonstrate limited and biased performance, which hinders their practical applications. In this study, we explore sparse points and local human preferences as weak labels in the target domain, thereby presenting a ","published":"2026-02-23T01:39:03+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.19423v1","arxiv_url":"http://arxiv.org/abs/2602.19423v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":5.0,"composite":4.45,"summary":"Prefer-DAS combines sparse point prompts with local human preference alignment for domain adaptive segmentation in electron microscopy. Introduces promptable multitask model with Local/Sparse Direct Preference Optimization (LPO/SLPO) and Unsupervised Preference Optimization for annotation-efficient adaptation.","reasoning":"Interesting preference alignment approach for specialized microscopy domain, but limited scope and no released artifacts.","code_url":null,"s2_tldr":"This study develops Prefer-DAS, which pioneers sparse promptable learning and local preference alignment, and introduces Local direct Preference Optimization and sparse LPO, plug-and-play solutions for alignment with spatially varying human feedback or sparse feedback.","s2_paper_id":"e2e22d73276c4247f2a9008373074c2252a1b96a","topics":"[\"3D / Vision\", \"Training\"]"},{"id":287,"run_id":1,"domain":"aiml","arxiv_id":"2602.19367","entry_id":"","title":"Time Series, Vision, and Language: Exploring the Limits of Alignment in Contrastive Representation Spaces","authors":"[\"Pratham Yashwante\", \"Rose Yu\"]","abstract":"The Platonic Representation Hypothesis posits that learned representations from models trained on different modalities converge to a shared latent structure of the world. However, this hypothesis has largely been examined in vision and language, and it remains unclear whether time series participate in such convergence. We first examine this in a trimodal setting and find that independently pretrained time series, vision, and language encoders exhibit near-orthogonal geometry in the absence of e","published":"2026-02-22T22:39:37+00:00","categories":"[\"cs.AI\", \"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.19367v1","arxiv_url":"http://arxiv.org/abs/2602.19367v1","comment":"24 Figures, 12 Tables","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":5.0,"composite":4.45,"summary":"Examines multimodal alignment between time series, vision, and language through contrastive learning. Finds time series aligns more strongly with vision than text, and alignment improves with model size but plateaus beyond certain information density thresholds.","reasoning":"Analysis paper exploring alignment limits rather than proposing new methods. Useful insights but limited direct applicability and no code/weights.","code_url":null,"s2_tldr":"This investigation reveals that overall alignment in contrastive representation spaces improves with model size, but this alignment is asymmetric: time series align more strongly with visual representations than with text, and images can act as effective intermediaries between time series and language.","s2_paper_id":"54a645b908e8b99ed819316244ee65d98d12aab1","topics":"[\"Training\", \"Optimization\"]"},{"id":288,"run_id":1,"domain":"aiml","arxiv_id":"2602.19357","entry_id":"","title":"MentalBlackboard: Evaluating Spatial Visualization via Mathematical Transformations","authors":"[\"Nilay Yilmaz\", \"Maitreya Patel\", \"Naga Sai Abhiram Kusumba\", \"Yixuan He\", \"Yezhou Yang\"]","abstract":"Spatial visualization is the mental ability to imagine, transform, and manipulate the spatial characteristics of objects and actions. This intelligence is a part of human cognition where actions and perception are connected on a mental level. To explore whether state-of-the-art Vision-Language Models (VLMs) exhibit this ability, we develop MentalBlackboard, an open-ended spatial visualization benchmark for Paper Folding and Hole Punching tests within two core tasks: prediction and planning. Our ","published":"2026-02-22T22:05:11+00:00","categories":"[\"cs.CV\", \"cs.LG\"]","pdf_url":"https://arxiv.org/pdf/2602.19357v1","arxiv_url":"http://arxiv.org/abs/2602.19357v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":5.0,"composite":4.45,"summary":"MentalBlackboard benchmark evaluates VLM spatial visualization abilities through paper folding/hole punching tests. Finds models struggle with symmetrical transformations and rotations, with best model (o3) achieving only 25% on text-based prediction despite 71.6% on non-spatial tasks.","reasoning":"Benchmark paper revealing VLM limitations in spatial reasoning. Useful diagnostic but no new methods or artifacts, limited direct applicability.","code_url":null,"s2_tldr":"MentalBlackboard is developed, an open-ended spatial visualization benchmark for Paper Folding and Hole Punching tests within two core tasks: prediction and planning, which reveals limitations of models in analyzing symmetrical relationships and in implementing the multi-stage symmetry process.","s2_paper_id":"dd68dcdbe28ec185452d815e863998a6e2ce55ce","topics":"[\"Benchmark\", \"Reasoning\", \"Language Models\"]"},{"id":289,"run_id":1,"domain":"aiml","arxiv_id":"2602.19268","entry_id":"","title":"CORVET: A CORDIC-Powered, Resource-Frugal Mixed-Precision Vector Processing Engine for High-Throughput AIoT applications","authors":"[\"Sonu Kumar\", \"Mohd Faisal Khan\", \"Mukul Lokhande\", \"Santosh Kumar Vishvakarma\"]","abstract":"This brief presents a runtime-adaptive, performance-enhanced vector engine featuring a low-resource, iterative CORDIC-based MAC unit for edge AI acceleration. The proposed design enables dynamic reconfiguration between approximate and accurate modes, exploiting the latency-accuracy trade-off for a wide range of workloads. Its resource-efficient approach further enables up to 4x throughput improvement within the same hardware resources by leveraging vectorised, time-multiplexed execution and flex","published":"2026-02-22T16:51:17+00:00","categories":"[\"cs.AR\", \"cs.AI\", \"cs.CV\", \"cs.NE\", \"eess.IV\"]","pdf_url":"https://arxiv.org/pdf/2602.19268v1","arxiv_url":"http://arxiv.org/abs/2602.19268v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":5.0,"score_axis_3":6.0,"composite":4.45,"summary":"CORVET presents a CORDIC-based vector processing engine for edge AI with runtime-adaptive approximate/accurate modes. Achieves 4x throughput improvement through vectorized execution with flexible precision (4/8/16-bit), delivering 4.83 TOPS/mm\u00b2 compute density.","reasoning":"Novel hardware architecture for edge AI with strong efficiency gains. ASIC implementation but no open weights/code for broader use.","code_url":null,"s2_tldr":"A runtime-adaptive, performance-enhanced vector engine featuring a low-resource, iterative CORDIC-based MAC unit for edge AI acceleration that enables dynamic reconfiguration between approximate and accurate modes, exploiting the latency-accuracy trade-off for a wide range of workloads.","s2_paper_id":"c4c040a19e5ad6350a8eb202da179b34b10d8e69","topics":"[\"Efficiency\"]"},{"id":290,"run_id":1,"domain":"aiml","arxiv_id":"2602.19254","entry_id":"","title":"RegionRoute: Regional Style Transfer with Diffusion Model","authors":"[\"Bowen Chen\", \"Jake Zuena\", \"Alan C. Bovik\", \"Divya Kothandaraman\"]","abstract":"Precise spatial control in diffusion-based style transfer remains challenging. This challenge arises because diffusion models treat style as a global feature and lack explicit spatial grounding of style representations, making it difficult to restrict style application to specific objects or regions. To our knowledge, existing diffusion models are unable to perform true localized style transfer, typically relying on handcrafted masks or multi-stage post-processing that introduce boundary artifac","published":"2026-02-22T16:11:07+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.19254v1","arxiv_url":"http://arxiv.org/abs/2602.19254v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":5.0,"composite":4.45,"summary":"RegionRoute enables precise spatial control in diffusion-based style transfer through attention supervision. Two losses (Focus KL-divergence and Cover BCE) teach the model to apply style to specific regions, with LoRA-MoE for multi-style adaptation.","reasoning":"Novel attention-supervised approach for localized style transfer. Addresses real limitation but no code/weights available.","code_url":null,"s2_tldr":"An attention-supervised diffusion framework that explicitly teaches the model where to apply a given style by aligning the attention scores of style tokens with object masks during training is proposed, producing regionally accurate and visually coherent results that outperform existing diffusion-based editing approaches.","s2_paper_id":"13a414b152969ca9e08ed6a9014207778d14b7f7","topics":"[]"},{"id":291,"run_id":1,"domain":"aiml","arxiv_id":"2602.19219","entry_id":"","title":"Controlled Face Manipulation and Synthesis for Data Augmentation","authors":"[\"Joris Kirchner\", \"Amogh Gudi\", \"Marian Bittner\", \"Chirag Raman\"]","abstract":"Deep learning vision models excel with abundant supervision, but many applications face label scarcity and class imbalance. Controllable image editing can augment scarce labeled data, yet edits often introduce artifacts and entangle non-target attributes. We study this in facial expression analysis, targeting Action Unit (AU) manipulation where annotation is costly and AU co-activation drives entanglement. We present a facial manipulation method that operates in the semantic latent space of a pr","published":"2026-02-22T15:03:06+00:00","categories":"[\"cs.CV\", \"cs.LG\"]","pdf_url":"https://arxiv.org/pdf/2602.19219v1","arxiv_url":"http://arxiv.org/abs/2602.19219v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":5.0,"score_axis_3":6.0,"composite":4.45,"summary":"Controlled facial AU manipulation in semantic latent space of pre-trained face generator (Diffusion Autoencoder) for data augmentation. Uses dependency-aware conditioning and orthogonal projection to reduce entanglement, improving AU detection training.","reasoning":"Solid data augmentation approach but narrow domain (facial expressions). No code/weights, incremental improvement.","code_url":null,"s2_tldr":"A facial manipulation method that operates in the semantic latent space of a pre-trained face generator (Diffusion Autoencoder) that reduces entanglement of semantic features via dependency-aware conditioning that accounts for AU co-activation and orthogonal projection that removes nuisance attribute directions.","s2_paper_id":"431b4b54fb15673359a751649de10e45656881a9","topics":"[\"Robotics\", \"Image Generation\"]"},{"id":292,"run_id":1,"domain":"aiml","arxiv_id":"2602.19198","entry_id":"","title":"Prompt Tuning for CLIP on the Pretrained Manifold","authors":"[\"Xi Yang\", \"Yuanrong Xu\", \"Weigang Zhang\", \"Guangming Lu\", \"David Zhang\", \"Jie Wen\"]","abstract":"Prompt tuning introduces learnable prompt vectors that adapt pretrained vision-language models to downstream tasks in a parameter-efficient manner. However, under limited supervision, prompt tuning alters pretrained representations and drives downstream features away from the pretrained manifold toward directions that are unfavorable for transfer. This drift degrades generalization. To address this limitation, we propose ManiPT, a framework that performs prompt tuning on the pretrained manifold.","published":"2026-02-22T13:58:41+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.19198v1","arxiv_url":"http://arxiv.org/abs/2602.19198v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":5.0,"score_axis_3":6.0,"composite":4.45,"summary":"ManiPT performs prompt tuning on the pretrained manifold using cosine consistency constraints and structural bias for incremental corrections. Prevents drift from pretrained representations under limited supervision, improving generalization across few-shot, domain, and cross-dataset settings.","reasoning":"Solid theoretical contribution to prompt tuning but incremental improvement. No code/weights available.","code_url":null,"s2_tldr":"ManiPT introduces cosine consistency constraints in both the text and image modalities to confine the learned representations within the pretrained geometric neighborhood and introduces a structural bias that enforces incremental corrections, guiding the adaptation along transferable directions to mitigate reliance on shortcut learning.","s2_paper_id":"ca5df08274021205b3852505bd9f97fad9b7b9ca","topics":"[\"Language Models\", \"Multimodal\", \"Efficiency\"]"},{"id":293,"run_id":1,"domain":"aiml","arxiv_id":"2602.19170","entry_id":"","title":"BriMA: Bridged Modality Adaptation for Multi-Modal Continual Action Quality Assessment","authors":"[\"Kanglei Zhou\", \"Chang Li\", \"Qingyi Pan\", \"Liyuan Wang\"]","abstract":"Action Quality Assessment (AQA) aims to score how well an action is performed and is widely used in sports analysis, rehabilitation assessment, and human skill evaluation. Multi-modal AQA has recently achieved strong progress by leveraging complementary visual and kinematic cues, yet real-world deployments often suffer from non-stationary modality imbalance, where certain modalities become missing or intermittently available due to sensor failures or annotation gaps. Existing continual AQA metho","published":"2026-02-22T13:00:52+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.19170v1","arxiv_url":"http://arxiv.org/abs/2602.19170v1","comment":"Accepted to CVPR 2026","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":5.0,"composite":4.45,"summary":"BriMA addresses multi-modal continual action quality assessment under modality-missing conditions. Memory-guided bridging imputes missing modalities using task-agnostic/specific representations, while modality-aware replay prioritizes informative samples.","reasoning":"Addresses practical deployment challenge but narrow sports/rehabilitation domain. CVPR 2026 accepted, no code.","code_url":null,"s2_tldr":"BriMA is introduced, an innovative approach to multi-modal continual AQA under modality-missing conditions that consists of a memory-guided bridging imputation module that reconstructs missing modalities using both task-agnostic and task-specific representations, and a modality-aware replay mechanism that prioritizes informative samples based on modality distortion and distribution drift.","s2_paper_id":"7773523378f638580a0cff8a34ac80b5914c6f9c","topics":"[\"Benchmark\"]"},{"id":294,"run_id":1,"domain":"aiml","arxiv_id":"2602.19146","entry_id":"","title":"VIGiA: Instructional Video Guidance via Dialogue Reasoning and Retrieval","authors":"[\"Diogo Gl\\u00f3ria-Silva\", \"David Semedo\", \"Jo\\u00e3o Maglh\\u00e3es\"]","abstract":"We introduce VIGiA, a novel multimodal dialogue model designed to understand and reason over complex, multi-step instructional video action plans. Unlike prior work which focuses mainly on text-only guidance, or treats vision and language in isolation, VIGiA supports grounded, plan-aware dialogue that requires reasoning over visual inputs, instructional plans, and interleaved user interactions. To this end, VIGiA incorporates two key capabilities: (1) multimodal plan reasoning, enabling the mode","published":"2026-02-22T12:20:28+00:00","categories":"[\"cs.CV\", \"cs.CL\"]","pdf_url":"https://arxiv.org/pdf/2602.19146v1","arxiv_url":"http://arxiv.org/abs/2602.19146v1","comment":"Accepted at EACL 2026 Findings","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":5.0,"composite":4.45,"summary":"VIGiA is a multimodal dialogue model for instructional video guidance that reasons over visual inputs, plans, and user interactions. Supports plan-aware VQA and retrieves relevant steps in text/visual format, achieving >90% accuracy on plan-aware tasks.","reasoning":"Novel task formulation but narrow domain (instructional videos). EACL 2026 Findings, no code/weights.","code_url":null,"s2_tldr":"This work introduces VIGiA, a novel multimodal dialogue model designed to understand and reason over complex, multi-step instructional video action plans, and shows that VIGiA outperforms existing state-of-the-art models on all tasks in a conversational plan guidance setting.","s2_paper_id":"d6238c0f7ea57b24a8acc049b4277eba3f71ff53","topics":"[\"Reasoning\", \"Retrieval / RAG\", \"Multimodal\"]"},{"id":296,"run_id":1,"domain":"aiml","arxiv_id":"2602.18907","entry_id":"","title":"DeepInterestGR: Mining Deep Multi-Interest Using Multi-Modal LLMs for Generative Recommendation","authors":"[\"Yangchen Zeng\"]","abstract":"Recent generative recommendation frameworks have demonstrated remarkable scaling potential by reformulating item prediction as autoregressive Semantic ID (SID) generation. However, existing methods primarily rely on shallow behavioral signals, encoding items solely through surface-level textual features such as titles and descriptions. This reliance results in a critical Shallow Interest problem: the model fails to capture the latent, semantically rich interests underlying user interactions, lim","published":"2026-02-21T17:03:06+00:00","categories":"[\"cs.LG\", \"cs.CV\", \"cs.CY\"]","pdf_url":"https://arxiv.org/pdf/2602.18907v1","arxiv_url":"http://arxiv.org/abs/2602.18907v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":5.0,"score_axis_3":6.0,"composite":4.45,"summary":"DeepInterestGR leverages multiple multi-modal LLMs to extract deep textual and visual interest representations for generative recommendation. Uses reinforcement learning with reward-labeled deep interests to improve semantic ID generation, outperforming baselines on Amazon Review benchmarks.","reasoning":"Interesting multi-LLM approach for interest mining. No code/weights shared. Practical for recommendation systems but limited to specific domain with unclear scalability.","code_url":null,"s2_tldr":"DeepInterestGR adopts a two-stage training pipeline: supervised fine-tuning aligns the generative model with deep interest signals and collaborative filtering patterns, followed by reinforcement learning with GRPO optimized by the authors' Interest-Aware Reward.","s2_paper_id":"6444f4d21395ef02514234cbb7c80c5e9c44aee7","topics":"[]"},{"id":297,"run_id":1,"domain":"aiml","arxiv_id":"2602.21158","entry_id":"","title":"SELAUR: Self Evolving LLM Agent via Uncertainty-aware Rewards","authors":"[\"Dengjia Zhang\", \"Xiaoou Liu\", \"Lu Cheng\", \"Yaqing Wang\", \"Kenton Murray\", \"Hua Wei\"]","abstract":"Large language models (LLMs) are increasingly deployed as multi-step decision-making agents, where effective reward design is essential for guiding learning. Although recent work explores various forms of reward shaping and step-level credit assignment, a key signal remains largely overlooked: the intrinsic uncertainty of LLMs. Uncertainty reflects model confidence, reveals where exploration is needed, and offers valuable learning cues even in failed trajectories. We introduce SELAUR: Self Evolv","published":"2026-02-24T18:04:54+00:00","categories":"[\"cs.LG\", \"cs.CL\"]","pdf_url":"https://arxiv.org/pdf/2602.21158v1","arxiv_url":"http://arxiv.org/abs/2602.21158v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":5.0,"score_axis_3":6.0,"composite":4.45,"summary":"SELAUR introduces uncertainty-aware rewards for LLM agents through entropy-, least-confidence-, and margin-based metrics combined into token-level estimates. Improves success rates on ALFWorld and WebShop benchmarks through failure-aware reward reshaping and enhanced exploration.","reasoning":"No code/weights shared. Novel integration of uncertainty into RL rewards for LLM agents. Practical improvements on benchmarks but limited evaluation scope.","code_url":null,"s2_tldr":"SELAUR: Self Evolving LLM Agent via Uncertainty-aware Rewards, a reinforcement learning framework that incorporates uncertainty directly into the reward design and employs a failure-aware reward reshaping mechanism that injects these uncertainty signals into step- and trajectory-level rewards to improve exploration efficiency and learning stability.","s2_paper_id":"c90aae7ec10e0bbfa63cea6d4018f18242439aca","topics":"[\"Language Models\", \"RL\", \"Agents\"]"},{"id":298,"run_id":1,"domain":"aiml","arxiv_id":"2602.21045","entry_id":"","title":"PaperTrail: A Claim-Evidence Interface for Grounding Provenance in LLM-based Scholarly Q&A","authors":"[\"Anna Martin-Boyle\", \"Cara A. C. Leckey\", \"Martha C. Brown\", \"Harmanpreet Kaur\"]","abstract":"Large language models (LLMs) are increasingly used in scholarly question-answering (QA) systems to help researchers synthesize vast amounts of literature. However, these systems often produce subtle errors (e.g., unsupported claims, errors of omission), and current provenance mechanisms like source citations are not granular enough for the rigorous verification that scholarly domain requires. To address this, we introduce PaperTrail, a novel interface that decomposes both LLM answers and source ","published":"2026-02-24T16:04:50+00:00","categories":"[\"cs.HC\", \"cs.CL\"]","pdf_url":"https://arxiv.org/pdf/2602.21045v1","arxiv_url":"http://arxiv.org/abs/2602.21045v1","comment":"25 pages, 3 figures. Accepted at the ACM CHI conference on Human Factors in Computing Systems 2026","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":5.0,"score_axis_3":6.0,"composite":4.45,"summary":"PaperTrail interface decomposes LLM answers and source documents into claims and evidence for scholarly QA verification. Within-subjects study with 26 researchers shows significantly lowered trust but unchanged behavior, revealing cognition-burden challenges. CHI 2026 accepted work on granular provenance mechanisms.","reasoning":"CHI accepted HCI work. No code/weights. Important findings on trust vs. behavior gap but interface-focused rather than model innovation. Limited to scholarly settings.","code_url":null,"s2_tldr":"This work introduces PaperTrail, a novel interface that decomposes both LLM answers and source documents into discrete claims and evidence, mapping them to reveal supported assertions, unsupported claims, and information omitted from the source texts.","s2_paper_id":"f3759457c6702b4ee9e798e1df6188d1702af65d","topics":"[\"Language Models\"]"},{"id":299,"run_id":1,"domain":"aiml","arxiv_id":"2602.20749","entry_id":"","title":"Explicit Grammar Semantic Feature Fusion for Robust Text Classification","authors":"[\"Azrin Sultana\", \"Firoz Ahmed\"]","abstract":"Natural Language Processing enables computers to understand human language by analysing and classifying text efficiently with deep-level grammatical and semantic features. Existing models capture features by learning from large corpora with transformer models, which are computationally intensive and unsuitable for resource-constrained environments. Therefore, our proposed study incorporates comprehensive grammatical rules alongside semantic information to build a robust, lightweight classificati","published":"2026-02-24T10:25:29+00:00","categories":"[\"cs.CL\"]","pdf_url":"https://arxiv.org/pdf/2602.20749v1","arxiv_url":"http://arxiv.org/abs/2602.20749v1","comment":"30 pages, 9 figures","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":5.0,"score_axis_3":6.0,"composite":4.45,"summary":"Proposes explicit encoding of grammatical structure (syntax, phrase patterns, complexity) into compact grammar vectors fused with frozen contextual embeddings. Lightweight approach outperforms baselines by 2-15% across heterogeneous domains without heavy transformers.","reasoning":"Practical lightweight alternative to full transformers for text classification, but no code/weights and incremental novelty.","code_url":null,"s2_tldr":"This study incorporates comprehensive grammatical rules alongside semantic information to build a robust, lightweight classification model without resorting to full parameterised transformer models or heavy deep learning architectures, resulting in a very lightweight model that delivers better performance on edge devices.","s2_paper_id":"10ceea14d9d9379618a07e438b94db61b3d428b8","topics":"[\"Efficiency\", \"Architecture\"]"},{"id":300,"run_id":1,"domain":"aiml","arxiv_id":"2602.20648","entry_id":"","title":"CARE: An Explainable Computational Framework for Assessing Client-Perceived Therapeutic Alliance Using Large Language Models","authors":"[\"Anqi Li\", \"Chenxiao Wang\", \"Yu Lu\", \"Renjun Xu\", \"Lizhi Ma\", \"Zhenzhong Lan\"]","abstract":"Client perceptions of the therapeutic alliance are critical for counseling effectiveness. Accurately capturing these perceptions remains challenging, as traditional post-session questionnaires are burdensome and often delayed, while existing computational approaches produce coarse scores, lack interpretable rationales, and fail to model holistic session context. We present CARE, an LLM-based framework to automatically predict multi-dimensional alliance scores and generate interpretable rationale","published":"2026-02-24T07:52:56+00:00","categories":"[\"cs.CL\"]","pdf_url":"https://arxiv.org/pdf/2602.20648v1","arxiv_url":"http://arxiv.org/abs/2602.20648v1","comment":"14 pages, 4 figures","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":5.0,"score_axis_3":6.0,"composite":4.45,"summary":"CARE uses LLaMA-3.1-8B-Instruct fine-tuned on 9,516 expert-curated rationales to predict multi-dimensional therapeutic alliance scores and generate interpretable rationales from counseling transcripts. Achieves 70% higher correlation with client ratings than counselor evaluations.","reasoning":"Domain-specific application with good results but limited to mental health counseling. No code/weights available, incremental ML application.","code_url":null,"s2_tldr":"Experiments show that CARE outperforms leading LLMs and substantially reduces the gap between counselor evaluations and client-perceived alliance, achieving over 70% higher Pearson correlation with client ratings, demonstrating its potential as an AI-assisted tool for supporting mental health care.","s2_paper_id":"3a6a6f5f1c7391a515b2fdffb887c64c20107e01","topics":"[\"Language Models\"]"},{"id":301,"run_id":1,"domain":"aiml","arxiv_id":"2602.20647","entry_id":"","title":"Semantic Novelty at Scale: Narrative Shape Taxonomy and Readership Prediction in 28,606 Books","authors":"[\"W. Frederick Zimmerman\"]","abstract":"I introduce semantic novelty--cosine distance between each paragraph's sentence embedding and the running centroid of all preceding paragraphs--as an information-theoretic measure of narrative structure at corpus scale. Applying it to 28,606 books in PG19 (pre-1920 English literature), I compute paragraph-level novelty curves using 768-dimensional SBERT embeddings, then reduce each to a 16-segment Piecewise Aggregate Approximation (PAA). Ward-linkage clustering on PAA vectors reveals eight canon","published":"2026-02-24T07:52:35+00:00","categories":"[\"cs.CL\"]","pdf_url":"https://arxiv.org/pdf/2602.20647v1","arxiv_url":"http://arxiv.org/abs/2602.20647v1","comment":"six figures. dataset available at Hugging Face","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":5.0,"composite":4.45,"summary":"Introduces semantic novelty measure (cosine distance to running centroid) for narrative structure analysis at scale. Applied to 28,606 books in PG19, reveals eight canonical narrative archetypes and identifies volume as strongest readership predictor. Dataset available on Hugging Face.","reasoning":"Novel information-theoretic approach to literary analysis with HF dataset, but limited practical applicability outside computational humanities research.","code_url":null,"s2_tldr":null,"s2_paper_id":"e692a6eb885a7c7398c0036bb68ae71232a0bd37","topics":"[\"Retrieval / RAG\"]"},{"id":302,"run_id":1,"domain":"aiml","arxiv_id":"2602.20042","entry_id":"","title":"Position: General Alignment Has Hit a Ceiling; Edge Alignment Must Be Taken Seriously","authors":"[\"Han Bao\", \"Yue Huang\", \"Xiaoda Wang\", \"Zheyuan Zhang\", \"Yujun Zhou\", \"Carl Yang\", \"Xiangliang Zhang\", \"Yanfang Ye\"]","abstract":"Large language models are being deployed in complex socio-technical systems, which exposes limits in current alignment practice. We take the position that the dominant paradigm of General Alignment, which compresses diverse human values into a single scalar reward, reaches a structural ceiling in settings with conflicting values, plural stakeholders, and irreducible uncertainty. These failures follow from the mathematics and incentives of scalarization and lead to \\textbf{structural} value flatt","published":"2026-02-23T16:51:43+00:00","categories":"[\"cs.CL\"]","pdf_url":"https://arxiv.org/pdf/2602.20042v1","arxiv_url":"http://arxiv.org/abs/2602.20042v1","comment":"26 pages, 5 figures","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":7.0,"score_axis_3":4.0,"composite":4.45,"summary":"Position paper arguing that current single-scalar reward alignment (\"General Alignment\") hits a ceiling in complex socio-technical systems with conflicting values. Proposes \"Edge Alignment\" as a multi-dimensional approach preserving value structure and supporting plural stakeholder representation through seven interdependent pillars across three phases.","reasoning":"Conceptual/position paper with no code or weights. Novel framework proposal but purely theoretical with limited immediate practical applicability for practitioners.","code_url":null,"s2_tldr":"This work introduces Edge Alignment as a distinct approach in which systems preserve multi dimensional value structure, support plural and democratic representation, and incorporate epistemic mechanisms for interaction and clarification.","s2_paper_id":"dabc73530a6a1118c7ff64d3fec4ee8cd5ca79b8","topics":"[\"Training\", \"Language Models\", \"Efficiency\"]"},{"id":303,"run_id":1,"domain":"aiml","arxiv_id":"2602.20020","entry_id":"","title":"gencat: Generative computerized adaptive testing","authors":"[\"Wanyong Feng\", \"Andrew Lan\"]","abstract":"Existing computerized Adaptive Testing (CAT) frameworks are typically built on predicting the correctness of a student response to a question. Although effective, this approach fails to leverage textual information in questions and responses, especially for open-ended questions. In this work, we propose GENCAT (\\textbf{GEN}erative \\textbf{CAT}), a novel CAT framework that leverages Large Language Models for knowledge estimate and question selection. First, we develop a Generative Item Response T","published":"2026-02-23T16:28:46+00:00","categories":"[\"cs.CL\"]","pdf_url":"https://arxiv.org/pdf/2602.20020v1","arxiv_url":"http://arxiv.org/abs/2602.20020v1","comment":"19 pages, 2 figures","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":5.0,"composite":4.45,"summary":"Proposes GENCAT, a generative CAT framework using LLMs for knowledge estimation and question selection from open-ended responses. Introduces Generative Item Response Theory (GIRT) trained via SFT then preference optimization, plus three question selection algorithms. Achieves up to 4.32% AUC improvement on programming datasets.","reasoning":"Educational domain application with incremental novelty. No code/weights released. Limited to specific educational testing scenarios.","code_url":null,"s2_tldr":"GENCAT is proposed, a novel CAT framework that leverages Large Language Models for knowledge estimate and question selection and introduces three question selection algorithms that leverage the generative capabilities of the GIRT model, based on the uncertainty, linguistic diversity, and information of sampled student responses.","s2_paper_id":"906d499901758c1a905c9ed9a3698aa50d773d66","topics":"[\"Language Models\"]"},{"id":304,"run_id":1,"domain":"aiml","arxiv_id":"2602.20017","entry_id":"","title":"QUIETT: Query-Independent Table Transformation for Robust Reasoning","authors":"[\"Gaurav Najpande\", \"Tampu Ravi Kumar\", \"Manan Roy Choudhury\", \"Neha Valeti\", \"Yanjie Fu\", \"Vivek Gupta\"]","abstract":"Real-world tables often exhibit irregular schemas, heterogeneous value formats, and implicit relational structure, which degrade the reliability of downstream table reasoning and question answering. Most existing approaches address these issues in a query-dependent manner, entangling table cleanup with reasoning and thus limiting generalization. We introduce QuIeTT, a query-independent table transformation framework that preprocesses raw tables into a single SQL-ready canonical representation be","published":"2026-02-23T16:23:49+00:00","categories":"[\"cs.CL\"]","pdf_url":"https://arxiv.org/pdf/2602.20017v1","arxiv_url":"http://arxiv.org/abs/2602.20017v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":5.0,"score_axis_3":6.0,"composite":4.45,"summary":"Introduces QuIeTT, a query-independent table transformation framework that preprocesses irregular tables into SQL-ready canonical format before reasoning. Decouples table cleanup from reasoning, enabling cleaner and more efficient querying. Shows consistent gains across WikiTQ, HiTab, NQ-Table, and SequentialQA benchmarks.","reasoning":"Practical preprocessing approach but incremental innovation. No code/weights. Solves real problem in table QA but not paradigm-shifting.","code_url":null,"s2_tldr":"QuIeTT is introduced, a query-independent table transformation framework that preprocesses raw tables into a single SQL-ready canonical representation before any test-time queries are observed, enabling cleaner, more reliable, and highly efficient querying without modifying downstream models.","s2_paper_id":"3b776fcdfc8a7ac6e5eedc69bc992019d80a328c","topics":"[\"Reasoning\"]"},{"id":305,"run_id":1,"domain":"aiml","arxiv_id":"2602.19612","entry_id":"","title":"Anatomy of Unlearning: The Dual Impact of Fact Salience and Model Fine-Tuning","authors":"[\"Borisiuk Anna\", \"Andrey Savchenko\", \"Alexander Panchenko\", \"Elena Tutubalina\"]","abstract":"Machine Unlearning (MU) enables Large Language Models (LLMs) to remove unsafe or outdated information. However, existing work assumes that all facts are equally forgettable and largely ignores whether the forgotten knowledge originates from pretraining or supervised fine-tuning (SFT). In this paper, we introduce DUAL (Dual Unlearning Evaluation across Training Stages), a benchmark of 28.6k Wikidata-derived triplets annotated with fact popularity using Wikipedia link counts and LLM-based salience","published":"2026-02-23T08:58:48+00:00","categories":"[\"cs.CL\"]","pdf_url":"https://arxiv.org/pdf/2602.19612v2","arxiv_url":"http://arxiv.org/abs/2602.19612v2","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":5.0,"composite":4.45,"summary":"Introduces DUAL benchmark (28.6k Wikidata triplets) to study machine unlearning across pretraining and SFT stages. Shows that SFT models respond differently to unlearning than pretrained models: SFT on forget data yields 10-50% higher retention and more stable tuning, while direct unlearning on pretrained models remains unstable.","reasoning":"Important insights on unlearning but no code/weights. Benchmark contribution with empirical findings. Moderate practical applicability for safety researchers.","code_url":null,"s2_tldr":"DUAL (Dual Unlearning Evaluation across Training Stages), a benchmark of 28.6k Wikidata-derived triplets annotated with fact popularity using Wikipedia link counts and LLM-based salience scores, shows that pretrained and SFT models respond differently to unlearning.","s2_paper_id":"7284509ddcb64bd2f53f86829b69272e88c27850","topics":"[\"Language Models\", \"Benchmark\"]"},{"id":306,"run_id":1,"domain":"aiml","arxiv_id":"2602.19543","entry_id":"","title":"Hyper-KGGen: A Skill-Driven Knowledge Extractor for High-Quality Knowledge Hypergraph Generation","authors":"[\"Rizhuo Huang\", \"Yifan Feng\", \"Rundong Xue\", \"Shihui Ying\", \"Jun-Hai Yong\", \"Chuan Shi\", \"Shaoyi Du\", \"Yue Gao\"]","abstract":"Knowledge hypergraphs surpass traditional binary knowledge graphs by encapsulating complex $n$-ary atomic facts, providing a more comprehensive paradigm for semantic representation. However, constructing high-quality hypergraphs remains challenging due to the \\textit{scenario gap}: generic extractors struggle to generalize across diverse domains with specific jargon, while existing methods often fail to balance structural skeletons with fine-grained details. To bridge this gap, we propose \\textb","published":"2026-02-23T06:32:00+00:00","categories":"[\"cs.CL\", \"cs.IR\"]","pdf_url":"https://arxiv.org/pdf/2602.19543v1","arxiv_url":"http://arxiv.org/abs/2602.19543v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":5.0,"composite":4.45,"summary":"Proposes Hyper-KGGen, a skill-driven framework for knowledge hypergraph extraction that reformulates extraction as dynamic skill-evolving process. Uses coarse-to-fine mechanism and adaptive skill acquisition module with stability-based feedback loop. Introduces HyperDocRED benchmark and outperforms baselines.","reasoning":"Novel framework for hypergraph extraction but no code/weights. Introduces useful benchmark but limited to specific KG extraction task.","code_url":null,"s2_tldr":"The proposed Hyper-KGGen is a skill-driven framework that reformulates extraction as a dynamic skill-evolving process that significantly outperforms strong baselines, validating that evolved skills provide substantially richer guidance than static few-shot examples in multi-scenario settings.","s2_paper_id":"3574d30622fa798e399ec7c22bcf1b49ed567bb6","topics":"[\"Retrieval / RAG\"]"},{"id":307,"run_id":1,"domain":"aiml","arxiv_id":"2602.19526","entry_id":"","title":"How to Train Your Deep Research Agent? Prompt, Reward, and Policy Optimization in Search-R1","authors":"[\"Yinuo Xu\", \"Shuo Lu\", \"Jianjie Cheng\", \"Meng Wang\", \"Qianlong Xie\", \"Xingxing Wang\", \"Ran He\", \"Jian Liang\"]","abstract":"Deep Research agents tackle knowledge-intensive tasks through multi-round retrieval and decision-oriented generation. While reinforcement learning (RL) has been shown to improve performance in this paradigm, its contributions remain underexplored. To fully understand the role of RL, we conduct a systematic study along three decoupled dimensions: prompt template, reward function, and policy optimization. Our study reveals that: 1) the Fast Thinking template yields greater stability and better per","published":"2026-02-23T05:33:17+00:00","categories":"[\"cs.CL\"]","pdf_url":"https://arxiv.org/pdf/2602.19526v1","arxiv_url":"http://arxiv.org/abs/2602.19526v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":5.0,"score_axis_3":6.0,"composite":4.45,"summary":"Systematic study of RL for Deep Research agents across prompt template, reward function, and policy optimization. Finds Fast Thinking template outperforms Slow Thinking; F1 reward causes collapse but action-level penalties help; REINFORCE outperforms PPO. Introduces Search-R1++ improving performance from 0.403 to 0.442 (Qwen2.5-7B).","reasoning":"Useful empirical study for research agents but no code/weights. Incremental improvement rather than novel architecture. Moderate practical value for agent builders.","code_url":null,"s2_tldr":"A systematic study along three decoupled dimensions: prompt template, reward function, and policy optimization reveals that the Fast Thinking template yields greater stability and better performance than the Slow Thinking template used in prior work.","s2_paper_id":"933315358a57a965fc28755f38109072e761451a","topics":"[\"Optimization\", \"RL\", \"Agents\"]"},{"id":308,"run_id":1,"domain":"aiml","arxiv_id":"2602.19317","entry_id":"","title":"Learning to Reason for Multi-Step Retrieval of Personal Context in Personalized Question Answering","authors":"[\"Maryam Amirizaniani\", \"Alireza Salemi\", \"Hamed Zamani\"]","abstract":"Personalization in Question Answering (QA) requires answers that are both accurate and aligned with users' background, preferences, and historical context. Existing state-of-the-art methods primarily rely on retrieval-augmented generation (RAG) solutions that construct personal context by retrieving relevant items from the user's profile. Existing methods use the user's query directly to retrieve personal documents, and such strategies often lead to surface-level personalization. We propose PR2 ","published":"2026-02-22T19:43:43+00:00","categories":"[\"cs.CL\", \"cs.AI\", \"cs.IR\"]","pdf_url":"https://arxiv.org/pdf/2602.19317v1","arxiv_url":"http://arxiv.org/abs/2602.19317v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":5.0,"score_axis_3":6.0,"composite":4.45,"summary":"PR2 introduces a reinforcement learning framework for personalized question answering that learns adaptive retrieval-reasoning policies, determining when and what to retrieve from user profiles. Achieves 8.8%-12% relative improvement over baselines on LaMP-QA benchmark by optimizing multi-turn reasoning trajectories.","reasoning":"Novel RL approach to personalized QA with solid improvements, but no code/weights mentioned. Practical for personalization applications, though evaluation limited to single benchmark.","code_url":null,"s2_tldr":"This work proposes PR2 (Personalized Retrieval-Augmented Reasoning), a reinforcement learning framework that integrates reasoning and retrieval from personal context for personalization in Question Answering.","s2_paper_id":"df6b25d0948f58f599170245946e199a9fce6caa","topics":"[\"Retrieval / RAG\", \"Training\"]"},{"id":309,"run_id":1,"domain":"aiml","arxiv_id":"2602.19157","entry_id":"","title":"Facet-Level Persona Control by Trait-Activated Routing with Contrastive SAE for Role-Playing LLMs","authors":"[\"Wenqiu Tang\", \"Zhen Wan\", \"Takahiro Komamizu\", \"Ichiro Ide\"]","abstract":"Personality control in Role-Playing Agents (RPAs) is commonly achieved via training-free methods that inject persona descriptions and memory through prompts or retrieval-augmented generation, or via supervised fine-tuning (SFT) on persona-specific corpora. While SFT can be effective, it requires persona-labeled data and retraining for new roles, limiting flexibility. In contrast, prompt- and RAG-based signals are easy to apply but can be diluted in long dialogues, leading to drifting and sometim","published":"2026-02-22T12:39:02+00:00","categories":"[\"cs.CL\"]","pdf_url":"https://arxiv.org/pdf/2602.19157v1","arxiv_url":"http://arxiv.org/abs/2602.19157v1","comment":"Accepted in PAKDD 2026 special session on Data Science :Foundation and Applications","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":5.0,"score_axis_3":6.0,"composite":4.45,"summary":"Proposes contrastive SAE framework for facet-level personality control in role-playing agents, using Big Five 30-facet model with learned control vectors and trait-activated routing. Outperforms Contrastive Activation Addition and prompt-only baselines while maintaining dialogue coherence.","reasoning":"Novel approach to personality control in LLMs with interesting use of SAE, but requires specialized training data. Moderate practical value for role-playing applications, no code/weights mentioned.","code_url":null,"s2_tldr":"A contrastive Sparse AutoEncoder (SAE) framework that learns facet-level personality control vectors aligned with the Big Five 30-facet model is proposed, confirming that contrastively trained latent vectors can enhance persona control while preserving dialogue coherence.","s2_paper_id":"814dce0994c6ba7b3ccba52dda147d58cf2c3e2c","topics":"[\"Language Models\", \"Retrieval / RAG\"]"},{"id":311,"run_id":1,"domain":"aiml","arxiv_id":"2602.18905","entry_id":"","title":"TRUE: A Trustworthy Unified Explanation Framework for Large Language Model Reasoning","authors":"[\"Yujiao Yang\"]","abstract":"Large language models (LLMs) have demonstrated strong capabilities in complex reasoning tasks, yet their decision-making processes remain difficult to interpret. Existing explanation methods often lack trustworthy structural insight and are limited to single-instance analysis, failing to reveal reasoning stability and systematic failure mechanisms. To address these limitations, we propose the Trustworthy Unified Explanation Framework (TRUE), which integrates executable reasoning verification, fe","published":"2026-02-21T17:00:54+00:00","categories":"[\"cs.LG\", \"cs.AI\", \"cs.CL\"]","pdf_url":"https://arxiv.org/pdf/2602.18905v1","arxiv_url":"http://arxiv.org/abs/2602.18905v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":5.0,"composite":4.45,"summary":"Proposes TRUE framework for trustworthy LLM reasoning explanation through executable verification, feasible-region DAG modeling, and causal failure mode analysis. Introduces multi-level explanations including executable structures, feasible-region representations, and Shapley value-based failure mode importance quantification.","reasoning":"Interesting interpretability framework but no code/weights provided. Novel theoretical contribution to explainability but limited practical adoption without implementation.","code_url":null,"s2_tldr":"The Trustworthy Unified Explanation Framework (TRUE), which integrates executable reasoning verification, feasible-region directed acyclic graph (DAG) modeling, and causal failure mode analysis, is proposed, establishing a unified and principled paradigm for improving the interpretability and reliability of LLM reasoning systems.","s2_paper_id":"075a9a286e71bc847036bd689ad0b8f22772b72c","topics":"[\"Language Models\", \"Reasoning\"]"},{"id":312,"run_id":1,"domain":"aiml","arxiv_id":"2602.18764","entry_id":"","title":"The Convergence of Schema-Guided Dialogue Systems and the Model Context Protocol","authors":"[\"Andreas Schlapbach\"]","abstract":"This paper establishes a fundamental convergence: Schema-Guided Dialogue (SGD) and the Model Context Protocol (MCP) represent two manifestations of a unified paradigm for deterministic, auditable LLM-agent interaction. SGD, designed for dialogue-based API discovery (2019), and MCP, now the de facto standard for LLM-tool integration, share the same core insight -- that schemas can encode not just tool signatures but operational constraints and reasoning guidance. By analyzing this convergence, we","published":"2026-02-21T09:02:35+00:00","categories":"[\"cs.AI\", \"cs.CL\"]","pdf_url":"https://arxiv.org/pdf/2602.18764v1","arxiv_url":"http://arxiv.org/abs/2602.18764v1","comment":"18 sections, 4 figures, 7 tables, 38 references. Original research presenting: (1) formal framework mapping Schema-Guided Dialogue principles to Model Context Protocol concepts, (2) five foundational design principles for LLM-native schema authoring, (3) architectural patterns for secure, scalable agent orchestration. Research supported by SBB (Swiss Federal Railways)","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":5.0,"score_axis_3":6.0,"composite":4.45,"summary":"Establishes convergence between Schema-Guided Dialogue (SGD) and Model Context Protocol (MCP), extracting five foundational principles for schema design including semantic completeness, action boundaries, failure documentation, progressive disclosure, and inter-tool relationships. Provides concrete design patterns for LLM-tool integration.","reasoning":"Conceptual framework unifying SGD and MCP with practical design principles, but no implementation. Useful for system architects but lacks empirical validation or code.","code_url":null,"s2_tldr":"Five foundational principles for schema design are extracted that position schema-driven governance as a scalable mechanism for AI system oversight without requiring proprietary system inspection -- central to Software 3.0.","s2_paper_id":"0637d35d48a598b4b3947523f05bb90bc623c216","topics":"[\"Optimization\", \"Language Models\", \"Agents\"]"},{"id":313,"run_id":1,"domain":"aiml","arxiv_id":"2602.18582","entry_id":"","title":"Hierarchical Reward Design from Language: Enhancing Alignment of Agent Behavior with Human Specifications","authors":"[\"Zhiqin Qian\", \"Ryan Diaz\", \"Sangwon Seo\", \"Vaibhav Unhelkar\"]","abstract":"When training artificial intelligence (AI) to perform tasks, humans often care not only about whether a task is completed but also how it is performed. As AI agents tackle increasingly complex tasks, aligning their behavior with human-provided specifications becomes critical for responsible AI deployment. Reward design provides a direct channel for such alignment by translating human expectations into reward functions that guide reinforcement learning (RL). However, existing methods are often to","published":"2026-02-20T19:41:17+00:00","categories":"[\"cs.AI\", \"cs.CL\", \"cs.HC\", \"cs.LG\"]","pdf_url":"https://arxiv.org/pdf/2602.18582v1","arxiv_url":"http://arxiv.org/abs/2602.18582v1","comment":"Extended version of an identically-titled paper accepted at AAMAS 2026","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":5.0,"score_axis_3":6.0,"composite":4.45,"summary":"Proposes HRDL (Hierarchical Reward Design from Language) and L2HR solution for translating human behavioral specifications into hierarchical RL rewards. Experiments show agents trained with L2HR-designed rewards better adhere to human specifications while completing tasks effectively.","reasoning":"Extends reward design to hierarchical RL with language interface. No code/weights provided. Application-specific to RL alignment, moderate practical impact.","code_url":null,"s2_tldr":"Hierarchical Reward Design from Language is introduced: a problem formulation that extends classical reward design to encode richer behavioral specifications for hierarchical RL agents and Language to Hierarchical Rewards (L2HR) is proposed as a solution to HRDL.","s2_paper_id":"f4f788aad3722da37d1e27eb108e573867f5d70c","topics":"[\"Training\", \"RL\", \"Agents\"]"},{"id":314,"run_id":1,"domain":"aiml","arxiv_id":"2602.18137","entry_id":"","title":"Agentic Adversarial QA for Improving Domain-Specific LLMs","authors":"[\"Vincent Grari\", \"Ciprian Tomoiaga\", \"Sylvain Lamprier\", \"Tatsunori Hashimoto\", \"Marcin Detyniecki\"]","abstract":"Large Language Models (LLMs), despite extensive pretraining on broad internet corpora, often struggle to adapt effectively to specialized domains. There is growing interest in fine-tuning these models for such domains; however, progress is constrained by the scarcity and limited coverage of high-quality, task-relevant data. To address this, synthetic data generation methods such as paraphrasing or knowledge extraction are commonly applied. Although these approaches excel at factual recall and co","published":"2026-02-20T10:53:09+00:00","categories":"[\"cs.CL\", \"cs.AI\", \"cs.LG\"]","pdf_url":"https://arxiv.org/pdf/2602.18137v1","arxiv_url":"http://arxiv.org/abs/2602.18137v1","comment":"9 pages, 1 Figure","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":5.0,"score_axis_3":6.0,"composite":4.45,"summary":"Proposes adversarial question-generation framework for domain-specific LLM fine-tuning that produces compact, semantically challenging questions by comparing model outputs to expert models. Achieves better accuracy with fewer synthetic samples on LegalBench, addressing sample efficiency issues in specialized domain adaptation.","reasoning":"Incremental improvement in synthetic data generation for domain adaptation. Useful for practitioners but not paradigm-shifting. No code available.","code_url":null,"s2_tldr":"This work proposes an adversarial question-generation framework that produces a compact set of semantically challenging questions that are constructed by comparing the outputs of the model to be adapted and a robust expert model grounded in reference documents, using an iterative, feedback-driven process.","s2_paper_id":"c90b3d11d85e37061dc18bf9b458e9da4f3627b0","topics":"[\"Agents\", \"Language Models\"]"},{"id":315,"run_id":1,"domain":"aiml","arxiv_id":"2602.17907","entry_id":"","title":"Improving Neural Topic Modeling with Semantically-Grounded Soft Label Distributions","authors":"[\"Raymond Li\", \"Amirhossein Abaskohi\", \"Chuyuan Li\", \"Gabriel Murray\", \"Giuseppe Carenini\"]","abstract":"Traditional neural topic models are typically optimized by reconstructing the document's Bag-of-Words (BoW) representations, overlooking contextual information and struggling with data sparsity. In this work, we propose a novel approach to construct semantically-grounded soft label targets using Language Models (LMs) by projecting the next token probabilities, conditioned on a specialized prompt, onto a pre-defined vocabulary to obtain contextually enriched supervision signals. By training the t","published":"2026-02-20T00:12:04+00:00","categories":"[\"cs.CL\", \"cs.AI\"]","pdf_url":"https://arxiv.org/pdf/2602.17907v1","arxiv_url":"http://arxiv.org/abs/2602.17907v1","comment":"20 pages, 5 figures","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":5.0,"score_axis_3":6.0,"composite":4.45,"summary":"Proposes semantically-grounded soft label distributions for neural topic models by projecting LM next-token probabilities onto vocabulary using specialized prompts. Achieves substantial improvements in topic coherence and purity over baselines, with better performance on retrieval-oriented applications.","reasoning":"Novel use of LM guidance for topic modeling with solid improvements. Practically useful but limited scope. No code/weights shared.","code_url":null,"s2_tldr":"This work proposes a novel approach to construct semantically-grounded soft label targets using Language Models (LMs) by projecting the next token probabilities, conditioned on a specialized prompt, onto a pre-defined vocabulary to obtain contextually enriched supervision signals.","s2_paper_id":"845dee333952332787a5a9bef4b44694540946f6","topics":"[\"Language Models\", \"Optimization\"]"},{"id":316,"run_id":1,"domain":"aiml","arxiv_id":"2602.17544","entry_id":"","title":"Evaluating Chain-of-Thought Reasoning through Reusability and Verifiability","authors":"[\"Shashank Aggarwal\", \"Ram Vikas Mishra\", \"Amit Awekar\"]","abstract":"In multi-agent IR pipelines for tasks such as search and ranking, LLM-based agents exchange intermediate reasoning in terms of Chain-of-Thought (CoT) with each other. Current CoT evaluation narrowly focuses on target task accuracy. However, this metric fails to assess the quality or utility of the reasoning process itself. To address this limitation, we introduce two novel measures: reusability and verifiability. We decouple CoT generation from execution using a Thinker-Executor framework. Reusa","published":"2026-02-19T16:59:11+00:00","categories":"[\"cs.AI\", \"cs.CL\", \"cs.IR\"]","pdf_url":"https://arxiv.org/pdf/2602.17544v1","arxiv_url":"http://arxiv.org/abs/2602.17544v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":5.0,"composite":4.45,"summary":"Proposes reusability and verifiability metrics for evaluating Chain-of-Thought reasoning beyond accuracy using a Thinker-Executor framework. Finds that specialized reasoning models' CoTs are not consistently more reusable than general-purpose LLMs, revealing blind spots in accuracy-based leaderboards.","reasoning":"Novel evaluation framework but limited practical impact (evaluation-focused). No code/weights mentioned. Useful insights for CoT research.","code_url":null,"s2_tldr":"Surprisingly, it is found that CoTs from specialized reasoning models are not consistently more reusable or verifiable than those from general-purpose LLMs like Llama and Gemma.","s2_paper_id":"14964e1a3be57069bb972b6e5ed68d3aec63c073","topics":"[\"Reasoning\", \"Benchmark\", \"Language Models\"]"},{"id":317,"run_id":1,"domain":"aiml","arxiv_id":"2602.17542","entry_id":"","title":"Using LLMs for Knowledge Component-level Correctness Labeling in Open-ended Coding Problems","authors":"[\"Zhangqi Duan\", \"Arnav Kankaria\", \"Dhruv Kartik\", \"Andrew Lan\"]","abstract":"Fine-grained skill representations, commonly referred to as knowledge components (KCs), are fundamental to many approaches in student modeling and learning analytics. However, KC-level correctness labels are rarely available in real-world datasets, especially for open-ended programming tasks where solutions typically involve multiple KCs simultaneously. Simply propagating problem-level correctness to all associated KCs obscures partial mastery and often leads to poorly fitted learning curves. To","published":"2026-02-19T16:58:34+00:00","categories":"[\"cs.CL\", \"cs.CY\"]","pdf_url":"https://arxiv.org/pdf/2602.17542v1","arxiv_url":"http://arxiv.org/abs/2602.17542v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":5.0,"score_axis_3":6.0,"composite":4.45,"summary":"Uses LLMs to automatically generate knowledge component (KC)-level correctness labels for open-ended programming tasks, addressing the challenge of partial mastery in student modeling. Shows improved learning curve fit and predictive performance compared to problem-level labeling.","reasoning":"Applied work in educational domain, limited broader ML applicability. No code/weights, moderate novelty (applying LLMs to existing problem).","code_url":null,"s2_tldr":"This work proposes an automated framework that leverages large language models (LLMs) to label KC-level correctness directly from student-written code and introduces a temporal context-aware Code-KC mapping mechanism to better align KCs with individual student code.","s2_paper_id":"78eca04faeedfe42a84f292069979627c9908df7","topics":"[\"Benchmark\"]"},{"id":318,"run_id":1,"domain":"aiml","arxiv_id":"2602.17443","entry_id":"","title":"AIDG: Evaluating Asymmetry Between Information Extraction and Containment in Multi-Turn Dialogue","authors":"[\"Adib Sakhawat\", \"Fardeen Sadab\", \"Rakin Shahriar\"]","abstract":"Evaluating the strategic reasoning capabilities of Large Language Models (LLMs) requires moving beyond static benchmarks to dynamic, multi-turn interactions. We introduce AIDG (Adversarial Information Deduction Game), a game-theoretic framework that probes the asymmetry between information extraction (active deduction) and information containment (state maintenance) in dialogue. We propose two complementary tasks: AIDG-I, measuring pragmatic strategy in social deduction, and AIDG-II, measuring c","published":"2026-02-19T15:09:12+00:00","categories":"[\"cs.CL\"]","pdf_url":"https://arxiv.org/pdf/2602.17443v1","arxiv_url":"http://arxiv.org/abs/2602.17443v1","comment":"16 pages, 5 figures, 13 tables. Includes appendix and supplementary materials","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":5.0,"composite":4.45,"summary":"AIDG introduces game-theoretic framework revealing LLMs have 350 ELO advantage in information containment over extraction (Cohen's d=5.47). Identifies information dynamics and constraint adherence as key bottlenecks, showing models excel at local defensive coherence but struggle with global state tracking for strategic inquiry.","reasoning":"Novel evaluation framework but limited practical applicability beyond analysis. No code/weights, interesting theoretical insights for dialogue systems.","code_url":null,"s2_tldr":"AIDG (Adversarial Information Deduction Game), a game-theoretic framework that probes the asymmetry between information extraction and information containment in dialogue, suggests that while LLMs excel at local defensive coherence, they struggle with the global state tracking required for strategic inquiry.","s2_paper_id":"48954b8d0cf0085551e380ea60493be0e407c64f","topics":"[\"Benchmark\", \"Language Models\", \"Reasoning\"]"},{"id":319,"run_id":1,"domain":"aiml","arxiv_id":"2602.17366","entry_id":"","title":"RPDR: A Round-trip Prediction-Based Data Augmentation Framework for Long-Tail Question Answering","authors":"[\"Yiming Zhang\", \"Siyue Zhang\", \"Junbo Zhao\", \"Chen Zhao\"]","abstract":"Long-tail question answering presents significant challenges for large language models (LLMs) due to their limited ability to acquire and accurately recall less common knowledge. Retrieval-augmented generation (RAG) systems have shown great promise in mitigating this limitation by integrating external retrieval mechanisms. However, dense retrieval models often face the same difficulties when generalizing to rare or niche knowledge. In this study, we introduce RPDR, a novel data augmentation fram","published":"2026-02-19T13:49:39+00:00","categories":"[\"cs.CL\"]","pdf_url":"https://arxiv.org/pdf/2602.17366v1","arxiv_url":"http://arxiv.org/abs/2602.17366v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":5.0,"score_axis_3":6.0,"composite":4.45,"summary":"RPDR framework improves dense retrievers for long-tail question answering through synthetic data generation, round-trip prediction for data selection, and targeted training. Shows substantial improvements over BM25 and Contriver on PopQA and EntityQuestion, with dynamic routing mechanism for query specialization.","reasoning":"Practical RAG improvement but no code/weights mentioned. Moderate novelty (data augmentation approach), good empirical results on long-tail benchmarks.","code_url":null,"s2_tldr":"This study introduces RPDR, a novel data augmentation framework that selects high-quality easy-to-learn training data, to enhance dense retrievers and proposes a dynamic routing mechanism to dynamically route queries to specialized retrieval modules to further improve retrieval performance.","s2_paper_id":"960cc1ebccffecf0b8a8a1ae8d411653e5c5035b","topics":"[\"Language Models\", \"Retrieval / RAG\"]"},{"id":320,"run_id":1,"domain":"aiml","arxiv_id":"2602.17744","entry_id":"","title":"Bayesian Optimality of In-Context Learning with Selective State Spaces","authors":"[\"Di Zhang\", \"Jiaqi Xing\"]","abstract":"We propose Bayesian optimal sequential prediction as a new principle for understanding in-context learning (ICL). Unlike interpretations framing Transformers as performing implicit gradient descent, we formalize ICL as meta-learning over latent sequence tasks. For tasks governed by Linear Gaussian State Space Models (LG-SSMs), we prove a meta-trained selective SSM asymptotically implements the Bayes-optimal predictor, converging to the posterior predictive mean. We further establish a statistica","published":"2026-02-19T12:41:28+00:00","categories":"[\"cs.LG\", \"cs.CL\", \"math.ST\", \"stat.ML\"]","pdf_url":"https://arxiv.org/pdf/2602.17744v1","arxiv_url":"http://arxiv.org/abs/2602.17744v1","comment":"17 pages","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":7.0,"score_axis_3":4.0,"composite":4.45,"summary":"Theoretical work proving selective SSMs achieve Bayes-optimal in-context learning for Linear Gaussian State Space Models, with statistical separation from gradient descent. Provides theoretical foundation explaining why selective SSMs can outperform Transformers on structured sequence tasks.","reasoning":"Low code_and_weights (no code/weights mentioned). Strong novelty in theoretical analysis connecting SSMs to Bayesian optimality. Limited immediate practical applicability due to theoretical focus and narrow task scope.","code_url":null,"s2_tldr":"This reframes ICL from\"implicit optimization to\"optimal inference,\" explaining the efficiency of selective SSMs and offering a principled basis for architecture design.","s2_paper_id":"89995c2d5ab695473b22a6d1276bb2bfa1adfddd","topics":"[\"Retrieval / RAG\", \"Architecture\", \"Optimization\"]"},{"id":321,"run_id":1,"domain":"aiml","arxiv_id":"2602.17316","entry_id":"","title":"Same Meaning, Different Scores: Lexical and Syntactic Sensitivity in LLM Evaluation","authors":"[\"Bogdan Kosti\\u0107\", \"Conor Fallon\", \"Julian Risch\", \"Alexander L\\u00f6ser\"]","abstract":"The rapid advancement of Large Language Models (LLMs) has established standardized evaluation benchmarks as the primary instrument for model comparison. Yet, their reliability is increasingly questioned due to sensitivity to shallow variations in input prompts. This paper examines how controlled, truth-conditionally equivalent lexical and syntactic perturbations affect the absolute performance and relative ranking of 23 contemporary LLMs across three benchmarks: MMLU, SQuAD, and AMEGA. We employ","published":"2026-02-19T12:24:42+00:00","categories":"[\"cs.CL\", \"cs.AI\"]","pdf_url":"https://arxiv.org/pdf/2602.17316v1","arxiv_url":"http://arxiv.org/abs/2602.17316v1","comment":"Accepted at LREC 2026","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":5.0,"score_axis_3":6.0,"composite":4.45,"summary":"Examines how lexical/syntactic perturbations affect 23 LLMs across MMLU, SQuAD, and AMEGA benchmarks. Finds models rely heavily on surface-level lexical patterns rather than abstract linguistic understanding, with lexical changes causing significant performance drops and leaderboard instability.","reasoning":"Low code_and_weights (no code/weights mentioned). Moderate novelty in systematic perturbation analysis. Decent practical applicability for understanding model robustness and evaluation reliability.","code_url":null,"s2_tldr":"Overall, the findings suggest that LLMs rely more on surface-level lexical patterns than on abstract linguistic competence, underscoring the need for robustness testing as a standard component of LLM evaluation.","s2_paper_id":"b6462230a2cc1d0a03597e68c8a430918de6b84f","topics":"[\"Language Models\", \"Benchmark\"]"},{"id":323,"run_id":1,"domain":"aiml","arxiv_id":"2602.17127","entry_id":"","title":"The Emergence of Lab-Driven Alignment Signatures: A Psychometric Framework for Auditing Latent Bias and Compounding Risk in Generative AI","authors":"[\"Dusan Bosnjakovic\"]","abstract":"As Large Language Models (LLMs) transition from standalone chat interfaces to foundational reasoning layers in multi-agent systems and recursive evaluation loops (LLM-as-a-judge), the detection of durable, provider-level behavioral signatures becomes a critical requirement for safety and governance. Traditional benchmarks measure transient task accuracy but fail to capture stable, latent response policies -- the ``prevailing mindsets'' embedded during training and alignment that outlive individu","published":"2026-02-19T06:56:01+00:00","categories":"[\"cs.CL\"]","pdf_url":"https://arxiv.org/pdf/2602.17127v1","arxiv_url":"http://arxiv.org/abs/2602.17127v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":5.0,"composite":4.45,"summary":"Introduces psychometric framework using ordinal vignettes to audit latent biases in LLMs, revealing persistent 'lab signals' that cluster behaviors by provider. Identifies optimization bias, sycophancy, and status-quo legitimization as compounding risks in multi-agent AI systems.","reasoning":"Low code_and_weights (no code/weights mentioned). Moderate novelty in applying psychometrics to detect provider-level signatures. Moderate practical applicability for AI safety and governance contexts.","code_url":null,"s2_tldr":"These findings demonstrate that in ``locked-in''provider ecosystems, latent biases are not merely static errors but compounding variables that risk creating recursive ideological echo chambers in multi-layered AI architectures.","s2_paper_id":"651997dea3d72ea15e146ee9b8ba693c88400c81","topics":"[\"Training\", \"Language Models\", \"Agents\"]"},{"id":324,"run_id":1,"domain":"aiml","arxiv_id":"2602.21098","entry_id":"","title":"Optimizing Occupancy Sensor Placement in Smart Environments","authors":"[\"Hao Lu\", \"Richard J. Radke\"]","abstract":"Understanding the locations of occupants in a commercial built environment is critical for realizing energy savings by delivering lighting, heating, and cooling only where it is needed. The key to achieving this goal is being able to recognize zone occupancy in real time, without impeding occupants' activities or compromising privacy. While low-resolution, privacy-preserving time-of-flight (ToF) sensor networks have demonstrated good performance in zone counting, the performance depends on caref","published":"2026-02-24T17:01:36+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.21098v1","arxiv_url":"http://arxiv.org/abs/2602.21098v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":4.0,"score_axis_3":6.0,"composite":4.1,"summary":"Proposes automatic sensor placement optimization for time-of-flight sensor networks in commercial buildings to maximize zone occupancy counting accuracy. Formulates placement as integer linear programming problem and validates through trajectory simulations across different office environments.","reasoning":"Narrow application domain (building sensor placement). Limited novelty (applying ILP to sensor placement). Practical for specific use case but not broadly applicable.","code_url":null,"s2_tldr":"This work proposes an automatic sensor placement method that determines optimal sensor layouts for a given number of sensors, and can predict the counting accuracy of such a layout, and demonstrates the effectiveness of the proposed method based on simulations of several different office environments.","s2_paper_id":"56757615de773e5cfdb6001eb77a73a72a1370de","topics":"[\"Optimization\"]"},{"id":325,"run_id":1,"domain":"aiml","arxiv_id":"2602.20818","entry_id":"","title":"GatedCLIP: Gated Multimodal Fusion for Hateful Memes Detection","authors":"[\"Yingying Guo\", \"Ke Zhang\", \"Zirong Zeng\"]","abstract":"Detecting hateful content in multimodal memes presents unique challenges, as harmful messages often emerge from the complex interplay between benign images and text. We propose GatedCLIP, a Vision-Language model that enhances CLIP's multimodal capabilities with specialized architectural improvements for hateful memes detection. Our approach introduces learned projection heads that map CLIP embeddings to a task-optimized semantic space, a dynamic gated fusion mechanism that adaptively weights vis","published":"2026-02-24T11:54:54+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.20818v1","arxiv_url":"http://arxiv.org/abs/2602.20818v1","comment":"Preprint","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":5.0,"score_axis_3":5.0,"composite":4.1,"summary":"GatedCLIP enhances CLIP for hateful memes detection through learned projection heads, dynamic gated fusion of visual/textual features, and contrastive learning objective. Achieves 0.66 AUROC on Hateful Memes dataset, substantially improving over CLIP baseline (0.49) with only 350K trainable parameters.","reasoning":"Incremental CLIP improvement for specific task. No code/weights. Modest novelty with limited generalization beyond hateful content detection.","code_url":null,"s2_tldr":"GatedCLIP is proposed, a Vision-Language model that enhances CLIP's multimodal capabilities with specialized architectural improvements for hateful memes detection and introduces learned projection heads that map CLIP embeddings to a task-optimized semantic space, a dynamic gated fusion mechanism that adaptively weights visual and textual features, and a contrastive learning objective that maintains cross-modal semantic alignment.","s2_paper_id":"6f1811fc3a7b388275debe5d534400325d4cbfbf","topics":"[\"Multimodal\", \"Language Models\", \"Retrieval / RAG\"]"},{"id":326,"run_id":1,"domain":"aiml","arxiv_id":"2602.20543","entry_id":"","title":"Beyond Human Performance: A Vision-Language Multi-Agent Approach for Quality Control in Pharmaceutical Manufacturing","authors":"[\"Subhra Jyoti Mandal\", \"Lara Rachidi\", \"Puneet Jain\", \"Matthieu Duvinage\", \"Sander W. Timmer\"]","abstract":"Colony-forming unit (CFU) detection is critical in pharmaceutical manufacturing, serving as a key component of Environmental Monitoring programs and ensuring compliance with stringent quality standards. Manual counting is labor-intensive and error-prone, while deep learning (DL) approaches, though accurate, remain vulnerable to sample quality variations and artifacts. Building on our earlier CNN-based framework (Beznik et al., 2020), we evaluated YOLOv5, YOLOv7, and YOLOv8 for CFU detection; how","published":"2026-02-24T04:48:05+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.20543v1","arxiv_url":"http://arxiv.org/abs/2602.20543v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":4.0,"score_axis_3":6.0,"composite":4.1,"summary":"Multi-agent framework combining Detectron2 (99% detection) with VLM agents for CFU detection in pharmaceutical manufacturing. VLM classifies plates and estimates counts; agreement within 5% triggers automation, otherwise routes to expert review with feedback loop.","reasoning":"No code/weights provided. Incremental combination of existing methods (Detectron2 + VLMs) for specialized industrial application. Practical for pharmaceutical QC but not broadly applicable.","code_url":null,"s2_tldr":"The proposed system provides a scalable, auditable, and regulation-ready solution for microbiological quality control, advancing automation in biopharmaceutical production.","s2_paper_id":"c8e2a5691756db74127df72d2a3368e760a89d0a","topics":"[\"Multimodal\", \"Agents\", \"Benchmark\"]"},{"id":327,"run_id":1,"domain":"aiml","arxiv_id":"2602.20342","entry_id":"","title":"Large-scale Photorealistic Outdoor 3D Scene Reconstruction from UAV Imagery Using Gaussian Splatting Techniques","authors":"[\"Christos Maikos\", \"Georgios Angelidis\", \"Georgios Th. Papadopoulos\"]","abstract":"In this study, we present an end-to-end pipeline capable of converting drone-captured video streams into high-fidelity 3D reconstructions with minimal latency. Unmanned aerial vehicles (UAVs) are extensively used in aerial real-time perception applications. Moreover, recent advances in 3D Gaussian Splatting (3DGS) have demonstrated significant potential for real-time neural rendering. However, their integration into end-to-end UAV-based reconstruction and visualization systems remains underexplo","published":"2026-02-23T20:40:26+00:00","categories":"[\"cs.CV\", \"cs.RO\"]","pdf_url":"https://arxiv.org/pdf/2602.20342v1","arxiv_url":"http://arxiv.org/abs/2602.20342v1","comment":"7 pages, 2 figures","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":4.0,"score_axis_3":6.0,"composite":4.1,"summary":"This paper presents an end-to-end pipeline for converting UAV video streams into 3D reconstructions using 3D Gaussian Splatting. The system integrates RTMP streaming, sensor fusion, and continuous model updates for low-latency AR/VR deployment, achieving visual fidelity within 4-7% of offline references.","reasoning":"No code available. Application-focused work applying existing 3DGS techniques to UAV scenarios. Practical for specific use cases but incremental in novelty.","code_url":null,"s2_tldr":"An end-to-end pipeline capable of converting drone-captured video streams into high-fidelity 3D reconstructions with minimal latency is presented, confirming the suitability of the proposed system for real-time, scalable augmented perception from aerial platforms.","s2_paper_id":"a68c016d3bc380917ff5f36ebdd3cd0635f632c4","topics":"[\"3D / Vision\"]"},{"id":328,"run_id":1,"domain":"aiml","arxiv_id":"2602.20137","entry_id":"","title":"Do Large Language Models Understand Data Visualization Rules?","authors":"[\"Martin Sinnona\", \"Valentin Bonas\", \"Emmanuel Iarussi\", \"Viviana Siless\"]","abstract":"Data visualization rules-derived from decades of research in design and perception-ensure trustworthy chart communication. While prior work has shown that large language models (LLMs) can generate charts or flag misleading figures, it remains unclear whether they can reason about and enforce visualization rules directly. Constraint-based systems such as Draco encode these rules as logical constraints for precise automated checks, but maintaining symbolic encodings requires expert effort, motivat","published":"2026-02-23T18:47:51+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.20137v1","arxiv_url":"http://arxiv.org/abs/2602.20137v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":5.0,"score_axis_3":5.0,"composite":4.1,"summary":"This work systematically evaluates LLMs' ability to enforce data visualization rules using hard-verification ground truth from Answer Set Programming. Tested on 2,000 Vega-Lite specs, frontier models achieve high adherence (up to 100%) and reliably detect common violations (F1 up to 0.82) but struggle with subtle perceptual rules.","reasoning":"No code available. Evaluation-focused study with practical insights but no novel methods. Useful for understanding LLM capabilities in specialized domain.","code_url":null,"s2_tldr":"The first systematic evaluation of LLMs against visualization rules using hard-verification ground truth derived from Answer Set Programming (ASP) is presented, which demonstrates the potential of LLMs as flexible, language-driven validators while highlighting their current limitations compared to symbolic solvers.","s2_paper_id":"f2faf46ba50f9c5b69838c26700d22ec6d465dc6","topics":"[\"Language Models\", \"Reasoning\"]"},{"id":329,"run_id":1,"domain":"aiml","arxiv_id":"2602.19623","entry_id":"","title":"PedaCo-Gen: Scaffolding Pedagogical Agency in Human-AI Collaborative Video Authoring","authors":"[\"Injun Baek\", \"Yearim Kim\", \"Nojun Kwak\"]","abstract":"While advancements in Text-to-Video (T2V) generative AI offer a promising path toward democratizing content creation, current models are often optimized for visual fidelity rather than instructional efficacy. This study introduces PedaCo-Gen, a pedagogically-informed human-AI collaborative video generating system for authoring instructional videos based on Mayer's Cognitive Theory of Multimedia Learning (CTML). Moving away from traditional \"one-shot\" generation, PedaCo-Gen introduces an Intermed","published":"2026-02-23T09:12:13+00:00","categories":"[\"cs.CV\", \"cs.AI\", \"cs.HC\"]","pdf_url":"https://arxiv.org/pdf/2602.19623v1","arxiv_url":"http://arxiv.org/abs/2602.19623v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":4.0,"score_axis_3":6.0,"composite":4.1,"summary":"PedaCo-Gen enables pedagogically-informed human-AI collaborative instructional video authoring based on Mayer's CTML. Introduces Intermediate Representation phase with AI reviewer for iterative blueprint refinement, achieving high production efficiency (M=4.26).","reasoning":"Novel application of AI for educational content but no code/weights. Domain-specific for pedagogy.","code_url":null,"s2_tldr":"PedaCo-Gen is introduced, a pedagogically-informed human-AI collaborative video generating system for authoring instructional videos based on Mayer's Cognitive Theory of Multimedia Learning (CTML), enabling educators to interactively review and refine video blueprints-comprising scripts and visual descriptions-with an AI reviewer.","s2_paper_id":"f24b7e098ecd8fd5a0eacea91d059ce327f253ff","topics":"[\"Video Generation\", \"Optimization\"]"},{"id":330,"run_id":1,"domain":"aiml","arxiv_id":"2602.19474","entry_id":"","title":"Structured Bitmap-to-Mesh Triangulation for Geometry-Aware Discretization of Image-Derived Domains","authors":"[\"Wei Feng\", \"Haiyong Zheng\"]","abstract":"We propose a template-driven triangulation framework that embeds raster- or segmentation-derived boundaries into a regular triangular grid for stable PDE discretization on image-derived domains. Unlike constrained Delaunay triangulation (CDT), which may trigger global connectivity updates, our method retriangulates only triangles intersected by the boundary, preserves the base mesh, and supports synchronization-free parallel execution. To ensure determinism and scalability, we classify all local","published":"2026-02-23T03:36:55+00:00","categories":"[\"cs.CG\", \"cs.CV\", \"cs.GR\"]","pdf_url":"https://arxiv.org/pdf/2602.19474v1","arxiv_url":"http://arxiv.org/abs/2602.19474v1","comment":"Revised version after peer review; under review at Graphical Models. Earlier version appeared on SSRN","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":5.0,"score_axis_3":5.0,"composite":4.1,"summary":"Template-driven triangulation framework embeds raster-derived boundaries into regular triangular grids for PDE discretization on image-derived domains. Uses finite symbolic lookup table for deterministic parallel triangulation with bounded angles, avoiding global connectivity updates of CDT.","reasoning":"No code/weights. Computational geometry method for PDE discretization is niche. Practical for geometric analysis but not mainstream ML.","code_url":null,"s2_tldr":"A template-driven triangulation framework that embeds raster- or segmentation-derived boundaries into a regular triangular grid for stable PDE discretization on image-derived domains and proves that the resulting mesh is closed, has bounded angles, and is compatible with cotangent-based discretizations and standard finite element methods.","s2_paper_id":"87f0e4098c5f1b321d5668a96f9a6e843317ef23","topics":"[\"3D / Vision\"]"},{"id":331,"run_id":1,"domain":"aiml","arxiv_id":"2602.18936","entry_id":"","title":"CRAFT-LoRA: Content-Style Personalization via Rank-Constrained Adaptation and Training-Free Fusion","authors":"[\"Yu Li\", \"Yujun Cai\", \"Chi Zhang\"]","abstract":"Personalized image generation requires effectively balancing content fidelity with stylistic consistency when synthesizing images based on text and reference examples. Low-Rank Adaptation (LoRA) offers an efficient personalization approach, with potential for precise control through combining LoRA weights on different concepts. However, existing combination techniques face persistent challenges: entanglement between content and style representations, insufficient guidance for controlling element","published":"2026-02-21T19:05:11+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.18936v2","arxiv_url":"http://arxiv.org/abs/2602.18936v2","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":4.0,"score_axis_3":6.0,"composite":4.1,"summary":"CRAFT-LoRA proposes rank-constrained backbone fine-tuning and training-free fusion for content-style personalization in image generation. Combines low-rank adaptation with prompt-guided control and timestep-dependent classifier-free guidance to improve content-style disentanglement without retraining overhead.","reasoning":"No code/weights shared. Novel approach to LoRA combination with practical training-free fusion, but incremental improvement over existing LoRA methods. Limited evaluation details provided.","code_url":null,"s2_tldr":"CRAFT-LoRA significantly improves content-style disentanglement, enables flexible semantic control over LoRA module combinations, and achieves high-fidelity generation without additional retraining overhead.","s2_paper_id":"56583c33107c351b8fb1db51c817a311c1b1b17e","topics":"[\"Image Generation\", \"Efficiency\"]"},{"id":332,"run_id":1,"domain":"aiml","arxiv_id":"2602.21059","entry_id":"","title":"An Expert Schema for Evaluating Large Language Model Errors in Scholarly Question-Answering Systems","authors":"[\"Anna Martin-Boyle\", \"William Humphreys\", \"Martha Brown\", \"Cara Leckey\", \"Harmanpreet Kaur\"]","abstract":"Large Language Models (LLMs) are transforming scholarly tasks like search and summarization, but their reliability remains uncertain. Current evaluation metrics for testing LLM reliability are primarily automated approaches that prioritize efficiency and scalability, but lack contextual nuance and fail to reflect how scientific domain experts assess LLM outputs in practice. We developed and validated a schema for evaluating LLM errors in scholarly question-answering systems that reflects the ass","published":"2026-02-24T16:16:44+00:00","categories":"[\"cs.HC\", \"cs.CL\"]","pdf_url":"https://arxiv.org/pdf/2602.21059v1","arxiv_url":"http://arxiv.org/abs/2602.21059v1","comment":"24 pages, 2 figures. Accepted at ACM CHI conference on Human Factors in Computing Systems, 2026","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":4.0,"score_axis_3":6.0,"composite":4.1,"summary":"Develops expert-validated schema for evaluating LLM errors in scholarly QA through collaboration with domain experts, identifying 20 error patterns across seven categories. CHI 2026 accepted work shows structured evaluation schemas help scientists detect previously overlooked issues through systematic assessment strategies.","reasoning":"CHI accepted. No code/weights. Important HCI contribution for LLM evaluation but limited to scholarly domain. Evaluation framework rather than technical innovation.","code_url":null,"s2_tldr":"A schema for evaluating LLM errors in scholarly question-answering systems that reflects the assessment strategies of practicing scientists and shows not only which errors experts naturally identify but also how structured evaluation schemas can help them detect previously overlooked issues.","s2_paper_id":"3c66da6266c6beb3fff8bbb338d5239c573df9a7","topics":"[\"Language Models\", \"Benchmark\"]"},{"id":333,"run_id":1,"domain":"aiml","arxiv_id":"2602.20973","entry_id":"","title":"Linear Reasoning vs. Proof by Cases: Obstacles for Large Language Models in FOL Problem Solving","authors":"[\"Yuliang Ji\", \"Fuchen Shen\", \"Jian Wu\", \"Qiujie Xie\", \"Yue Zhang\"]","abstract":"To comprehensively evaluate the mathematical reasoning capabilities of Large Language Models (LLMs), researchers have introduced abundant mathematical reasoning datasets. However, most existing datasets primarily focus on linear reasoning, neglecting other parts such as proof by contradiction and proof by cases, which are crucial for investigating LLMs' reasoning abilities. To address this limitation, we first introduce a novel first-order logic (FOL) dataset named PC-FOL, annotated by professio","published":"2026-02-24T14:53:34+00:00","categories":"[\"cs.CL\"]","pdf_url":"https://arxiv.org/pdf/2602.20973v1","arxiv_url":"http://arxiv.org/abs/2602.20973v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":4.0,"composite":4.1,"summary":"Introduces PC-FOL dataset focusing on case-based reasoning problems in first-order logic, annotated by mathematicians. Reveals substantial performance gaps between linear and case-based reasoning in LLMs, with theoretical analysis via graphical models explaining the disparity.","reasoning":"Novel benchmark for mathematical reasoning but no code/weights, limited to research evaluation rather than practical deployment.","code_url":null,"s2_tldr":"This work introduces a novel first-order logic (FOL) dataset named PC-FOL, annotated by professional mathematicians, focusing on case-based reasoning problems, and provides a theoretical analysis grounded in graphical model, which provides an explanation for the observed disparity between the two types of reasoning problems.","s2_paper_id":"6ec91bfee3e5d33788f2c68832e4360000f1ea46","topics":"[\"Language Models\", \"Reasoning\", \"Benchmark\"]"},{"id":334,"run_id":1,"domain":"aiml","arxiv_id":"2602.20433","entry_id":"","title":"Disentangling Geometry, Performance, and Training in Language Models","authors":"[\"Atharva Kulkarni\", \"Jacob Mitchell Springer\", \"Arjun Subramonian\", \"Swabha Swayamdipta\"]","abstract":"Geometric properties of Transformer weights, particularly the unembedding matrix, have been widely useful in language model interpretability research. Yet, their utility for estimating downstream performance remains unclear. In this work, we systematically investigate the relationship between model performance and the unembedding matrix geometry, particularly its effective rank. Our experiments, involving a suite of 108 OLMo-style language models trained under controlled variation, reveal severa","published":"2026-02-24T00:31:04+00:00","categories":"[\"cs.CL\"]","pdf_url":"https://arxiv.org/pdf/2602.20433v1","arxiv_url":"http://arxiv.org/abs/2602.20433v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":5.0,"score_axis_3":5.0,"composite":4.1,"summary":"Investigates relationship between LLM performance and unembedding matrix geometry (effective rank) using 108 controlled OLMo-style models. Finds that high effective rank correlates with best performance but isn't universal; low rank co-occurs with but doesn't cause performance degradation. Geometry primarily reflects training choices rather than performance.","reasoning":"Useful empirical analysis with controlled experiments, but largely incremental findings. No novel methods or architectures. Limited immediate applicability beyond understanding existing models.","code_url":null,"s2_tldr":"The findings suggest that the model's geometry, as captured by existing metrics, primarily reflects training choices rather than performance, which is in contrast to prior work.","s2_paper_id":"6e8bc61fefd9c4dfce67b1a3784e396cb38b6c37","topics":"[\"Language Models\", \"Retrieval / RAG\", \"Architecture\"]"},{"id":335,"run_id":1,"domain":"aiml","arxiv_id":"2602.19840","entry_id":"","title":"SAMAS: A Spectrum-Guided Multi-Agent System for Achieving Style Fidelity in Literary Translation","authors":"[\"Jingzhuo Wu\", \"Jiajun Zhang\", \"Keyan Jin\", \"Dehua Ma\", \"Junbo Wang\"]","abstract":"Modern large language models (LLMs) excel at generating fluent and faithful translations. However, they struggle to preserve an author's unique literary style, often producing semantically correct but generic outputs. This limitation stems from the inability of current single-model and static multi-agent systems to perceive and adapt to stylistic variations. To address this, we introduce the Style-Adaptive Multi-Agent System (SAMAS), a novel framework that treats style preservation as a signal p","published":"2026-02-23T13:40:44+00:00","categories":"[\"cs.CL\"]","pdf_url":"https://arxiv.org/pdf/2602.19840v1","arxiv_url":"http://arxiv.org/abs/2602.19840v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":5.0,"score_axis_3":5.0,"composite":4.1,"summary":"Proposes SAMAS (Style-Adaptive Multi-Agent System) for literary translation that quantifies style into Stylistic Feature Spectrum using wavelet packet transform. Dynamically assembles specialized translation agents based on source text patterns. Achieves competitive semantic accuracy with statistically significant advantage in style fidelity.","reasoning":"Interesting multi-agent approach but no code/weights. Specific to literary translation. Incremental improvement on existing LLM translation.","code_url":null,"s2_tldr":"The Style-Adaptive Multi-Agent System (SAMAS) is introduced, a novel framework that treats style preservation as a signal processing task and quantifies literary style into a Stylistic Feature Spectrum (SFS) using the wavelet packet transform.","s2_paper_id":"9643f44ff5d9f50790aec2a72d83ad534e93aedf","topics":"[\"Agents\", \"Language Models\"]"},{"id":337,"run_id":1,"domain":"aiml","arxiv_id":"2602.19177","entry_id":"","title":"Next Reply Prediction X Dataset: Linguistic Discrepancies in Naively Generated Content","authors":"[\"Simon M\\u00fcnker\", \"Nils Schwager\", \"Kai Kugler\", \"Michael Heseltine\", \"Achim Rettinger\"]","abstract":"The increasing use of Large Language Models (LLMs) as proxies for human participants in social science research presents a promising, yet methodologically risky, paradigm shift. While LLMs offer scalability and cost-efficiency, their \"naive\" application, where they are prompted to generate content without explicit behavioral constraints, introduces significant linguistic discrepancies that challenge the validity of research findings. This paper addresses these limitations by introducing a novel,","published":"2026-02-22T13:14:27+00:00","categories":"[\"cs.CL\", \"cs.AI\"]","pdf_url":"https://arxiv.org/pdf/2602.19177v1","arxiv_url":"http://arxiv.org/abs/2602.19177v1","comment":"8 pages (12 including references), 2 figures and 2 tables","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":4.0,"score_axis_3":6.0,"composite":4.1,"summary":"Introduces a history-conditioned reply prediction dataset for X/Twitter to evaluate linguistic discrepancies between LLM-generated and human content. Provides quantitative framework using stylistic and content metrics to assess synthetic data quality for computational social science applications.","reasoning":"Useful dataset for evaluating LLM authenticity, but focused on social science validation rather than novel methods. Moderate practical value for researchers using LLMs as human proxies.","code_url":null,"s2_tldr":"This paper introduces a novel, history-conditioned reply prediction task on authentic X data, to create a dataset designed to evaluate the linguistic output of LLMs against human-generated content, providing a quantitative framework for researchers to assess the quality and authenticity of synthetic data.","s2_paper_id":"d05edb016549da74c4053bcb4a1c75b92ef88e0b","topics":"[\"Benchmark\", \"Language Models\", \"Reasoning\"]"},{"id":338,"run_id":1,"domain":"aiml","arxiv_id":"2602.19115","entry_id":"","title":"How Do LLMs Encode Scientific Quality? An Empirical Study Using Monosemantic Features from Sparse Autoencoders","authors":"[\"Michael McCoubrey\", \"Angelo Salatino\", \"Francesco Osborne\", \"Enrico Motta\"]","abstract":"In recent years, there has been a growing use of generative AI, and large language models (LLMs) in particular, to support both the assessment and generation of scientific work. Although some studies have shown that LLMs can, to a certain extent, evaluate research according to perceived quality, our understanding of the internal mechanisms that enable this capability remains limited. This paper presents the first study that investigates how LLMs encode the concept of scientific quality through r","published":"2026-02-22T10:12:20+00:00","categories":"[\"cs.CL\", \"cs.AI\", \"cs.DL\"]","pdf_url":"https://arxiv.org/pdf/2602.19115v1","arxiv_url":"http://arxiv.org/abs/2602.19115v1","comment":"Presented at SESAME 2025: Smarter Extraction of ScholArly MEtadata using Knowledge Graphs and Language Models, @ JCDL 2025","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":5.0,"score_axis_3":5.0,"composite":4.1,"summary":"First study investigating how LLMs encode scientific quality through monosemantic features from sparse autoencoders. Identifies four recurring feature types capturing research quality dimensions: methodologies, publication types, high-impact fields, and scientific jargon, validated through citation and journal metric prediction.","reasoning":"Novel interpretability work for understanding research quality representation in LLMs. Moderate practical value, primarily analytical rather than actionable, no code/weights.","code_url":null,"s2_tldr":"This paper presents the first study that investigates how LLMs encode the concept of scientific quality through relevant monosemantic features extracted using sparse autoencoders, and identifies four recurring types of features that capture key aspects of how research quality is represented.","s2_paper_id":"4f2a3156d886a3d041bbfe67d7097553d0c98cd4","topics":"[\"Language Models\", \"Benchmark\"]"},{"id":339,"run_id":1,"domain":"aiml","arxiv_id":"2602.19101","entry_id":"","title":"Value Entanglement: Conflation Between Different Kinds of Good In (Some) Large Language Models","authors":"[\"Seong Hah Cho\", \"Junyi Li\", \"Anna Leshinskaya\"]","abstract":"Value alignment of Large Language Models (LLMs) requires us to empirically measure these models' actual, acquired representation of value. Among the characteristics of value representation in humans is that they distinguish among value of different kinds. We investigate whether LLMs likewise distinguish three different kinds of good: moral, grammatical, and economic. By probing model behavior, embeddings, and residual stream activations, we report pervasive cases of value entanglement: a conflat","published":"2026-02-22T09:11:26+00:00","categories":"[\"cs.CL\", \"cs.AI\"]","pdf_url":"https://arxiv.org/pdf/2602.19101v1","arxiv_url":"http://arxiv.org/abs/2602.19101v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":5.0,"score_axis_3":5.0,"composite":4.1,"summary":"Investigates value representation in LLMs, finding pervasive 'value entanglement' where grammatical and economic valuation are overly influenced by moral value. Proposes selective ablation of morality-associated activation vectors to repair this conflation and align with human norms.","reasoning":"Interesting interpretability work on value alignment with novel findings, but limited practical applicability beyond alignment research. No code/weights mentioned.","code_url":null,"s2_tldr":"","s2_paper_id":"01901262ae2ddb5e5543b7118703cf36f81098ff","topics":"[\"Language Models\", \"Retrieval / RAG\", \"Training\"]"},{"id":340,"run_id":1,"domain":"aiml","arxiv_id":"2602.18776","entry_id":"","title":"ArabicNumBench: Evaluating Arabic Number Reading in Large Language Models","authors":"[\"Anas Alhumud\", \"Abdulaziz Alhammadi\", \"Muhammad Badruddin Khan\"]","abstract":"We present ArabicNumBench, a comprehensive benchmark for evaluating large language models on Arabic number reading tasks across Eastern Arabic-Indic numerals (0-9 in Arabic script) and Western Arabic numerals (0-9). We evaluate 71 models from 10 providers using four prompting strategies (zero-shot, zero-shot CoT, few-shot, few-shot CoT) on 210 number reading tasks spanning six contextual categories: pure numerals, addresses, dates, quantities, and prices. Our evaluation comprises 59,010 individu","published":"2026-02-21T10:00:56+00:00","categories":"[\"cs.CL\", \"cs.AI\"]","pdf_url":"https://arxiv.org/pdf/2602.18776v1","arxiv_url":"http://arxiv.org/abs/2602.18776v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":4.0,"score_axis_3":6.0,"composite":4.1,"summary":"Presents ArabicNumBench, comprehensive benchmark evaluating 71 models across 10 providers on Arabic number reading tasks. Evaluates 59,010 test cases across 210 tasks with four prompting strategies. Few-shot CoT achieves 2.8x improvement over zero-shot (80.06% vs 28.76%).","reasoning":"Useful benchmark for Arabic NLP but incremental contribution. Provides baselines for model selection but no novel methods or architectures.","code_url":null,"s2_tldr":"Comprehensive evaluation of 281 model-strategy combinations demonstrates that numerical accuracy and instruction-following represent distinct capabilities, establishing baselines for Arabic number comprehension and providing actionable guidance for model selection in production Arabic NLP systems.","s2_paper_id":"00f3621f0ade52c6b230f970c8d584aa12475a63","topics":"[\"Language Models\", \"Benchmark\"]"},{"id":341,"run_id":1,"domain":"aiml","arxiv_id":"2602.18652","entry_id":"","title":"PolyFrame at MWE-2026 AdMIRe 2: When Words Are Not Enough: Multimodal Idiom Disambiguation","authors":"[\"Nina Hosseini-Kivanani\"]","abstract":"Multimodal models struggle with idiomatic expressions due to their non-compositional meanings, a challenge amplified in multilingual settings. We introduced PolyFrame, our system for the MWE-2026 AdMIRe2 shared task on multimodal idiom disambiguation, featuring a unified pipeline for both image+text ranking (Subtask A) and text-only caption ranking (Subtask B). All model variants retain frozen CLIP-style vision--language encoders and the multilingual BGE M3 encoder, training only lightweight mod","published":"2026-02-20T23:07:55+00:00","categories":"[\"cs.CL\"]","pdf_url":"https://arxiv.org/pdf/2602.18652v1","arxiv_url":"http://arxiv.org/abs/2602.18652v1","comment":"Accepted at AdMIRe 2 shared task (Advancing Multimodal Idiomaticity Representation) colocated with 22nd Workshop on Multiword Expressions (MWE 2026) @EACL2026","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":5.0,"score_axis_3":5.0,"composite":4.1,"summary":"Introduces PolyFrame system for MWE-2026 AdMIRe2 multimodal idiom disambiguation. Unified pipeline for image+text and text-only ranking using frozen CLIP/BGE encoders with lightweight modules. Achieves 60% Top-1 on English dev, 0.35/0.73 Top-1/NDCG on multilingual test across 15 languages.","reasoning":"Task-specific system for shared task with moderate performance. Limited generalizability beyond idiom disambiguation and no code released.","code_url":null,"s2_tldr":"PolyFrame is introduced, a system for the MWE-2026 AdMIRe2 shared task on multimodal idiom disambiguation, featuring a unified pipeline for both image+text ranking and text-only caption ranking, and sentence-type prediction and multimodal fusion enhance robustness.","s2_paper_id":"b9fc6ec83530c83b0e2bd0dce3c4795dbe56fffc","topics":"[\"Multimodal\"]"},{"id":342,"run_id":1,"domain":"aiml","arxiv_id":"2602.17981","entry_id":"","title":"Decomposing Retrieval Failures in RAG for Long-Document Financial Question Answering","authors":"[\"Amine Kobeissi\", \"Philippe Langlais\"]","abstract":"Retrieval-augmented generation is increasingly used for financial question answering over long regulatory filings, yet reliability depends on retrieving the exact context needed to justify answers in high stakes settings. We study a frequent failure mode in which the correct document is retrieved but the page or chunk that contains the answer is missed, leading the generator to extrapolate from incomplete context. Despite its practical significance, this within-document retrieval failure mode ha","published":"2026-02-20T04:31:40+00:00","categories":"[\"cs.CL\", \"cs.IR\"]","pdf_url":"https://arxiv.org/pdf/2602.17981v1","arxiv_url":"http://arxiv.org/abs/2602.17981v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":4.0,"score_axis_3":6.0,"composite":4.1,"summary":"Analyzes within-document retrieval failures in RAG for financial QA, introducing oracle-based analysis across document/page/chunk granularity. Proposes domain fine-tuned bi-encoder for page-level retrieval that significantly improves recall on FinanceBench by treating pages as intermediate semantic units.","reasoning":"Incremental improvement in RAG retrieval for specialized domain. Practical for financial applications but limited novelty. No code shared.","code_url":null,"s2_tldr":"A domain fine-tuned page scorer is introduced that treats pages as an intermediate retrieval unit between documents and chunks and fine-tune a bi-encoder specifically for page-level relevance on financial filings, exploiting the semantic coherence of pages.","s2_paper_id":"c16e8450de9afc9a39248ce87bb1922bb855c961","topics":"[\"Retrieval / RAG\"]"},{"id":343,"run_id":1,"domain":"aiml","arxiv_id":"2602.17937","entry_id":"","title":"Analyzing LLM Instruction Optimization for Tabular Fact Verification","authors":"[\"Xiaotang Du\", \"Giwon Hong\", \"Wai-Chung Kwan\", \"Rohit Saxena\", \"Ivan Titov\", \"Pasquale Minervini\", \"Emily Allaway\"]","abstract":"Instruction optimization provides a lightweight, model-agnostic approach to enhancing the reasoning performance of large language models (LLMs). This paper presents the first systematic comparison of instruction optimization, based on the DSPy optimization framework, for tabular fact verification. We evaluate four out-of-the-box prompting techniques that cover both text-only prompting and code use: direct prediction, Chain-of-Thought (CoT), ReAct with SQL tools, and CodeAct with Python execution","published":"2026-02-20T01:56:27+00:00","categories":"[\"cs.CL\", \"cs.PL\"]","pdf_url":"https://arxiv.org/pdf/2602.17937v1","arxiv_url":"http://arxiv.org/abs/2602.17937v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":4.0,"score_axis_3":6.0,"composite":4.1,"summary":"First systematic comparison of DSPy instruction optimization for tabular fact verification, evaluating COPRO, MiPROv2, and SIMBA optimizers across four benchmarks. Finds MiPROv2 best for CoT, SIMBA best for ReAct agents, with behavioral analysis showing SIMBA encourages more direct reasoning paths.","reasoning":"Useful empirical study for practitioners using DSPy framework. Incremental evaluation work rather than novel methods. No new code.","code_url":null,"s2_tldr":"It is found that instruction optimization consistently improves verification accuracy, with MiPROv2 yielding the most stable gains for CoT, and SIMBA providing the largest benefits for ReAct agents, particularly at larger model scales.","s2_paper_id":"d7163ab2fb1ea250a23707c92373fcadc831b90b","topics":"[\"Language Models\", \"Optimization\", \"Reasoning\"]"},{"id":344,"run_id":1,"domain":"aiml","arxiv_id":"2602.17467","entry_id":"","title":"PEACE 2.0: Grounded Explanations and Counter-Speech for Combating Hate Expressions","authors":"[\"Greta Damo\", \"St\\u00e9phane Petiot\", \"Elena Cabrio\", \"Serena Villata\"]","abstract":"The increasing volume of hate speech on online platforms poses significant societal challenges. While the Natural Language Processing community has developed effective methods to automatically detect the presence of hate speech, responses to it, called counter-speech, are still an open challenge. We present PEACE 2.0, a novel tool that, besides analysing and explaining why a message is considered hateful or not, also generates a response to it. More specifically, PEACE 2.0 has three main new fun","published":"2026-02-19T15:33:56+00:00","categories":"[\"cs.CL\"]","pdf_url":"https://arxiv.org/pdf/2602.17467v1","arxiv_url":"http://arxiv.org/abs/2602.17467v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":4.0,"score_axis_3":6.0,"composite":4.1,"summary":"PEACE 2.0 extends hate speech detection with RAG-based grounded explanations and evidence-backed counter-speech generation. The tool analyzes why messages are hateful and automatically generates contextually-appropriate responses for both explicit and implicit hate speech.","reasoning":"Incremental improvement over existing tool, no mention of code/weights. Practical application for content moderation but limited technical novelty.","code_url":null,"s2_tldr":"PEACE 2.0 is presented, a novel tool that, besides analysing and explaining why a message is considered hateful or not, also generates a response to it, and explores the characteristics of counter-speech replies.","s2_paper_id":"d8d279a3a889e6460d28894452a86b8293a07df4","topics":"[\"Speech / Audio\"]"},{"id":345,"run_id":1,"domain":"aiml","arxiv_id":"2602.17424","entry_id":"","title":"Diverse Word Choices, Same Reference: Annotating Lexically-Rich Cross-Document Coreference","authors":"[\"Anastasia Zhukova\", \"Felix Hamborg\", \"Karsten Donnay\", \"Norman Meuschke\", \"Bela Gipp\"]","abstract":"Cross-document coreference resolution (CDCR) identifies and links mentions of the same entities and events across related documents, enabling content analysis that aggregates information at the level of discourse participants. However, existing datasets primarily focus on event resolution and employ a narrow definition of coreference, which limits their effectiveness in analyzing diverse and polarized news coverage where wording varies widely. This paper proposes a revised CDCR annotation scheme","published":"2026-02-19T14:56:01+00:00","categories":"[\"cs.CL\"]","pdf_url":"https://arxiv.org/pdf/2602.17424v1","arxiv_url":"http://arxiv.org/abs/2602.17424v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":5.0,"score_axis_3":5.0,"composite":4.1,"summary":"Proposes revised cross-document coreference annotation scheme treating discourse elements as conceptual units, accommodating identity and near-identity relations. Reannotates NewsWCL50 and ECB+ subset to capture lexical diversity and framing variation in polarized news coverage.","reasoning":"Dataset contribution but no code/weights for models. Moderate novelty in annotation methodology, limited immediate practical impact.","code_url":null,"s2_tldr":"This paper proposes a revised CDCR annotation scheme of the NewsWCL50 dataset, treating coreference chains as discourse elements (DEs) and conceptual units of analysis, and reannotates the NewsWCL50 and a subset of ECB+ using a unified codebook and evaluates the new datasets through lexical diversity metrics and a same-head-lemma baseline.","s2_paper_id":"e0b8e64534e52d6e1729ac0f3d73e401fc01e8e5","topics":"[\"Benchmark\"]"},{"id":346,"run_id":1,"domain":"aiml","arxiv_id":"2602.17051","entry_id":"","title":"Evaluating Cross-Lingual Classification Approaches Enabling Topic Discovery for Multilingual Social Media Data","authors":"[\"Deepak Uniyal\", \"Md Abul Bashar\", \"Richi Nayak\"]","abstract":"Analysing multilingual social media discourse remains a major challenge in natural language processing, particularly when large-scale public debates span across diverse languages. This study investigates how different approaches for cross-lingual text classification can support reliable analysis of global conversations. Using hydrogen energy as a case study, we analyse a decade-long dataset of over nine million tweets in English, Japanese, Hindi, and Korean (2013--2022) for topic discovery. The ","published":"2026-02-19T03:46:11+00:00","categories":"[\"cs.CL\", \"cs.AI\", \"cs.LG\"]","pdf_url":"https://arxiv.org/pdf/2602.17051v1","arxiv_url":"http://arxiv.org/abs/2602.17051v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":4.0,"score_axis_3":6.0,"composite":4.1,"summary":"Evaluates four cross-lingual classification approaches for multilingual social media analysis using 9M hydrogen energy tweets (2013-2022) across English, Japanese, Hindi, Korean. Highlights trade-offs between translation and multilingual strategies for large-scale topic discovery.","reasoning":"Low code_and_weights (no code/weights mentioned). Moderate novelty as systematic comparison study. Good practical applicability for multilingual social media analysis pipelines.","code_url":null,"s2_tldr":"This study investigates how different approaches for cross-lingual text classification can support reliable analysis of global conversations, using hydrogen energy as a case study and highlights key trade-offs between translation and multilingual approaches.","s2_paper_id":"050fb8046f36f72aa887d3bfdb6176d7f94d741b","topics":"[\"Benchmark\"]"},{"id":347,"run_id":1,"domain":"aiml","arxiv_id":"2602.21137","entry_id":"","title":"UDVideoQA: A Traffic Video Question Answering Dataset for Multi-Object Spatio-Temporal Reasoning in Urban Dynamics","authors":"[\"Joseph Raj Vishal\", \"Nagasiri Poluri\", \"Katha Naik\", \"Rutuja Patil\", \"Kashyap Hegde Kota\", \"Krishna Vinod\", \"Prithvi Jai Ramesh\", \"Mohammad Farhadi\", \"Yezhou Yang\", \"Bharatesh Chakravarthi\"]","abstract":"Understanding the complex, multi-agent dynamics of urban traffic remains a fundamental challenge for video language models. This paper introduces Urban Dynamics VideoQA, a benchmark dataset that captures the unscripted real-world behavior of dynamic urban scenes. UDVideoQA is curated from 16 hours of traffic footage recorded at multiple city intersections under diverse traffic, weather, and lighting conditions. It employs an event-driven dynamic blur technique to ensure privacy preservation with","published":"2026-02-24T17:33:12+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.21137v1","arxiv_url":"http://arxiv.org/abs/2602.21137v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":4.0,"score_axis_3":5.0,"composite":3.75,"summary":"UDVideoQA is a traffic video QA benchmark with 28K question-answer pairs from 8 hours of annotated urban traffic footage. Features privacy-preserving dynamic blur and hierarchical reasoning taxonomy from basic understanding to counterfactual inference, benchmarking 10 SOTA VideoLMs.","reasoning":"Dataset contribution for specific domain (traffic). Limited novelty (benchmark creation), no code/weights. Niche application area (urban traffic) reduces broader applicability.","code_url":null,"s2_tldr":"The UDVideoQA suite, including the dataset, annotation tools, and benchmarks for both VideoQA and VideoQGen, provides a foundation for advancing robust, privacy-aware, and real-world multimodal reasoning.","s2_paper_id":"03d3540f6bbaffd1a711c79a4dd0835c63220334","topics":"[\"Benchmark\", \"Reasoning\", \"Language Models\"]"},{"id":348,"run_id":1,"domain":"aiml","arxiv_id":"2602.20718","entry_id":"","title":"Monocular Endoscopic Tissue 3D Reconstruction with Multi-Level Geometry Regularization","authors":"[\"Yangsen Chen\", \"Hao Wang\"]","abstract":"Reconstructing deformable endoscopic tissues is crucial for achieving robot-assisted surgery. However, 3D Gaussian Splatting-based approaches encounter challenges in achieving consistent tissue surface reconstruction, while existing NeRF-based methods lack real-time rendering capabilities. In pursuit of both smooth deformable surfaces and real-time rendering, we introduce a novel approach based on 3D Gaussian Splatting. Specifically, we introduce surface-aware reconstruction, initially employing","published":"2026-02-24T09:29:36+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.20718v1","arxiv_url":"http://arxiv.org/abs/2602.20718v1","comment":"ijcnn 2025","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":5.0,"score_axis_3":4.0,"composite":3.75,"summary":"Proposes 3D Gaussian Splatting-based approach for monocular endoscopic tissue reconstruction with multi-level geometry regularization. Combines SDF-based mesh construction with local rigidity constraints for deformable soft tissue.","reasoning":"Domain-specific medical application. No code/weights. Incremental over existing 3DGS methods. Limited general applicability.","code_url":null,"s2_tldr":"This work introduces surface-aware reconstruction, initially employing a Sign Distance Field-based method to construct a mesh, subsequently utilizing this mesh to constrain the Gaussian Splatting reconstruction process, and incorporates local rigidity and global non-rigidity restrictions to guide Gaussian deformation.","s2_paper_id":"f103b169e7b1736ddd45293f182789315f907ad0","topics":"[\"3D / Vision\", \"Robotics\"]"},{"id":349,"run_id":1,"domain":"aiml","arxiv_id":"2602.20700","entry_id":"","title":"NGL-Prompter: Training-Free Sewing Pattern Estimation from a Single Image","authors":"[\"Anna Badalyan\", \"Pratheba Selvaraju\", \"Giorgio Becherini\", \"Omid Taheri\", \"Victoria Fernandez Abrevaya\", \"Michael Black\"]","abstract":"Estimating sewing patterns from images is a practical approach for creating high-quality 3D garments. Due to the lack of real-world pattern-image paired data, prior approaches fine-tune large vision language models (VLMs) on synthetic garment datasets generated by randomly sampling from a parametric garment model GarmentCode. However, these methods often struggle to generalize to in-the-wild images, fail to capture real-world correlations between garment parts, and are typically restricted to si","published":"2026-02-24T09:01:11+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.20700v1","arxiv_url":"http://arxiv.org/abs/2602.20700v1","comment":"10 pages, 7 figures","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":4.0,"score_axis_3":5.0,"composite":3.75,"summary":"NGL-Prompter introduces training-free pipeline for sewing pattern estimation from images using large VLMs. Proposes Natural Garment Language (NGL) as intermediate representation that bridges VLM descriptions and parametric GarmentCode, enabling multi-layer garment reconstruction.","reasoning":"No code/weights. Domain-specific (fashion/garment). Clever prompting approach but limited architectural novelty. Incremental over existing VLM applications.","code_url":null,"s2_tldr":"NGL (Natural Garment Language), a novel intermediate language that restructures GarmentCode into a representation more understandable to language models is proposed, and NGL-Prompter, a training-free pipeline that queries large VLMs to extract structured garment parameters, which are then deterministically mapped to valid GarmentCode are introduced.","s2_paper_id":"26ed24de00e32862a1ea041b82f4378ad17ab94e","topics":"[\"Language Models\", \"Multimodal\", \"3D / Vision\"]"},{"id":350,"run_id":1,"domain":"aiml","arxiv_id":"2602.20658","entry_id":"","title":"Vision-Language Models for Ergonomic Assessment of Manual Lifting Tasks: Estimating Horizontal and Vertical Hand Distances from RGB Video","authors":"[\"Mohammad Sadra Rajabi\", \"Aanuoluwapo Ojelade\", \"Sunwook Kim\", \"Maury A. Nussbaum\"]","abstract":"Manual lifting tasks are a major contributor to work-related musculoskeletal disorders, and effective ergonomic risk assessment is essential for quantifying physical exposure and informing ergonomic interventions. The Revised NIOSH Lifting Equation (RNLE) is a widely used ergonomic risk assessment tool for lifting tasks that relies on six task variables, including horizontal (H) and vertical (V) hand distances; such distances are typically obtained through manual measurement or specialized sensi","published":"2026-02-24T08:01:49+00:00","categories":"[\"cs.CV\", \"cs.AI\", \"cs.HC\", \"cs.LG\"]","pdf_url":"https://arxiv.org/pdf/2602.20658v1","arxiv_url":"http://arxiv.org/abs/2602.20658v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":4.0,"score_axis_3":5.0,"composite":3.75,"summary":"Evaluates vision-language models for ergonomic assessment, estimating horizontal/vertical hand distances from RGB video for NIOSH Lifting Equation. Develops detection and segmentation-based pipelines achieving ~6-8cm errors for practical workplace safety applications.","reasoning":"No code/weights. Domain-specific (ergonomics/safety). Application paper with limited ML novelty. Niche use case.","code_url":null,"s2_tldr":"The feasibility of using innovative vision-language models (VLMs) to non-invasively estimate H and V from RGB video streams is evaluated and support the feasibility of VLM-based pipelines for video-based estimation of RNLE distance parameters is supported.","s2_paper_id":"6a8c16a64f35dea159a22b1d38251c676d1b7ce5","topics":"[\"Language Models\", \"Multimodal\"]"},{"id":351,"run_id":1,"domain":"aiml","arxiv_id":"2602.20584","entry_id":"","title":"Long-Term Multi-Session 3D Reconstruction Under Substantial Appearance Change","authors":"[\"Beverley Gorry\", \"Tobias Fischer\", \"Michael Milford\", \"Alejandro Fontan\"]","abstract":"Long-term environmental monitoring requires the ability to reconstruct and align 3D models across repeated site visits separated by months or years. However, existing Structure-from-Motion (SfM) pipelines implicitly assume near-simultaneous image capture and limited appearance change, and therefore fail when applied to long-term monitoring scenarios such as coral reef surveys, where substantial visual and structural change is common. In this paper, we show that the primary limitation of current ","published":"2026-02-24T06:12:51+00:00","categories":"[\"cs.CV\", \"cs.RO\"]","pdf_url":"https://arxiv.org/pdf/2602.20584v1","arxiv_url":"http://arxiv.org/abs/2602.20584v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":5.0,"score_axis_3":4.0,"composite":3.75,"summary":"Proposes joint SfM reconstruction across long-term sessions (years apart) by enforcing cross-session correspondences within reconstruction rather than post-hoc alignment. Combines handcrafted and learned features for coral reef monitoring under substantial appearance change.","reasoning":"No code/weights provided. Incremental improvement to SfM for specialized domain (coral reefs, environmental monitoring). Limited general applicability.","code_url":null,"s2_tldr":"","s2_paper_id":"17431b8b2d7f2e6f87932a9cfe376809a329318f","topics":"[\"3D / Vision\"]"},{"id":352,"run_id":1,"domain":"aiml","arxiv_id":"2602.20539","entry_id":"","title":"Progressive Per-Branch Depth Optimization for DEFOM-Stereo and SAM3 Joint Analysis in UAV Forestry Applications","authors":"[\"Yida Lin\", \"Bing Xue\", \"Mengjie Zhang\", \"Sam Schofield\", \"Richard Green\"]","abstract":"Accurate per-branch 3D reconstruction is a prerequisite for autonomous UAV-based tree pruning; however, dense disparity maps from modern stereo matchers often remain too noisy for individual branch analysis in complex forest canopies. This paper introduces a progressive pipeline integrating DEFOM-Stereo foundation-model disparity estimation, SAM3 instance segmentation, and multi-stage depth optimization to deliver robust per-branch point clouds. Starting from a naive baseline, we systematically ","published":"2026-02-24T04:37:18+00:00","categories":"[\"eess.IV\", \"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.20539v1","arxiv_url":"http://arxiv.org/abs/2602.20539v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":4.0,"score_axis_3":5.0,"composite":3.75,"summary":"Progressive pipeline combining DEFOM-Stereo, SAM3, and multi-stage depth optimization for per-branch 3D reconstruction in UAV forestry. Reduces per-branch depth std by 82% through skeleton-preserving erosion, color validation, and five-stage depth refinement.","reasoning":"Code/data released but unclear on model weights. Incremental engineering improvements for specialized domain (UAV forestry). Limited novelty but solid practical results for niche application.","code_url":null,"s2_tldr":"A progressive pipeline integrating DEFOM-Stereo foundation-model disparity estimation, SAM3 instance segmentation, and multi-stage depth optimization to deliver robust per-branch point clouds suitable for autonomous pruning tool positioning is introduced.","s2_paper_id":"ae04b629262b8a20a735aed792d7fec11ad2807e","topics":"[\"Optimization\", \"Language Models\", \"Efficiency\"]"},{"id":353,"run_id":1,"domain":"aiml","arxiv_id":"2602.20531","entry_id":"","title":"A Lightweight Vision-Language Fusion Framework for Predicting App Ratings from User Interfaces and Metadata","authors":"[\"Azrin Sultana\", \"Firoz Ahmed\"]","abstract":"App ratings are among the most significant indicators of the quality, usability, and overall user satisfaction of mobile applications. However, existing app rating prediction models are largely limited to textual data or user interface (UI) features, overlooking the importance of jointly leveraging UI and semantic information. To address these limitations, this study proposes a lightweight vision--language framework that integrates both mobile UI and semantic information for app rating predictio","published":"2026-02-24T04:17:50+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.20531v1","arxiv_url":"http://arxiv.org/abs/2602.20531v1","comment":"24 pages, 10 figures","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":4.0,"score_axis_3":5.0,"composite":3.75,"summary":"Lightweight vision-language framework for app rating prediction combining MobileNetV3 (UI features) and DistilBERT (text features) with gated fusion. Achieves MAE 0.1060, R\u00b2 0.8529 on mobile UI dataset with efficient edge deployment potential.","reasoning":"No code/weights provided. Straightforward combination of existing encoders for specialized task (app rating prediction). Practical for developers but limited novelty and general applicability.","code_url":null,"s2_tldr":"A lightweight vision--language framework that integrates both mobile UI and semantic information for app rating prediction, which combines MobileNetV3 to extract visual features from UI layouts and DistilBERT to extract textual features is proposed.","s2_paper_id":"1e2b522c8f38312117b62039a11a4177d66fa20e","topics":"[\"Multimodal\"]"},{"id":355,"run_id":1,"domain":"aiml","arxiv_id":"2602.20084","entry_id":"","title":"Do Large Language Models Understand Data Visualization Principles?","authors":"[\"Martin Sinnona\", \"Valentin Bonas\", \"Viviana Siless\", \"Emmanuel Iarussi\"]","abstract":"Data visualization principles, derived from decades of research in design and perception, ensure proper visual communication. While prior work has shown that large language models (LLMs) can generate charts or flag misleading figures, it remains unclear whether they and their vision-language counterparts (VLMs) can reason about and enforce visualization principles directly. Constraint based systems encode these principles as logical rules for precise automated checks, but translating them into f","published":"2026-02-23T17:51:06+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.20084v1","arxiv_url":"http://arxiv.org/abs/2602.20084v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":5.0,"score_axis_3":4.0,"composite":3.75,"summary":"This paper evaluates whether LLMs and VLMs can reason about data visualization principles using Answer Set Programming ground truth. It assesses models on detecting and correcting principle violations in Vega-Lite charts, revealing an asymmetry where models are better at fixing than detecting violations.","reasoning":"No code or weights available; evaluation-focused study with limited novelty; narrow applicability to data visualization domain.","code_url":null,"s2_tldr":"This work presents the first systematic evaluation of both LLMs and VLMs on their ability to reason about visualization principles, using hard verification ground truth derived from Answer Set Programming (ASP).","s2_paper_id":"c7af56a88e68f422edc047ade44b2994d7dc7d6a","topics":"[\"Language Models\", \"Multimodal\", \"Reasoning\"]"},{"id":356,"run_id":1,"domain":"aiml","arxiv_id":"2602.20041","entry_id":"","title":"EEG-Driven Intention Decoding: Offline Deep Learning Benchmarking on a Robotic Rover","authors":"[\"Ghadah Alosaimi\", \"Maha Alsayyari\", \"Yixin Sun\", \"Stamos Katsigiannis\", \"Amir Atapour-Abarghouei\", \"Toby P. Breckon\"]","abstract":"Brain-computer interfaces (BCIs) provide a hands-free control modality for mobile robotics, yet decoding user intent during real-world navigation remains challenging. This work presents a brain-robot control framework for offline decoding of driving commands during robotic rover operation. A 4WD Rover Pro platform was remotely operated by 12 participants who navigated a predefined route using a joystick, executing the commands forward, reverse, left, right, and stop. Electroencephalogram (EEG) s","published":"2026-02-23T16:50:21+00:00","categories":"[\"cs.RO\", \"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.20041v1","arxiv_url":"http://arxiv.org/abs/2602.20041v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":4.0,"score_axis_3":5.0,"composite":3.75,"summary":"A BCI framework for offline EEG-driven robotic rover control using deep learning models. ShallowConvNet achieved the best performance for action and intent prediction from 16-channel EEG signals during navigation tasks, providing a reproducible benchmark for BCI systems.","reasoning":"No code/weights indicated; incremental application of existing deep learning to BCI; limited to offline analysis and specific hardware.","code_url":null,"s2_tldr":"By combining real-world robotic control with multi-horizon EEG intention decoding, this study introduces a reproducible benchmark and reveals key design insights for predictive deep learning-based BCI systems.","s2_paper_id":"999cb8dbee00e86c6551bf532a33c9b467ebb0ce","topics":"[\"Robotics\", \"Benchmark\"]"},{"id":357,"run_id":1,"domain":"aiml","arxiv_id":"2602.19874","entry_id":"","title":"BigMaQ: A Big Macaque Motion and Animation Dataset Bridging Image and 3D Pose Representations","authors":"[\"Lucas Martini\", \"Alexander Lappe\", \"Anna Bogn\\u00e1r\", \"Rufin Vogels\", \"Martin A. Giese\"]","abstract":"The recognition of dynamic and social behavior in animals is fundamental for advancing ethology, ecology, medicine and neuroscience. Recent progress in deep learning has enabled automated behavior recognition from video, yet an accurate reconstruction of the three-dimensional (3D) pose and shape has not been integrated into this process. Especially for non-human primates, mesh-based tracking efforts lag behind those for other species, leaving pose descriptions restricted to sparse keypoints that","published":"2026-02-23T14:21:15+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.19874v1","arxiv_url":"http://arxiv.org/abs/2602.19874v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":5.0,"score_axis_3":4.0,"composite":3.75,"summary":"BigMaQ introduces a large-scale dataset of 750+ scenes of interacting rhesus macaques with detailed 3D pose descriptions using textured avatars. The dataset provides BigMaQ500, an action recognition benchmark demonstrating substantial mAP improvements when pose information is included.","reasoning":"Code and data publicly available; moderate novelty as dataset contribution; limited applicability outside primate research domain.","code_url":null,"s2_tldr":"The BigMaQ dataset is introduced, a large-scale dataset comprising more than 750 scenes of interacting rhesus macaques with detailed 3D pose descriptions that establishes the first dataset that both integrates dynamic 3D pose-shape representations into the learning task of animal action recognition and provides a rich resource to advance the study of visual appearance, posture, and social interaction in non-human primates.","s2_paper_id":"f74d83fae9df475bd4631b54199c449b9c55d257","topics":"[\"3D / Vision\", \"Benchmark\"]"},{"id":358,"run_id":1,"domain":"aiml","arxiv_id":"2602.19828","entry_id":"","title":"TextShield-R1: Reinforced Reasoning for Tampered Text Detection","authors":"[\"Chenfan Qu\", \"Yiwu Zhong\", \"Jian Liu\", \"Xuekang Zhu\", \"Bohan Yu\", \"Lianwen Jin\"]","abstract":"The growing prevalence of tampered images poses serious security threats, highlighting the urgent need for reliable detection methods. Multimodal large language models (MLLMs) demonstrate strong potential in analyzing tampered images and generating interpretations. However, they still struggle with identifying micro-level artifacts, exhibit low accuracy in localizing tampered text regions, and heavily rely on expensive annotations for forgery interpretation. To this end, we introduce TextShield-","published":"2026-02-23T13:26:18+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.19828v1","arxiv_url":"http://arxiv.org/abs/2602.19828v1","comment":"AAAI 2026","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":4.0,"score_axis_3":5.0,"composite":3.75,"summary":"TextShield-R1 introduces first RL-based MLLM for tampered text detection with forensic reasoning. Includes new TFR benchmark with 45k+ images across 16 languages and 10 tampering techniques, achieving 96.2% accuracy and 70% user preference.","reasoning":"No code/weights available despite strong practical application. Novel use of RL for forensic reasoning but lacks open artifacts.","code_url":null,"s2_tldr":"This approach introduces Forensic Continual Pre-training, an easy-to-hard curriculum that well prepares the MLLM for tampered text detection by harnessing the large-scale cheap data from natural image forensic and OCR tasks, and introduces the Text Forensics Reasoning (TFR) benchmark, comprising over 45k real and tampered images across 16 languages, 10 tampering techniques, and diverse domains.","s2_paper_id":"1bc859b364aa35331b2d1a32e4b11e0d2f3eb799","topics":"[\"Reasoning\", \"Language Models\", \"Multimodal\"]"},{"id":360,"run_id":1,"domain":"aiml","arxiv_id":"2602.19442","entry_id":"","title":"UrbanAlign: Post-hoc Semantic Calibration for VLM-Human Preference Alignment","authors":"[\"Yecheng Zhang\", \"Rong Zhao\", \"Zhizhou Sha\", \"Yong Li\", \"Lei Wang\", \"Ce Hou\", \"Wen Ji\", \"Hao Huang\", \"Yunshan Wan\", \"Jian Yu\"]","abstract":"Aligning vision-language model (VLM) outputs with human preferences in domain-specific tasks typically requires fine-tuning or reinforcement learning, both of which demand labelled data and GPU compute. We show that for subjective perception tasks, this alignment can be achieved without any model training: VLMs are already strong concept extractors but poor decision calibrators, and the gap can be closed externally. We propose a training-free post-hoc concept-bottleneck pipeline consisting of th","published":"2026-02-23T02:24:55+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.19442v1","arxiv_url":"http://arxiv.org/abs/2602.19442v1","comment":"26 pages","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":5.0,"score_axis_3":4.0,"composite":3.75,"summary":"UrbanAlign provides training-free post-hoc calibration for VLM outputs on subjective urban perception tasks through concept mining, multi-agent scoring, and geometric calibration. Achieves 72.2% accuracy on Place Pulse 2.0, outperforming baselines by +15pp without modifying model weights.","reasoning":"Incremental approach to VLM calibration for narrow domain application. Training-free is useful but limited scope and no code/weights.","code_url":null,"s2_tldr":"A training-free post-hoc concept-bottleneck pipeline consisting of three tightly coupled stages: concept mining, multi-agent structured scoring, and geometric calibration, unified by an end-to-end dimension optimization loop is proposed.","s2_paper_id":"678bbff48353d358d377dcb2c98a04fbf465173c","topics":"[\"Multimodal\", \"Training\", \"Language Models\"]"},{"id":361,"run_id":1,"domain":"aiml","arxiv_id":"2602.19437","entry_id":"","title":"FinSight-Net:A Physics-Aware Decoupled Network with Frequency-Domain Compensation for Underwater Fish Detection in Smart Aquaculture","authors":"[\"Jinsong Yang\", \"Zeyuan Hu\", \"Yichen Li\", \"Hong Yu\"]","abstract":"Underwater fish detection (UFD) is a core capability for smart aquaculture and marine ecological monitoring. While recent detectors improve accuracy by stacking feature extractors or introducing heavy attention modules, they often incur substantial computational overhead and, more importantly, neglect the physics that fundamentally limits UFD: wavelength-dependent absorption and turbidity-induced scattering significantly degrade contrast, blur fine structures, and introduce backscattering noise,","published":"2026-02-23T02:12:47+00:00","categories":"[\"cs.CV\", \"cs.AI\"]","pdf_url":"https://arxiv.org/pdf/2602.19437v1","arxiv_url":"http://arxiv.org/abs/2602.19437v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":5.0,"score_axis_3":4.0,"composite":3.75,"summary":"FinSight-Net targets underwater fish detection by addressing physics-based degradation (wavelength absorption, scattering) through multi-scale decoupled dual-stream processing and frequency-domain compensation. Achieves 92.8% mAP with 29% fewer parameters than YOLOv11s on specialized aquaculture benchmarks.","reasoning":"Domain-specific application (aquaculture) with physics-aware design, but limited generalizability and no code/weights available.","code_url":null,"s2_tldr":"FinSight-Net is proposed, an efficient and physics-aware detection framework tailored for complex aquaculture environments that achieves state-of-the-art performance and introduces a Multi-Scale Decoupled Dual-Stream Processing bottleneck that explicitly targets frequency-specific information loss via heterogeneous convolutional branches.","s2_paper_id":"f5b61821ef454d5d5a6a111ae1b60dc2bed008d7","topics":"[\"Reasoning\"]"},{"id":362,"run_id":1,"domain":"aiml","arxiv_id":"2602.19123","entry_id":"","title":"StreetTree: A Large-Scale Global Benchmark for Fine-Grained Tree Species Classification","authors":"[\"Jiapeng Li\", \"Yingjing Huang\", \"Fan Zhang\", \"Yu liu\"]","abstract":"The fine-grained classification of street trees is a crucial task for urban planning, streetscape management, and the assessment of urban ecosystem services. However, progress in this field has been significantly hindered by the lack of large-scale, geographically diverse, and publicly available benchmark datasets specifically designed for street trees. To address this critical gap, we introduce StreetTree, the world's first large-scale benchmark dataset dedicated to fine-grained street tree cla","published":"2026-02-22T10:43:43+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.19123v1","arxiv_url":"http://arxiv.org/abs/2602.19123v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":5.0,"score_axis_3":4.0,"composite":3.75,"summary":"StreetTree introduces a large-scale benchmark with 12M+ images covering 8,300+ street tree species from 133 countries for fine-grained classification. Includes hierarchical taxonomy and establishes baselines showing limitations of existing vision models under urban complexity.","reasoning":"Large dataset contribution but primarily a benchmark paper. Narrow domain (urban trees), no novel methods or models.","code_url":null,"s2_tldr":"","s2_paper_id":"916657ec3cfca7923deb1e98da1f68e9a181c58f","topics":"[\"Benchmark\", \"Agents\"]"},{"id":363,"run_id":1,"domain":"aiml","arxiv_id":"2602.19055","entry_id":"","title":"Automated Disentangling Analysis of Skin Colour for Lesion Images","authors":"[\"Wenbo Yang\", \"Eman Rezk\", \"Walaa M. Moursi\", \"Zhou Wang\"]","abstract":"Machine-learning models working on skin images often have degraded performance when the skin colour captured in images (SCCI) differs between training and deployment. Such differences arise from entangled environmental factors (e.g., illumination, camera settings), and intrinsic factors (e.g., skin tone) that cannot be accurately described by a single \"skin tone\" scalar. To mitigate such colour mismatch, we propose a skin-colour disentangling framework that adapts disentanglement-by-compression ","published":"2026-02-22T05:52:39+00:00","categories":"[\"eess.IV\", \"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.19055v1","arxiv_url":"http://arxiv.org/abs/2602.19055v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":5.0,"score_axis_3":4.0,"composite":3.75,"summary":"Proposes a skin-colour disentangling framework for dermatology images using disentanglement-by-compression with randomized decolourization and geometry-aligned post-processing. Enables counterfactual editing and colour transfer for educational visualization, achieving competitive lesion classification through dataset augmentation.","reasoning":"Narrow medical imaging application with limited general applicability. Incremental approach using existing techniques, no code/weights.","code_url":null,"s2_tldr":"A skin-colour disentangling framework that adapts disentanglement-by-compression to learn a structured, manipulable latent space for SCCI from unlabelled dermatology images is proposed and it is demonstrated that dataset-level augmentation and colour normalization based on this framework achieve competitive lesion classification performance.","s2_paper_id":"3bb881216b9477ce719522986b53ae5e1be3e392","topics":"[\"Efficiency\"]"},{"id":365,"run_id":1,"domain":"aiml","arxiv_id":"2602.18965","entry_id":"","title":"Face Presentation Attack Detection via Content-Adaptive Spatial Operators","authors":"[\"Shujaat Khan\"]","abstract":"Face presentation attack detection (FacePAD) is critical for securing facial authentication against print, replay, and mask-based spoofing. This paper proposes CASO-PAD, an RGB-only, single-frame model that enhances MobileNetV3 with content-adaptive spatial operators (involution) to better capture localized spoof cues. Unlike spatially shared convolution kernels, the proposed operator generates location-specific, channel-shared kernels conditioned on the input, improving spatial selectivity with","published":"2026-02-21T22:13:31+00:00","categories":"[\"cs.CV\", \"eess.IV\"]","pdf_url":"https://arxiv.org/pdf/2602.18965v1","arxiv_url":"http://arxiv.org/abs/2602.18965v1","comment":"14 Pages, 8 Figures","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":4.0,"score_axis_3":5.0,"composite":3.75,"summary":"CASO-PAD introduces content-adaptive spatial operators (involution) to MobileNetV3 for single-frame RGB face presentation attack detection. The model achieves strong performance (near-perfect accuracy on multiple benchmarks) while remaining lightweight (3.6M params, 0.64 GFLOPs), making it suitable for on-device deployment without auxiliary sensors.","reasoning":"Scores low on code_and_weights (no GitHub/HF links, 0 upvotes) but offers practical mobile-class efficiency and decent novelty through adaptive spatial operators. Security applications limit broader interest.","code_url":null,"s2_tldr":"CASO-PAD is proposed, an RGB-only, single-frame model that enhances MobileNetV3 with content-adaptive spatial operators (involution) to better capture localized spoof cues, and provides a practical pathway for robust, on-device FacePAD with mobile-class compute and without auxiliary sensors or temporal stacks.","s2_paper_id":"3fabf8673563f75cd8dc681e143b3fe0ffecc299","topics":"[]"},{"id":366,"run_id":1,"domain":"aiml","arxiv_id":"2602.20976","entry_id":"","title":"Evaluating Proactive Risk Awareness of Large Language Models","authors":"[\"Xuan Luo\", \"Yubin Chen\", \"Zhiyu Hou\", \"Linpu Yu\", \"Geng Tu\", \"Jing Li\", \"Ruifeng Xu\"]","abstract":"As large language models (LLMs) are increasingly embedded in everyday decision-making, their safety responsibilities extend beyond reacting to explicit harmful intent toward anticipating unintended but consequential risks. In this work, we introduce a proactive risk awareness evaluation framework that measures whether LLMs can anticipate potential harms and provide warnings before damage occurs. We construct the Butterfly dataset to instantiate this framework in the environmental and ecological ","published":"2026-02-24T15:00:00+00:00","categories":"[\"cs.CL\", \"cs.CY\"]","pdf_url":"https://arxiv.org/pdf/2602.20976v1","arxiv_url":"http://arxiv.org/abs/2602.20976v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":5.0,"score_axis_3":4.0,"composite":3.75,"summary":"Introduces a proactive risk awareness evaluation framework and Butterfly dataset with 1,094 ecological queries to measure whether LLMs can anticipate unintended harms before damage occurs. Experiments across five LLMs reveal significant performance gaps under length constraints and cross-lingual settings.","reasoning":"Safety evaluation work with domain-specific dataset but no code/weights, limited practitioner utility beyond safety research, incremental to existing safety benchmarks.","code_url":null,"s2_tldr":"A proactive risk awareness evaluation framework that measures whether LLMs can anticipate potential harms and provide warnings before damage occurs is introduced, and the Butterfly dataset is constructed to instantiate this framework in the environmental and ecological domain.","s2_paper_id":"52b5d8d6e85fc7aedd2b6ce313cacf27e5c73e7c","topics":"[\"Language Models\", \"Benchmark\", \"Reasoning\"]"},{"id":367,"run_id":1,"domain":"aiml","arxiv_id":"2602.20966","entry_id":"","title":"Blackbird Language Matrices: A Framework to Investigate the Linguistic Competence of Language Models","authors":"[\"Paola Merlo\", \"Chunyang Jiang\", \"Giuseppe Samo\", \"Vivi Nastase\"]","abstract":"This article describes a novel language task, the Blackbird Language Matrices (BLM) task, inspired by intelligence tests, and illustrates the BLM datasets, their construction and benchmarking, and targeted experiments on chunking and systematicity. BLMs are multiple-choice problems, structured at multiple levels: within each sentence, across the input sequence, within each candidate answer. Because of their rich structure, these curated, but naturalistic datasets are key to answer some core ques","published":"2026-02-24T14:45:08+00:00","categories":"[\"cs.CL\"]","pdf_url":"https://arxiv.org/pdf/2602.20966v1","arxiv_url":"http://arxiv.org/abs/2602.20966v1","comment":"Under review, 46 pages, 5 tables, 28 figures","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":4.0,"score_axis_3":5.0,"composite":3.75,"summary":"Presents Blackbird Language Matrices (BLM), intelligence test-inspired multiple-choice linguistic tasks with rich multi-level structure. Shows LLMs can solve BLMs and detect systematic patterns across sentences, though primarily focused on linguistic analysis rather than practical applications.","reasoning":"Curated evaluation dataset for linguistic competence with some code likely available, but focus is on interpretability research rather than breakthrough methods.","code_url":null,"s2_tldr":"It is shown that BLMs, while challenging, can be solved at good levels of performance, in more than one language, with simple baseline models or, at better performance levels, with more tailored models.","s2_paper_id":"faa770fc44a31c66e7cdd9b501fb8acb95eb12f7","topics":"[\"Language Models\", \"Benchmark\"]"},{"id":368,"run_id":1,"domain":"aiml","arxiv_id":"2602.20918","entry_id":"","title":"Predicting Sentence Acceptability Judgments in Multimodal Contexts","authors":"[\"Hyewon Jang\", \"Nikolai Ilinykh\", \"Sharid Lo\\u00e1iciga\", \"Jey Han Lau\", \"Shalom Lappin\"]","abstract":"Previous work has examined the capacity of deep neural networks (DNNs), particularly transformers, to predict human sentence acceptability judgments, both independently of context, and in document contexts. We consider the effect of prior exposure to visual images (i.e., visual context) on these judgments for humans and large language models (LLMs). Our results suggest that, in contrast to textual context, visual images appear to have little if any impact on human acceptability ratings. However,","published":"2026-02-24T13:54:38+00:00","categories":"[\"cs.AI\", \"cs.CL\"]","pdf_url":"https://arxiv.org/pdf/2602.20918v1","arxiv_url":"http://arxiv.org/abs/2602.20918v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":5.0,"score_axis_3":4.0,"composite":3.75,"summary":"Studies impact of visual context on sentence acceptability judgments in humans and LLMs. Finds visual images have little effect on human ratings but cause compression effects in LLMs, with model-specific variations in judgment distribution patterns.","reasoning":"Multimodal evaluation study without code/weights, interesting for research but limited immediate practical application.","code_url":null,"s2_tldr":"The results suggest that, in contrast to textual context, visual images appear to have little if any impact on human acceptability ratings, however, LLMs display the compression effect seen in previous work on human judgments in document contexts.","s2_paper_id":"580b64457e203d8b577fcec939274bc740064563","topics":"[\"Multimodal\", \"Language Models\", \"Architecture\"]"},{"id":369,"run_id":1,"domain":"aiml","arxiv_id":"2602.20859","entry_id":"","title":"FinAnchor: Aligned Multi-Model Representations for Financial Prediction","authors":"[\"Zirui He\", \"Huopu Zhang\", \"Yanguang Liu\", \"Sirui Wu\", \"Mengnan Du\"]","abstract":"Financial prediction from long documents involves significant challenges, as actionable signals are often sparse and obscured by noise, and the optimal LLM for generating embeddings varies across tasks and time periods. In this paper, we propose FinAnchor(Financial Anchored Representations), a lightweight framework that integrates embeddings from multiple LLMs without fine-tuning the underlying models. FinAnchor addresses the incompatibility of feature spaces by selecting an anchor embedding spa","published":"2026-02-24T13:02:09+00:00","categories":"[\"cs.CL\"]","pdf_url":"https://arxiv.org/pdf/2602.20859v1","arxiv_url":"http://arxiv.org/abs/2602.20859v1","comment":"11 pages, 4 figures, 5 tables","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":4.0,"score_axis_3":5.0,"composite":3.75,"summary":"FinAnchor aligns embeddings from multiple LLMs for financial prediction without fine-tuning by selecting an anchor space and learning linear mappings. Lightweight framework consistently outperforms single-model baselines across multiple financial NLP tasks.","reasoning":"Practical ensemble approach for financial domain but no code/weights available, incremental to existing ensemble methods.","code_url":null,"s2_tldr":"Across multiple financial NLP tasks, FinAnchor consistently outperforms strong single-model baselines and standard ensemble methods, demonstrating the effectiveness of anchoring heterogeneous representations for robust financial prediction.","s2_paper_id":"899cd7b7abefad12226acb85c4a5d22f84686bac","topics":"[\"Language Models\", \"Retrieval / RAG\"]"},{"id":370,"run_id":1,"domain":"aiml","arxiv_id":"2602.20513","entry_id":"","title":"From Performance to Purpose: A Sociotechnical Taxonomy for Evaluating Large Language Model Utility","authors":"[\"Gavin Levinson\", \"Keith Feldman\"]","abstract":"As large language models (LLMs) continue to improve at completing discrete tasks, they are being integrated into increasingly complex and diverse real-world systems. However, task-level success alone does not establish a model's fit for use in practice. In applied, high-stakes settings, LLM effectiveness is driven by a wider array of sociotechnical determinants that extend beyond conventional performance measures. Although a growing set of metrics capture many of these considerations, they are r","published":"2026-02-24T03:31:07+00:00","categories":"[\"cs.CL\"]","pdf_url":"https://arxiv.org/pdf/2602.20513v1","arxiv_url":"http://arxiv.org/abs/2602.20513v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":4.0,"score_axis_3":5.0,"composite":3.75,"summary":"LUX proposes a taxonomy for evaluating LLM utility beyond task performance, organized into four domains: performance, interaction, operations, and governance. Provides structured framework with hierarchical dimensions and components, accompanied by a web tool connecting components to relevant metrics for applied evaluation in high-stakes enterprise settings.","reasoning":"Incremental work organizing existing evaluation concepts into taxonomy. Useful for practitioners but not architecturally novel. No models or code, purely framework/survey-oriented.","code_url":null,"s2_tldr":"The Language Model Utility Taxonomy (LUX) is introduced, a comprehensive framework that structures utility evaluation across four domains: performance, interaction, operations, and governance.","s2_paper_id":"07aa3b45b548494b8b9e6d57d0b1e2dd75ae65f9","topics":"[\"Language Models\", \"Benchmark\"]"},{"id":371,"run_id":1,"domain":"aiml","arxiv_id":"2602.20336","entry_id":"","title":"Natural Language Processing Models for Robust Document Categorization","authors":"[\"Radoslaw Roszczyk\", \"Pawel Tecza\", \"Maciej Stodolski\", \"Krzysztof Siwek\"]","abstract":"This article presents an evaluation of several machine learning methods applied to automated text classification, alongside the design of a demonstrative system for unbalanced document categorization and distribution. The study focuses on balancing classification accuracy with computational efficiency, a key consideration when integrating AI into real world automation pipelines. Three models of varying complexity were examined: a Naive Bayes classifier, a bidirectional LSTM network, and a fine t","published":"2026-02-23T20:33:22+00:00","categories":"[\"cs.CL\"]","pdf_url":"https://arxiv.org/pdf/2602.20336v1","arxiv_url":"http://arxiv.org/abs/2602.20336v1","comment":"13 pages, 1 fiure, 5 tables","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":3.0,"score_axis_3":6.0,"composite":3.75,"summary":"Compares Naive Bayes, BiLSTM, and fine-tuned BERT for document classification. BERT achieves 99%+ accuracy but requires significant resources; BiLSTM provides 98.56% accuracy with moderate cost; Naive Bayes trains in milliseconds but only reaches 94.5%. Implements demonstrative system for automated routing of technical requests.","reasoning":"Standard comparison of well-known methods with no novel techniques. Practical system implementation but purely incremental work surveying existing approaches.","code_url":null,"s2_tldr":"The study concludes that BiLSTM offers the most balanced solution for the examined scenario, while also outlining opportunities for future improvements and further exploration of transformer architectures.","s2_paper_id":"bda709906cfdb5c855e9b04028f7fca44e1197d7","topics":"[\"Benchmark\"]"},{"id":372,"run_id":1,"domain":"aiml","arxiv_id":"2602.20092","entry_id":"","title":"BabyLM Turns 4 and Goes Multilingual: Call for Papers for the 2026 BabyLM Workshop","authors":"[\"Leshem Choshen\", \"Ryan Cotterell\", \"Mustafa Omer Gul\", \"Jaap Jumelet\", \"Tal Linzen\", \"Aaron Mueller\", \"Suchir Salhan\", \"Raj Sanjay Shah\", \"Alex Warstadt\", \"Ethan Gotlieb Wilcox\"]","abstract":"The goal of the BabyLM is to stimulate new research connections between cognitive modeling and language model pretraining. We invite contributions in this vein to the BabyLM Workshop, which will also include the 4th iteration of the BabyLM Challenge. As in previous years, the challenge features two ``standard'' tracks (Strict and Strict-Small), in which participants must train language models on under 100M or 10M words of data, respectively. This year, we move beyond our previous English-only pr","published":"2026-02-23T18:02:23+00:00","categories":"[\"cs.CL\"]","pdf_url":"https://arxiv.org/pdf/2602.20092v2","arxiv_url":"http://arxiv.org/abs/2602.20092v2","comment":"8 pages, 1 table. arXiv admin note: substantial text overlap with arXiv:2502.10645","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":4.0,"score_axis_3":5.0,"composite":3.75,"summary":"Call for papers for 4th BabyLM Challenge and Workshop focusing on cognitive modeling and efficient language model pretraining. Features Strict (100M words), Strict-Small (10M words), and new Multilingual track (English, Dutch, Chinese). Solicits papers on training efficiency, small-scale datasets, cognitive modeling, evaluation, and architecture innovation.","reasoning":"Workshop call for papers with no novel research contributions. Useful for community but not a research paper. Multilingual track is incremental extension.","code_url":null,"s2_tldr":"The goal of the BabyLM is to stimulate new research connections between cognitive modeling and language model pretraining, and calls for papers related to the overall theme of BabyLM, which includes training efficiency, small-scale training datasets, cognitive modeling, model evaluation, and architecture innovation.","s2_paper_id":"7fbf39885d4a0c46215637b6c6431a52d1fbc766","topics":"[\"Language Models\"]"},{"id":373,"run_id":1,"domain":"aiml","arxiv_id":"2602.20065","entry_id":"","title":"Multilingual Large Language Models do not comprehend all natural languages to equal degrees","authors":"[\"Natalia Moskvina\", \"Raquel Montero\", \"Masaya Yoshida\", \"Ferdy Hubers\", \"Paolo Morosi\", \"Walid Irhaymi\", \"Jin Yan\", \"Tamara Serrano\", \"Elena Pagliarini\", \"Fritz G\\u00fcnther\"]","abstract":"Large Language Models (LLMs) play a critical role in how humans access information. While their core use relies on comprehending written requests, our understanding of this ability is currently limited, because most benchmarks evaluate LLMs in high-resource languages predominantly spoken by Western, Educated, Industrialised, Rich, and Democratic (WEIRD) communities. The default assumption is that English is the best-performing language for LLMs, while smaller, low-resource languages are linked t","published":"2026-02-23T17:22:46+00:00","categories":"[\"cs.CL\", \"cs.AI\"]","pdf_url":"https://arxiv.org/pdf/2602.20065v1","arxiv_url":"http://arxiv.org/abs/2602.20065v1","comment":"36 pages, 3 figures, 2 tables, 4 supplementary tables","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":5.0,"score_axis_3":4.0,"composite":3.75,"summary":"Evaluates language comprehension abilities of 3 popular multilingual LLMs across 12 typologically diverse languages from five language families. Finds models exhibit high linguistic accuracy but fall behind human baselines in all languages. Contrary to expectations, English is not best-performing and is systematically outperformed by several Romance languages.","reasoning":"Useful empirical study on multilingual comprehension with surprising findings. Incremental evaluation work without novel methods. No models or code released.","code_url":null,"s2_tldr":"The role of several factors that drive LLM performance, such as tokenization, language distance from Spanish and English, size of training data, and data origin in high- vs. low-resource languages and WEIRD vs. non-WEIRD communities are discussed.","s2_paper_id":"ae72d58fd5510984f1776b22a338912e80fe0ef4","topics":"[\"Language Models\", \"Benchmark\"]"},{"id":374,"run_id":1,"domain":"aiml","arxiv_id":"2602.19991","entry_id":"","title":"Cross-lingual Matryoshka Representation Learning across Speech and Text","authors":"[\"Yaya Sy\", \"Dioula Doucour\\u00e9\", \"Christophe Cerisara\", \"Irina Illina\"]","abstract":"Speakers of under-represented languages face both a language barrier, as most online knowledge is in a few dominant languages, and a modality barrier, since information is largely text-based while many languages are primarily oral. We address this for French-Wolof by training the first bilingual speech-text Matryoshka embedding model, enabling efficient retrieval of French text from Wolof speech queries without relying on a costly ASR-translation pipelines. We introduce large-scale data curation","published":"2026-02-23T15:57:16+00:00","categories":"[\"cs.CL\"]","pdf_url":"https://arxiv.org/pdf/2602.19991v1","arxiv_url":"http://arxiv.org/abs/2602.19991v1","comment":"Preprint, under review","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":5.0,"score_axis_3":4.0,"composite":3.75,"summary":"Introduces first bilingual French-Wolof speech-text Matryoshka embedding model for cross-lingual retrieval without ASR-translation pipeline. Enables efficient retrieval of French text from Wolof speech queries. Generalizes to speech intent detection, demonstrating learning of general semantic representations.","reasoning":"Addresses under-resourced languages but very specific use case (French-Wolof). No code/weights. Limited applicability beyond this language pair.","code_url":null,"s2_tldr":"This work training the first bilingual speech-text Matryoshka embedding model, enabling efficient retrieval of French text from Wolof speech queries without relying on a costly ASR-translation pipelines, and analyze cost-accuracy trade-offs across Matryoshka dimensions and ranks.","s2_paper_id":"7f8394008d71017f1e2810126af853f670935c03","topics":"[\"Speech / Audio\", \"Efficiency\", \"Retrieval / RAG\"]"},{"id":375,"run_id":1,"domain":"aiml","arxiv_id":"2602.19969","entry_id":"","title":"ReAttn: Improving Attention-based Re-ranking via Attention Re-weighting","authors":"[\"Yuxing Tian\", \"Fengran Mo\", \"Weixu Zhang\", \"Yiyan Qi\", \"Jian-Yun Nie\"]","abstract":"The strong capabilities of recent Large Language Models (LLMs) have made them highly effective for zero-shot re-ranking task. Attention-based re-ranking methods, which derive relevance scores directly from attention weights, offer an efficient and interpretable alternative to generation-based re-ranking methods. However, they still face two major limitations. First, attention signals are highly concentrated a small subset of tokens within a few documents, making others indistinguishable. Second,","published":"2026-02-23T15:30:52+00:00","categories":"[\"cs.CL\", \"cs.AI\"]","pdf_url":"https://arxiv.org/pdf/2602.19969v1","arxiv_url":"http://arxiv.org/abs/2602.19969v1","comment":"Accepted by EACL2026","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":4.0,"score_axis_3":5.0,"composite":3.75,"summary":"Proposes ReAttn, a post-hoc re-weighting strategy for attention-based LLM re-ranking that addresses over-concentration and lexical bias. Uses cross-document IDF weighting and entropy-based regularization on existing attention weights without additional training. Shows effectiveness in re-ranking experiments.","reasoning":"Incremental improvement to existing attention-based re-ranking. No code/weights. Practical but not novel architecture or paradigm.","code_url":null,"s2_tldr":"This paper proposes a post-hoc re-weighting strategy for attention-based re-ranking methods that first compute the cross-document IDF weighting to down-weight attention on query-overlapping tokens that frequently appear across the candidate documents, reducing lexical bias and emphasizing distinctive terms.","s2_paper_id":"8372ee94829939b8a0de71dc828cc3956ed2769f","topics":"[\"Language Models\", \"Efficiency\"]"},{"id":376,"run_id":1,"domain":"aiml","arxiv_id":"2602.19919","entry_id":"","title":"Janus-Q: End-to-End Event-Driven Trading via Hierarchical-Gated Reward Modeling","authors":"[\"Xiang Li\", \"Zikai Wei\", \"Yiyan Qi\", \"Wanyun Zhou\", \"Xiang Liu\", \"Penglei Sun\", \"Yongqi Zhang\", \"Xiaowen Chu\"]","abstract":"Financial market movements are often driven by discrete financial events conveyed through news, whose impacts are heterogeneous, abrupt, and difficult to capture under purely numerical prediction objectives. These limitations have motivated growing interest in using textual information as the primary source of trading signals in learning-based systems. Two key challenges hinder existing approaches: (1) the absence of large-scale, event-centric datasets that jointly model news semantics and stati","published":"2026-02-23T14:58:51+00:00","categories":"[\"cs.CL\", \"cs.LG\"]","pdf_url":"https://arxiv.org/pdf/2602.19919v1","arxiv_url":"http://arxiv.org/abs/2602.19919v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":5.0,"score_axis_3":4.0,"composite":3.75,"summary":"Proposes Janus-Q, an end-to-end event-driven trading framework using financial news events as primary decision units. Constructs 62,400-article dataset with fine-grained event annotations and CAR. Combines supervised learning with RL using Hierarchical Gated Reward Model (HGRM), improving Sharpe Ratio by up to 102%.","reasoning":"Finance-specific application with no code/weights. Limited to financial trading domain. Incremental application of RL to specialized use case.","code_url":null,"s2_tldr":"Janus-Q is proposed, an end-to-end event-driven trading framework that elevates financial news events from auxiliary signals to primary decision units and achieves more consistent, interpretable, and profitable trading decisions than market indices and LLM baselines.","s2_paper_id":"c9929adeb75495670b68a57033fddb243218e3de","topics":"[\"Training\", \"RL\", \"Benchmark\"]"},{"id":378,"run_id":1,"domain":"aiml","arxiv_id":"2602.19569","entry_id":"","title":"Temporal-Aware Heterogeneous Graph Reasoning with Multi-View Fusion for Temporal Question Answering","authors":"[\"Wuzhenghong Wen\", \"Bowen Zhou\", \"Jinwen Huang\", \"Xianjie Wu\", \"Yuwei Sun\", \"Su Pan\", \"Liang Li\", \"Jianting Liu\"]","abstract":"Question Answering over Temporal Knowledge Graphs (TKGQA) has attracted growing interest for handling time-sensitive queries. However, existing methods still struggle with: 1) weak incorporation of temporal constraints in question representation, causing biased reasoning; 2) limited ability to perform explicit multi-hop reasoning; and 3) suboptimal fusion of language and graph representations. We propose a novel framework with temporal-aware question encoding, multi-hop graph reasoning, and mult","published":"2026-02-23T07:36:36+00:00","categories":"[\"cs.CL\", \"cs.AI\"]","pdf_url":"https://arxiv.org/pdf/2602.19569v1","arxiv_url":"http://arxiv.org/abs/2602.19569v1","comment":"6pages","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":4.0,"score_axis_3":5.0,"composite":3.75,"summary":"Proposes temporal-aware framework for Temporal Knowledge Graph QA with: 1) constraint-aware question encoding combining LM semantics with temporal entity dynamics, 2) temporal-aware GNN for multi-hop reasoning, 3) multi-view attention for fusing question context and temporal graph knowledge. Shows improvements on TKGQA benchmarks.","reasoning":"Incremental improvement to TKGQA systems. No code/weights. Specific to temporal KG domain with limited broader applicability.","code_url":null,"s2_tldr":"This work proposes a novel framework with temporal-aware question encoding, multi-hop graph reasoning, and multi-view heterogeneous information fusion for time-sensitive queries and introduces a constraint-aware question representation that combines semantic cues from language models with temporal entity dynamics.","s2_paper_id":"8d1bad4b6d492df5ad8ef1013d1b44d682966cac","topics":"[\"Reasoning\", \"Retrieval / RAG\"]"},{"id":379,"run_id":1,"domain":"aiml","arxiv_id":"2602.19403","entry_id":"","title":"Personalized Prediction of Perceived Message Effectiveness Using Large Language Model Based Digital Twins","authors":"[\"Jasmin Han\", \"Janardan Devkota\", \"Joseph Waring\", \"Amanda Luken\", \"Felix Naughton\", \"Roger Vilardaga\", \"Jonathan Bricker\", \"Carl Latkin\", \"Meghan Moran\", \"Yiqun Chen\"]","abstract":"Perceived message effectiveness (PME) by potential intervention end-users is important for selecting and optimizing personalized smoking cessation intervention messages for mobile health (mHealth) platform delivery. This study evaluates whether large language models (LLMs) can accurately predict PME for smoking cessation messages. We evaluated multiple models for predicting PME across three domains: content quality, coping support, and quitting support. The dataset comprised 3010 message ratin","published":"2026-02-23T00:32:23+00:00","categories":"[\"cs.CL\", \"stat.AP\"]","pdf_url":"https://arxiv.org/pdf/2602.19403v1","arxiv_url":"http://arxiv.org/abs/2602.19403v1","comment":"31 pages, 5 figures, submitted to Journal of the American Medical Informatics Association (JAMIA). Drs. Chen and Thrul share last authorship","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":4.0,"score_axis_3":5.0,"composite":3.75,"summary":"This study evaluates LLM-based digital twins for predicting perceived message effectiveness in smoking cessation interventions. Digital twins incorporating personal profiles outperform zero/few-shot LLMs and supervised baselines, achieving 0.49 accuracy for content quality prediction and 0.75 directional accuracy on simplified scales.","reasoning":"Medical application with limited broader applicability. Interesting use of LLM digital twins but narrow domain focus (smoking cessation), no code/weights mentioned.","code_url":null,"s2_tldr":"Digital twin predictions showed greater dispersion across rating categories, indicating improved sensitivity to individual differences in PME, and LLM-based digital twins show potential for supporting personalization of mobile smoking cessation and other health behavior change interventions.","s2_paper_id":"90f7f39b9adf47328c47b0f2c13806a3091d4b35","topics":"[\"Language Models\", \"Benchmark\", \"Optimization\"]"},{"id":380,"run_id":1,"domain":"aiml","arxiv_id":"2602.19333","entry_id":"","title":"PerSoMed: A Large-Scale Balanced Dataset for Persian Social Media Text Classification","authors":"[\"Isun Chehreh\", \"Ebrahim Ansari\"]","abstract":"This research introduces the first large-scale, well-balanced Persian social media text classification dataset, specifically designed to address the lack of comprehensive resources in this domain. The dataset comprises 36,000 posts across nine categories (Economic, Artistic, Sports, Political, Social, Health, Psychological, Historical, and Science & Technology), each containing 4,000 samples to ensure balanced class distribution. Data collection involved 60,000 raw posts from various Persian soc","published":"2026-02-22T20:53:08+00:00","categories":"[\"cs.CL\", \"cs.IR\", \"cs.SI\"]","pdf_url":"https://arxiv.org/pdf/2602.19333v1","arxiv_url":"http://arxiv.org/abs/2602.19333v1","comment":"10 pages, including 1 figure","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":3.0,"score_axis_3":6.0,"composite":3.75,"summary":"PerSoMed introduces a large-scale, balanced Persian social media text classification dataset with 36,000 posts across nine categories. TookaBERT-Large achieves best performance (0.9621 F1-score), demonstrating that transformer-based models consistently outperform traditional neural networks for Persian text classification.","reasoning":"Useful dataset contribution for Persian NLP, but incremental modeling work. Dataset availability boosts practical value, though limited to specific language/domain.","code_url":null,"s2_tldr":"This research introduces the first large-scale, well-balanced Persian social media text classification dataset and provides comprehensive evaluations of cutting-edge models, establishing a solid foundation for further developments in Persian NLP, including trend analysis, social behavior modeling, and user classification.","s2_paper_id":"f0108b4773d7d028e677c746ec35de26896952f2","topics":"[\"Benchmark\", \"Reasoning\"]"},{"id":381,"run_id":1,"domain":"aiml","arxiv_id":"2602.19212","entry_id":"","title":"Retrieval Augmented Enhanced Dual Co-Attention Framework for Target Aware Multimodal Bengali Hateful Meme Detection","authors":"[\"Raihan Tanvir\", \"Md. Golam Rabiul Alam\"]","abstract":"Hateful content on social media increasingly appears as multimodal memes that combine images and text to convey harmful narratives. In low-resource languages such as Bengali, automated detection remains challenging due to limited annotated data, class imbalance, and pervasive code-mixing. To address these issues, we augment the Bengali Hateful Memes (BHM) dataset with semantically aligned samples from the Multimodal Aggression Dataset in Bengali (MIMOSA), improving both class balance and semanti","published":"2026-02-22T14:48:25+00:00","categories":"[\"cs.CL\"]","pdf_url":"https://arxiv.org/pdf/2602.19212v1","arxiv_url":"http://arxiv.org/abs/2602.19212v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":4.0,"score_axis_3":5.0,"composite":3.75,"summary":"This work proposes xDORA and RAG-Fused DORA for Bengali hateful meme detection, combining vision and multilingual text encoders with retrieval augmentation. Achieves 0.78-0.79 F1 for hate detection and 0.71-0.74 for target detection on augmented dataset, outperforming baselines including LLaVA in few-shot settings.","reasoning":"Incremental application to specific language/task (Bengali hate speech). Limited broader applicability, no code/weights mentioned, typical multimodal classification approach.","code_url":null,"s2_tldr":"A FAISS-based k-nearest neighbor classifier for non-parametric inference and RAG-Fused DORA, which incorporates retrieval-driven contextual reasoning are developed and introduced, demonstrating the effectiveness of supervised, retrieval-augmented, and non-parametric multimodal frameworks for addressing linguistic and cultural complexities in low-resource hate speech detection.","s2_paper_id":"e22e66fa2950f9d661ef2b4562990e40724d9113","topics":"[\"Multimodal\", \"Retrieval / RAG\", \"Benchmark\"]"},{"id":382,"run_id":1,"domain":"aiml","arxiv_id":"2602.19160","entry_id":"","title":"Reasoning Capabilities of Large Language Models. Lessons Learned from General Game Playing","authors":"[\"Maciej \\u015awiechowski\", \"Adam \\u017bychowski\", \"Jacek Ma\\u0144dziuk\"]","abstract":"This paper examines the reasoning capabilities of Large Language Models (LLMs) from a novel perspective, focusing on their ability to operate within formally specified, rule-governed environments. We evaluate four LLMs (Gemini 2.5 Pro and Flash variants, Llama 3.3 70B and GPT-OSS 120B) on a suite of forward-simulation tasks-including next / multistep state formulation, and legal action generation-across a diverse set of reasoning problems illustrated through General Game Playing (GGP) game insta","published":"2026-02-22T12:43:00+00:00","categories":"[\"cs.AI\", \"cs.CL\", \"cs.LO\"]","pdf_url":"https://arxiv.org/pdf/2602.19160v1","arxiv_url":"http://arxiv.org/abs/2602.19160v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":4.0,"score_axis_3":5.0,"composite":3.75,"summary":"Evaluates reasoning capabilities of four LLMs on General Game Playing tasks, analyzing performance on forward simulation, state formulation, and legal action generation. Finds strong performance on most tasks with degradation over longer horizons, and identifies common reasoning errors through case-based analysis.","reasoning":"Thorough evaluation study but primarily benchmarking work. Limited novelty in methods, no code/weights, focused on characterizing existing model capabilities.","code_url":null,"s2_tldr":"Detailed case-based analysis of the LLM performance provides novel insights into common reasoning errors in the considered logic-based problem formulation, including hallucinated rules, redundant state facts, or syntactic errors.","s2_paper_id":"ce5bfb67c4e72382cad5ca0c9dfbd3d0be717b40","topics":"[\"Language Models\", \"Reasoning\", \"Benchmark\"]"},{"id":383,"run_id":1,"domain":"aiml","arxiv_id":"2602.19159","entry_id":"","title":"Beyond Behavioural Trade-Offs: Mechanistic Tracing of Pain-Pleasure Decisions in an LLM","authors":"[\"Francesca Bianco\", \"Derek Shiller\"]","abstract":"Prior behavioural work suggests that some LLMs alter choices when options are framed as causing pain or pleasure, and that such deviations can scale with stated intensity. To bridge behavioural evidence (what the model does) with mechanistic interpretability (what computations support it), we investigate how valence-related information is represented and where it is causally used inside a transformer. Using Gemma-2-9B-it and a minimalist decision task modelled on prior work, we (i) map represent","published":"2026-02-22T12:42:38+00:00","categories":"[\"cs.AI\", \"cs.CL\", \"cs.LG\"]","pdf_url":"https://arxiv.org/pdf/2602.19159v1","arxiv_url":"http://arxiv.org/abs/2602.19159v1","comment":"24 pages, 8+1 Tables","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":5.0,"score_axis_3":4.0,"composite":3.75,"summary":"Investigates how Gemma-2-9B-it encodes pain/pleasure valence through mechanistic interpretability using probing, activation interventions, and dose-response analysis. Finds valence sign is linearly separable from early layers, with causal effects concentrated in late-layer attention outputs through distributed heads.","reasoning":"Novel mechanistic interpretability work but limited to single model and narrow task domain. Interesting for AI ethics/sentience discussions but low immediate practical applicability.","code_url":null,"s2_tldr":"This work supports a more evidence-driven debate on AI sentience and welfare, and governance when setting policy, auditing standards, and safety safeguards, and supports a more evidence-driven debate on AI sentience and welfare.","s2_paper_id":"101d0b03b8624965165605767daf0abf87470c2d","topics":"[\"Language Models\", \"Architecture\"]"},{"id":385,"run_id":1,"domain":"aiml","arxiv_id":"2602.18429","entry_id":"","title":"VIRAASAT: Traversing Novel Paths for Indian Cultural Reasoning","authors":"[\"Harshul Raj Surana\", \"Arijit Maji\", \"Aryan Vats\", \"Akash Ghosh\", \"Sriparna Saha\", \"Amit Sheth\"]","abstract":"Large Language Models (LLMs) have made significant progress in reasoning tasks across various domains such as mathematics and coding. However, their performance deteriorates in tasks requiring rich socio-cultural knowledge and diverse local contexts, particularly those involving Indian Culture. Existing Cultural benchmarks are (i) Manually crafted, (ii) contain single-hop questions testing factual recall, and (iii) prohibitively costly to scale, leaving this deficiency largely unmeasured. To add","published":"2026-02-20T18:53:07+00:00","categories":"[\"cs.CL\", \"cs.IR\"]","pdf_url":"https://arxiv.org/pdf/2602.18429v1","arxiv_url":"http://arxiv.org/abs/2602.18429v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":5.0,"score_axis_3":4.0,"composite":3.75,"summary":"Introduces VIRAASAT, a semi-automated multi-hop QA dataset for Indian cultural reasoning spanning 28 states/territories with 3,200+ questions from a 700+ artifact knowledge graph. Proposes Symbolic Chain-of-Manipulation (SCoM) that outperforms CoT baselines by 20% through KG-based reasoning.","reasoning":"Domain-specific dataset (Indian culture) with novel SCoM approach. Limited generalizability beyond cultural QA. No code/weights released despite claiming to release dataset.","code_url":null,"s2_tldr":"This work introduces VIRAASAT, a novel, semi-automated multi-hop approach for generating cultural specific multi-hop Question-Answering dataset for Indian culture, and proposes a novel framework named Symbolic Chain-of-Manipulation (SCoM), adapting the Chain-of-Manipulation paradigm, to train the model to simulate atomic Knowledge Graph manipulations internally.","s2_paper_id":"ec3d7c17f008ef83a97b5fa29f107ccd751b1db9","topics":"[\"Reasoning\", \"Language Models\", \"Benchmark\"]"},{"id":386,"run_id":1,"domain":"aiml","arxiv_id":"2602.18351","entry_id":"","title":"Validating Political Position Predictions of Arguments","authors":"[\"Jordan Robinson\", \"Angus R. Williams\", \"Katie Atkinson\", \"Anthony G. Cohn\"]","abstract":"Real-world knowledge representation often requires capturing subjective, continuous attributes -- such as political positions -- that conflict with pairwise validation, the widely accepted gold standard for human evaluation. We address this challenge through a dual-scale validation framework applied to political stance prediction in argumentative discourse, combining pointwise and pairwise human annotation. Using 22 language models, we construct a large-scale knowledge base of political position","published":"2026-02-20T17:03:44+00:00","categories":"[\"cs.CL\", \"cs.AI\"]","pdf_url":"https://arxiv.org/pdf/2602.18351v1","arxiv_url":"http://arxiv.org/abs/2602.18351v1","comment":"13 pages, 6 figures, 6 tables. Under review","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":5.0,"score_axis_3":4.0,"composite":3.75,"summary":"Introduces dual-scale validation framework (pointwise + pairwise) for political stance prediction in argumentative discourse. Creates knowledge base of 23,228 UK Question Time arguments with political positions from 22 LMs, showing stronger alignment in pairwise rankings (\u03b1=0.86) than pointwise (\u03b1=0.578).","reasoning":"Domain-specific application (UK politics) with validation methodology contribution. Limited generalizability. No code/weights for reproducibility.","code_url":null,"s2_tldr":"","s2_paper_id":"91f07e0e0327203800a29273267420d2eb4e7383","topics":"[\"Language Models\", \"Retrieval / RAG\", \"Benchmark\"]"},{"id":387,"run_id":1,"domain":"aiml","arxiv_id":"2602.18346","entry_id":"","title":"Vichara: Appellate Judgment Prediction and Explanation for the Indian Judicial System","authors":"[\"Pavithra PM Nair\", \"Preethu Rose Anish\"]","abstract":"In jurisdictions like India, where courts face an extensive backlog of cases, artificial intelligence offers transformative potential for legal judgment prediction. A critical subset of this backlog comprises appellate cases, which are formal decisions issued by higher courts reviewing the rulings of lower courts. To this end, we present Vichara, a novel framework tailored to the Indian judicial system that predicts and explains appellate judgments. Vichara processes English-language appellate c","published":"2026-02-20T16:57:44+00:00","categories":"[\"cs.CL\", \"cs.AI\"]","pdf_url":"https://arxiv.org/pdf/2602.18346v1","arxiv_url":"http://arxiv.org/abs/2602.18346v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":5.0,"score_axis_3":4.0,"composite":3.75,"summary":"Presents Vichara framework for Indian appellate judgment prediction and explanation using decision point decomposition and IRAC-inspired structured explanations. GPT-4o mini achieves F1 of 81.5 (PredEx) and 80.3 (ILDC_expert), outperforming baselines with interpretable explanations.","reasoning":"Application-specific to Indian legal system. No code/weights provided. Limited domain applicability despite strong results in narrow context.","code_url":null,"s2_tldr":"Vichara, a novel framework tailored to the Indian judicial system that predicts and explains appellate judgments, surpasses existing judgment prediction benchmarks on both datasets and enhances interpretability, allowing legal professionals to assess the soundness of predictions efficiently.","s2_paper_id":"54abdf2c7d8728405f96aebd2e27ffcdc9e08889","topics":"[]"},{"id":388,"run_id":1,"domain":"aiml","arxiv_id":"2602.18326","entry_id":"","title":"Predicting Contextual Informativeness for Vocabulary Learning using Deep Learning","authors":"[\"Tao Wu\", \"Adam Kapelner\"]","abstract":"We describe a modern deep learning system that automatically identifies informative contextual examples (\\qu{contexts}) for first language vocabulary instruction for high school student. Our paper compares three modeling approaches: (i) an unsupervised similarity-based strategy using MPNet's uniformly contextualized embeddings, (ii) a supervised framework built on instruction-aware, fine-tuned Qwen3 embeddings with a nonlinear regression head and (iii) model (ii) plus handcrafted context feature","published":"2026-02-20T16:32:14+00:00","categories":"[\"cs.CL\"]","pdf_url":"https://arxiv.org/pdf/2602.18326v1","arxiv_url":"http://arxiv.org/abs/2602.18326v1","comment":"8 pages, 3 figures, 4 tables","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":4.0,"score_axis_3":5.0,"composite":3.75,"summary":"Develops deep learning system for identifying informative vocabulary learning contexts using MPNet embeddings, fine-tuned Qwen3, and handcrafted features. Best model achieves 440:1 good-to-bad ratio while retaining 30% of good contexts using novel Retention Competency Curve metric.","reasoning":"Application-specific to educational technology. Incremental approach combining existing techniques. Limited broader impact beyond vocabulary instruction.","code_url":null,"s2_tldr":"It is demonstrated that a modern embedding model on neural network architecture, when guided by human supervision, results in a low-cost large supply of near-perfect contexts for teaching vocabulary for a variety of target words.","s2_paper_id":"9a801c63ed1eb4d06f079c55221d9a909dc83cd9","topics":"[\"Language Models\", \"Retrieval / RAG\"]"},{"id":389,"run_id":1,"domain":"aiml","arxiv_id":"2602.18324","entry_id":"","title":"PsihoRo: Depression and Anxiety Romanian Text Corpus","authors":"[\"Alexandra Ciobotaru\", \"Ana-Maria Bucur\", \"Liviu P. Dinu\"]","abstract":"Psychological corpora in NLP are collections of texts used to analyze human psychology, emotions, and mental health. These texts allow researchers to study psychological constructs, detect mental health issues and analyze emotional language. However, mental health data can be difficult to collect correctly from social media, due to suppositions made by the collectors. A more pragmatic strategy involves gathering data through open-ended questions and then assessing this information with self-repo","published":"2026-02-20T16:24:23+00:00","categories":"[\"cs.CL\"]","pdf_url":"https://arxiv.org/pdf/2602.18324v2","arxiv_url":"http://arxiv.org/abs/2602.18324v2","comment":"This article was accepted at LREC 2026","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[{\"id\": \"Alegzandra/PsihoRo\", \"likes\": 0}]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":4.0,"score_axis_3":5.0,"composite":3.75,"summary":"Introduces PsihoRo, first Romanian corpus for depression/anxiety detection with 205 respondents using open-ended questions plus PHQ-9/GAD-7 screening. Applies statistical analysis, LIWC, emotion detection, and topic modeling to establish baseline for Romanian mental health NLP.","reasoning":"Language-specific resource (Romanian) filling gap but limited scale (205 samples). No code/weights. Incremental contribution to mental health NLP.","code_url":null,"s2_tldr":"","s2_paper_id":"60ce328154f483236550c85bdc60c2a361d287e7","topics":"[\"Reasoning\"]"},{"id":390,"run_id":1,"domain":"aiml","arxiv_id":"2602.17815","entry_id":"","title":"Neural Synchrony Between Socially Interacting Language Models","authors":"[\"Zhining Zhang\", \"Wentao Zhu\", \"Chi Han\", \"Yizhou Wang\", \"Heng Ji\"]","abstract":"Neuroscience has uncovered a fundamental mechanism of our social nature: human brain activity becomes synchronized with others in many social contexts involving interaction. Traditionally, social minds have been regarded as an exclusive property of living beings. Although large language models (LLMs) are widely accepted as powerful approximations of human behavior, with multi-LLM system being extensively explored to enhance their capabilities, it remains controversial whether they can be meaning","published":"2026-02-19T20:33:54+00:00","categories":"[\"cs.CL\"]","pdf_url":"https://arxiv.org/pdf/2602.17815v1","arxiv_url":"http://arxiv.org/abs/2602.17815v1","comment":"Accepted at ICLR 2026","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":5.0,"score_axis_3":4.0,"composite":3.75,"summary":"Explores neural synchrony between socially interacting LLMs as proxy for analyzing their sociality at representational level. Finds neural synchrony correlates strongly with social performance, suggesting parallels with human social interaction dynamics. Accepted at ICLR 2026.","reasoning":"Interesting conceptual work on LLM sociality but limited practical applicability. Novel perspective but no tools/code for practitioners.","code_url":null,"s2_tldr":"This work introduces neural synchrony during social simulations as a novel proxy for analyzing the sociality of LLMs at the representational level and indicates that neural synchrony between LLMs is strongly correlated with their social performance, highlighting an important link between neural synchrony and the social behaviors of LLMs.","s2_paper_id":"fa50e0521d9bc40724de5d3b63e70a1b5859d8c5","topics":"[\"Language Models\"]"},{"id":391,"run_id":1,"domain":"aiml","arxiv_id":"2602.17663","entry_id":"","title":"CLEF HIPE-2026: Evaluating Accurate and Efficient Person-Place Relation Extraction from Multilingual Historical Texts","authors":"[\"Juri Opitz\", \"Corina Racl\\u00e9\", \"Emanuela Boros\", \"Andrianos Michail\", \"Matteo Romanello\", \"Maud Ehrmann\", \"Simon Clematide\"]","abstract":"HIPE-2026 is a CLEF evaluation lab dedicated to person-place relation extraction from noisy, multilingual historical texts. Building on the HIPE-2020 and HIPE-2022 campaigns, it extends the series toward semantic relation extraction by targeting the task of identifying person--place associations in multiple languages and time periods. Systems are asked to classify relations of two types - $at$ (\"Has the person ever been at this place?\") and $isAt$ (\"Is the person located at this place around pub","published":"2026-02-19T18:59:44+00:00","categories":"[\"cs.AI\", \"cs.CL\", \"cs.IR\"]","pdf_url":"https://arxiv.org/pdf/2602.17663v1","arxiv_url":"http://arxiv.org/abs/2602.17663v1","comment":"ECIR 2026. CLEF Evaluation Lab. Registration DL: 2026/04/23. Task Homepage at https://hipe-eval.github.io/HIPE-2026/","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":4.0,"score_axis_3":5.0,"composite":3.75,"summary":"CLEF HIPE-2026 evaluation lab for person-place relation extraction from multilingual historical texts. Extends previous campaigns to semantic relations (at/isAt) with three-fold evaluation of accuracy, computational efficiency, and domain generalization. Supports digital humanities applications.","reasoning":"Benchmark/evaluation task announcement. Useful for specialized community but limited broader ML interest. No code/models yet.","code_url":null,"s2_tldr":"By linking relation extraction to large-scale historical data processing, HIPE-2026 aims to support downstream applications in knowledge-graph construction, historical biography reconstruction, and spatial analysis in digital humanities.","s2_paper_id":"95250face306b2b2679df6789fd96602844e4618","topics":"[\"Benchmark\", \"Efficiency\"]"},{"id":392,"run_id":1,"domain":"aiml","arxiv_id":"2602.17653","entry_id":"","title":"Differences in Typological Alignment in Language Models' Treatment of Differential Argument Marking","authors":"[\"Iskar Deng\", \"Nathalia Xu\", \"Shane Steinert-Threlkeld\"]","abstract":"Recent work has shown that language models (LMs) trained on synthetic corpora can exhibit typological preferences that resemble cross-linguistic regularities in human languages, particularly for syntactic phenomena such as word order. In this paper, we extend this paradigm to differential argument marking (DAM), a semantic licensing system in which morphological marking depends on semantic prominence. Using a controlled synthetic learning method, we train GPT-2 models on 18 corpora implementing ","published":"2026-02-19T18:56:34+00:00","categories":"[\"cs.CL\"]","pdf_url":"https://arxiv.org/pdf/2602.17653v1","arxiv_url":"http://arxiv.org/abs/2602.17653v1","comment":"15 pages, 7 figures, 7 tables. Under review","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":6.0,"score_axis_3":3.0,"composite":3.75,"summary":"This paper investigates how language models exhibit typological preferences for differential argument marking systems when trained on synthetic corpora. The study reveals models reliably favor natural markedness direction but fail to reproduce human-like object preference, suggesting distinct underlying mechanisms for different typological tendencies.","reasoning":"Limited practical impact (linguistic analysis of synthetic corpora); no code/weights. Moderate novelty in applying linguistic typology to LM analysis.","code_url":null,"s2_tldr":"This paper trains GPT-2 models on 18 corpora implementing distinct DAM systems and evaluates their generalization using minimal pairs, revealing a dissociation between two typological dimensions of DAM.","s2_paper_id":"29f791bcc4ec0535e87a92783d585f9ccec3c70f","topics":"[\"Language Models\", \"Reasoning\", \"Training\"]"},{"id":393,"run_id":1,"domain":"aiml","arxiv_id":"2602.17623","entry_id":"","title":"Unmasking the Factual-Conceptual Gap in Persian Language Models","authors":"[\"Alireza Sakhaeirad\", \"Ali Ma'manpoosh\", \"Arshia Hemmat\"]","abstract":"While emerging Persian NLP benchmarks have expanded into pragmatics and politeness, they rarely distinguish between memorized cultural facts and the ability to reason about implicit social norms. We introduce DivanBench, a diagnostic benchmark focused on superstitions and customs, arbitrary, context-dependent rules that resist simple logical deduction. Through 315 questions across three task types (factual retrieval, paired scenario verification, and situational reasoning), we evaluate seven Per","published":"2026-02-19T18:42:46+00:00","categories":"[\"cs.CL\"]","pdf_url":"https://arxiv.org/pdf/2602.17623v1","arxiv_url":"http://arxiv.org/abs/2602.17623v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":5.0,"score_axis_3":4.0,"composite":3.75,"summary":"DivanBench introduces a diagnostic benchmark for Persian LLMs focused on cultural superstitions and customs, revealing severe acquiescence bias and a 21% performance gap between factual retrieval and situational reasoning. The study shows continuous Persian pretraining amplifies rather than reduces these biases.","reasoning":"Limited scope (Persian-only), no code/weights, diagnostic rather than solution-oriented. Moderate novelty in cultural competence evaluation.","code_url":null,"s2_tldr":"DivanBench is introduced, a diagnostic benchmark focused on superstitions and customs, arbitrary, context-dependent rules that resist simple logical deduction that demonstrate that cultural competence requires more than scaling monolingual data.","s2_paper_id":"85f0f0ed68d97a5be270dafaef3f47f2bbb8237d","topics":"[\"Language Models\", \"Retrieval / RAG\", \"Reasoning\"]"},{"id":394,"run_id":1,"domain":"aiml","arxiv_id":"2602.17469","entry_id":"","title":"Auditing Reciprocal Sentiment Alignment: Inversion Risk, Dialect Representation and Intent Misalignment in Transformers","authors":"[\"Nusrat Jahan Lia\", \"Shubhashis Roy Dipta\"]","abstract":"The core theme of bidirectional alignment is ensuring that AI systems accurately understand human intent and that humans can trust AI behavior. However, this loop fractures significantly across language barriers. Our research addresses Cross-Lingual Sentiment Misalignment between Bengali and English by benchmarking four transformer architectures. We reveal severe safety and representational failures in current alignment paradigms. We demonstrate that compressed model (mDistilBERT) exhibits 28.7%","published":"2026-02-19T15:35:13+00:00","categories":"[\"cs.CL\", \"cs.HC\"]","pdf_url":"https://arxiv.org/pdf/2602.17469v1","arxiv_url":"http://arxiv.org/abs/2602.17469v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":5.0,"score_axis_3":4.0,"composite":3.75,"summary":"Benchmarks four transformer architectures for Bengali-English sentiment alignment, revealing 28.7% sentiment inversion rate in compressed models and 57% increased error for formal Bengali dialect. Identifies asymmetric empathy and modern bias issues affecting human-AI trust in low-resource languages.","reasoning":"Limited scope (Bengali-specific), no code/weights, analysis-focused rather than solution-oriented. Moderate novelty in cross-lingual alignment study.","code_url":null,"s2_tldr":"This research addresses Cross-Lingual Sentiment Misalignment between Bengali and English by benchmarking four transformer architectures and argues that equitable human-AI co-evolution requires pluralistic, culturally grounded alignment that respects language and dialectal diversity over universal compression.","s2_paper_id":"b65da61ab652ead21f1f5a4313063e70b0f9c6da","topics":"[\"Training\", \"Architecture\", \"Efficiency\"]"},{"id":395,"run_id":1,"domain":"aiml","arxiv_id":"2602.17425","entry_id":"","title":"Evaluating Extremely Low-Resource Machine Translation: A Comparative Study of ChrF++ and BLEU Metrics","authors":"[\"Sanjeev Kumar\", \"Preethi Jyothi\", \"Pushpak Bhattacharyya\"]","abstract":"Evaluating machine translation (MT) quality in extremely low-resource language (ELRL) scenarios poses unique challenges, as widely used metrics such as BLEU, effective in high-resource settings, often misrepresent quality in data-scarce contexts. This work presents a comparative analysis of BLEU, an n-gram-based metric, and ChrF++, a character-based metric, for MT evaluation in ELRL settings. We examine how each metric responds to translation artifacts, including hallucinations, repetition, sour","published":"2026-02-19T14:56:42+00:00","categories":"[\"cs.CL\"]","pdf_url":"https://arxiv.org/pdf/2602.17425v1","arxiv_url":"http://arxiv.org/abs/2602.17425v1","comment":"6 pages","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":4.0,"score_axis_3":5.0,"composite":3.75,"summary":"Compares BLEU and ChrF++ metrics for evaluating machine translation in extremely low-resource languages (Magahi, Bhojpuri, Chhattisgarhi). Shows BLEU provides complementary lexical-precision insights to ChrF++, challenging recent reliance on ChrF++ alone for ELRL evaluation.","reasoning":"Limited scope (evaluation study), no code/weights. Incremental contribution to MT evaluation practices in low-resource settings.","code_url":null,"s2_tldr":"Findings show that BLEU, despite its lower absolute scores, provides complementary lexical-precision insights that improve interpretability.","s2_paper_id":"36352d73d05b7adca05b98821616648fafee8e21","topics":"[\"Benchmark\"]"},{"id":396,"run_id":1,"domain":"aiml","arxiv_id":"2602.17229","entry_id":"","title":"Mechanistic Interpretability of Cognitive Complexity in LLMs via Linear Probing using Bloom's Taxonomy","authors":"[\"Bianca Raimondi\", \"Maurizio Gabbrielli\"]","abstract":"The black-box nature of Large Language Models necessitates novel evaluation frameworks that transcend surface-level performance metrics. This study investigates the internal neural representations of cognitive complexity using Bloom's Taxonomy as a hierarchical lens. By analyzing high-dimensional activation vectors from different LLMs, we probe whether different cognitive levels, ranging from basic recall (Remember) to abstract synthesis (Create), are linearly separable within the model's residu","published":"2026-02-19T10:19:04+00:00","categories":"[\"cs.AI\", \"cs.CL\"]","pdf_url":"https://arxiv.org/pdf/2602.17229v1","arxiv_url":"http://arxiv.org/abs/2602.17229v1","comment":"Preprint. Under review","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":5.0,"score_axis_3":4.0,"composite":3.75,"summary":"Mechanistic interpretability study showing LLMs encode cognitive complexity levels (Bloom's Taxonomy) in linearly separable representations with 95% classifier accuracy. Demonstrates models resolve cognitive difficulty early in forward pass with increasing layer-wise separability.","reasoning":"Low code_and_weights (preprint, no code/weights mentioned). Moderate novelty in applying educational taxonomy to neural representations. Limited immediate practical applicability beyond interpretability research.","code_url":null,"s2_tldr":"This study investigates the internal neural representations of cognitive complexity using Bloom's Taxonomy as a hierarchical lens, providing strong evidence that cognitive level is encoded in a linearly accessible subspace of the model's representations.","s2_paper_id":"e0ad5c38a423d03e2dd6de79354aa033446a247a","topics":"[\"Language Models\", \"Benchmark\"]"},{"id":397,"run_id":1,"domain":"aiml","arxiv_id":"2602.17221","entry_id":"","title":"From Labor to Collaboration: A Methodological Experiment Using AI Agents to Augment Research Perspectives in Taiwan's Humanities and Social Sciences","authors":"[\"Yi-Chih Huang\"]","abstract":"Generative AI is reshaping knowledge work, yet existing research focuses predominantly on software engineering and the natural sciences, with limited methodological exploration for the humanities and social sciences. Positioned as a \"methodological experiment,\" this study proposes an AI Agent-based collaborative research workflow (Agentic Workflow) for humanities and social science research. Taiwan's Claude.ai usage data (N = 7,729 conversations, November 2025) from the Anthropic Economic Index ","published":"2026-02-19T10:12:08+00:00","categories":"[\"cs.AI\", \"cs.CL\", \"cs.CY\"]","pdf_url":"https://arxiv.org/pdf/2602.17221v1","arxiv_url":"http://arxiv.org/abs/2602.17221v1","comment":"also in Chinese","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":4.0,"score_axis_3":5.0,"composite":3.75,"summary":"Proposes AI Agent-based collaborative workflow for humanities/social sciences research, demonstrating seven-stage modular framework with clear human-AI division of labor. Validates methodology through analysis of Taiwan Claude.ai usage data (7,729 conversations), identifying three operational collaboration modes.","reasoning":"Low code_and_weights (methodology paper, unclear if code shared). Moderate novelty in workflow design. Moderate practical applicability limited to humanities/social science research contexts.","code_url":null,"s2_tldr":"This study contributes by proposing a replicable AI collaboration framework for humanities and social science researchers, and identifying three operational modes of human-AI collaboration - direct execution, iterative refinement, and human-led - through reflexive documentation of the operational process.","s2_paper_id":"66da11cc6be384a77656737c4aa6e44733442f4f","topics":"[\"Reasoning\", \"Code\", \"Agents\"]"},{"id":398,"run_id":1,"domain":"aiml","arxiv_id":"2602.17108","entry_id":"","title":"Projective Psychological Assessment of Large Multimodal Models Using Thematic Apperception Tests","authors":"[\"Anton Dzega\", \"Aviad Elyashar\", \"Ortal Slobodin\", \"Odeya Cohen\", \"Rami Puzis\"]","abstract":"Thematic Apperception Test (TAT) is a psychometrically grounded, multidimensional assessment framework that systematically differentiates between cognitive-representational and affective-relational components of personality-like functioning. This test is a projective psychological framework designed to uncover unconscious aspects of personality. This study examines whether the personality traits of Large Multimodal Models (LMMs) can be assessed through non-language-based modalities, using the So","published":"2026-02-19T06:08:33+00:00","categories":"[\"cs.CL\"]","pdf_url":"https://arxiv.org/pdf/2602.17108v1","arxiv_url":"http://arxiv.org/abs/2602.17108v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":5.0,"score_axis_3":4.0,"composite":3.75,"summary":"Applies Thematic Apperception Test to multimodal LLMs using SCORS-G framework, evaluating both story generation and narrative assessment capabilities. Finds models excel at interpersonal dynamics but consistently fail at perceiving/regulating aggression, with performance varying by model family.","reasoning":"Low code_and_weights (no code/weights mentioned). Moderate novelty in applying projective psychology to multimodal models. Limited practical applicability beyond personality assessment research.","code_url":null,"s2_tldr":"","s2_paper_id":"81ff7135a2c758318edd003f68fb202ea0bc433f","topics":"[\"Multimodal\", \"Reasoning\"]"},{"id":400,"run_id":1,"domain":"aiml","arxiv_id":"2602.19278","entry_id":"","title":"A Two-Stage Detection-Tracking Framework for Stable Apple Quality Inspection in Dense Conveyor-Belt Environments","authors":"[\"Keonvin Park\", \"Aditya Pal\", \"Jin Hong Mok\"]","abstract":"Industrial fruit inspection systems must operate reliably under dense multi-object interactions and continuous motion, yet most existing works evaluate detection or classification at the image level without ensuring temporal stability in video streams. We present a two-stage detection-tracking framework for stable multi-apple quality inspection in conveyor-belt environments. An orchard-trained YOLOv8 model performs apple localization, followed by ByteTrack multi-object tracking to maintain persi","published":"2026-02-22T17:24:13+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.19278v1","arxiv_url":"http://arxiv.org/abs/2602.19278v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":3.0,"score_axis_3":5.0,"composite":3.4,"summary":"A two-stage framework for stable apple quality inspection on conveyor belts using YOLOv8 detection, ByteTrack tracking, and ResNet18 classification. Track-level aggregation enforces temporal consistency to reduce prediction oscillation across video frames.","reasoning":"Incremental application of existing methods to industrial fruit inspection. No code/weights, narrow domain focus.","code_url":null,"s2_tldr":"A two-stage detection-tracking framework for stable multi-apple quality inspection in conveyor-belt environments is presented and improved stability compared to frame-wise inference is demonstrated, suggesting that integrating tracking is essential for practical automated fruit grading systems.","s2_paper_id":"e427f615d8ef10a4b745d6e59afacdaa63f2bf78","topics":"[\"Benchmark\"]"},{"id":401,"run_id":1,"domain":"aiml","arxiv_id":"2602.18959","entry_id":"","title":"YOLOv10-Based Multi-Task Framework for Hand Localization and Laterality Classification in Surgical Videos","authors":"[\"Kedi Sun\", \"Le Zhang\"]","abstract":"Real-time hand tracking in trauma surgery is essential for supporting rapid and precise intraoperative decisions. We propose a YOLOv10-based framework that simultaneously localizes hands and classifies their laterality (left or right) in complex surgical scenes. The model is trained on the Trauma THOMPSON Challenge 2025 Task 2 dataset, consisting of first-person surgical videos with annotated hand bounding boxes. Extensive data augmentation and a multi-task detection design improve robustness ag","published":"2026-02-21T21:41:56+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.18959v1","arxiv_url":"http://arxiv.org/abs/2602.18959v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":3.0,"score_axis_3":5.0,"composite":3.4,"summary":"YOLOv10-based framework for real-time hand localization and laterality classification in surgical videos achieves 67-71% classification accuracy with mAP of 0.33. Trained on Trauma THOMPSON Challenge 2025 dataset with multi-task detection design for surgical scene understanding.","reasoning":"Medical application with limited generalizability. No code/weights available. Applies existing YOLOv10 architecture to specific task without significant architectural innovation.","code_url":null,"s2_tldr":"A YOLOv10-based framework that simultaneously localizes hands and classifies their laterality (left or right) in complex surgical scenes in order to establish a foundation for advanced hand-instrument interaction analysis in emergency surgical procedures is proposed.","s2_paper_id":"a583f6bf3461348c3f3addd86948e34881a6c0c8","topics":"[\"Benchmark\"]"},{"id":402,"run_id":1,"domain":"aiml","arxiv_id":"2602.18903","entry_id":"","title":"SCHEMA for Gemini 3 Pro Image: A Structured Methodology for Controlled AI Image Generation on Google's Native Multimodal Model","authors":"[\"Luca Cazzaniga\"]","abstract":"This paper presents SCHEMA (Structured Components for Harmonized Engineered Modular Architecture), a structured prompt engineering methodology specifically developed for Google Gemini 3 Pro Image. Unlike generic prompt guidelines or model-agnostic tips, SCHEMA is an engineered framework built on systematic professional practice encompassing 850 verified API predictions within an estimated corpus of approximately 4,800 generated images, spanning six professional domains: real estate photography, ","published":"2026-02-21T16:51:40+00:00","categories":"[\"cs.CV\", \"cs.HC\"]","pdf_url":"https://arxiv.org/pdf/2602.18903v1","arxiv_url":"http://arxiv.org/abs/2602.18903v1","comment":"24 pages, 8 tables. Based on SCHEMA Method v1.0 (deposited December 11, 2025). Previously published on Zenodo: doi:10.5281/zenodo.18721380","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":3.0,"score_axis_3":5.0,"composite":3.4,"summary":"SCHEMA presents a structured prompt engineering methodology specifically for Google Gemini 3 Pro Image based on 850 verified API predictions across six professional domains. Introduces three-tier progressive system (BASE, MEDIO, AVANZATO) with modular label architecture achieving 91% mandatory compliance and 94% prohibitions compliance.","reasoning":"Model-specific prompt engineering guide rather than novel research. No code/weights. Practical for Gemini users but limited generalizability and no architectural innovation.","code_url":null,"s2_tldr":"Key findings include an observed 91% Mandatory compliance rate and 94% Prohibitions compliance rate across 621 structured prompts, a comparative batch consistency test demonstrating substantially higher inter-generation coherence for structured prompts, independent practitioner validation, and a dedicated Information Design validation demonstrating>95% first-generation compliance for spatial and typographical control across approximately 300 publicly verifiable infographics.","s2_paper_id":"3adebb0e42f70348fa15a81c7c65b18c78bf159c","topics":"[\"Image Generation\", \"Multimodal\"]"},{"id":403,"run_id":1,"domain":"aiml","arxiv_id":"2602.19643","entry_id":"","title":"KGHaluBench: A Knowledge Graph-Based Hallucination Benchmark for Evaluating the Breadth and Depth of LLM Knowledge","authors":"[\"Alex Robertson\", \"Huizhi Liang\", \"Mahbub Gani\", \"Rohit Kumar\", \"Srijith Rajamohan\"]","abstract":"Large Language Models (LLMs) possess a remarkable capacity to generate persuasive and intelligible language. However, coherence does not equate to truthfulness, as the responses often contain subtle hallucinations. Existing benchmarks are limited by static and narrow questions, leading to limited coverage and misleading evaluations. We present KGHaluBench, a Knowledge Graph-based hallucination benchmark that assesses LLMs across the breadth and depth of their knowledge, providing a fairer and mo","published":"2026-02-23T09:41:46+00:00","categories":"[\"cs.CL\"]","pdf_url":"https://arxiv.org/pdf/2602.19643v1","arxiv_url":"http://arxiv.org/abs/2602.19643v1","comment":"EACL 2026 Findings","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":4.0,"score_axis_3":4.0,"composite":3.4,"summary":"Introduces KGHaluBench, a Knowledge Graph-based hallucination benchmark that dynamically constructs challenging questions to evaluate LLMs across breadth and depth of knowledge. Evaluates 25 frontier models with novel accuracy and hallucination metrics, addressing popularity bias through statistical difficulty estimation.","reasoning":"Useful benchmark but no code/weights for models. Evaluation-focused with limited immediate practical applicability. Incremental improvement to hallucination detection.","code_url":null,"s2_tldr":"KGHaluBench is presented, a Knowledge Graph-based hallucination benchmark that assesses LLMs across the breadth and depth of their knowledge, providing a fairer and more comprehensive insight into LLM truthfulness.","s2_paper_id":"33390060103d54658d2fed650c929edf63b90f14","topics":"[\"Language Models\", \"Retrieval / RAG\", \"Benchmark\"]"},{"id":404,"run_id":1,"domain":"aiml","arxiv_id":"2602.18699","entry_id":"","title":"Semantic Substrate Theory: An Operator-Theoretic Framework for Geometric Semantic Drift","authors":"[\"Stephen Russell\"]","abstract":"Most semantic drift studies report multiple signals e.g., embedding displacement, neighbor changes, distributional divergence, and recursive trajectory instability, without a shared explanatory theory that relates them. This paper proposes a formalization of these signals in one time-indexed substrate, $S_t=(X,d_t,P_t)$, combining embedding geometry with local diffusion. Within this substrate, node-level neighborhood drift measures changes in local conditional distributions, coarse Ricci curvatu","published":"2026-02-21T03:04:21+00:00","categories":"[\"cs.CL\", \"cs.AI\"]","pdf_url":"https://arxiv.org/pdf/2602.18699v1","arxiv_url":"http://arxiv.org/abs/2602.18699v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":5.0,"score_axis_3":3.0,"composite":3.4,"summary":"Proposes Semantic Substrate Theory formalizing semantic drift signals in unified framework combining embedding geometry with local diffusion. Introduces bridge mass as predictor of neighborhood rewiring. Provides theoretical model with test contracts; empirical validation deferred.","reasoning":"Theoretical framework for semantic drift without empirical validation or implementation. Academic contribution but limited practical utility without experimental results.","code_url":null,"s2_tldr":null,"s2_paper_id":"07d48b98fb6cbf0f19081527620723103c0f3eff","topics":"[\"Retrieval / RAG\"]"},{"id":405,"run_id":1,"domain":"aiml","arxiv_id":"2602.17905","entry_id":"","title":"Games That Teach, Chats That Convince: Comparing Interactive and Static Formats for Persuasive Learning","authors":"[\"Seyed Hossein Alavi\", \"Zining Wang\", \"Shruthi Chockkalingam\", \"Raymond T. Ng\", \"Vered Shwartz\"]","abstract":"Interactive systems such as chatbots and games are increasingly used to persuade and educate on sustainability-related topics, yet it remains unclear how different delivery formats shape learning and persuasive outcomes when content is held constant. Grounding on identical arguments and factual content across conditions, we present a controlled user study comparing three modes of information delivery: static essays, conversational chatbots, and narrative text-based games. Across subjective measu","published":"2026-02-20T00:07:18+00:00","categories":"[\"cs.HC\", \"cs.AI\", \"cs.CL\", \"cs.ET\"]","pdf_url":"https://arxiv.org/pdf/2602.17905v2","arxiv_url":"http://arxiv.org/abs/2602.17905v2","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":3.0,"score_axis_3":5.0,"composite":3.4,"summary":"Controlled user study comparing essays, chatbots, and text-based games for sustainability education. Finds chatbots increase perceived importance but games achieve higher delayed knowledge retention despite lower perceived learning, revealing dissociation between subjective experience and actual learning.","reasoning":"HCI study with limited ML contribution. Interesting findings but not directly applicable to ML practitioners. No technical artifacts.","code_url":null,"s2_tldr":"A controlled user study comparing three modes of information delivery shows a dissociation between how persuasive experiences feel and what participants retain, and point to important design trade-offs between interactivity, realism, and learning in persuasive systems and serious games.","s2_paper_id":"3b19fa19b5026bcc8721267781171b83a7219c0c","topics":"[]"},{"id":406,"run_id":1,"domain":"aiml","arxiv_id":"2602.20853","entry_id":"","title":"On the Explainability of Vision-Language Models in Art History","authors":"[\"Stefanie Schneider\"]","abstract":"Vision-Language Models (VLMs) transfer visual and textual data into a shared embedding space. In so doing, they enable a wide range of multimodal tasks, while also raising critical questions about the nature of machine 'understanding.' In this paper, we examine how Explainable Artificial Intelligence (XAI) methods can render the visual reasoning of a VLM - namely, CLIP - legible in art-historical contexts. To this end, we evaluate seven methods, combining zero-shot localization experiments with ","published":"2026-02-24T12:53:28+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.20853v1","arxiv_url":"http://arxiv.org/abs/2602.20853v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":4.0,"score_axis_3":3.0,"composite":3.05,"summary":"Examines seven XAI methods for explaining CLIP's visual reasoning in art-historical contexts through zero-shot localization and human interpretability studies. Finds methods capture some human interpretation but effectiveness depends on conceptual stability and representational availability of examined categories.","reasoning":"Survey/analysis paper in niche art history domain. No code/weights or technical contributions. Limited practical applicability beyond specialized domain.","code_url":null,"s2_tldr":"This paper examines how Explainable Artificial Intelligence (XAI) methods can render the visual reasoning of a VLM - namely, CLIP - legible in art-historical contexts, combining zero-shot localization experiments with human interpretability studies.","s2_paper_id":"fd4fa206adbbfc8c6810bc327874475bf3c38e9a","topics":"[\"Language Models\", \"Multimodal\", \"Retrieval / RAG\"]"},{"id":407,"run_id":1,"domain":"aiml","arxiv_id":"2602.19823","entry_id":"","title":"Open-vocabulary 3D scene perception in industrial environments","authors":"[\"Keno Moenck\", \"Adrian Philip Florea\", \"Julian Koch\", \"Thorsten Sch\\u00fcppstuhl\"]","abstract":"Autonomous vision applications in production, intralogistics, or manufacturing environments require perception capabilities beyond a small, fixed set of classes. Recent open-vocabulary methods, leveraging 2D Vision-Language Foundation Models (VLFMs), target this task but often rely on class-agnostic segmentation models pre-trained on non-industrial datasets (e.g., household scenes). In this work, we first demonstrate that such models fail to generalize, performing poorly on common industrial obj","published":"2026-02-23T13:22:51+00:00","categories":"[\"cs.CV\"]","pdf_url":"https://arxiv.org/pdf/2602.19823v1","arxiv_url":"http://arxiv.org/abs/2602.19823v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":3.0,"score_axis_3":4.0,"composite":3.05,"summary":"Proposes training-free open-vocabulary 3D perception pipeline for industrial environments using IndustrialCLIP and superpoint merging. Addresses domain gap in industrial object segmentation without requiring pre-trained segmentation models.","reasoning":"Industrial/domain-specific application with no code/weights. Training-free approach is practical but incremental.","code_url":null,"s2_tldr":"This work proposes a training-free, open-vocabulary 3D perception pipeline that overcomes the limitation of class-agnostic segmentation models pre-trained on non-industrial datasets, and generates masks by merging pre-computed superpoints based on their semantic features.","s2_paper_id":"796463e8bac9c27477925892ed3bf1e2141bb686","topics":"[\"3D / Vision\", \"Language Models\", \"Multimodal\"]"},{"id":408,"run_id":1,"domain":"aiml","arxiv_id":"2602.19698","entry_id":"","title":"Iconographic Classification and Content-Based Recommendation for Digitized Artworks","authors":"[\"Krzysztof Kutt\", \"Maciej Baczy\\u0144ski\"]","abstract":"We present a proof-of-concept system that automates iconographic classification and content-based recommendation of digitized artworks using the Iconclass vocabulary and selected artificial intelligence methods. The prototype implements a four-stage workflow for classification and recommendation, which integrates YOLOv8 object detection with algorithmic mappings to Iconclass codes, rule-based inference for abstract meanings, and three complementary recommenders (hierarchical proximity, IDF-weigh","published":"2026-02-23T10:44:27+00:00","categories":"[\"cs.DL\", \"cs.AI\", \"cs.CV\", \"cs.IR\"]","pdf_url":"https://arxiv.org/pdf/2602.19698v1","arxiv_url":"http://arxiv.org/abs/2602.19698v1","comment":"14 pages, 7 figures; submitted to ICCS 2026 conference","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":3.0,"score_axis_3":4.0,"composite":3.05,"summary":"Proof-of-concept system for automated iconographic classification of artworks using YOLOv8 and Iconclass vocabulary with rule-based inference and three recommenders. Targets heritage repository navigation and cataloging acceleration.","reasoning":"Domain-specific cultural heritage application without code/weights. Incremental engineering of existing methods.","code_url":null,"s2_tldr":"A proof-of-concept system that automates iconographic classification and content-based recommendation of digitized artworks using the Iconclass vocabulary and selected artificial intelligence methods, demonstrating the potential of Iconclass-aware computer vision and recommendation methods to accelerate cataloging and enhance navigation in large heritage repositories.","s2_paper_id":"6fb3b3f26cacc73d738847026329169da894ac3f","topics":"[\"3D / Vision\"]"},{"id":410,"run_id":1,"domain":"aiml","arxiv_id":"2602.18961","entry_id":"","title":"Depth-Enhanced YOLO-SAM2 Detection for Reliable Ballast Insufficiency Identification","authors":"[\"Shiyu Liu\", \"Dylan Lester\", \"Husnu Narman\", \"Ammar Alzarrad\", \"Pingping Zhu\"]","abstract":"This paper presents a depth-enhanced YOLO-SAM2 framework for detecting ballast insufficiency in railway tracks using RGB-D data. Although YOLOv8 provides reliable localization, the RGB-only model shows limited safety performance, achieving high precision (0.99) but low recall (0.49) due to insufficient ballast, as it tends to over-predict the sufficient class. To improve reliability, we incorporate depth-based geometric analysis enabled by a sleeper-aligned depth-correction pipeline that compens","published":"2026-02-21T21:49:06+00:00","categories":"[\"cs.CV\", \"eess.IV\", \"eess.SY\"]","pdf_url":"https://arxiv.org/pdf/2602.18961v1","arxiv_url":"http://arxiv.org/abs/2602.18961v1","comment":"Submitted to the IEEE International Symposium on Robotic and Sensors Environments (ROSE) 2026","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":3.0,"score_axis_3":4.0,"composite":3.05,"summary":"Depth-enhanced YOLO-SAM2 framework for railway ballast insufficiency detection combines RGB-D data, depth correction, and SAM2 segmentation. The system improves recall from 0.49 to 0.80 and F1-score to over 0.80 through geometric analysis of sleeper and ballast profiles.","reasoning":"Domain-specific application (railway inspection) with limited broader applicability. No code/weights shared. Incremental improvement over YOLO baseline using depth sensors.","code_url":null,"s2_tldr":"This paper presents a depth-enhanced YOLO-SAM2 framework for detecting ballast insufficiency in railway tracks using RGB-D data and demonstrates that integrating depth correction with YOLO-SAM2 yields a more robust and reliable approach for automated railway ballast inspection.","s2_paper_id":"2698884c69a52c8ad6070d1040b6f9998f754653","topics":"[]"},{"id":411,"run_id":1,"domain":"aiml","arxiv_id":"2602.20224","entry_id":"","title":"Exploring Anti-Aging Literature via ConvexTopics and Large Language Models","authors":"[\"Lana E. Yeganova\", \"Won G. Kim\", \"Shubo Tian\", \"Natalie Xie\", \"Donald C. Comeau\", \"W. John Wilbur\", \"Zhiyong Lu\"]","abstract":"The rapid expansion of biomedical publications creates challenges for organizing knowledge and detecting emerging trends, underscoring the need for scalable and interpretable methods. Common clustering and topic modeling approaches such as K-means or LDA remain sensitive to initialization and prone to local optima, limiting reproducibility and evaluation. We propose a reformulation of a convex optimization based clustering algorithm that produces stable, fine-grained topics by selecting exemplar","published":"2026-02-23T16:17:33+00:00","categories":"[\"cs.LG\", \"cs.AI\", \"cs.CL\"]","pdf_url":"https://arxiv.org/pdf/2602.20224v1","arxiv_url":"http://arxiv.org/abs/2602.20224v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":4.0,"score_axis_3":3.0,"composite":3.05,"summary":"Applies convex optimization-based clustering (ConvexTopics) to 12,000 PubMed articles on aging/longevity. Produces stable, interpretable topics validated by medical experts, outperforming K-means, LDA, and BERTopic on reproducibility. Focuses on biomedical knowledge organization.","reasoning":"Medical domain application using existing methods. No code/weights. Limited to specific biomedical use case with low broader applicability.","code_url":null,"s2_tldr":"A reformulation of a convex optimization based clustering algorithm that produces stable, fine-grained topics by selecting exemplars from the data and guaranteeing a global optimum, which provides a basis for developing scalable, web-accessible tools for knowledge discovery.","s2_paper_id":"bf91c3bfba8e7c64c5d3ad7bdfdcf111e93b9ff4","topics":"[\"Language Models\", \"Benchmark\", \"Optimization\"]"},{"id":412,"run_id":1,"domain":"aiml","arxiv_id":"2602.19878","entry_id":"","title":"Axis Decomposition for ODRL: Resolving Dimensional Ambiguity in Policy Constraints through Interval Semantics","authors":"[\"Daham Mustafa\", \"Diego Collarana\", \"Yixin Peng\", \"Rafiqul Haque\", \"Christoph Lange-Bever\", \"Christoph Quix\", \"Stephan Decker\"]","abstract":"Every ODRL 2.2 constraint compares a single scalar value: (leftOperand, operator, rightOperand). Five of ODRL's approximately 34 left operands, however, denote multi-dimensional quantities--image dimensions, canvas positions, geographic coordinates--whose specification text explicitly references multiple axes. For these operands, a single scalar constraint admits one interpretation per axis, making policy evaluation non-deterministic. We classify ODRL's left operands by value-domain structure ","published":"2026-02-23T14:24:46+00:00","categories":"[\"cs.CL\", \"cs.LO\"]","pdf_url":"https://arxiv.org/pdf/2602.19878v1","arxiv_url":"http://arxiv.org/abs/2602.19878v1","comment":"16 pages, 5 tables. Preprint","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":4.0,"score_axis_3":3.0,"composite":3.05,"summary":"Introduces axis-decomposition framework for ODRL multi-dimensional constraints (image dimensions, geographic coordinates) to resolve ambiguity. Presents ODRL Spatial Axis Profile with 15 axis-specific operands. Validated on 117 benchmarks with concordance between Vampire and Z3 provers.","reasoning":"Highly specialized formal methods work for policy language. No code/weights for ML practitioners. Limited to ODRL/spatial constraint domain.","code_url":null,"s2_tldr":"An axis-decomposition framework is presented that refines each dimensional operand into axis-specific scalar operands and proves four properties: deterministic interpretation, AABB completeness, sound over-approximation under projection, and conservative extension.","s2_paper_id":"cb91c6a50291a9398454c408f1b0d6b25a0190a4","topics":"[\"Benchmark\"]"},{"id":413,"run_id":1,"domain":"aiml","arxiv_id":"2602.19583","entry_id":"","title":"DEEP: Docker-based Execution and Evaluation Platform","authors":"[\"Sergio G\\u00f3mez Gonz\\u00e1lez\", \"Miguel Domingo\", \"Francisco Casacuberta\"]","abstract":"Comparative evaluation of several systems is a recurrent task in researching. It is a key step before deciding which system to use for our work, or, once our research has been conducted, to demonstrate the potential of the resulting model. Furthermore, it is the main task of competitive, public challenges evaluation. Our proposed software (DEEP) automates both the execution and scoring of machine translation and optical character recognition models. Furthermore, it is easily extensible to other ","published":"2026-02-23T08:08:57+00:00","categories":"[\"cs.CL\"]","pdf_url":"https://arxiv.org/pdf/2602.19583v1","arxiv_url":"http://arxiv.org/abs/2602.19583v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":3.0,"score_axis_3":4.0,"composite":3.05,"summary":"Proposes DEEP, a Docker-based platform for automated execution and evaluation of MT and OCR models. Includes clustering algorithm for statistical significance analysis and visualization web-app. Aims to standardize comparative evaluation in research and competitive challenges.","reasoning":"Engineering tool rather than research contribution. No ML weights. Practical for evaluation workflows but not novel ML methodology.","code_url":null,"s2_tldr":"The proposed software (DEEP) automates both the execution and scoring of machine translation and optical character recognition models and uses a clustering algorithm based on a statistical analysis of the significance of the results yielded by each model, according to the evaluation metrics.","s2_paper_id":"9a9408c2ade4ffbb628f745ad729fed643a79282","topics":"[\"Benchmark\"]"},{"id":414,"run_id":1,"domain":"aiml","arxiv_id":"2602.19463","entry_id":"","title":"PuppetChat: Fostering Intimate Communication through Bidirectional Actions and Micronarratives","authors":"[\"Emma Jiren Wang\", \"Siying Hu\", \"Zhicong Lu\"]","abstract":"As a primary channel for sustaining modern intimate relationships, instant messaging facilitates frequent connection across distances. However, today's tools often dilute care; they favor single tap reactions and vague emojis that do not support two way action responses, do not preserve the feeling that the exchange keeps going without breaking, and are weakly tied to who we are and what we share. To address this challenge, we present PuppetChat, a dyadic messaging prototype that restores this e","published":"2026-02-23T03:17:27+00:00","categories":"[\"cs.HC\", \"cs.AI\", \"cs.CL\", \"cs.CY\"]","pdf_url":"https://arxiv.org/pdf/2602.19463v1","arxiv_url":"http://arxiv.org/abs/2602.19463v1","comment":"19 pages, 8 figures; Accepted by ACM CHI 2026. In Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems (CHI'24)","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":3.0,"score_axis_3":4.0,"composite":3.05,"summary":"PuppetChat is a dyadic messaging prototype using embodied interaction, reciprocity-aware recommendations, and personalized micronarratives to enhance intimate communication. A 10-day field study with 11 dyads showed improvements in social presence and expressive self-disclosure.","reasoning":"HCI application paper focused on messaging interface design. Limited relevance to core AI/ML practitioners, no code/weights, incremental application work.","code_url":null,"s2_tldr":"PuppetChat is a dyadic messaging prototype that restores this expressive depth through embodied interaction and uses a reciprocity aware recommender to encourage responsive actions and generates personalized micronarratives from user stories to ground interactions in personal history.","s2_paper_id":"b5a284b432c323e1c3029da95538deca64571f4c","topics":"[]"},{"id":415,"run_id":1,"domain":"aiml","arxiv_id":"2602.18092","entry_id":"","title":"Perceived Political Bias in LLMs Reduces Persuasive Abilities","authors":"[\"Matthew DiGiuseppe\", \"Joshua Robison\"]","abstract":"Conversational AI has been proposed as a scalable way to correct public misconceptions and spread misinformation. Yet its effectiveness may depend on perceptions of its political neutrality. As LLMs enter partisan conflict, elites increasingly portray them as ideologically aligned. We test whether these credibility attacks reduce LLM-based persuasion. In a preregistered U.S. survey experiment (N=2144), participants completed a three-round conversation with ChatGPT about a personally held economi","published":"2026-02-20T09:33:16+00:00","categories":"[\"cs.CL\", \"cs.AI\", \"cs.CY\"]","pdf_url":"https://arxiv.org/pdf/2602.18092v1","arxiv_url":"http://arxiv.org/abs/2602.18092v1","comment":"39 pages, 10 figures","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":3.0,"score_axis_3":4.0,"composite":3.05,"summary":"Survey experiment (N=2144) showing that perceived political bias in LLMs reduces their persuasive ability by 28% when users are warned about partisan alignment. Demonstrates political contingency of conversational AI effectiveness through transcript analysis of ChatGPT interactions.","reasoning":"Social science study on perception rather than technical contribution. Limited practical applicability for ML practitioners. No code/models.","code_url":null,"s2_tldr":"","s2_paper_id":"1336d446c53f620890980fef2504f28888414c4a","topics":"[\"Language Models\", \"Reasoning\"]"},{"id":416,"run_id":1,"domain":"aiml","arxiv_id":"2602.18029","entry_id":"","title":"Towards More Standardized AI Evaluation: From Models to Agents","authors":"[\"Ali El Filali\", \"In\\u00e8s Bedar\"]","abstract":"Evaluation is no longer a final checkpoint in the machine learning lifecycle. As AI systems evolve from static models to compound, tool-using agents, evaluation becomes a core control function. The question is no longer \"How good is the model?\" but \"Can we trust the system to behave as intended, under change, at scale?\". Yet most evaluation practices remain anchored in assumptions inherited from the model-centric era: static benchmarks, aggregate scores, and one-off success criteria. This paper ","published":"2026-02-20T06:54:44+00:00","categories":"[\"cs.CL\", \"cs.AI\"]","pdf_url":"https://arxiv.org/pdf/2602.18029v1","arxiv_url":"http://arxiv.org/abs/2602.18029v1","comment":"19 pages, 3 figures","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":3.0,"score_axis_3":4.0,"composite":3.05,"summary":"Position paper arguing that evaluation practices must evolve from static model-centric benchmarks to address agentic AI systems. Discusses how current evaluation approaches fail to capture system behavior under change and at scale, calling for evaluation as measurement discipline rather than performance theater.","reasoning":"Survey/position paper without technical contributions or empirical work. Limited immediate practical value despite important conceptual points.","code_url":null,"s2_tldr":"The role of evaluation in the AI era is clarified, especially for agents: not as performance theater, but as a measurement discipline that conditions trust, iteration, and governance in non-deterministic systems.","s2_paper_id":"fcd55b29cb1cf214384d7f7ea8cfecf12daa8ba2","topics":"[\"Benchmark\"]"},{"id":417,"run_id":1,"domain":"aiml","arxiv_id":"2602.17850","entry_id":"","title":"Mind the Style: Impact of Communication Style on Human-Chatbot Interaction","authors":"[\"Erik Derner\", \"Dalibor Ku\\u010dera\", \"Aditya Gulati\", \"Ayoub Bagheri\", \"Nuria Oliver\"]","abstract":"Conversational agents increasingly mediate everyday digital interactions, yet the effects of their communication style on user experience and task success remain unclear. Addressing this gap, we describe the results of a between-subject user study where participants interact with one of two versions of a chatbot called NAVI which assists users in an interactive map-based 2D navigation task. The two chatbot versions differ only in communication style: one is friendly and supportive, while the oth","published":"2026-02-19T21:32:41+00:00","categories":"[\"cs.HC\", \"cs.AI\", \"cs.CL\", \"cs.CY\"]","pdf_url":"https://arxiv.org/pdf/2602.17850v1","arxiv_url":"http://arxiv.org/abs/2602.17850v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":3.0,"score_axis_3":4.0,"composite":3.05,"summary":"Between-subject study showing friendly chatbot communication style increases satisfaction and task completion rates for female participants only in 2D navigation task. Finds little evidence of linguistic accommodation, suggesting limited user mimicry of chatbot style.","reasoning":"HCI study with gender-specific findings. Limited ML contribution or practical applicability for model developers. No technical artifacts.","code_url":null,"s2_tldr":"The results of a between-subject user study where participants interact with one of two versions of a chatbot called NAVI which assists users in an interactive map-based 2D navigation task show that the friendly style increases subjective satisfaction and significantly improves task completion rates among female participants only.","s2_paper_id":"02f210cc36e867fbf1127f47cf08208d035ad14c","topics":"[\"Robotics\"]"},{"id":418,"run_id":1,"domain":"aiml","arxiv_id":"2602.17848","entry_id":"","title":"On the scaling relationship between cloze probabilities and language model next-token prediction","authors":"[\"Cassandra L. Jacobs\", \"Morgan Grobol\"]","abstract":"Recent work has shown that larger language models have better predictive power for eye movement and reading time data. While even the best models under-allocate probability mass to human responses, larger models assign higher-quality estimates of next tokens and their likelihood of production in cloze data because they are less sensitive to lexical co-occurrence statistics while being better aligned semantically to human cloze responses. The results provide support for the claim that the greater","published":"2026-02-19T21:29:55+00:00","categories":"[\"cs.CL\"]","pdf_url":"https://arxiv.org/pdf/2602.17848v1","arxiv_url":"http://arxiv.org/abs/2602.17848v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":4.0,"score_axis_3":3.0,"composite":3.05,"summary":"Empirical study showing larger language models assign higher-quality next-token probability estimates aligned with human cloze responses while being less sensitive to lexical co-occurrence. Suggests greater memorization capacity helps semantic appropriateness but reduces sensitivity to low-level word recognition cues.","reasoning":"Incremental psycholinguistic analysis of scaling laws. Limited immediate practical value for practitioners. No code/models.","code_url":null,"s2_tldr":"The results provide support for the claim that the greater memorization capacity of larger models helps them guess more semantically appropriate words, but makes them less sensitive to low-level information that is relevant for word recognition.","s2_paper_id":"e26b8e1eabbb3f1ef9830d17af013108a3f6d364","topics":"[\"Language Models\"]"},{"id":419,"run_id":1,"domain":"aiml","arxiv_id":"2602.19763","entry_id":"","title":"Training Deep Stereo Matching Networks on Tree Branch Imagery: A Benchmark Study for Real-Time UAV Forestry Applications","authors":"[\"Yida Lin\", \"Bing Xue\", \"Mengjie Zhang\", \"Sam Schofield\", \"Richard Green\"]","abstract":"Autonomous drone-based tree pruning needs accurate, real-time depth estimation from stereo cameras. Depth is computed from disparity maps using $Z = f B/d$, so even small disparity errors cause noticeable depth mistakes at working distances. Building on our earlier work that identified DEFOM-Stereo as the best reference disparity generator for vegetation scenes, we present the first study to train and test ten deep stereo matching networks on real tree branch images. We use the Canterbury Tree B","published":"2026-02-23T12:12:43+00:00","categories":"[\"cs.CV\", \"eess.IV\"]","pdf_url":"https://arxiv.org/pdf/2602.19763v1","arxiv_url":"http://arxiv.org/abs/2602.19763v1","comment":"","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":2.0,"score_axis_3":4.0,"composite":2.7,"summary":"Benchmark study training 10 stereo matching networks on Canterbury Tree Branches dataset (5,313 pairs) for UAV forestry applications. BANet-3D achieves best quality (SSIM=0.883), AnyNet reaches 6.99 FPS for near-real-time on Jetson Orin.","reasoning":"Domain-specific benchmark study without novel methods or code. Forestry application is niche.","code_url":null,"s2_tldr":"This work presents the first study to train and test ten deep stereo matching networks on real tree branch images and finds that BANet-3D produces the best overall quality, while RAFT-Stereo scores highest on scene-level understanding.","s2_paper_id":"ba2e0953b05c48ae3f510b88b389a0e6f9853d5e","topics":"[\"Benchmark\", \"Efficiency\", \"3D / Vision\"]"},{"id":420,"run_id":1,"domain":"aiml","arxiv_id":"2602.20052","entry_id":"","title":"Entropy in Large Language Models","authors":"[\"Marco Scharringhausen\"]","abstract":"In this study, the output of large language models (LLM) is considered an information source generating an unlimited sequence of symbols drawn from a finite alphabet. Given the probabilistic nature of modern LLMs, we assume a probabilistic model for these LLMs, following a constant random distribution and the source itself thus being stationary. We compare this source entropy (per word) to that of natural language (written or spoken) as represented by the Open American National Corpus (OANC). Ou","published":"2026-02-23T17:02:45+00:00","categories":"[\"cs.CL\"]","pdf_url":"https://arxiv.org/pdf/2602.20052v1","arxiv_url":"http://arxiv.org/abs/2602.20052v1","comment":"7 pages, 2 figures, 3 tables","source":"arxiv","github_repo":"","github_stars":null,"hf_upvotes":0,"hf_models":"[]","hf_datasets":"[]","hf_spaces":"[]","score_axis_1":2.0,"score_axis_2":3.0,"score_axis_3":3.0,"composite":2.7,"summary":"Theoretical study comparing source entropy (per word) of LLM outputs to natural language using Open American National Corpus. Assumes probabilistic model with constant random distribution and stationary source. Finds LLM word entropy is lower than natural speech entropy (written or spoken).","reasoning":"Theoretical entropy analysis with limited practical applicability. No novel methods or architectures. Purely analytical study without code or empirical validation on modern LLMs.","code_url":null,"s2_tldr":"This study compares the source entropy of the output of large language models (LLM) to that of natural language (written or spoken) as represented by the Open American National Corpus (OANC).","s2_paper_id":"739a2fc27365617574fa64a57951a6d6ec9c02c7","topics":"[\"Language Models\"]"}]}