DVPS Scientific Watch - a dvps Collection

dvps 's Collections

DVPS Research Papers

DVPS Scientific Watch

updated 3 days ago

Collection of external scientific material relevant to the project

Upvote

HuggingFaceFW/finetranslations

Viewer • Updated Jan 9 • 3.33B • 20k • 296

Note Multilingual synthetic corpus for translation. Over 1 trillion tokens of parallel text in English and 500+ languages by translating data from FineWeb2 into English using Gemma3 27B
LLM-Inference-Bench: Inference Benchmarking of Large Language Models on AI Accelerators

Paper • 2411.00136 • Published Oct 31, 2024

Note General overview of popular inference engines performance across different workload scenarions and GPU platforms.
The Illusion of Readiness in Health AI

Paper • 2509.18234 • Published Sep 22, 2025 • 1

Note This paper is about proper and systematic evaluation to appropriately adjust claims, revealing hidden fragilities and failure modes—such as reliance on masking artifacts, shortcut learning, or flawed reasoning—and highlighting how models may perform well on specific benchmarks while failing to generalize.
The Roots of Performance Disparity in Multilingual Language Models: Intrinsic Modeling Difficulty or Design Choices?

Paper • 2601.07220 • Published Jan 12

Note A survey on multilingual LLMs with design recommendations for tokenization, sampling, architectures, and evaluation to support multilingual LMs.
google/translategemma-27b-it

Image-Text-to-Text • 29B • Updated Jan 28 • 29.6k • 379

Note Multimodal translation model for around 55 langauges and optimized from Gemma family models for MT task.
TutorBench: A Benchmark To Assess Tutoring Capabilities Of Large Language Models

Paper • 2510.02663 • Published Oct 3, 2025 • 2
Scaling Spatial Intelligence with Multimodal Foundation Models

Paper • 2511.13719 • Published Nov 17, 2025 • 50

Note MMFMs still exhibit surprising deficiencies in spatial intelligence. In this paper, they explore scaling up MMFMs to cultivate spatial intelligence within the SenseNova-SI family, built upon established multimodal foundations including visual understanding models (Qwen3-VL and InternVL3) and unified understanding and generation models.
Multimodal Foundation Models for Early Disease Detection

Paper • 2510.01899 • Published Oct 2, 2025

Note Most diagnostic models still process different modalities in isolation. This limits their ability to capture early, cross-modal disease signatures. This work introduces a MMFM built on a transformer architecture that integrates heterogeneous clinical data through modality-specific encoders and cross-modal attention.
Assessing the value of Geo-Foundational Models for Flood Inundation Mapping: Benchmarking models for Sentinel-1, Sentinel-2, and Planetscope for end-users

Paper • 2511.01990 • Published Nov 3, 2025

Note Geo-foundation models show promise for flood mapping, offering modest but consistent improvements over traditional approaches while reducing data and computational requirements
Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models

Paper • 2601.07372 • Published Jan 12 • 51

Note Split memory and reasoning in Large Language Models.
MIT-10M: A Large Scale Parallel Corpus of Multilingual Image Translation

Paper • 2412.07147 • Published Dec 10, 2024 • 5

Note Parallel image translation dataset spanning 840k images across 14 languages. The images are sourced from 8 languages from 28 categories. GPT-4 is used to translate the OCR-regognized text, verified using similarity to Google Translate and random human verification.
End-to-End Test-Time Training for Long Context

Paper • 2512.23675 • Published Dec 29, 2025 • 24
AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments

Paper • 2405.07960 • Published May 13, 2024 • 1

Note Work on Benchmarking LLMs in Medical Domain; Introduces a simulated clinical environment in which multimodal agents interact with patients, gather incomplete information, use tools, and reason over time.
Innovator-VL: A Multimodal Large Language Model for Scientific Discovery

Paper • 2601.19325 • Published Jan 27 • 85

Note Interesting recent work on a multimodal LLM for scientific discovery incorporating a wide variety of scientific information and showing consistently high results on benchmarks.
EvalBlocks: A Modular Pipeline for Rapidly Evaluating Foundation Models in Medical Imaging

Paper • 2601.03811 • Published Jan 7

Note An evaluation framework that efficiently tracks datasets, model variants, aggregation choices, and downstream tasks while remaining fast, reproducible, and scalable. keywords: evaluation, medical imaging, reproducibility
LINGOLY-TOO: Disentangling Memorisation from Reasoning with Linguistic Templatisation and Orthographic Obfuscation

Paper • 2503.02972 • Published Mar 4, 2025 • 25
Disentangling Language and Culture for Evaluating Multilingual Large Language Models

Paper • 2505.24635 • Published May 30, 2025

Note Dual Evaluation Framework to assess the multilingual capabilities of LLMs by decomposing the evaluation along the dimensions of linguistic medium and cultural context. Evaluations on a wide range of models reveal that models exhibit better performance when questions are culturally aligned with the language (finding supported by interpretability probing). Findings highlight the necessity of culturally and linguistically informed model evaluations.
"Be My Cheese?": Cultural Nuance Benchmarking for Machine Translation in Multilingual LLMs

Paper • 2602.04729 • Published Feb 4

Note A benchmark for assessing cultural localisation in machine translation produced by state-of-the-art multilingual LLMs. Evaluatio of 7 multilingual LLMs across 15 target languages. Findings demonstrate a persistent gap between grammatical adequacy and cultural resonance, highlighting the need for culturally informed training data, improved cross-lingual pragmatics, and evaluation paradigms that better reflect real-world communicative competence.
From Hard Refusals to Safe-Completions: Toward Output-Centric Safety Training

Paper • 2508.09224 • Published Aug 12, 2025
The Dragon Hatchling: The Missing Link between the Transformer and Models of the Brain

Paper • 2509.26507 • Published Sep 30, 2025 • 551

Note Dragon Hatchling is a biologically inspired, scale-free, attention-based state space large language model architecture that combines Transformer-level performance with inherent interpretability through sparse, positive activations and Hebbian synaptic plasticity.
THOR: A Versatile Foundation Model for Earth Observation Climate and Society Applications

Paper • 2601.16011 • Published Jan 22

Note THOR is a versatile Earth Observation foundation model that unifies multi-resolution Sentinel-1, -2, and -3 satellite data in a single architecture.
A Token is Worth over 1,000 Tokens: Efficient Knowledge Distillation through Low-Rank Clone

Paper • 2505.12781 • Published May 19, 2025 • 2
EO-VAE: Towards A Multi-sensor Tokenizer for Earth Observation Data

Paper • 2602.12177 • Published Feb 12
Mercury: Ultra-Fast Language Models Based on Diffusion

Paper • 2506.17298 • Published Jun 17, 2025 • 11
CineMA: A Foundation Model for Cine Cardiac MRI

Paper • 2506.00679 • Published May 31, 2025
UniComp: A Unified Evaluation of Large Language Model Compression via Pruning, Quantization and Distillation

Paper • 2602.09130 • Published Feb 9

Note UniComp is a unified evaluation framework for comparing pruning, quantization, and knowledge distillation. It evaluates compressed models along three dimensions: performance, reliability, and efficiency, using a diverse set of capability- and safety-oriented benchmarks together with a hardware-aware efficiency analysis.
zai-org/GLM-5

Text Generation • 754B • Updated Apr 5 • 69.4k • • 2.11k
Exploring In-Image Machine Translation with Real-World Background

Paper • 2505.15282 • Published May 21, 2025

Note This work proposes a pipeline for In-Image Machine Translation. It involves decomposition of the source image into background and a simplified "text image". The text image undergoes an end-to-end translation in the image modality following which it is fused with the background image.
Ming-Omni: A Unified Multimodal Model for Perception and Generation

Paper • 2506.09344 • Published Jun 11, 2025 • 33

Note This multimodal model takes in text, images, video and audio as input; and has generation capabilities for text, audio and images. Key characteristics include MoE architecture with modality-specific routing, BPE audio token encoding for generation, dedicated learnable query tokens corresponding multiple spatial resolutions for image generation.
Beyond Language Modeling: An Exploration of Multimodal Pretraining

Paper • 2603.03276 • Published Mar 3 • 107
EchoingECG: An Electrocardiogram Cross-Modal Model for Echocardiogram Tasks

Paper • 2509.25791 • Published Sep 30, 2025

Note Introduces a cross-modal framework that distills ECHO knowledge into ECG representations, enabling ECG-based prediction of cardiac function metrics traditionally requiring ECHO. The method combines Probabilistic Cross-Modal Embeddings with ECHO-CLIP in a student-teacher setup.
OlmoEarth: Stable Latent Image Modeling for Multimodal Earth Observation

Paper • 2511.13655 • Published Nov 17, 2025 • 12

Note Multimodal, spatio-temporal foundation model that employs a novel self-supervised learning formulation, masking strategy, and loss all designed for the Earth observation domain. OlmoEarth achieves state-of-the-art performance compared to 12 other foundation models across a variety of research benchmarks and real-world tasks from external partners.
MARCUS: An agentic, multimodal vision-language model for cardiac diagnosis and management

Paper • 2603.22179 • Published Mar 23

Note Decentralized agent approaches for AD Cardiology
Spatial-RAG: Spatial Retrieval Augmented Generation for Real-World Geospatial Reasoning Questions

Paper • 2502.18470 • Published Feb 4, 2025

Note Spatial Retrieval-Augmented Generation (RAG) framework designed for geospatial question answering. Spatial-RAG integrates structured spatial databases with LLMs via a hybrid spatial retriever that combines sparse spatial filtering and dense semantic matching. It formulates the answering process as a multi-objective optimization over spatial and semantic relevance, identifying Pareto-optimal candidates and dynamically selecting the best response based on user intent.
V-JEPA 2.1: Unlocking Dense Features in Video Self-Supervised Learning

Paper • 2603.14482 • Published Mar 15 • 36

Note A self-supervised vision model that learns to understand the physical world by using a unified architecture to jointly train on both unlabelled images and moving videos.
LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference

Paper • 2510.09665 • Published Oct 8, 2025 • 9
MMRC: A Large-Scale Benchmark for Understanding Multimodal Large Language Model in Real-World Conversation

Paper • 2502.11903 • Published Feb 17, 2025
Brittlebench: Quantifying LLM robustness via prompt sensitivity

Paper • 2603.13285 • Published Apr 6

Note Novel evaluation pipeline to holistically evaluate the sensitivity of frontier models to noise and variability inherent in real-world user inputs. Authors find that semantics-preserving input perturbations can account for up to half of the performance variance for a given model. Brittlebench highlights the need for more robust evaluations and models, and allows us to systematically understand model brittleness.
LLMs and Speech: Integration vs. Combination

Paper • 2603.15045 • Published Mar 16

Note A study on how to best utilize pre-trained LLMs for automatic speech recognition. The authors compare the tight integration of an acoustic model (AM) with the LLM ("speech LLM") to the traditional way of combining AM and LLM via shallow fusion.
Project Imaging-X: A Survey of 1000+ Open-Access Medical Imaging Datasets for Foundation Model Development

Paper • 2603.27460 • Published Mar 29 • 72

Note Note: survey of 1000+ datasets. Project page: https://github.com/uni-medical/Project-Imaging-X
Dyslexify: A Mechanistic Defense Against Typographic Attacks in CLIP

Paper • 2508.20570 • Published Feb 26

Note A training-free defense against typographic attacks in CLIP that works by locating and selectively ablating the attention heads responsible for processing injected text.
G-Retriever: Retrieval-Augmented Generation for Textual Graph Understanding and Question Answering

Paper • 2402.07630 • Published Feb 12, 2024 • 2
Understand and Accelerate Memory Processing Pipeline for Disaggregated LLM Inference

Paper • 2603.29002 • Published Mar 30 • 6
SLAM-LLM: A Modular, Open-Source Multimodal Large Language Model Framework and Best Practice for Speech, Language, Audio and Music Processing

Paper • 2601.09385 • Published Jan 14

Note A framework for training speech/audio/music multimodal LLMs, providing pluggable encoders, projectors, LLM backbones, and PEFT adapters, plus reference recipes and SoTA checkpoints for tasks like ASR and audio captioning.
Earth Embeddings Reveal Diverse Urban Signals from Space

Paper • 2604.03456 • Published Apr 3
MedGemma 1.5 Technical Report

Paper • 2604.05081 • Published Apr 6 • 15
BAAI Cardiac Agent: An intelligent multimodal agent for automated reasoning and diagnosis of cardiovascular diseases from cardiac magnetic resonance imaging

Paper • 2604.04078 • Published Apr 5 • 2

Note The paper introduces BAAI Cardiac Agent, a multimodal AI system for cardiac MRI analysis that combines image understanding, segmentation, reasoning, and disease diagnosis within a unified framework. It integrates a large vision-language models with specialized cardiac imaging modules, enabling automated interpretation across multiple MRI views and sequences. The work also contributes new datasets.
On Optimizing Multimodal Jailbreaks for Spoken Language Models

Paper • 2603.19127 • Published Mar 19

Note The paper investigates vulnerabilities of SLMs when subjected to gradient-based multimodal attacks. By introducing a joint optimization framework, it shows that multimodal attacks indeed threaten model safety alignment, resulting in up to 10× worse jailbreak rates than unimodal attacks. It shows that multimodal perturbations act on partially independent jailbreak spaces, and their combination tends to expose vulnerabilities that are not visible under unimodal attacks.
Transducing Language Models

Paper • 2603.05193 • Published Mar 5

Note The paper introduces a general framework for transforming language models using transducers.
TextME: Bridging Unseen Modalities Through Text Descriptions

Paper • 2602.03098 • Published Feb 3
Sensing Cardiac Health Across Scenarios and Devices: A Multi-Modal Foundation Model Pretrained on Heterogeneous Data from 1.7 Million Individuals

Paper • 2507.01045 • Published Jun 23, 2025 • 1

Note Cardiac sensing foundation model (CSFM). Multimodal integration of data from various large-scale datasets, comprising cardiac signals from approximately 1.7 million individuals and their corresponding clinical or machine-generated text reports
Weak-SIGReg: Covariance Regularization for Stable Deep Learning

Paper • 2603.05924 • Published Mar 6 • 2

Note Simplified technique to regularize a network to learn robust features for arbitrary downstream task
There Will Be a Scientific Theory of Deep Learning

Paper • 2604.21691 • Published Apr 23 • 1
MedAgent-Pro: Towards Multi-modal Evidence-based Medical Diagnosis via Reasoning Agentic Workflow

Paper • 2503.18968 • Published Mar 21, 2025 • 8

Note A hierarchical AI framework that improves multi-modal medical diagnosis by combining clinical guideline planning with step-by-step diagnostic verification.
UALM: Unified Audio Language Model for Understanding, Generation and Reasoning

Paper • 2510.12000 • Published Oct 13, 2025 • 1

Note A single model that unifies audio understanding, text-to-audio generation, and multimodal reasoning - matching the quality of specialized state-of-the-art models across all three tasks.
BigEarthNet.txt: A Large-Scale Multi-Sensor Image-Text Dataset and Benchmark for Earth Observation

Paper • 2603.29630 • Published Mar 31 • 1

Note BigEarthNettxt contains 464044 co-registered Sentinel-1 synthetic aperture radar and Sentinel-2 multispectral images with 9.6M text annotations, including: i) geographically anchored captions describing land-use/land-cover (LULC) classes, their spatial relations, and environmental context; ii) visual question answering pairs relevant for different tasks; and iii) referring expression detection instructions for bounding box prediction.
TRADE: Transducer-Augmented Decoder for Speech LLM

Paper • 2606.08486 • Published 27 days ago

Note The paper presents TRADE (TRansducer-Augmented DEcoder), which augments a multimodal LLM with a transducer branch that shares the audio encoder and uses the LLM's hidden states directly as the prediction network -- coupling frame-synchronous acoustic alignment with the LLM's linguistic reasoning.
GlobeAudio: A Multilingual Multicultural Benchmark for Naturalistic Evaluation of Large Audio-Language Models

Paper • 2606.08194 • Published 28 days ago • 4

Note The paper presents GlobeAudio, a multilingual and multicultural benchmark designed to evaluate naturalistic audio understanding. GlobeAudio consists of 5,637 multiple-choice questions across six typologically diverse languages, expertly crafted by native speakers grounded on naturally occurring audio.
A multimodal and temporal foundation model for virtual patient representations at healthcare system scale

Paper • 2604.18570 • Published Apr 20
Analysing The Impact of Sequence Composition on Language Model Pre-Training

Paper • 2402.13991 • Published Feb 21, 2024 • 1
Sakana Fugu Technical Report

Paper • 2606.21228 • Published 11 days ago • 1
A Generalizable Deep Learning System for Cardiac MRI

Paper • 2312.00357 • Published Dec 1, 2023
Training-Free Multimodal Large Language Model Orchestration

Paper • 2508.10016 • Published Aug 6, 2025 • 1

Note A training-free framework that integrates standalone modality experts into a unified multimodal system.
Benchmarking AI Agents for Addressing Scientific Challenges Across Scales

Paper • 2606.12736 • Published 24 days ago • 5
How Auditory Knowledge in LLM Backbones Shapes Audio Language Models: A Holistic Evaluation

Paper • 2603.19195 • Published Mar 19 • 4

Note Finding: an audio-LLM's audio performance is strongly predicted by its text-only backbone, and much "hearing" ability is inherited rather than learned from audio.
FastVLM: Efficient Vision Encoding for Vision Language Models

Paper • 2412.13303 • Published Dec 17, 2024 • 77
Thinking to Recall: How Reasoning Unlocks Parametric Knowledge in LLMs

Paper • 2603.09906 • Published Mar 10 • 76
MMDiff: Extending Diffusion Transformers for Multi-Modal Generation

Paper • 2606.16673 • Published 19 days ago • 5

Upvote