Get trending papers in your email inbox once a day!
Get trending papers in your email inbox!
SubscribeConCerNet: A Contrastive Learning Based Framework for Automated Conservation Law Discovery and Trustworthy Dynamical System Prediction
Deep neural networks (DNN) have shown great capacity of modeling a dynamical system; nevertheless, they usually do not obey physics constraints such as conservation laws. This paper proposes a new learning framework named ConCerNet to improve the trustworthiness of the DNN based dynamics modeling to endow the invariant properties. ConCerNet consists of two steps: (i) a contrastive learning method to automatically capture the system invariants (i.e. conservation properties) along the trajectory observations; (ii) a neural projection layer to guarantee that the learned dynamics models preserve the learned invariants. We theoretically prove the functional relationship between the learned latent representation and the unknown system invariant function. Experiments show that our method consistently outperforms the baseline neural networks in both coordinate error and conservation metrics by a large margin. With neural network based parameterization and no dependence on prior knowledge, our method can be extended to complex and large-scale dynamics by leveraging an autoencoder.
Optimization- and AI-based approaches to academic quality quantification for transparent academic recruitment: part 1-model development
For fair academic recruitment at universities and research institutions, determination of the right measure based on globally accepted academic quality features is a highly delicate, challenging, but quite important problem to be addressed. In a series of two papers, we consider the modeling part for academic quality quantification in the first paper, in this paper, and the case studies part in the second paper. For academic quality quantification modeling, we develop two computational frameworks which can be used to construct a decision-support tool: (i) an optimization-based framework and (ii) a Siamese network (a type of artificial neural network)-based framework. The output of both models is a single index called Academic Quality Index (AQI) which is a measure of the overall academic quality. The data of academics from first-class and average-class world universities, based on Times Higher Education World University Rankings and QS World University Rankings, are assumed as the reference data for tuning model parameters.
Graph Reinforcement Learning for Network Control via Bi-Level Optimization
Optimization problems over dynamic networks have been extensively studied and widely used in the past decades to formulate numerous real-world problems. However, (1) traditional optimization-based approaches do not scale to large networks, and (2) the design of good heuristics or approximation algorithms often requires significant manual trial-and-error. In this work, we argue that data-driven strategies can automate this process and learn efficient algorithms without compromising optimality. To do so, we present network control problems through the lens of reinforcement learning and propose a graph network-based framework to handle a broad class of problems. Instead of naively computing actions over high-dimensional graph elements, e.g., edges, we propose a bi-level formulation where we (1) specify a desired next state via RL, and (2) solve a convex program to best achieve it, leading to drastically improved scalability and performance. We further highlight a collection of desirable features to system designers, investigate design decisions, and present experiments on real-world control problems showing the utility, scalability, and flexibility of our framework.
Unraveling the cognitive patterns of Large Language Models through module communities
Large Language Models (LLMs) have reshaped our world with significant advancements in science, engineering, and society through applications ranging from scientific discoveries and medical diagnostics to Chatbots. Despite their ubiquity and utility, the underlying mechanisms of LLM remain concealed within billions of parameters and complex structures, making their inner architecture and cognitive processes challenging to comprehend. We address this gap by adopting approaches to understanding emerging cognition in biology and developing a network-based framework that links cognitive skills, LLM architectures, and datasets, ushering in a paradigm shift in foundation model analysis. The skill distribution in the module communities demonstrates that while LLMs do not strictly parallel the focalized specialization observed in specific biological systems, they exhibit unique communities of modules whose emergent skill patterns partially mirror the distributed yet interconnected cognitive organization seen in avian and small mammalian brains. Our numerical results highlight a key divergence from biological systems to LLMs, where skill acquisition benefits substantially from dynamic, cross-regional interactions and neural plasticity. By integrating cognitive science principles with machine learning, our framework provides new insights into LLM interpretability and suggests that effective fine-tuning strategies should leverage distributed learning dynamics rather than rigid modular interventions.
The 'Paris-end' of town? Urban typology through machine learning
The confluence of recent advances in availability of geospatial information, computing power, and artificial intelligence offers new opportunities to understand how and where our cities differ or are alike. Departing from a traditional `top-down' analysis of urban design features, this project analyses millions of images of urban form (consisting of street view, satellite imagery, and street maps) to find shared characteristics. A (novel) neural network-based framework is trained with imagery from the largest 1692 cities in the world and the resulting models are used to compare within-city locations from Melbourne and Sydney to determine the closest connections between these areas and their international comparators. This work demonstrates a new, consistent, and objective method to begin to understand the relationship between cities and their health, transport, and environmental consequences of their design. The results show specific advantages and disadvantages using each type of imagery. Neural networks trained with map imagery will be highly influenced by the mix of roads, public transport, and green and blue space as well as the structure of these elements. The colours of natural and built features stand out as dominant characteristics in satellite imagery. The use of street view imagery will emphasise the features of a human scaled visual geography of streetscapes. Finally, and perhaps most importantly, this research also answers the age-old question, ``Is there really a `Paris-end' to your city?''.
AgentRouter: A Knowledge-Graph-Guided LLM Router for Collaborative Multi-Agent Question Answering
Large language models (LLMs) and agent-based frameworks have advanced rapidly, enabling diverse applications. Yet, with the proliferation of models and agentic strategies, practitioners face substantial uncertainty in selecting the best configuration for a downstream task. Prior studies show that different agents and backbones exhibit complementary strengths, and that larger models are not always superior, underscoring the need for adaptive routing mechanisms. Existing approaches to agent routing, however, often emphasize cost efficiency while overlooking the fine-grained contextual and relational structure inherent in QA tasks. In this paper, we propose tAgentRouter, a framework that formulates multi-agent QA as a knowledge-graph-guided routing problem supervised by empirical performance signals. Specifically, we convert QA instance into a knowledge graph that jointly encodes queries, contextual entities, and agents, and then train a heterogeneous graph neural network (GNN) to propagate information across node types and produce task-aware routing distributions over agents. By leveraging soft supervision and weighted aggregation of agent outputs, AgentRouter learns principled collaboration schemes that capture the complementary strengths of diverse agents. Extensive experiments demonstrate that our framework consistently outperforms single-agent and ensemble baselines, while generalizing across benchmarks and LLM backbones. These results highlight the effectiveness and robustness of graph-supervised multi-agent routing for question answering.
Real-time Holistic Robot Pose Estimation with Unknown States
Estimating robot pose from RGB images is a crucial problem in computer vision and robotics. While previous methods have achieved promising performance, most of them presume full knowledge of robot internal states, e.g. ground-truth robot joint angles. However, this assumption is not always valid in practical situations. In real-world applications such as multi-robot collaboration or human-robot interaction, the robot joint states might not be shared or could be unreliable. On the other hand, existing approaches that estimate robot pose without joint state priors suffer from heavy computation burdens and thus cannot support real-time applications. This work introduces an efficient framework for real-time robot pose estimation from RGB images without requiring known robot states. Our method estimates camera-to-robot rotation, robot state parameters, keypoint locations, and root depth, employing a neural network module for each task to facilitate learning and sim-to-real transfer. Notably, it achieves inference in a single feed-forward pass without iterative optimization. Our approach offers a 12-time speed increase with state-of-the-art accuracy, enabling real-time holistic robot pose estimation for the first time. Code and models are available at https://github.com/Oliverbansk/Holistic-Robot-Pose-Estimation.
Robust Angular Synchronization via Directed Graph Neural Networks
The angular synchronization problem aims to accurately estimate (up to a constant additive phase) a set of unknown angles theta_1, dots, theta_nin[0, 2pi) from m noisy measurements of their offsets theta_i-theta_j ;mod ; 2pi. Applications include, for example, sensor network localization, phase retrieval, and distributed clock synchronization. An extension of the problem to the heterogeneous setting (dubbed k-synchronization) is to estimate k groups of angles simultaneously, given noisy observations (with unknown group assignment) from each group. Existing methods for angular synchronization usually perform poorly in high-noise regimes, which are common in applications. In this paper, we leverage neural networks for the angular synchronization problem, and its heterogeneous extension, by proposing GNNSync, a theoretically-grounded end-to-end trainable framework using directed graph neural networks. In addition, new loss functions are devised to encode synchronization objectives. Experimental results on extensive data sets demonstrate that GNNSync attains competitive, and often superior, performance against a comprehensive set of baselines for the angular synchronization problem and its extension, validating the robustness of GNNSync even at high noise levels.
Robo-taxi Fleet Coordination at Scale via Reinforcement Learning
Fleets of robo-taxis offering on-demand transportation services, commonly known as Autonomous Mobility-on-Demand (AMoD) systems, hold significant promise for societal benefits, such as reducing pollution, energy consumption, and urban congestion. However, orchestrating these systems at scale remains a critical challenge, with existing coordination algorithms often failing to exploit the systems' full potential. This work introduces a novel decision-making framework that unites mathematical modeling with data-driven techniques. In particular, we present the AMoD coordination problem through the lens of reinforcement learning and propose a graph network-based framework that exploits the main strengths of graph representation learning, reinforcement learning, and classical operations research tools. Extensive evaluations across diverse simulation fidelities and scenarios demonstrate the flexibility of our approach, achieving superior system performance, computational efficiency, and generalizability compared to prior methods. Finally, motivated by the need to democratize research efforts in this area, we release publicly available benchmarks, datasets, and simulators for network-level coordination alongside an open-source codebase designed to provide accessible simulation platforms and establish a standardized validation process for comparing methodologies. Code available at: https://github.com/StanfordASL/RL4AMOD
CrossFi: A Cross Domain Wi-Fi Sensing Framework Based on Siamese Network
In recent years, Wi-Fi sensing has garnered significant attention due to its numerous benefits, such as privacy protection, low cost, and penetration ability. Extensive research has been conducted in this field, focusing on areas such as gesture recognition, people identification, and fall detection. However, many data-driven methods encounter challenges related to domain shift, where the model fails to perform well in environments different from the training data. One major factor contributing to this issue is the limited availability of Wi-Fi sensing datasets, which makes models learn excessive irrelevant information and over-fit to the training set. Unfortunately, collecting large-scale Wi-Fi sensing datasets across diverse scenarios is a challenging task. To address this problem, we propose CrossFi, a siamese network-based approach that excels in both in-domain scenario and cross-domain scenario, including few-shot, zero-shot scenarios, and even works in few-shot new-class scenario where testing set contains new categories. The core component of CrossFi is a sample-similarity calculation network called CSi-Net, which improves the structure of the siamese network by using an attention mechanism to capture similarity information, instead of simply calculating the distance or cosine similarity. Based on it, we develop an extra Weight-Net that can generate a template for each class, so that our CrossFi can work in different scenarios. Experimental results demonstrate that our CrossFi achieves state-of-the-art performance across various scenarios. In gesture recognition task, our CrossFi achieves an accuracy of 98.17% in in-domain scenario, 91.72% in one-shot cross-domain scenario, 64.81% in zero-shot cross-domain scenario, and 84.75% in one-shot new-class scenario. The code for our model is publicly available at https://github.com/RS2002/CrossFi.
PARAGRAPH2GRAPH: A GNN-based framework for layout paragraph analysis
Document layout analysis has a wide range of requirements across various domains, languages, and business scenarios. However, most current state-of-the-art algorithms are language-dependent, with architectures that rely on transformer encoders or language-specific text encoders, such as BERT, for feature extraction. These approaches are limited in their ability to handle very long documents due to input sequence length constraints and are closely tied to language-specific tokenizers. Additionally, training a cross-language text encoder can be challenging due to the lack of labeled multilingual document datasets that consider privacy. Furthermore, some layout tasks require a clean separation between different layout components without overlap, which can be difficult for image segmentation-based algorithms to achieve. In this paper, we present Paragraph2Graph, a language-independent graph neural network (GNN)-based model that achieves competitive results on common document layout datasets while being adaptable to business scenarios with strict separation. With only 19.95 million parameters, our model is suitable for industrial applications, particularly in multi-language scenarios.
3DFPN-HS$^2$: 3D Feature Pyramid Network Based High Sensitivity and Specificity Pulmonary Nodule Detection
Accurate detection of pulmonary nodules with high sensitivity and specificity is essential for automatic lung cancer diagnosis from CT scans. Although many deep learning-based algorithms make great progress for improving the accuracy of nodule detection, the high false positive rate is still a challenging problem which limited the automatic diagnosis in routine clinical practice. In this paper, we propose a novel pulmonary nodule detection framework based on a 3D Feature Pyramid Network (3DFPN) to improve the sensitivity of nodule detection by employing multi-scale features to increase the resolution of nodules, as well as a parallel top-down path to transit the high-level semantic features to complement low-level general features. Furthermore, a High Sensitivity and Specificity (HS^2) network is introduced to eliminate the falsely detected nodule candidates by tracking the appearance changes in continuous CT slices of each nodule candidate. The proposed framework is evaluated on the public Lung Nodule Analysis (LUNA16) challenge dataset. Our method is able to accurately detect lung nodules at high sensitivity and specificity and achieves 90.4% sensitivity with 1/8 false positive per scan which outperforms the state-of-the-art results 15.6%.
Knowledge Transfer from Weakly Labeled Audio using Convolutional Neural Network for Sound Events and Scenes
In this work we propose approaches to effectively transfer knowledge from weakly labeled web audio data. We first describe a convolutional neural network (CNN) based framework for sound event detection and classification using weakly labeled audio data. Our model trains efficiently from audios of variable lengths; hence, it is well suited for transfer learning. We then propose methods to learn representations using this model which can be effectively used for solving the target task. We study both transductive and inductive transfer learning tasks, showing the effectiveness of our methods for both domain and task adaptation. We show that the learned representations using the proposed CNN model generalizes well enough to reach human level accuracy on ESC-50 sound events dataset and set state of art results on this dataset. We further use them for acoustic scene classification task and once again show that our proposed approaches suit well for this task as well. We also show that our methods are helpful in capturing semantic meanings and relations as well. Moreover, in this process we also set state-of-art results on Audioset dataset, relying on balanced training set.
Learning fast, accurate, and stable closures of a kinetic theory of an active fluid
Important classes of active matter systems can be modeled using kinetic theories. However, kinetic theories can be high dimensional and challenging to simulate. Reduced-order representations based on tracking only low-order moments of the kinetic model serve as an efficient alternative, but typically require closure assumptions to model unrepresented higher-order moments. In this study, we present a learning framework based on neural networks that exploit rotational symmetries in the closure terms to learn accurate closure models directly from kinetic simulations. The data-driven closures demonstrate excellent a-priori predictions comparable to the state-of-the-art Bingham closure. We provide a systematic comparison between different neural network architectures and demonstrate that nonlocal effects can be safely ignored to model the closure terms. We develop an active learning strategy that enables accurate prediction of the closure terms across the entire parameter space using a single neural network without the need for retraining. We also propose a data-efficient training procedure based on time-stepping constraints and a differentiable pseudo-spectral solver, which enables the learning of stable closures suitable for a-posteriori inference. The coarse-grained simulations equipped with data-driven closure models faithfully reproduce the mean velocity statistics, scalar order parameters, and velocity power spectra observed in simulations of the kinetic theory. Our differentiable framework also facilitates the estimation of parameters in coarse-grained descriptions conditioned on data.
Deep Pixel-wise Binary Supervision for Face Presentation Attack Detection
Face recognition has evolved as a prominent biometric authentication modality. However, vulnerability to presentation attacks curtails its reliable deployment. Automatic detection of presentation attacks is essential for secure use of face recognition technology in unattended scenarios. In this work, we introduce a Convolutional Neural Network (CNN) based framework for presentation attack detection, with deep pixel-wise supervision. The framework uses only frame level information making it suitable for deployment in smart devices with minimal computational and time overhead. We demonstrate the effectiveness of the proposed approach in public datasets for both intra as well as cross-dataset experiments. The proposed approach achieves an HTER of 0% in Replay Mobile dataset and an ACER of 0.42% in Protocol-1 of OULU dataset outperforming state of the art methods.
Medical Image Registration via Neural Fields
Image registration is an essential step in many medical image analysis tasks. Traditional methods for image registration are primarily optimization-driven, finding the optimal deformations that maximize the similarity between two images. Recent learning-based methods, trained to directly predict transformations between two images, run much faster, but suffer from performance deficiencies due to model generalization and the inefficiency in handling individual image specific deformations. Here we present a new neural net based image registration framework, called NIR (Neural Image Registration), which is based on optimization but utilizes deep neural nets to model deformations between image pairs. NIR represents the transformation between two images with a continuous function implemented via neural fields, receiving a 3D coordinate as input and outputting the corresponding deformation vector. NIR provides two ways of generating deformation field: directly output a displacement vector field for general deformable registration, or output a velocity vector field and integrate the velocity field to derive the deformation field for diffeomorphic image registration. The optimal registration is discovered by updating the parameters of the neural field via stochastic gradient descent. We describe several design choices that facilitate model optimization, including coordinate encoding, sinusoidal activation, coordinate sampling, and intensity sampling. Experiments on two 3D MR brain scan datasets demonstrate that NIR yields state-of-the-art performance in terms of both registration accuracy and regularity, while running significantly faster than traditional optimization-based methods.
Unsupervised Learning of Long-Term Motion Dynamics for Videos
We present an unsupervised representation learning approach that compactly encodes the motion dependencies in videos. Given a pair of images from a video clip, our framework learns to predict the long-term 3D motions. To reduce the complexity of the learning framework, we propose to describe the motion as a sequence of atomic 3D flows computed with RGB-D modality. We use a Recurrent Neural Network based Encoder-Decoder framework to predict these sequences of flows. We argue that in order for the decoder to reconstruct these sequences, the encoder must learn a robust video representation that captures long-term motion dependencies and spatial-temporal relations. We demonstrate the effectiveness of our learned temporal representations on activity classification across multiple modalities and datasets such as NTU RGB+D and MSR Daily Activity 3D. Our framework is generic to any input modality, i.e., RGB, Depth, and RGB-D videos.
Robust Outlier Rejection for 3D Registration with Variational Bayes
Learning-based outlier (mismatched correspondence) rejection for robust 3D registration generally formulates the outlier removal as an inlier/outlier classification problem. The core for this to be successful is to learn the discriminative inlier/outlier feature representations. In this paper, we develop a novel variational non-local network-based outlier rejection framework for robust alignment. By reformulating the non-local feature learning with variational Bayesian inference, the Bayesian-driven long-range dependencies can be modeled to aggregate discriminative geometric context information for inlier/outlier distinction. Specifically, to achieve such Bayesian-driven contextual dependencies, each query/key/value component in our non-local network predicts a prior feature distribution and a posterior one. Embedded with the inlier/outlier label, the posterior feature distribution is label-dependent and discriminative. Thus, pushing the prior to be close to the discriminative posterior in the training step enables the features sampled from this prior at test time to model high-quality long-range dependencies. Notably, to achieve effective posterior feature guidance, a specific probabilistic graphical model is designed over our non-local model, which lets us derive a variational low bound as our optimization objective for model training. Finally, we propose a voting-based inlier searching strategy to cluster the high-quality hypothetical inliers for transformation estimation. Extensive experiments on 3DMatch, 3DLoMatch, and KITTI datasets verify the effectiveness of our method.
Neural Network-Based Score Estimation in Diffusion Models: Optimization and Generalization
Diffusion models have emerged as a powerful tool rivaling GANs in generating high-quality samples with improved fidelity, flexibility, and robustness. A key component of these models is to learn the score function through score matching. Despite empirical success on various tasks, it remains unclear whether gradient-based algorithms can learn the score function with a provable accuracy. As a first step toward answering this question, this paper establishes a mathematical framework for analyzing score estimation using neural networks trained by gradient descent. Our analysis covers both the optimization and the generalization aspects of the learning procedure. In particular, we propose a parametric form to formulate the denoising score-matching problem as a regression with noisy labels. Compared to the standard supervised learning setup, the score-matching problem introduces distinct challenges, including unbounded input, vector-valued output, and an additional time variable, preventing existing techniques from being applied directly. In this paper, we show that with proper designs, the evolution of neural networks during training can be accurately modeled by a series of kernel regression tasks. Furthermore, by applying an early-stopping rule for gradient descent and leveraging recent developments in neural tangent kernels, we establish the first generalization error (sample complexity) bounds for learning the score function with neural networks, despite the presence of noise in the observations. Our analysis is grounded in a novel parametric form of the neural network and an innovative connection between score matching and regression analysis, facilitating the application of advanced statistical and optimization techniques.
OFFER: A Motif Dimensional Framework for Network Representation Learning
Aiming at better representing multivariate relationships, this paper investigates a motif dimensional framework for higher-order graph learning. The graph learning effectiveness can be improved through OFFER. The proposed framework mainly aims at accelerating and improving higher-order graph learning results. We apply the acceleration procedure from the dimensional of network motifs. Specifically, the refined degree for nodes and edges are conducted in two stages: (1) employ motif degree of nodes to refine the adjacency matrix of the network; and (2) employ motif degree of edges to refine the transition probability matrix in the learning process. In order to assess the efficiency of the proposed framework, four popular network representation algorithms are modified and examined. By evaluating the performance of OFFER, both link prediction results and clustering results demonstrate that the graph representation learning algorithms enhanced with OFFER consistently outperform the original algorithms with higher efficiency.
Splines-Based Feature Importance in Kolmogorov-Arnold Networks: A Framework for Supervised Tabular Data Dimensionality Reduction
High-dimensional datasets require effective feature selection to improve predictive performance, interpretability, and robustness. We propose and evaluate feature selection methods for tabular datasets based on Kolmogorov-Arnold networks (KANs), which parameterize feature transformations through splines, enabling direct access to interpretable importance measures. We introduce four KAN-based selectors (KAN-L1, KAN-L2, KAN-SI, KAN-KO) and compare them against classical baselines (LASSO, Random Forest, Mutual Information, SVM-RFE) across multiple classification and regression tabular dataset benchmarks. Average (over three retention levels: 20\%, 40\%, and 60\%) F1 scores and R^2 score results reveal that KAN-based selectors, particularly KAN-L2, KAN-L1, KAN-SI, and KAN-KO, are competitive with and sometimes superior to classical baselines in structured and synthetic datasets. However, KAN-L1 is often too aggressive in regression, removing useful features, while KAN-L2 underperforms in classification, where simple coefficient shrinkage misses complex feature interactions. KAN-L2 and KAN-SI provide robust performance on noisy regression datasets and heterogeneous datasets, aligning closely with ensemble predictors. In classification tasks, KAN selectors such as KAN-L1, KAN-KO, and KAN-SI sometimes surpass the other selectors by eliminating redundancy, particularly in high-dimensional multi-class data. Overall, our findings demonstrate that KAN-based feature selection provides a powerful and interpretable alternative to traditional methods, capable of uncovering nonlinear and multivariate feature relevance beyond sparsity or impurity-based measures.
A projection-based framework for gradient-free and parallel learning
We present a feasibility-seeking approach to neural network training. This mathematical optimization framework is distinct from conventional gradient-based loss minimization and uses projection operators and iterative projection algorithms. We reformulate training as a large-scale feasibility problem: finding network parameters and states that satisfy local constraints derived from its elementary operations. Training then involves projecting onto these constraints, a local operation that can be parallelized across the network. We introduce PJAX, a JAX-based software framework that enables this paradigm. PJAX composes projection operators for elementary operations, automatically deriving the solution operators for the feasibility problems (akin to autodiff for derivatives). It inherently supports GPU/TPU acceleration, provides a familiar NumPy-like API, and is extensible. We train diverse architectures (MLPs, CNNs, RNNs) on standard benchmarks using PJAX, demonstrating its functionality and generality. Our results show that this approach is as a compelling alternative to gradient-based training, with clear advantages in parallelism and the ability to handle non-differentiable operations.
A Misclassification Network-Based Method for Comparative Genomic Analysis
Classifying genome sequences based on metadata has been an active area of research in comparative genomics for decades with many important applications across the life sciences. Established methods for classifying genomes can be broadly grouped into sequence alignment-based and alignment-free models. Conventional alignment-based models rely on genome similarity measures calculated based on local sequence alignments or consistent ordering among sequences. However, such methods are computationally expensive when dealing with large ensembles of even moderately sized genomes. In contrast, alignment-free (AF) approaches measure genome similarity based on summary statistics in an unsupervised setting and are efficient enough to analyze large datasets. However, both alignment-based and AF methods typically assume fixed scoring rubrics that lack the flexibility to assign varying importance to different parts of the sequences based on prior knowledge. In this study, we integrate AI and network science approaches to develop a comparative genomic analysis framework that addresses these limitations. Our approach, termed the Genome Misclassification Network Analysis (GMNA), simultaneously leverages misclassified instances, a learned scoring rubric, and label information to classify genomes based on associated metadata and better understand potential drivers of misclassification. We evaluate the utility of the GMNA using Naive Bayes and convolutional neural network models, supplemented by additional experiments with transformer-based models, to construct SARS-CoV-2 sampling location classifiers using over 500,000 viral genome sequences and study the resulting network of misclassifications. We demonstrate the global health potential of the GMNA by leveraging the SARS-CoV-2 genome misclassification networks to investigate the role human mobility played in structuring geographic clustering of SARS-CoV-2.
EventTransAct: A video transformer-based framework for Event-camera based action recognition
Recognizing and comprehending human actions and gestures is a crucial perception requirement for robots to interact with humans and carry out tasks in diverse domains, including service robotics, healthcare, and manufacturing. Event cameras, with their ability to capture fast-moving objects at a high temporal resolution, offer new opportunities compared to standard action recognition in RGB videos. However, previous research on event camera action recognition has primarily focused on sensor-specific network architectures and image encoding, which may not be suitable for new sensors and limit the use of recent advancements in transformer-based architectures. In this study, we employ a computationally efficient model, namely the video transformer network (VTN), which initially acquires spatial embeddings per event-frame and then utilizes a temporal self-attention mechanism. In order to better adopt the VTN for the sparse and fine-grained nature of event data, we design Event-Contrastive Loss (L_{EC}) and event-specific augmentations. Proposed L_{EC} promotes learning fine-grained spatial cues in the spatial backbone of VTN by contrasting temporally misaligned frames. We evaluate our method on real-world action recognition of N-EPIC Kitchens dataset, and achieve state-of-the-art results on both protocols - testing in seen kitchen (74.9\% accuracy) and testing in unseen kitchens (42.43\% and 46.66\% Accuracy). Our approach also takes less computation time compared to competitive prior approaches, which demonstrates the potential of our framework EventTransAct for real-world applications of event-camera based action recognition. Project Page: https://tristandb8.github.io/EventTransAct_webpage/
LLM-Based Routing in Mixture of Experts: A Novel Framework for Trading
Recent advances in deep learning and large language models (LLMs) have facilitated the deployment of the mixture-of-experts (MoE) mechanism in the stock investment domain. While these models have demonstrated promising trading performance, they are often unimodal, neglecting the wealth of information available in other modalities, such as textual data. Moreover, the traditional neural network-based router selection mechanism fails to consider contextual and real-world nuances, resulting in suboptimal expert selection. To address these limitations, we propose LLMoE, a novel framework that employs LLMs as the router within the MoE architecture. Specifically, we replace the conventional neural network-based router with LLMs, leveraging their extensive world knowledge and reasoning capabilities to select experts based on historical price data and stock news. This approach provides a more effective and interpretable selection mechanism. Our experiments on multimodal real-world stock datasets demonstrate that LLMoE outperforms state-of-the-art MoE models and other deep neural network approaches. Additionally, the flexible architecture of LLMoE allows for easy adaptation to various downstream tasks.
A Bayesian Flow Network Framework for Chemistry Tasks
In this work, we introduce ChemBFN, a language model that handles chemistry tasks based on Bayesian flow networks working on discrete data. A new accuracy schedule is proposed to improve the sampling quality by significantly reducing the reconstruction loss. We show evidence that our method is appropriate for generating molecules with satisfied diversity even when a smaller number of sampling steps is used. A classifier-free guidance method is adapted for conditional generation. It is also worthwhile to point out that after generative training, our model can be fine-tuned on regression and classification tasks with the state-of-the-art performance, which opens the gate of building all-in-one models in a single module style. Our model has been open sourced at https://github.com/Augus1999/bayesian-flow-network-for-chemistry.
Big-data-driven and AI-based framework to enable personalization in wireless networks
Current communication networks use design methodologies that prevent the realization of maximum network efficiency. In the first place, while users' perception of satisfactory service diverges widely, current networks are designed to be a "universal fit," where they are generally over-engineered to deliver services appealing to all types of users. Also, current networks lack user-level data cognitive intelligence that would enable fast personalized network decisions and actions through automation. Thus, in this article, we propose the utilization of AI, big data analytics, and real-time non-intrusive user feedback in order to enable the personalization of wireless networks. Based on each user's actual QoS requirements and context, a multi-objective formulation enables the network to micro-manage and optimize the provided QoS and user satisfaction levels simultaneously. Moreover, in order to enable user feedback tracking and measurement, we propose a user satisfaction model based on the zone of tolerance concept. Furthermore, we propose a big-data-driven and AI-based personalization framework to integrate personalization into wireless networks. Finally, we implement a personalized network prototype to demonstrate the proposed personalization concept and its potential benefits through a case study. The case study shows how personalization can be realized to enable the efficient optimization of network resources such that certain requirement levels of user satisfaction and revenue in the form of saved resources are achieved.
NeuFA: Neural Network Based End-to-End Forced Alignment with Bidirectional Attention Mechanism
Although deep learning and end-to-end models have been widely used and shown superiority in automatic speech recognition (ASR) and text-to-speech (TTS) synthesis, state-of-the-art forced alignment (FA) models are still based on hidden Markov model (HMM). HMM has limited view of contextual information and is developed with long pipelines, leading to error accumulation and unsatisfactory performance. Inspired by the capability of attention mechanism in capturing long term contextual information and learning alignments in ASR and TTS, we propose a neural network based end-to-end forced aligner called NeuFA, in which a novel bidirectional attention mechanism plays an essential role. NeuFA integrates the alignment learning of both ASR and TTS tasks in a unified framework by learning bidirectional alignment information from a shared attention matrix in the proposed bidirectional attention mechanism. Alignments are extracted from the learnt attention weights and optimized by the ASR, TTS and FA tasks in a multi-task learning manner. Experimental results demonstrate the effectiveness of our proposed model, with mean absolute error on test set drops from 25.8 ms to 23.7 ms at word level, and from 17.0 ms to 15.7 ms at phoneme level compared with state-of-the-art HMM based model.
A Network-based End-to-End Trainable Task-oriented Dialogue System
Teaching machines to accomplish tasks by conversing naturally with humans is challenging. Currently, developing task-oriented dialogue systems requires creating multiple components and typically this involves either a large amount of handcrafting, or acquiring costly labelled datasets to solve a statistical learning problem for each component. In this work we introduce a neural network-based text-in, text-out end-to-end trainable goal-oriented dialogue system along with a new way of collecting dialogue data based on a novel pipe-lined Wizard-of-Oz framework. This approach allows us to develop dialogue systems easily and without making too many assumptions about the task at hand. The results show that the model can converse with human subjects naturally whilst helping them to accomplish tasks in a restaurant search domain.
SALM: A Multi-Agent Framework for Language Model-Driven Social Network Simulation
Contemporary approaches to agent-based modeling (ABM) of social systems have traditionally emphasized rule-based behaviors, limiting their ability to capture nuanced dynamics by moving beyond predefined rules and leveraging contextual understanding from LMs of human social interaction. This paper presents SALM (Social Agent LM Framework), a novel approach for integrating language models (LMs) into social network simulation that achieves unprecedented temporal stability in multi-agent scenarios. Our primary contributions include: (1) a hierarchical prompting architecture enabling stable simulation beyond 4,000 timesteps while reducing token usage by 73%, (2) an attention-based memory system achieving 80% cache hit rates (95% CI [78%, 82%]) with sub-linear memory growth of 9.5%, and (3) formal bounds on personality stability. Through extensive validation against SNAP ego networks, we demonstrate the first LLM-based framework capable of modeling long-term social phenomena while maintaining empirically validated behavioral fidelity.
TreeFormer: a Semi-Supervised Transformer-based Framework for Tree Counting from a Single High Resolution Image
Automatic tree density estimation and counting using single aerial and satellite images is a challenging task in photogrammetry and remote sensing, yet has an important role in forest management. In this paper, we propose the first semisupervised transformer-based framework for tree counting which reduces the expensive tree annotations for remote sensing images. Our method, termed as TreeFormer, first develops a pyramid tree representation module based on transformer blocks to extract multi-scale features during the encoding stage. Contextual attention-based feature fusion and tree density regressor modules are further designed to utilize the robust features from the encoder to estimate tree density maps in the decoder. Moreover, we propose a pyramid learning strategy that includes local tree density consistency and local tree count ranking losses to utilize unlabeled images into the training process. Finally, the tree counter token is introduced to regulate the network by computing the global tree counts for both labeled and unlabeled images. Our model was evaluated on two benchmark tree counting datasets, Jiangsu, and Yosemite, as well as a new dataset, KCL-London, created by ourselves. Our TreeFormer outperforms the state of the art semi-supervised methods under the same setting and exceeds the fully-supervised methods using the same number of labeled images. The codes and datasets are available at https://github.com/HAAClassic/TreeFormer.
RankMixup: Ranking-Based Mixup Training for Network Calibration
Network calibration aims to accurately estimate the level of confidences, which is particularly important for employing deep neural networks in real-world systems. Recent approaches leverage mixup to calibrate the network's predictions during training. However, they do not consider the problem that mixtures of labels in mixup may not accurately represent the actual distribution of augmented samples. In this paper, we present RankMixup, a novel mixup-based framework alleviating the problem of the mixture of labels for network calibration. To this end, we propose to use an ordinal ranking relationship between raw and mixup-augmented samples as an alternative supervisory signal to the label mixtures for network calibration. We hypothesize that the network should estimate a higher level of confidence for the raw samples than the augmented ones (Fig.1). To implement this idea, we introduce a mixup-based ranking loss (MRL) that encourages lower confidences for augmented samples compared to raw ones, maintaining the ranking relationship. We also propose to leverage the ranking relationship among multiple mixup-augmented samples to further improve the calibration capability. Augmented samples with larger mixing coefficients are expected to have higher confidences and vice versa (Fig.1). That is, the order of confidences should be aligned with that of mixing coefficients. To this end, we introduce a novel loss, M-NDCG, in order to reduce the number of misaligned pairs of the coefficients and confidences. Extensive experimental results on standard benchmarks for network calibration demonstrate the effectiveness of RankMixup.
PatchFusion: An End-to-End Tile-Based Framework for High-Resolution Monocular Metric Depth Estimation
Single image depth estimation is a foundational task in computer vision and generative modeling. However, prevailing depth estimation models grapple with accommodating the increasing resolutions commonplace in today's consumer cameras and devices. Existing high-resolution strategies show promise, but they often face limitations, ranging from error propagation to the loss of high-frequency details. We present PatchFusion, a novel tile-based framework with three key components to improve the current state of the art: (1) A patch-wise fusion network that fuses a globally-consistent coarse prediction with finer, inconsistent tiled predictions via high-level feature guidance, (2) A Global-to-Local (G2L) module that adds vital context to the fusion network, discarding the need for patch selection heuristics, and (3) A Consistency-Aware Training (CAT) and Inference (CAI) approach, emphasizing patch overlap consistency and thereby eradicating the necessity for post-processing. Experiments on UnrealStereo4K, MVS-Synth, and Middleburry 2014 demonstrate that our framework can generate high-resolution depth maps with intricate details. PatchFusion is independent of the base model for depth estimation. Notably, our framework built on top of SOTA ZoeDepth brings improvements for a total of 17.3% and 29.4% in terms of the root mean squared error (RMSE) on UnrealStereo4K and MVS-Synth, respectively.
DeeCLIP: A Robust and Generalizable Transformer-Based Framework for Detecting AI-Generated Images
This paper introduces DeeCLIP, a novel framework for detecting AI-generated images using CLIP-ViT and fusion learning. Despite significant advancements in generative models capable of creating highly photorealistic images, existing detection methods often struggle to generalize across different models and are highly sensitive to minor perturbations. To address these challenges, DeeCLIP incorporates DeeFuser, a fusion module that combines high-level and low-level features, improving robustness against degradations such as compression and blurring. Additionally, we apply triplet loss to refine the embedding space, enhancing the model's ability to distinguish between real and synthetic content. To further enable lightweight adaptation while preserving pre-trained knowledge, we adopt parameter-efficient fine-tuning using low-rank adaptation (LoRA) within the CLIP-ViT backbone. This approach supports effective zero-shot learning without sacrificing generalization. Trained exclusively on 4-class ProGAN data, DeeCLIP achieves an average accuracy of 89.00% on 19 test subsets composed of generative adversarial network (GAN) and diffusion models. Despite having fewer trainable parameters, DeeCLIP outperforms existing methods, demonstrating superior robustness against various generative models and real-world distortions. The code is publicly available at https://github.com/Mamadou-Keita/DeeCLIP for research purposes.
Neural network emulator to constrain the high-$z$ IGM thermal state from Lyman-$α$ forest flux auto-correlation function
We present a neural network emulator to constrain the thermal parameters of the intergalactic medium (IGM) at 5.4z6.0 using the Lyman-displaystylealpha (Lydisplaystylealpha) forest flux auto-correlation function. Our auto-differentiable JAX-based framework accelerates the surrogate model generation process using approximately 100 sparsely sampled Nyx hydrodynamical simulations with varying combinations of thermal parameters, i.e., the temperature at mean density T_{{0}}, the slope of the temperaturedisplaystyle-density relation displaystylegamma, and the mean transmission flux langle{F}{rangle}. We show that this emulator has a typical accuracy of 1.0% across the specified redshift range. Bayesian inference of the IGM thermal parameters, incorporating emulator uncertainty propagation, is further expedited using NumPyro Hamiltonian Monte Carlo. We compare both the inference results and computational cost of our framework with the traditional nearest-neighbor interpolation approach applied to the same set of mock Lyalpha flux. By examining the credibility contours of the marginalized posteriors for T_{{0}},gamma,and{langle}{F}{rangle} obtained using the emulator, the statistical reliability of measurements is established through inference on 100 realistic mock data sets of the auto-correlation function.
DE-GAN: A Conditional Generative Adversarial Network for Document Enhancement
Documents often exhibit various forms of degradation, which make it hard to be read and substantially deteriorate the performance of an OCR system. In this paper, we propose an effective end-to-end framework named Document Enhancement Generative Adversarial Networks (DE-GAN) that uses the conditional GANs (cGANs) to restore severely degraded document images. To the best of our knowledge, this practice has not been studied within the context of generative adversarial deep networks. We demonstrate that, in different tasks (document clean up, binarization, deblurring and watermark removal), DE-GAN can produce an enhanced version of the degraded document with a high quality. In addition, our approach provides consistent improvements compared to state-of-the-art methods over the widely used DIBCO 2013, DIBCO 2017 and H-DIBCO 2018 datasets, proving its ability to restore a degraded document image to its ideal condition. The obtained results on a wide variety of degradation reveal the flexibility of the proposed model to be exploited in other document enhancement problems.
HyRet-Change: A hybrid retentive network for remote sensing change detection
Recently convolution and transformer-based change detection (CD) methods provide promising performance. However, it remains unclear how the local and global dependencies interact to effectively alleviate the pseudo changes. Moreover, directly utilizing standard self-attention presents intrinsic limitations including governing global feature representations limit to capture subtle changes, quadratic complexity, and restricted training parallelism. To address these limitations, we propose a Siamese-based framework, called HyRet-Change, which can seamlessly integrate the merits of convolution and retention mechanisms at multi-scale features to preserve critical information and enhance adaptability in complex scenes. Specifically, we introduce a novel feature difference module to exploit both convolutions and multi-head retention mechanisms in a parallel manner to capture complementary information. Furthermore, we propose an adaptive local-global interactive context awareness mechanism that enables mutual learning and enhances discrimination capability through information exchange. We perform experiments on three challenging CD datasets and achieve state-of-the-art performance compared to existing methods. Our source code is publicly available at https://github.com/mustansarfiaz/HyRect-Change.
Relation Classification via Recurrent Neural Network
Deep learning has gained much success in sentence-level relation classification. For example, convolutional neural networks (CNN) have delivered competitive performance without much effort on feature engineering as the conventional pattern-based methods. Thus a lot of works have been produced based on CNN structures. However, a key issue that has not been well addressed by the CNN-based method is the lack of capability to learn temporal features, especially long-distance dependency between nominal pairs. In this paper, we propose a simple framework based on recurrent neural networks (RNN) and compare it with CNN-based model. To show the limitation of popular used SemEval-2010 Task 8 dataset, we introduce another dataset refined from MIMLRE(Angeli et al., 2014). Experiments on two different datasets strongly indicates that the RNN-based model can deliver better performance on relation classification, and it is particularly capable of learning long-distance relation patterns. This makes it suitable for real-world applications where complicated expressions are often involved.
FlexRAG: A Flexible and Comprehensive Framework for Retrieval-Augmented Generation
Retrieval-Augmented Generation (RAG) plays a pivotal role in modern large language model applications, with numerous existing frameworks offering a wide range of functionalities to facilitate the development of RAG systems. However, we have identified several persistent challenges in these frameworks, including difficulties in algorithm reproduction and sharing, lack of new techniques, and high system overhead. To address these limitations, we introduce FlexRAG, an open-source framework specifically designed for research and prototyping. FlexRAG supports text-based, multimodal, and network-based RAG, providing comprehensive lifecycle support alongside efficient asynchronous processing and persistent caching capabilities. By offering a robust and flexible solution, FlexRAG enables researchers to rapidly develop, deploy, and share advanced RAG systems. Our toolkit and resources are available at https://github.com/ictnlp/FlexRAG{https://github.com/ictnlp/FlexRAG}.
Class-Incremental Grouping Network for Continual Audio-Visual Learning
Continual learning is a challenging problem in which models need to be trained on non-stationary data across sequential tasks for class-incremental learning. While previous methods have focused on using either regularization or rehearsal-based frameworks to alleviate catastrophic forgetting in image classification, they are limited to a single modality and cannot learn compact class-aware cross-modal representations for continual audio-visual learning. To address this gap, we propose a novel class-incremental grouping network (CIGN) that can learn category-wise semantic features to achieve continual audio-visual learning. Our CIGN leverages learnable audio-visual class tokens and audio-visual grouping to continually aggregate class-aware features. Additionally, it utilizes class tokens distillation and continual grouping to prevent forgetting parameters learned from previous tasks, thereby improving the model's ability to capture discriminative audio-visual categories. We conduct extensive experiments on VGGSound-Instruments, VGGSound-100, and VGG-Sound Sources benchmarks. Our experimental results demonstrate that the CIGN achieves state-of-the-art audio-visual class-incremental learning performance. Code is available at https://github.com/stoneMo/CIGN.
CodePDE: An Inference Framework for LLM-driven PDE Solver Generation
Partial differential equations (PDEs) are fundamental to modeling physical systems, yet solving them remains a complex challenge. Traditional numerical solvers rely on expert knowledge to implement and are computationally expensive, while neural-network-based solvers require large training datasets and often lack interpretability. In this work, we frame PDE solving as a code generation task and introduce CodePDE, the first inference framework for generating PDE solvers using large language models (LLMs). Leveraging advanced inference-time algorithms and scaling strategies, CodePDE unlocks critical capacities of LLM for PDE solving: reasoning, debugging, selfrefinement, and test-time scaling -- all without task-specific tuning. CodePDE achieves superhuman performance across a range of representative PDE problems. We also present a systematic empirical analysis of LLM generated solvers, analyzing their accuracy, efficiency, and numerical scheme choices. Our findings highlight the promise and the current limitations of LLMs in PDE solving, offering a new perspective on solver design and opportunities for future model development. Our code is available at https://github.com/LithiumDA/CodePDE.
DSTC8-AVSD: Multimodal Semantic Transformer Network with Retrieval Style Word Generator
Audio Visual Scene-aware Dialog (AVSD) is the task of generating a response for a question with a given scene, video, audio, and the history of previous turns in the dialog. Existing systems for this task employ the transformers or recurrent neural network-based architecture with the encoder-decoder framework. Even though these techniques show superior performance for this task, they have significant limitations: the model easily overfits only to memorize the grammatical patterns; the model follows the prior distribution of the vocabularies in a dataset. To alleviate the problems, we propose a Multimodal Semantic Transformer Network. It employs a transformer-based architecture with an attention-based word embedding layer that generates words by querying word embeddings. With this design, our model keeps considering the meaning of the words at the generation stage. The empirical results demonstrate the superiority of our proposed model that outperforms most of the previous works for the AVSD task.
MULAN: Multitask Universal Lesion Analysis Network for Joint Lesion Detection, Tagging, and Segmentation
When reading medical images such as a computed tomography (CT) scan, radiologists generally search across the image to find lesions, characterize and measure them, and then describe them in the radiological report. To automate this process, we propose a multitask universal lesion analysis network (MULAN) for joint detection, tagging, and segmentation of lesions in a variety of body parts, which greatly extends existing work of single-task lesion analysis on specific body parts. MULAN is based on an improved Mask R-CNN framework with three head branches and a 3D feature fusion strategy. It achieves the state-of-the-art accuracy in the detection and tagging tasks on the DeepLesion dataset, which contains 32K lesions in the whole body. We also analyze the relationship between the three tasks and show that tag predictions can improve detection accuracy via a score refinement layer.
Ambiguity in solving imaging inverse problems with deep learning based operators
In recent years, large convolutional neural networks have been widely used as tools for image deblurring, because of their ability in restoring images very precisely. It is well known that image deblurring is mathematically modeled as an ill-posed inverse problem and its solution is difficult to approximate when noise affects the data. Really, one limitation of neural networks for deblurring is their sensitivity to noise and other perturbations, which can lead to instability and produce poor reconstructions. In addition, networks do not necessarily take into account the numerical formulation of the underlying imaging problem, when trained end-to-end. In this paper, we propose some strategies to improve stability without losing to much accuracy to deblur images with deep-learning based methods. First, we suggest a very small neural architecture, which reduces the execution time for training, satisfying a green AI need, and does not extremely amplify noise in the computed image. Second, we introduce a unified framework where a pre-processing step balances the lack of stability of the following, neural network-based, step. Two different pre-processors are presented: the former implements a strong parameter-free denoiser, and the latter is a variational model-based regularized formulation of the latent imaging problem. This framework is also formally characterized by mathematical analysis. Numerical experiments are performed to verify the accuracy and stability of the proposed approaches for image deblurring when unknown or not-quantified noise is present; the results confirm that they improve the network stability with respect to noise. In particular, the model-based framework represents the most reliable trade-off between visual precision and robustness.
RISING a new framework for few-view tomographic image reconstruction with deep learning
This paper proposes a new two-step procedure for sparse-view tomographic image reconstruction. It is called RISING, since it combines an early-stopped Rapid Iterative Solver with a subsequent Iteration Network-based Gaining step. So far, regularized iterative methods have widely been used for X-ray computed tomography image reconstruction from low-sampled data, since they converge to a sparse solution in a suitable domain, as upheld by the Compressed Sensing theory. Unfortunately, their use is practically limited by their high computational cost which imposes to perform only a few iterations in the available time for clinical exams. Data-driven methods, using neural networks to post-process a coarse and noisy image obtained from geometrical algorithms, have been recently studied and appreciated for both their computational speed and accurate reconstructions. However, there is no evidence, neither theoretically nor numerically, that neural networks based algorithms solve the mathematical inverse problem modeling the tomographic reconstruction process. In our two-step approach, the first phase executes very few iterations of a regularized model-based algorithm whereas the second step completes the missing iterations by means of a neural network. The resulting hybrid deep-variational framework preserves the convergence properties of the iterative method and, at the same time, it exploits the computational speed and flexibility of a data-driven approach. Experiments performed on a simulated and a real data set confirm the numerical and visual accuracy of the reconstructed RISING images in short computational times.
Deep Learning-based Approaches for State Space Models: A Selective Review
State-space models (SSMs) offer a powerful framework for dynamical system analysis, wherein the temporal dynamics of the system are assumed to be captured through the evolution of the latent states, which govern the values of the observations. This paper provides a selective review of recent advancements in deep neural network-based approaches for SSMs, and presents a unified perspective for discrete time deep state space models and continuous time ones such as latent neural Ordinary Differential and Stochastic Differential Equations. It starts with an overview of the classical maximum likelihood based approach for learning SSMs, reviews variational autoencoder as a general learning pipeline for neural network-based approaches in the presence of latent variables, and discusses in detail representative deep learning models that fall under the SSM framework. Very recent developments, where SSMs are used as standalone architectural modules for improving efficiency in sequence modeling, are also examined. Finally, examples involving mixed frequency and irregularly-spaced time series data are presented to demonstrate the advantage of SSMs in these settings.
Deep-Q Learning with Hybrid Quantum Neural Network on Solving Maze Problems
Quantum computing holds great potential for advancing the limitations of machine learning algorithms to handle higher dimensions of data and reduce overall training parameters in deep learning (DL) models. This study uses a trainable variational quantum circuit (VQC) on a gate-based quantum computing model to investigate the potential for quantum benefit in a model-free reinforcement learning problem. Through a comprehensive investigation and evaluation of the current model and capabilities of quantum computers, we designed and trained a novel hybrid quantum neural network based on the latest Qiskit and PyTorch framework. We compared its performance with a full-classical CNN with and without an incorporated VQC. Our research provides insights into the potential of deep quantum learning to solve a maze problem and, potentially, other reinforcement learning problems. We conclude that reinforcement learning problems can be practical with reasonable training epochs. Moreover, a comparative study of full-classical and hybrid quantum neural networks is discussed to understand these two approaches' performance, advantages, and disadvantages to deep-Q learning problems, especially on larger-scale maze problems larger than 4x4.
Advances in integration of end-to-end neural and clustering-based diarization for real conversational speech
Recently, we proposed a novel speaker diarization method called End-to-End-Neural-Diarization-vector clustering (EEND-vector clustering) that integrates clustering-based and end-to-end neural network-based diarization approaches into one framework. The proposed method combines advantages of both frameworks, i.e. high diarization performance and handling of overlapped speech based on EEND, and robust handling of long recordings with an arbitrary number of speakers based on clustering-based approaches. However, the method was only evaluated so far on simulated 2-speaker meeting-like data. This paper is to (1) report recent advances we made to this framework, including newly introduced robust constrained clustering algorithms, and (2) experimentally show that the method can now significantly outperform competitive diarization methods such as Encoder-Decoder Attractor (EDA)-EEND, on CALLHOME data which comprises real conversational speech data including overlapped speech and an arbitrary number of speakers. By further analyzing the experimental results, this paper also discusses pros and cons of the proposed method and reveals potential for further improvement. A set of the code to reproduce the results is available at https://github.com/nttcslab-sp/EEND-vector-clustering.
Position: Categorical Deep Learning is an Algebraic Theory of All Architectures
We present our position on the elusive quest for a general-purpose framework for specifying and studying deep learning architectures. Our opinion is that the key attempts made so far lack a coherent bridge between specifying constraints which models must satisfy and specifying their implementations. Focusing on building a such a bridge, we propose to apply category theory -- precisely, the universal algebra of monads valued in a 2-category of parametric maps -- as a single theory elegantly subsuming both of these flavours of neural network design. To defend our position, we show how this theory recovers constraints induced by geometric deep learning, as well as implementations of many architectures drawn from the diverse landscape of neural networks, such as RNNs. We also illustrate how the theory naturally encodes many standard constructs in computer science and automata theory.
InstaFace: Identity-Preserving Facial Editing with Single Image Inference
Facial appearance editing is crucial for digital avatars, AR/VR, and personalized content creation, driving realistic user experiences. However, preserving identity with generative models is challenging, especially in scenarios with limited data availability. Traditional methods often require multiple images and still struggle with unnatural face shifts, inconsistent hair alignment, or excessive smoothing effects. To overcome these challenges, we introduce a novel diffusion-based framework, InstaFace, to generate realistic images while preserving identity using only a single image. Central to InstaFace, we introduce an efficient guidance network that harnesses 3D perspectives by integrating multiple 3DMM-based conditionals without introducing additional trainable parameters. Moreover, to ensure maximum identity retention as well as preservation of background, hair, and other contextual features like accessories, we introduce a novel module that utilizes feature embeddings from a facial recognition model and a pre-trained vision-language model. Quantitative evaluations demonstrate that our method outperforms several state-of-the-art approaches in terms of identity preservation, photorealism, and effective control of pose, expression, and lighting.
Examining the Source of Defects from a Mechanical Perspective for 3D Anomaly Detection
In this paper, we explore a novel approach to 3D anomaly detection (AD) that goes beyond merely identifying anomalies based on structural characteristics. Our primary perspective is that most anomalies arise from unpredictable defective forces originating from both internal and external sources. To address these anomalies, we seek out opposing forces that can help correct them. Therefore, we introduce the Mechanics Complementary Model-based Framework for the 3D-AD task (MC4AD), which generates internal and external corrective forces for each point. We first propose a Diverse Anomaly-Generation (DA-Gen) module designed to simulate various types of anomalies. Next, we present the Corrective Force Prediction Network (CFP-Net), which uses complementary representations for point-level analysis to simulate the different contributions from internal and external corrective forces. To ensure the corrective forces are constrained effectively, we have developed a combined loss function that includes a new symmetric loss and an overall loss. Notably, we implement a Hierarchical Quality Control (HQC) strategy based on a three-way decision process and contribute a dataset titled Anomaly-IntraVariance, which incorporates intraclass variance to evaluate our model. As a result, the proposed MC4AD has been proven effective through theory and experimentation. The experimental results demonstrate that our approach yields nine state-of-the-art performances, achieving optimal results with minimal parameters and the fastest inference speed across five existing datasets, in addition to the proposed Anomaly-IntraVariance dataset. The source is available at https://github.com/hzzzzzhappy/MC4AD
Towards Real-World Burst Image Super-Resolution: Benchmark and Method
Despite substantial advances, single-image super-resolution (SISR) is always in a dilemma to reconstruct high-quality images with limited information from one input image, especially in realistic scenarios. In this paper, we establish a large-scale real-world burst super-resolution dataset, i.e., RealBSR, to explore the faithful reconstruction of image details from multiple frames. Furthermore, we introduce a Federated Burst Affinity network (FBAnet) to investigate non-trivial pixel-wise displacements among images under real-world image degradation. Specifically, rather than using pixel-wise alignment, our FBAnet employs a simple homography alignment from a structural geometry aspect and a Federated Affinity Fusion (FAF) strategy to aggregate the complementary information among frames. Those fused informative representations are fed to a Transformer-based module of burst representation decoding. Besides, we have conducted extensive experiments on two versions of our datasets, i.e., RealBSR-RAW and RealBSR-RGB. Experimental results demonstrate that our FBAnet outperforms existing state-of-the-art burst SR methods and also achieves visually-pleasant SR image predictions with model details. Our dataset, codes, and models are publicly available at https://github.com/yjsunnn/FBANet.
PlaneRecTR: Unified Query Learning for 3D Plane Recovery from a Single View
3D plane recovery from a single image can usually be divided into several subtasks of plane detection, segmentation, parameter estimation and possibly depth estimation. Previous works tend to solve this task by either extending the RCNN-based segmentation network or the dense pixel embedding-based clustering framework. However, none of them tried to integrate above related subtasks into a unified framework but treat them separately and sequentially, which we suspect is potentially a main source of performance limitation for existing approaches. Motivated by this finding and the success of query-based learning in enriching reasoning among semantic entities, in this paper, we propose PlaneRecTR, a Transformer-based architecture, which for the first time unifies all subtasks related to single-view plane recovery with a single compact model. Extensive quantitative and qualitative experiments demonstrate that our proposed unified learning achieves mutual benefits across subtasks, obtaining a new state-of-the-art performance on public ScanNet and NYUv2-Plane datasets. Codes are available at https://github.com/SJingjia/PlaneRecTR.
Improving Intrinsic Exploration by Creating Stationary Objectives
Exploration bonuses in reinforcement learning guide long-horizon exploration by defining custom intrinsic objectives. Several exploration objectives like count-based bonuses, pseudo-counts, and state-entropy maximization are non-stationary and hence are difficult to optimize for the agent. While this issue is generally known, it is usually omitted and solutions remain under-explored. The key contribution of our work lies in transforming the original non-stationary rewards into stationary rewards through an augmented state representation. For this purpose, we introduce the Stationary Objectives For Exploration (SOFE) framework. SOFE requires identifying sufficient statistics for different exploration bonuses and finding an efficient encoding of these statistics to use as input to a deep network. SOFE is based on proposing state augmentations that expand the state space but hold the promise of simplifying the optimization of the agent's objective. We show that SOFE improves the performance of several exploration objectives, including count-based bonuses, pseudo-counts, and state-entropy maximization. Moreover, SOFE outperforms prior methods that attempt to stabilize the optimization of intrinsic objectives. We demonstrate the efficacy of SOFE in hard-exploration problems, including sparse-reward tasks, pixel-based observations, 3D navigation, and procedurally generated environments.
Multimodal Fake News Detection via CLIP-Guided Learning
Multimodal fake news detection has attracted many research interests in social forensics. Many existing approaches introduce tailored attention mechanisms to guide the fusion of unimodal features. However, how the similarity of these features is calculated and how it will affect the decision-making process in FND are still open questions. Besides, the potential of pretrained multi-modal feature learning models in fake news detection has not been well exploited. This paper proposes a FND-CLIP framework, i.e., a multimodal Fake News Detection network based on Contrastive Language-Image Pretraining (CLIP). Given a targeted multimodal news, we extract the deep representations from the image and text using a ResNet-based encoder, a BERT-based encoder and two pair-wise CLIP encoders. The multimodal feature is a concatenation of the CLIP-generated features weighted by the standardized cross-modal similarity of the two modalities. The extracted features are further processed for redundancy reduction before feeding them into the final classifier. We introduce a modality-wise attention module to adaptively reweight and aggregate the features. We have conducted extensive experiments on typical fake news datasets. The results indicate that the proposed framework has a better capability in mining crucial features for fake news detection. The proposed FND-CLIP can achieve better performances than previous works, i.e., 0.7\%, 6.8\% and 1.3\% improvements in overall accuracy on Weibo, Politifact and Gossipcop, respectively. Besides, we justify that CLIP-based learning can allow better flexibility on multimodal feature selection.
Reduce Information Loss in Transformers for Pluralistic Image Inpainting
Transformers have achieved great success in pluralistic image inpainting recently. However, we find existing transformer based solutions regard each pixel as a token, thus suffer from information loss issue from two aspects: 1) They downsample the input image into much lower resolutions for efficiency consideration, incurring information loss and extra misalignment for the boundaries of masked regions. 2) They quantize 256^3 RGB pixels to a small number (such as 512) of quantized pixels. The indices of quantized pixels are used as tokens for the inputs and prediction targets of transformer. Although an extra CNN network is used to upsample and refine the low-resolution results, it is difficult to retrieve the lost information back.To keep input information as much as possible, we propose a new transformer based framework "PUT". Specifically, to avoid input downsampling while maintaining the computation efficiency, we design a patch-based auto-encoder P-VQVAE, where the encoder converts the masked image into non-overlapped patch tokens and the decoder recovers the masked regions from inpainted tokens while keeping the unmasked regions unchanged. To eliminate the information loss caused by quantization, an Un-Quantized Transformer (UQ-Transformer) is applied, which directly takes the features from P-VQVAE encoder as input without quantization and regards the quantized tokens only as prediction targets. Extensive experiments show that PUT greatly outperforms state-of-the-art methods on image fidelity, especially for large masked regions and complex large-scale datasets. Code is available at https://github.com/liuqk3/PUT
Inspection-L: Self-Supervised GNN Node Embeddings for Money Laundering Detection in Bitcoin
Criminals have become increasingly experienced in using cryptocurrencies, such as Bitcoin, for money laundering. The use of cryptocurrencies can hide criminal identities and transfer hundreds of millions of dollars of dirty funds through their criminal digital wallets. However, this is considered a paradox because cryptocurrencies are goldmines for open-source intelligence, giving law enforcement agencies more power when conducting forensic analyses. This paper proposed Inspection-L, a graph neural network (GNN) framework based on a self-supervised Deep Graph Infomax (DGI) and Graph Isomorphism Network (GIN), with supervised learning algorithms, namely Random Forest (RF), to detect illicit transactions for anti-money laundering (AML). To the best of our knowledge, our proposal is the first to apply self-supervised GNNs to the problem of AML in Bitcoin. The proposed method was evaluated on the Elliptic dataset and shows that our approach outperforms the state-of-the-art in terms of key classification metrics, which demonstrates the potential of self-supervised GNN in the detection of illicit cryptocurrency transactions.
EFLNet: Enhancing Feature Learning for Infrared Small Target Detection
Single-frame infrared small target detection is considered to be a challenging task, due to the extreme imbalance between target and background, bounding box regression is extremely sensitive to infrared small targets, and small target information is easy to lose in the high-level semantic layer. In this paper, we propose an enhancing feature learning network (EFLNet) based on YOLOv7 framework to solve these problems. First, we notice that there is an extremely imbalance between the target and the background in the infrared image, which makes the model pay more attention to the background features, resulting in missed detection. To address this problem, we propose a new adaptive threshold focal loss function that adjusts the loss weight automatically, compelling the model to allocate greater attention to target features. Second, we introduce the normalized Gaussian Wasserstein distance to alleviate the difficulty of model convergence caused by the extreme sensitivity of the bounding box regression to infrared small targets. Finally, we incorporate a dynamic head mechanism into the network to enable adaptive learning of the relative importance of each semantic layer. Experimental results demonstrate our method can achieve better performance in the detection performance of infrared small targets compared to state-of-the-art deep-learning based methods.
Simulating Fluids in Real-World Still Images
In this work, we tackle the problem of real-world fluid animation from a still image. The key of our system is a surface-based layered representation deriving from video decomposition, where the scene is decoupled into a surface fluid layer and an impervious background layer with corresponding transparencies to characterize the composition of the two layers. The animated video can be produced by warping only the surface fluid layer according to the estimation of fluid motions and recombining it with the background. In addition, we introduce surface-only fluid simulation, a 2.5D fluid calculation version, as a replacement for motion estimation. Specifically, we leverage the triangular mesh based on a monocular depth estimator to represent the fluid surface layer and simulate the motion in the physics-based framework with the inspiration of the classic theory of the hybrid Lagrangian-Eulerian method, along with a learnable network so as to adapt to complex real-world image textures. We demonstrate the effectiveness of the proposed system through comparison with existing methods in both standard objective metrics and subjective ranking scores. Extensive experiments not only indicate our method's competitive performance for common fluid scenes but also better robustness and reasonability under complex transparent fluid scenarios. Moreover, as the proposed surface-based layer representation and surface-only fluid simulation naturally disentangle the scene, interactive editing such as adding objects to the river and texture replacing could be easily achieved with realistic results.
DeepFEA: Deep Learning for Prediction of Transient Finite Element Analysis Solutions
Finite Element Analysis (FEA) is a powerful but computationally intensive method for simulating physical phenomena. Recent advancements in machine learning have led to surrogate models capable of accelerating FEA. Yet there are still limitations in developing surrogates of transient FEA models that can simultaneously predict the solutions for both nodes and elements with applicability on both the 2D and 3D domains. Motivated by this research gap, this study proposes DeepFEA, a deep learning-based framework that leverages a multilayer Convolutional Long Short-Term Memory (ConvLSTM) network branching into two parallel convolutional neural networks to predict the solutions for both nodes and elements of FEA models. The proposed network is optimized using a novel adaptive learning algorithm, called Node-Element Loss Optimization (NELO). NELO minimizes the error occurring at both branches of the network enabling the prediction of solutions for transient FEA simulations. The experimental evaluation of DeepFEA is performed on three datasets in the context of structural mechanics, generated to serve as publicly available reference datasets. The results show that DeepFEA can achieve less than 3% normalized mean and root mean squared error for 2D and 3D simulation scenarios, and inference times that are two orders of magnitude faster than FEA. In contrast, relevant state-of-the-art methods face challenges with multi-dimensional output and dynamic input prediction. Furthermore, DeepFEA's robustness was demonstrated in a real-life biomedical scenario, confirming its suitability for accurate and efficient predictions of FEA simulations.
OpenSD: Unified Open-Vocabulary Segmentation and Detection
Recently, a few open-vocabulary methods have been proposed by employing a unified architecture to tackle generic segmentation and detection tasks. However, their performance still lags behind the task-specific models due to the conflict between different tasks, and their open-vocabulary capability is limited due to the inadequate use of CLIP. To address these challenges, we present a universal transformer-based framework, abbreviated as OpenSD, which utilizes the same architecture and network parameters to handle open-vocabulary segmentation and detection tasks. First, we introduce a decoder decoupled learning strategy to alleviate the semantic conflict between thing and staff categories so that each individual task can be learned more effectively under the same framework. Second, to better leverage CLIP for end-to-end segmentation and detection, we propose dual classifiers to handle the in-vocabulary domain and out-of-vocabulary domain, respectively. The text encoder is further trained to be region-aware for both thing and stuff categories through decoupled prompt learning, enabling them to filter out duplicated and low-quality predictions, which is important to end-to-end segmentation and detection. Extensive experiments are conducted on multiple datasets under various circumstances. The results demonstrate that OpenSD outperforms state-of-the-art open-vocabulary segmentation and detection methods in both closed- and open-vocabulary settings. Code is available at https://github.com/strongwolf/OpenSD
An Automatic SOAP Classification System Using Weakly Supervision And Transfer Learning
In this paper, we introduce a comprehensive framework for developing a machine learning-based SOAP (Subjective, Objective, Assessment, and Plan) classification system without manually SOAP annotated training data or with less manually SOAP annotated training data. The system is composed of the following two parts: 1) Data construction, 2) A neural network-based SOAP classifier, and 3) Transfer learning framework. In data construction, since a manual construction of a large size training dataset is expensive, we propose a rule-based weak labeling method utilizing the structured information of an EHR note. Then, we present a SOAP classifier composed of a pre-trained language model and bi-directional long-short term memory with conditional random field (Bi-LSTM-CRF). Finally, we propose a transfer learning framework that re-uses the trained parameters of the SOAP classifier trained with the weakly labeled dataset for datasets collected from another hospital. The proposed weakly label-based learning model successfully performed SOAP classification (89.99 F1-score) on the notes collected from the target hospital. Otherwise, in the notes collected from other hospitals and departments, the performance dramatically decreased. Meanwhile, we verified that the transfer learning framework is advantageous for inter-hospital adaptation of the model increasing the models' performance in every cases. In particular, the transfer learning approach was more efficient when the manually annotated data size was smaller. We showed that SOAP classification models trained with our weakly labeling algorithm can perform SOAP classification without manually annotated data on the EHR notes from the same hospital. The transfer learning framework helps SOAP classification model's inter-hospital migration with a minimal size of the manually annotated dataset.
HuPR: A Benchmark for Human Pose Estimation Using Millimeter Wave Radar
This paper introduces a novel human pose estimation benchmark, Human Pose with Millimeter Wave Radar (HuPR), that includes synchronized vision and radio signal components. This dataset is created using cross-calibrated mmWave radar sensors and a monocular RGB camera for cross-modality training of radar-based human pose estimation. There are two advantages of using mmWave radar to perform human pose estimation. First, it is robust to dark and low-light conditions. Second, it is not visually perceivable by humans and thus, can be widely applied to applications with privacy concerns, e.g., surveillance systems in patient rooms. In addition to the benchmark, we propose a cross-modality training framework that leverages the ground-truth 2D keypoints representing human body joints for training, which are systematically generated from the pre-trained 2D pose estimation network based on a monocular camera input image, avoiding laborious manual label annotation efforts. The framework consists of a new radar pre-processing method that better extracts the velocity information from radar data, Cross- and Self-Attention Module (CSAM), to fuse multi-scale radar features, and Pose Refinement Graph Convolutional Networks (PRGCN), to refine the predicted keypoint confidence heatmaps. Our intensive experiments on the HuPR benchmark show that the proposed scheme achieves better human pose estimation performance with only radar data, as compared to traditional pre-processing solutions and previous radio-frequency-based methods.
FlowTransformer: A Transformer Framework for Flow-based Network Intrusion Detection Systems
This paper presents the FlowTransformer framework, a novel approach for implementing transformer-based Network Intrusion Detection Systems (NIDSs). FlowTransformer leverages the strengths of transformer models in identifying the long-term behaviour and characteristics of networks, which are often overlooked by most existing NIDSs. By capturing these complex patterns in network traffic, FlowTransformer offers a flexible and efficient tool for researchers and practitioners in the cybersecurity community who are seeking to implement NIDSs using transformer-based models. FlowTransformer allows the direct substitution of various transformer components, including the input encoding, transformer, classification head, and the evaluation of these across any flow-based network dataset. To demonstrate the effectiveness and efficiency of the FlowTransformer framework, we utilise it to provide an extensive evaluation of various common transformer architectures, such as GPT 2.0 and BERT, on three commonly used public NIDS benchmark datasets. We provide results for accuracy, model size and speed. A key finding of our evaluation is that the choice of classification head has the most significant impact on the model performance. Surprisingly, Global Average Pooling, which is commonly used in text classification, performs very poorly in the context of NIDS. In addition, we show that model size can be reduced by over 50\%, and inference and training times improved, with no loss of accuracy, by making specific choices of input encoding and classification head instead of other commonly used alternatives.
SparseByteNN: A Novel Mobile Inference Acceleration Framework Based on Fine-Grained Group Sparsity
To address the challenge of increasing network size, researchers have developed sparse models through network pruning. However, maintaining model accuracy while achieving significant speedups on general computing devices remains an open problem. In this paper, we present a novel mobile inference acceleration framework SparseByteNN, which leverages fine-grained kernel sparsity to achieve real-time execution as well as high accuracy. Our framework consists of two parts: (a) A fine-grained kernel sparsity schema with a sparsity granularity between structured pruning and unstructured pruning. It designs multiple sparse patterns for different operators. Combined with our proposed whole network rearrangement strategy, the schema achieves a high compression rate and high precision at the same time. (b) Inference engine co-optimized with the sparse pattern. The conventional wisdom is that this reduction in theoretical FLOPs does not translate into real-world efficiency gains. We aim to correct this misconception by introducing a family of efficient sparse kernels for ARM and WebAssembly. Equipped with our efficient implementation of sparse primitives, we show that sparse versions of MobileNet-v1 outperform strong dense baselines on the efficiency-accuracy curve. Experimental results on Qualcomm 855 show that for 30% sparse MobileNet-v1, SparseByteNN achieves 1.27x speedup over the dense version and 1.29x speedup over the state-of-the-art sparse inference engine MNN with a slight accuracy drop of 0.224%. The source code of SparseByteNN will be available at https://github.com/lswzjuer/SparseByteNN
A Daily Tourism Demand Prediction Framework Based on Multi-head Attention CNN: The Case of The Foreign Entrant in South Korea
Developing an accurate tourism forecasting model is essential for making desirable policy decisions for tourism management. Early studies on tourism management focus on discovering external factors related to tourism demand. Recent studies utilize deep learning in demand forecasting along with these external factors. They mainly use recursive neural network models such as LSTM and RNN for their frameworks. However, these models are not suitable for use in forecasting tourism demand. This is because tourism demand is strongly affected by changes in various external factors, and recursive neural network models have limitations in handling these multivariate inputs. We propose a multi-head attention CNN model (MHAC) for addressing these limitations. The MHAC uses 1D-convolutional neural network to analyze temporal patterns and the attention mechanism to reflect correlations between input variables. This model makes it possible to extract spatiotemporal characteristics from time-series data of various variables. We apply our forecasting framework to predict inbound tourist changes in South Korea by considering external factors such as politics, disease, season, and attraction of Korean culture. The performance results of extensive experiments show that our method outperforms other deep-learning-based prediction frameworks in South Korea tourism forecasting.
VividFace: A Diffusion-Based Hybrid Framework for High-Fidelity Video Face Swapping
Video face swapping is becoming increasingly popular across various applications, yet existing methods primarily focus on static images and struggle with video face swapping because of temporal consistency and complex scenarios. In this paper, we present the first diffusion-based framework specifically designed for video face swapping. Our approach introduces a novel image-video hybrid training framework that leverages both abundant static image data and temporal video sequences, addressing the inherent limitations of video-only training. The framework incorporates a specially designed diffusion model coupled with a VidFaceVAE that effectively processes both types of data to better maintain temporal coherence of the generated videos. To further disentangle identity and pose features, we construct the Attribute-Identity Disentanglement Triplet (AIDT) Dataset, where each triplet has three face images, with two images sharing the same pose and two sharing the same identity. Enhanced with a comprehensive occlusion augmentation, this dataset also improves robustness against occlusions. Additionally, we integrate 3D reconstruction techniques as input conditioning to our network for handling large pose variations. Extensive experiments demonstrate that our framework achieves superior performance in identity preservation, temporal consistency, and visual quality compared to existing methods, while requiring fewer inference steps. Our approach effectively mitigates key challenges in video face swapping, including temporal flickering, identity preservation, and robustness to occlusions and pose variations.
VPU-EM: An Event-based Modeling Framework to Evaluate NPU Performance and Power Efficiency at Scale
State-of-art NPUs are typically architected as a self-contained sub-system with multiple heterogeneous hardware computing modules, and a dataflow-driven programming model. There lacks well-established methodology and tools in the industry to evaluate and compare the performance of NPUs from different architectures. We present an event-based performance modeling framework, VPU-EM, targeting scalable performance evaluation of modern NPUs across diversified AI workloads. The framework adopts high-level event-based system-simulation methodology to abstract away design details for speed, while maintaining hardware pipelining, concurrency and interaction with software task scheduling. It is natively developed in Python and built to interface directly with AI frameworks such as Tensorflow, PyTorch, ONNX and OpenVINO, linking various in-house NPU graph compilers to achieve optimized full model performance. Furthermore, VPU-EM also provides the capability to model power characteristics of NPU in Power-EM mode to enable joint performance/power analysis. Using VPU-EM, we conduct performance/power analysis of models from representative neural network architecture. We demonstrate that even though this framework is developed for Intel VPU, an Intel in-house NPU IP technology, the methodology can be generalized for analysis of modern NPUs.
Fast as CHITA: Neural Network Pruning with Combinatorial Optimization
The sheer size of modern neural networks makes model serving a serious computational challenge. A popular class of compression techniques overcomes this challenge by pruning or sparsifying the weights of pretrained networks. While useful, these techniques often face serious tradeoffs between computational requirements and compression quality. In this work, we propose a novel optimization-based pruning framework that considers the combined effect of pruning (and updating) multiple weights subject to a sparsity constraint. Our approach, CHITA, extends the classical Optimal Brain Surgeon framework and results in significant improvements in speed, memory, and performance over existing optimization-based approaches for network pruning. CHITA's main workhorse performs combinatorial optimization updates on a memory-friendly representation of local quadratic approximation(s) of the loss function. On a standard benchmark of pretrained models and datasets, CHITA leads to significantly better sparsity-accuracy tradeoffs than competing methods. For example, for MLPNet with only 2% of the weights retained, our approach improves the accuracy by 63% relative to the state of the art. Furthermore, when used in conjunction with fine-tuning SGD steps, our method achieves significant accuracy gains over the state-of-the-art approaches.
MambaEVT: Event Stream based Visual Object Tracking using State Space Model
Event camera-based visual tracking has drawn more and more attention in recent years due to the unique imaging principle and advantages of low energy consumption, high dynamic range, and dense temporal resolution. Current event-based tracking algorithms are gradually hitting their performance bottlenecks, due to the utilization of vision Transformer and the static template for target object localization. In this paper, we propose a novel Mamba-based visual tracking framework that adopts the state space model with linear complexity as a backbone network. The search regions and target template are fed into the vision Mamba network for simultaneous feature extraction and interaction. The output tokens of search regions will be fed into the tracking head for target localization. More importantly, we consider introducing a dynamic template update strategy into the tracking framework using the Memory Mamba network. By considering the diversity of samples in the target template library and making appropriate adjustments to the template memory module, a more effective dynamic template can be integrated. The effective combination of dynamic and static templates allows our Mamba-based tracking algorithm to achieve a good balance between accuracy and computational cost on multiple large-scale datasets, including EventVOT, VisEvent, and FE240hz. The source code will be released on https://github.com/Event-AHU/MambaEVT
Neural network layers as parametric spans
Properties such as composability and automatic differentiation made artificial neural networks a pervasive tool in applications. Tackling more challenging problems caused neural networks to progressively become more complex and thus difficult to define from a mathematical perspective. We present a general definition of linear layer arising from a categorical framework based on the notions of integration theory and parametric spans. This definition generalizes and encompasses classical layers (e.g., dense, convolutional), while guaranteeing existence and computability of the layer's derivatives for backpropagation.
BrainMAE: A Region-aware Self-supervised Learning Framework for Brain Signals
The human brain is a complex, dynamic network, which is commonly studied using functional magnetic resonance imaging (fMRI) and modeled as network of Regions of interest (ROIs) for understanding various brain functions. Recent studies utilize deep learning approaches to learn the brain network representation based on functional connectivity (FC) profile, broadly falling into two main categories. The Fixed-FC approaches, utilizing the FC profile which represents the linear temporal relation within the brain network, are limited by failing to capture informative brain temporal dynamics. On the other hand, the Dynamic-FC approaches, modeling the evolving FC profile over time, often exhibit less satisfactory performance due to challenges in handling the inherent noisy nature of fMRI data. To address these challenges, we propose Brain Masked Auto-Encoder (BrainMAE) for learning representations directly from fMRI time-series data. Our approach incorporates two essential components: a region-aware graph attention mechanism designed to capture the relationships between different brain ROIs, and a novel self-supervised masked autoencoding framework for effective model pre-training. These components enable the model to capture rich temporal dynamics of brain activity while maintaining resilience to inherent noise in fMRI data. Our experiments demonstrate that BrainMAE consistently outperforms established baseline methods by significant margins in four distinct downstream tasks. Finally, leveraging the model's inherent interpretability, our analysis of model-generated representations reveals findings that resonate with ongoing research in the field of neuroscience.
NNV: The Neural Network Verification Tool for Deep Neural Networks and Learning-Enabled Cyber-Physical Systems
This paper presents the Neural Network Verification (NNV) software tool, a set-based verification framework for deep neural networks (DNNs) and learning-enabled cyber-physical systems (CPS). The crux of NNV is a collection of reachability algorithms that make use of a variety of set representations, such as polyhedra, star sets, zonotopes, and abstract-domain representations. NNV supports both exact (sound and complete) and over-approximate (sound) reachability algorithms for verifying safety and robustness properties of feed-forward neural networks (FFNNs) with various activation functions. For learning-enabled CPS, such as closed-loop control systems incorporating neural networks, NNV provides exact and over-approximate reachability analysis schemes for linear plant models and FFNN controllers with piecewise-linear activation functions, such as ReLUs. For similar neural network control systems (NNCS) that instead have nonlinear plant models, NNV supports over-approximate analysis by combining the star set analysis used for FFNN controllers with zonotope-based analysis for nonlinear plant dynamics building on CORA. We evaluate NNV using two real-world case studies: the first is safety verification of ACAS Xu networks and the second deals with the safety verification of a deep learning-based adaptive cruise control system.
Adaptive Self-training Framework for Fine-grained Scene Graph Generation
Scene graph generation (SGG) models have suffered from inherent problems regarding the benchmark datasets such as the long-tailed predicate distribution and missing annotation problems. In this work, we aim to alleviate the long-tailed problem of SGG by utilizing unannotated triplets. To this end, we introduce a Self-Training framework for SGG (ST-SGG) that assigns pseudo-labels for unannotated triplets based on which the SGG models are trained. While there has been significant progress in self-training for image recognition, designing a self-training framework for the SGG task is more challenging due to its inherent nature such as the semantic ambiguity and the long-tailed distribution of predicate classes. Hence, we propose a novel pseudo-labeling technique for SGG, called Class-specific Adaptive Thresholding with Momentum (CATM), which is a model-agnostic framework that can be applied to any existing SGG models. Furthermore, we devise a graph structure learner (GSL) that is beneficial when adopting our proposed self-training framework to the state-of-the-art message-passing neural network (MPNN)-based SGG models. Our extensive experiments verify the effectiveness of ST-SGG on various SGG models, particularly in enhancing the performance on fine-grained predicate classes.
HappyFeat -- An interactive and efficient BCI framework for clinical applications
Brain-Computer Interface (BCI) systems allow users to perform actions by translating their brain activity into commands. Such systems usually need a training phase, consisting in training a classification algorithm to discriminate between mental states using specific features from the recorded signals. This phase of feature selection and training is crucial for BCI performance and presents specific constraints to be met in a clinical context, such as post-stroke rehabilitation. In this paper, we present HappyFeat, a software making Motor Imagery (MI) based BCI experiments easier, by gathering all necessary manipulations and analysis in a single convenient GUI and via automation of experiment or analysis parameters. The resulting workflow allows for effortlessly selecting the best features, helping to achieve good BCI performance in time-constrained environments. Alternative features based on Functional Connectivity can be used and compared or combined with Power Spectral Density, allowing a network-oriented approach. We then give details of HappyFeat's main mechanisms, and a review of its performances in typical use cases. We also show that it can be used as an efficient tool for comparing different metrics extracted from the signals, to train the classification algorithm. To this end, we show a comparison between the commonly-used Power Spectral Density and network metrics based on Functional Connectivity. HappyFeat is available as an open-source project which can be freely downloaded on GitHub.
Thunder-DeID: Accurate and Efficient De-identification Framework for Korean Court Judgments
To ensure a balance between open access to justice and personal data protection, the South Korean judiciary mandates the de-identification of court judgments before they can be publicly disclosed. However, the current de-identification process is inadequate for handling court judgments at scale while adhering to strict legal requirements. Additionally, the legal definitions and categorizations of personal identifiers are vague and not well-suited for technical solutions. To tackle these challenges, we propose a de-identification framework called Thunder-DeID, which aligns with relevant laws and practices. Specifically, we (i) construct and release the first Korean legal dataset containing annotated judgments along with corresponding lists of entity mentions, (ii) introduce a systematic categorization of Personally Identifiable Information (PII), and (iii) develop an end-to-end deep neural network (DNN)-based de-identification pipeline. Our experimental results demonstrate that our model achieves state-of-the-art performance in the de-identification of court judgments.
Graph Neural Networks for Jamming Source Localization
Graph-based learning has emerged as a transformative approach for modeling complex relationships across diverse domains, yet its potential in wireless security remains largely unexplored. In this work, we introduce the first application of graph-based learning for jamming source localization, addressing the imminent threat of jamming attacks in wireless networks. Unlike geometric optimization techniques that struggle under environmental uncertainties and dense interference, we reformulate localization as an inductive graph regression task. Our approach integrates structured node representations that encode local and global signal aggregation, ensuring spatial coherence and adaptive signal fusion. To enhance robustness, we incorporate an attention-based graph neural network that adaptively refines neighborhood influence and introduces a confidence-guided estimation mechanism that dynamically balances learned predictions with domain-informed priors. We evaluate our approach under complex radio frequency environments with varying sampling densities and signal propagation conditions, conducting comprehensive ablation studies on graph construction, feature selection, and pooling strategies. Results demonstrate that our novel graph-based learning framework significantly outperforms established localization baselines, particularly in challenging scenarios with sparse and obfuscated signal information. Code is available at [https://github.com/daniaherzalla/gnn-jamming-source-localization](https://github.com/daniaherzalla/gnn-jamming-source-localization).
Describing a Knowledge Base
We aim to automatically generate natural language descriptions about an input structured knowledge base (KB). We build our generation framework based on a pointer network which can copy facts from the input KB, and add two attention mechanisms: (i) slot-aware attention to capture the association between a slot type and its corresponding slot value; and (ii) a new table position self-attention to capture the inter-dependencies among related slots. For evaluation, besides standard metrics including BLEU, METEOR, and ROUGE, we propose a KB reconstruction based metric by extracting a KB from the generation output and comparing it with the input KB. We also create a new data set which includes 106,216 pairs of structured KBs and their corresponding natural language descriptions for two distinct entity types. Experiments show that our approach significantly outperforms state-of-the-art methods. The reconstructed KB achieves 68.8% - 72.6% F-score.
Grounding Language Plans in Demonstrations Through Counterfactual Perturbations
Grounding the common-sense reasoning of Large Language Models in physical domains remains a pivotal yet unsolved problem for embodied AI. Whereas prior works have focused on leveraging LLMs directly for planning in symbolic spaces, this work uses LLMs to guide the search of task structures and constraints implicit in multi-step demonstrations. Specifically, we borrow from manipulation planning literature the concept of mode families, which group robot configurations by specific motion constraints, to serve as an abstraction layer between the high-level language representations of an LLM and the low-level physical trajectories of a robot. By replaying a few human demonstrations with synthetic perturbations, we generate coverage over the demonstrations' state space with additional successful executions as well as counterfactuals that fail the task. Our explanation-based learning framework trains an end-to-end differentiable neural network to predict successful trajectories from failures and as a by-product learns classifiers that ground low-level states and images in mode families without dense labeling. The learned grounding classifiers can further be used to translate language plans into reactive policies in the physical domain in an interpretable manner. We show our approach improves the interpretability and reactivity of imitation learning through 2D navigation and simulated and real robot manipulation tasks. Website: https://sites.google.com/view/grounding-plans
Looking at CTR Prediction Again: Is Attention All You Need?
Click-through rate (CTR) prediction is a critical problem in web search, recommendation systems and online advertisement displaying. Learning good feature interactions is essential to reflect user's preferences to items. Many CTR prediction models based on deep learning have been proposed, but researchers usually only pay attention to whether state-of-the-art performance is achieved, and ignore whether the entire framework is reasonable. In this work, we use the discrete choice model in economics to redefine the CTR prediction problem, and propose a general neural network framework built on self-attention mechanism. It is found that most existing CTR prediction models align with our proposed general framework. We also examine the expressive power and model complexity of our proposed framework, along with potential extensions to some existing models. And finally we demonstrate and verify our insights through some experimental results on public datasets.
Deep learning probability flows and entropy production rates in active matter
Active matter systems, from self-propelled colloids to motile bacteria, are characterized by the conversion of free energy into useful work at the microscopic scale. These systems generically involve physics beyond the reach of equilibrium statistical mechanics, and a persistent challenge has been to understand the nature of their nonequilibrium states. The entropy production rate and the magnitude of the steady-state probability current provide quantitative ways to do so by measuring the breakdown of time-reversal symmetry and the strength of nonequilibrium transport of measure. Yet, their efficient computation has remained elusive, as they depend on the system's unknown and high-dimensional probability density. Here, building upon recent advances in generative modeling, we develop a deep learning framework that estimates the score of this density. We show that the score, together with the microscopic equations of motion, gives direct access to the entropy production rate, the probability current, and their decomposition into local contributions from individual particles, spatial regions, and degrees of freedom. To represent the score, we introduce a novel, spatially-local transformer-based network architecture that learns high-order interactions between particles while respecting their underlying permutation symmetry. We demonstrate the broad utility and scalability of the method by applying it to several high-dimensional systems of interacting active particles undergoing motility-induced phase separation (MIPS). We show that a single instance of our network trained on a system of 4096 particles at one packing fraction can generalize to other regions of the phase diagram, including systems with as many as 32768 particles. We use this observation to quantify the spatial structure of the departure from equilibrium in MIPS as a function of the number of particles and the packing fraction.
Symmetry and Uncertainty-Aware Object SLAM for 6DoF Object Pose Estimation
We propose a keypoint-based object-level SLAM framework that can provide globally consistent 6DoF pose estimates for symmetric and asymmetric objects alike. To the best of our knowledge, our system is among the first to utilize the camera pose information from SLAM to provide prior knowledge for tracking keypoints on symmetric objects -- ensuring that new measurements are consistent with the current 3D scene. Moreover, our semantic keypoint network is trained to predict the Gaussian covariance for the keypoints that captures the true error of the prediction, and thus is not only useful as a weight for the residuals in the system's optimization problems, but also as a means to detect harmful statistical outliers without choosing a manual threshold. Experiments show that our method provides competitive performance to the state of the art in 6DoF object pose estimation, and at a real-time speed. Our code, pre-trained models, and keypoint labels are available https://github.com/rpng/suo_slam.
Agentic Deep Graph Reasoning Yields Self-Organizing Knowledge Networks
We present an agentic, autonomous graph expansion framework that iteratively structures and refines knowledge in situ. Unlike conventional knowledge graph construction methods relying on static extraction or single-pass learning, our approach couples a reasoning-native large language model with a continually updated graph representation. At each step, the system actively generates new concepts and relationships, merges them into a global graph, and formulates subsequent prompts based on its evolving structure. Through this feedback-driven loop, the model organizes information into a scale-free network characterized by hub formation, stable modularity, and bridging nodes that link disparate knowledge clusters. Over hundreds of iterations, new nodes and edges continue to appear without saturating, while centrality measures and shortest path distributions evolve to yield increasingly distributed connectivity. Our analysis reveals emergent patterns, such as the rise of highly connected 'hub' concepts and the shifting influence of 'bridge' nodes, indicating that agentic, self-reinforcing graph construction can yield open-ended, coherent knowledge structures. Applied to materials design problems, we present compositional reasoning experiments by extracting node-specific and synergy-level principles to foster genuinely novel knowledge synthesis, yielding cross-domain ideas that transcend rote summarization and strengthen the framework's potential for open-ended scientific discovery. We discuss other applications in scientific discovery and outline future directions for enhancing scalability and interpretability.
Fast Convex Pruning of Deep Neural Networks
We develop a fast, tractable technique called Net-Trim for simplifying a trained neural network. The method is a convex post-processing module, which prunes (sparsifies) a trained network layer by layer, while preserving the internal responses. We present a comprehensive analysis of Net-Trim from both the algorithmic and sample complexity standpoints, centered on a fast, scalable convex optimization program. Our analysis includes consistency results between the initial and retrained models before and after Net-Trim application and guarantees on the number of training samples needed to discover a network that can be expressed using a certain number of nonzero terms. Specifically, if there is a set of weights that uses at most s terms that can re-create the layer outputs from the layer inputs, we can find these weights from O(slog N/s) samples, where N is the input size. These theoretical results are similar to those for sparse regression using the Lasso, and our analysis uses some of the same recently-developed tools (namely recent results on the concentration of measure and convex analysis). Finally, we propose an algorithmic framework based on the alternating direction method of multipliers (ADMM), which allows a fast and simple implementation of Net-Trim for network pruning and compression.
LLM-Augmented Graph Neural Recommenders: Integrating User Reviews
Recommender systems increasingly aim to combine signals from both user reviews and purchase (or other interaction) behaviors. While user-written comments provide explicit insights about preferences, merging these textual representations from large language models (LLMs) with graph-based embeddings of user actions remains a challenging task. In this work, we propose a framework that employs both a Graph Neural Network (GNN)-based model and an LLM to produce review-aware representations, preserving review semantics while mitigating textual noise. Our approach utilizes a hybrid objective that balances user-item interactions against text-derived features, ensuring that user's both behavioral and linguistic signals are effectively captured. We evaluate this method on multiple datasets from diverse application domains, demonstrating consistent improvements over a baseline GNN-based recommender model. Notably, our model achieves significant gains in recommendation accuracy when review data is sparse or unevenly distributed. These findings highlight the importance of integrating LLM-driven textual feedback with GNN-derived user behavioral patterns to develop robust, context-aware recommender systems.
Revisiting Image Pyramid Structure for High Resolution Salient Object Detection
Salient object detection (SOD) has been in the spotlight recently, yet has been studied less for high-resolution (HR) images. Unfortunately, HR images and their pixel-level annotations are certainly more labor-intensive and time-consuming compared to low-resolution (LR) images and annotations. Therefore, we propose an image pyramid-based SOD framework, Inverse Saliency Pyramid Reconstruction Network (InSPyReNet), for HR prediction without any of HR datasets. We design InSPyReNet to produce a strict image pyramid structure of saliency map, which enables to ensemble multiple results with pyramid-based image blending. For HR prediction, we design a pyramid blending method which synthesizes two different image pyramids from a pair of LR and HR scale from the same image to overcome effective receptive field (ERF) discrepancy. Our extensive evaluations on public LR and HR SOD benchmarks demonstrate that InSPyReNet surpasses the State-of-the-Art (SotA) methods on various SOD metrics and boundary accuracy.
Fiducial Focus Augmentation for Facial Landmark Detection
Deep learning methods have led to significant improvements in the performance on the facial landmark detection (FLD) task. However, detecting landmarks in challenging settings, such as head pose changes, exaggerated expressions, or uneven illumination, continue to remain a challenge due to high variability and insufficient samples. This inadequacy can be attributed to the model's inability to effectively acquire appropriate facial structure information from the input images. To address this, we propose a novel image augmentation technique specifically designed for the FLD task to enhance the model's understanding of facial structures. To effectively utilize the newly proposed augmentation technique, we employ a Siamese architecture-based training mechanism with a Deep Canonical Correlation Analysis (DCCA)-based loss to achieve collective learning of high-level feature representations from two different views of the input images. Furthermore, we employ a Transformer + CNN-based network with a custom hourglass module as the robust backbone for the Siamese framework. Extensive experiments show that our approach outperforms multiple state-of-the-art approaches across various benchmark datasets.
GNN-Coder: Boosting Semantic Code Retrieval with Combined GNNs and Transformer
Code retrieval is a crucial component in modern software development, particularly in large-scale projects. However, existing approaches relying on sequence-based models often fail to fully exploit the structural dependencies inherent in code, leading to suboptimal retrieval performance, particularly with structurally complex code fragments. In this paper, we introduce GNN-Coder, a novel framework based on Graph Neural Network (GNN) to utilize Abstract Syntax Tree (AST). We make the first attempt to study how GNN-integrated Transformer can promote the development of semantic retrieval tasks by capturing the structural and semantic features of code. We further propose an innovative graph pooling method tailored for AST, utilizing the number of child nodes as a key feature to highlight the intrinsic topological relationships within the AST. This design effectively integrates both sequential and hierarchical representations, enhancing the model's ability to capture code structure and semantics. Additionally, we introduce the Mean Angular Margin (MAM), a novel metric for quantifying the uniformity of code embedding distributions, providing a standardized measure of feature separability. The proposed method achieves a lower MAM, indicating a more discriminative feature representation. This underscores GNN-Coder's superior ability to distinguish between code snippets, thereby enhancing retrieval accuracy. Experimental results show that GNN-Coder significantly boosts retrieval performance, with a 1\%-10\% improvement in MRR on the CSN dataset, and a notable 20\% gain in zero-shot performance on the CosQA dataset.
A Good Student is Cooperative and Reliable: CNN-Transformer Collaborative Learning for Semantic Segmentation
In this paper, we strive to answer the question "how to collaboratively learn convolutional neural network (CNN)-based and vision transformer (ViT)-based models by selecting and exchanging the reliable knowledge between them for semantic segmentation?" Accordingly, we propose an online knowledge distillation (KD) framework that can simultaneously learn compact yet effective CNN-based and ViT-based models with two key technical breakthroughs to take full advantage of CNNs and ViT while compensating their limitations. Firstly, we propose heterogeneous feature distillation (HFD) to improve students' consistency in low-layer feature space by mimicking heterogeneous features between CNNs and ViT. Secondly, to facilitate the two students to learn reliable knowledge from each other, we propose bidirectional selective distillation (BSD) that can dynamically transfer selective knowledge. This is achieved by 1) region-wise BSD determining the directions of knowledge transferred between the corresponding regions in the feature space and 2) pixel-wise BSD discerning which of the prediction knowledge to be transferred in the logit space. Extensive experiments on three benchmark datasets demonstrate that our proposed framework outperforms the state-of-the-art online distillation methods by a large margin, and shows its efficacy in learning collaboratively between ViT-based and CNN-based models.
CLARA: A Constrained Reinforcement Learning Based Resource Allocation Framework for Network Slicing
As mobile networks proliferate, we are experiencing a strong diversification of services, which requires greater flexibility from the existing network. Network slicing is proposed as a promising solution for resource utilization in 5G and future networks to address this dire need. In network slicing, dynamic resource orchestration and network slice management are crucial for maximizing resource utilization. Unfortunately, this process is too complex for traditional approaches to be effective due to a lack of accurate models and dynamic hidden structures. We formulate the problem as a Constrained Markov Decision Process (CMDP) without knowing models and hidden structures. Additionally, we propose to solve the problem using CLARA, a Constrained reinforcement LeArning based Resource Allocation algorithm. In particular, we analyze cumulative and instantaneous constraints using adaptive interior-point policy optimization and projection layer, respectively. Evaluations show that CLARA clearly outperforms baselines in resource allocation with service demand guarantees.
Self-Organizing Agent Network for LLM-based Workflow Automation
Recent multi-agent frameworks built upon large language models (LLMs) have demonstrated remarkable capabilities in complex task planning. However, in real-world enterprise environments, business workflows are typically composed through modularization and reuse of numerous subprocesses, resulting in intricate workflows characterized by lengthy and deeply nested execution paths. Such complexity poses significant challenges for LLM-driven orchestration, as extended reasoning chains and state-space explosions severely impact planning effectiveness and the proper sequencing of tool invocations. Therefore, developing an orchestration method with controllable structures capable of handling multi-layer nesting becomes a critical issue. To address this, we propose a novel structure-driven orchestration framework Self-Organizing Agent Network (SOAN). SOAN incrementally builds a formalized agent network by identifying and encapsulating structural units as independent agents, enhancing modularity and clarity in orchestration. Extensive evaluations were performed using multiple benchmarks as well as a real-world enterprise workflow dataset. Experimental results demonstrate that SOAN significantly outperforms state-of-the-art methods in terms of adaptability, fault tolerance, and execution efficiency.
Set Transformer: A Framework for Attention-based Permutation-Invariant Neural Networks
Many machine learning tasks such as multiple instance learning, 3D shape recognition, and few-shot image classification are defined on sets of instances. Since solutions to such problems do not depend on the order of elements of the set, models used to address them should be permutation invariant. We present an attention-based neural network module, the Set Transformer, specifically designed to model interactions among elements in the input set. The model consists of an encoder and a decoder, both of which rely on attention mechanisms. In an effort to reduce computational complexity, we introduce an attention scheme inspired by inducing point methods from sparse Gaussian process literature. It reduces the computation time of self-attention from quadratic to linear in the number of elements in the set. We show that our model is theoretically attractive and we evaluate it on a range of tasks, demonstrating the state-of-the-art performance compared to recent methods for set-structured data.
Density Adaptive Attention-based Speech Network: Enhancing Feature Understanding for Mental Health Disorders
Speech-based depression detection poses significant challenges for automated detection due to its unique manifestation across individuals and data scarcity. Addressing these challenges, we introduce DAAMAudioCNNLSTM and DAAMAudioTransformer, two parameter efficient and explainable models for audio feature extraction and depression detection. DAAMAudioCNNLSTM features a novel CNN-LSTM framework with multi-head Density Adaptive Attention Mechanism (DAAM), focusing dynamically on informative speech segments. DAAMAudioTransformer, leveraging a transformer encoder in place of the CNN-LSTM architecture, incorporates the same DAAM module for enhanced attention and interpretability. These approaches not only enhance detection robustness and interpretability but also achieve state-of-the-art performance: DAAMAudioCNNLSTM with an F1 macro score of 0.702 and DAAMAudioTransformer with an F1 macro score of 0.72 on the DAIC-WOZ dataset, without reliance on supplementary information such as vowel positions and speaker information during training/validation as in previous approaches. Both models' significant explainability and efficiency in leveraging speech signals for depression detection represent a leap towards more reliable, clinically useful diagnostic tools, promising advancements in speech and mental health care. To foster further research in this domain, we make our code publicly available.
LAN-HDR: Luminance-based Alignment Network for High Dynamic Range Video Reconstruction
As demands for high-quality videos continue to rise, high-resolution and high-dynamic range (HDR) imaging techniques are drawing attention. To generate an HDR video from low dynamic range (LDR) images, one of the critical steps is the motion compensation between LDR frames, for which most existing works employed the optical flow algorithm. However, these methods suffer from flow estimation errors when saturation or complicated motions exist. In this paper, we propose an end-to-end HDR video composition framework, which aligns LDR frames in the feature space and then merges aligned features into an HDR frame, without relying on pixel-domain optical flow. Specifically, we propose a luminance-based alignment network for HDR (LAN-HDR) consisting of an alignment module and a hallucination module. The alignment module aligns a frame to the adjacent reference by evaluating luminance-based attention, excluding color information. The hallucination module generates sharp details, especially for washed-out areas due to saturation. The aligned and hallucinated features are then blended adaptively to complement each other. Finally, we merge the features to generate a final HDR frame. In training, we adopt a temporal loss, in addition to frame reconstruction losses, to enhance temporal consistency and thus reduce flickering. Extensive experiments demonstrate that our method performs better or comparable to state-of-the-art methods on several benchmarks.
TriCLIP-3D: A Unified Parameter-Efficient Framework for Tri-Modal 3D Visual Grounding based on CLIP
3D visual grounding allows an embodied agent to understand visual information in real-world 3D environments based on human instructions, which is crucial for embodied intelligence. Existing 3D visual grounding methods typically rely on separate encoders for different modalities (e.g., RGB images, text, and 3D point clouds), resulting in large and complex models that are inefficient to train. While some approaches use pre-trained 2D multi-modal models like CLIP for 3D tasks, they still struggle with aligning point cloud data to 2D encoders. As a result, these methods continue to depend on 3D encoders for feature extraction, further increasing model complexity and training inefficiency. In this paper, we propose a unified 2D pre-trained multi-modal network to process all three modalities (RGB images, text, and point clouds), significantly simplifying the architecture. By leveraging a 2D CLIP bi-modal model with adapter-based fine-tuning, this framework effectively adapts to the tri-modal setting, improving both adaptability and performance across modalities. Our Geometric-Aware 2D-3D Feature Recovery and Fusion (GARF) module is designed to fuse geometric multi-scale features from point clouds and images. We then integrate textual features for final modality fusion and introduce a multi-modal decoder to facilitate deep cross-modal understanding. Together, our method achieves unified feature extraction and fusion across the three modalities, enabling an end-to-end 3D visual grounding model. Compared to the baseline, our method reduces the number of trainable parameters by approximately 58\%, while achieving a 6.52\% improvement in the 3D detection task and a 6.25\% improvement in the 3D visual grounding task.
GAN-EM: GAN based EM learning framework
Expectation maximization (EM) algorithm is to find maximum likelihood solution for models having latent variables. A typical example is Gaussian Mixture Model (GMM) which requires Gaussian assumption, however, natural images are highly non-Gaussian so that GMM cannot be applied to perform clustering task on pixel space. To overcome such limitation, we propose a GAN based EM learning framework that can maximize the likelihood of images and estimate the latent variables with only the constraint of L-Lipschitz continuity. We call this model GAN-EM, which is a framework for image clustering, semi-supervised classification and dimensionality reduction. In M-step, we design a novel loss function for discriminator of GAN to perform maximum likelihood estimation (MLE) on data with soft class label assignments. Specifically, a conditional generator captures data distribution for K classes, and a discriminator tells whether a sample is real or fake for each class. Since our model is unsupervised, the class label of real data is regarded as latent variable, which is estimated by an additional network (E-net) in E-step. The proposed GAN-EM achieves state-of-the-art clustering and semi-supervised classification results on MNIST, SVHN and CelebA, as well as comparable quality of generated images to other recently developed generative models.
An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition
Image-based sequence recognition has been a long-standing research topic in computer vision. In this paper, we investigate the problem of scene text recognition, which is among the most important and challenging tasks in image-based sequence recognition. A novel neural network architecture, which integrates feature extraction, sequence modeling and transcription into a unified framework, is proposed. Compared with previous systems for scene text recognition, the proposed architecture possesses four distinctive properties: (1) It is end-to-end trainable, in contrast to most of the existing algorithms whose components are separately trained and tuned. (2) It naturally handles sequences in arbitrary lengths, involving no character segmentation or horizontal scale normalization. (3) It is not confined to any predefined lexicon and achieves remarkable performances in both lexicon-free and lexicon-based scene text recognition tasks. (4) It generates an effective yet much smaller model, which is more practical for real-world application scenarios. The experiments on standard benchmarks, including the IIIT-5K, Street View Text and ICDAR datasets, demonstrate the superiority of the proposed algorithm over the prior arts. Moreover, the proposed algorithm performs well in the task of image-based music score recognition, which evidently verifies the generality of it.
Dynamic Neural Network for Multi-Task Learning Searching across Diverse Network Topologies
In this paper, we present a new MTL framework that searches for structures optimized for multiple tasks with diverse graph topologies and shares features among tasks. We design a restricted DAG-based central network with read-in/read-out layers to build topologically diverse task-adaptive structures while limiting search space and time. We search for a single optimized network that serves as multiple task adaptive sub-networks using our three-stage training process. To make the network compact and discretized, we propose a flow-based reduction algorithm and a squeeze loss used in the training process. We evaluate our optimized network on various public MTL datasets and show ours achieves state-of-the-art performance. An extensive ablation study experimentally validates the effectiveness of the sub-module and schemes in our framework.
Path Aggregation Network for Instance Segmentation
The way that information propagates in neural networks is of great importance. In this paper, we propose Path Aggregation Network (PANet) aiming at boosting information flow in proposal-based instance segmentation framework. Specifically, we enhance the entire feature hierarchy with accurate localization signals in lower layers by bottom-up path augmentation, which shortens the information path between lower layers and topmost feature. We present adaptive feature pooling, which links feature grid and all feature levels to make useful information in each feature level propagate directly to following proposal subnetworks. A complementary branch capturing different views for each proposal is created to further improve mask prediction. These improvements are simple to implement, with subtle extra computational overhead. Our PANet reaches the 1st place in the COCO 2017 Challenge Instance Segmentation task and the 2nd place in Object Detection task without large-batch training. It is also state-of-the-art on MVD and Cityscapes. Code is available at https://github.com/ShuLiu1993/PANet
RomanTex: Decoupling 3D-aware Rotary Positional Embedded Multi-Attention Network for Texture Synthesis
Painting textures for existing geometries is a critical yet labor-intensive process in 3D asset generation. Recent advancements in text-to-image (T2I) models have led to significant progress in texture generation. Most existing research approaches this task by first generating images in 2D spaces using image diffusion models, followed by a texture baking process to achieve UV texture. However, these methods often struggle to produce high-quality textures due to inconsistencies among the generated multi-view images, resulting in seams and ghosting artifacts. In contrast, 3D-based texture synthesis methods aim to address these inconsistencies, but they often neglect 2D diffusion model priors, making them challenging to apply to real-world objects To overcome these limitations, we propose RomanTex, a multiview-based texture generation framework that integrates a multi-attention network with an underlying 3D representation, facilitated by our novel 3D-aware Rotary Positional Embedding. Additionally, we incorporate a decoupling characteristic in the multi-attention block to enhance the model's robustness in image-to-texture task, enabling semantically-correct back-view synthesis. Furthermore, we introduce a geometry-related Classifier-Free Guidance (CFG) mechanism to further improve the alignment with both geometries and images. Quantitative and qualitative evaluations, along with comprehensive user studies, demonstrate that our method achieves state-of-the-art results in texture quality and consistency.
LightGNN: Simple Graph Neural Network for Recommendation
Graph neural networks (GNNs) have demonstrated superior performance in collaborative recommendation through their ability to conduct high-order representation smoothing, effectively capturing structural information within users' interaction patterns. However, existing GNN paradigms face significant challenges in scalability and robustness when handling large-scale, noisy, and real-world datasets. To address these challenges, we present LightGNN, a lightweight and distillation-based GNN pruning framework designed to substantially reduce model complexity while preserving essential collaboration modeling capabilities. Our LightGNN framework introduces a computationally efficient pruning module that adaptively identifies and removes redundant edges and embedding entries for model compression. The framework is guided by a resource-friendly hierarchical knowledge distillation objective, whose intermediate layer augments the observed graph to maintain performance, particularly in high-rate compression scenarios. Extensive experiments on public datasets demonstrate LightGNN's effectiveness, significantly improving both computational efficiency and recommendation accuracy. Notably, LightGNN achieves an 80% reduction in edge count and 90% reduction in embedding entries while maintaining performance comparable to more complex state-of-the-art baselines. The implementation of our LightGNN framework is available at the github repository: https://github.com/HKUDS/LightGNN.
ARM-Net: Adaptive Relation Modeling Network for Structured Data
Relational databases are the de facto standard for storing and querying structured data, and extracting insights from structured data requires advanced analytics. Deep neural networks (DNNs) have achieved super-human prediction performance in particular data types, e.g., images. However, existing DNNs may not produce meaningful results when applied to structured data. The reason is that there are correlations and dependencies across combinations of attribute values in a table, and these do not follow simple additive patterns that can be easily mimicked by a DNN. The number of possible such cross features is combinatorial, making them computationally prohibitive to model. Furthermore, the deployment of learning models in real-world applications has also highlighted the need for interpretability, especially for high-stakes applications, which remains another issue of concern to DNNs. In this paper, we present ARM-Net, an adaptive relation modeling network tailored for structured data, and a lightweight framework ARMOR based on ARM-Net for relational data analytics. The key idea is to model feature interactions with cross features selectively and dynamically, by first transforming the input features into exponential space, and then determining the interaction order and interaction weights adaptively for each cross feature. We propose a novel sparse attention mechanism to dynamically generate the interaction weights given the input tuple, so that we can explicitly model cross features of arbitrary orders with noisy features filtered selectively. Then during model inference, ARM-Net can specify the cross features being used for each prediction for higher accuracy and better interpretability. Our extensive experiments on real-world datasets demonstrate that ARM-Net consistently outperforms existing models and provides more interpretable predictions for data-driven decision making.
MERLiN: Single-Shot Material Estimation and Relighting for Photometric Stereo
Photometric stereo typically demands intricate data acquisition setups involving multiple light sources to recover surface normals accurately. In this paper, we propose MERLiN, an attention-based hourglass network that integrates single image-based inverse rendering and relighting within a single unified framework. We evaluate the performance of photometric stereo methods using these relit images and demonstrate how they can circumvent the underlying challenge of complex data acquisition. Our physically-based model is trained on a large synthetic dataset containing complex shapes with spatially varying BRDF and is designed to handle indirect illumination effects to improve material reconstruction and relighting. Through extensive qualitative and quantitative evaluation, we demonstrate that the proposed framework generalizes well to real-world images, achieving high-quality shape, material estimation, and relighting. We assess these synthetically relit images over photometric stereo benchmark methods for their physical correctness and resulting normal estimation accuracy, paving the way towards single-shot photometric stereo through physically-based relighting. This work allows us to address the single image-based inverse rendering problem holistically, applying well to both synthetic and real data and taking a step towards mitigating the challenge of data acquisition in photometric stereo.
Model Agnostic Hybrid Sharding For Heterogeneous Distributed Inference
The rapid growth of large-scale AI models, particularly large language models has brought significant challenges in data privacy, computational resources, and accessibility. Traditional centralized architectures often struggle to meet required data security and scalability needs which hinders the democratization of AI systems. Nesa introduces a model-agnostic sharding framework designed for decentralized AI inference. Our framework uses blockchain-based sequential deep neural network sharding to distribute computational tasks across a diverse network of nodes based on a personalised heuristic and routing mechanism. This enables efficient distributed training and inference for recent large-scale models even on consumer-grade hardware. We use compression techniques like dynamic blockwise quantization and mixed matrix decomposition to reduce data transfer and memory needs. We also integrate robust security measures, including hardware-based trusted execution environments to ensure data integrity and confidentiality. Evaluating our system across various natural language processing and vision tasks shows that these compression strategies do not compromise model accuracy. Our results highlight the potential to democratize access to cutting-edge AI technologies by enabling secure and efficient inference on a decentralized network.
Deep Umbra: A Generative Approach for Sunlight Access Computation in Urban Spaces
Sunlight and shadow play critical roles in how urban spaces are utilized, thrive, and grow. While access to sunlight is essential to the success of urban environments, shadows can provide shaded places to stay during the hot seasons, mitigate heat island effect, and increase pedestrian comfort levels. Properly quantifying sunlight access and shadows in large urban environments is key in tackling some of the important challenges facing cities today. In this paper, we propose Deep Umbra, a novel computational framework that enables the quantification of sunlight access and shadows at a global scale. Our framework is based on a conditional generative adversarial network that considers the physical form of cities to compute high-resolution spatial information of accumulated sunlight access for the different seasons of the year. We use data from seven different cities to train our model, and show, through an extensive set of experiments, its low overall RMSE (below 0.1) as well as its extensibility to cities that were not part of the training set. Additionally, we contribute a set of case studies and a comprehensive dataset with sunlight access information for more than 100 cities across six continents of the world. Deep Umbra is available at https://urbantk.org/shadows.
Mastering Spatial Graph Prediction of Road Networks
Accurately predicting road networks from satellite images requires a global understanding of the network topology. We propose to capture such high-level information by introducing a graph-based framework that simulates the addition of sequences of graph edges using a reinforcement learning (RL) approach. In particular, given a partially generated graph associated with a satellite image, an RL agent nominates modifications that maximize a cumulative reward. As opposed to standard supervised techniques that tend to be more restricted to commonly used surrogate losses, these rewards can be based on various complex, potentially non-continuous, metrics of interest. This yields more power and flexibility to encode problem-dependent knowledge. Empirical results on several benchmark datasets demonstrate enhanced performance and increased high-level reasoning about the graph topology when using a tree-based search. We further highlight the superiority of our approach under substantial occlusions by introducing a new synthetic benchmark dataset for this task.
BS-LDM: Effective Bone Suppression in High-Resolution Chest X-Ray Images with Conditional Latent Diffusion Models
Lung diseases represent a significant global health challenge, with Chest X-Ray (CXR) being a key diagnostic tool due to their accessibility and affordability. Nonetheless, the detection of pulmonary lesions is often hindered by overlapping bone structures in CXR images, leading to potential misdiagnoses. To address this issue, we developed an end-to-end framework called BS-LDM, designed to effectively suppress bone in high-resolution CXR images. This framework is based on conditional latent diffusion models and incorporates a multi-level hybrid loss-constrained vector-quantized generative adversarial network which is crafted for perceptual compression, ensuring the preservation of details. To further enhance the framework's performance, we introduce offset noise and a temporal adaptive thresholding strategy. These additions help minimize discrepancies in generating low-frequency information, thereby improving the clarity of the generated soft tissue images. Additionally, we have compiled a high-quality bone suppression dataset named SZCH-X-Rays. This dataset includes 818 pairs of high-resolution CXR and dual-energy subtraction soft tissue images collected from a partner hospital. Moreover, we processed 241 data pairs from the JSRT dataset into negative images, which are more commonly used in clinical practice. Our comprehensive experimental and clinical evaluations reveal that BS-LDM excels in bone suppression, underscoring its significant clinical value.
REACH: Reinforcement Learning for Efficient Allocation in Community and Heterogeneous Networks
Community GPU platforms are emerging as a cost-effective and democratized alternative to centralized GPU clusters for AI workloads, aggregating idle consumer GPUs from globally distributed and heterogeneous environments. However, their extreme hardware/software diversity, volatile availability, and variable network conditions render traditional schedulers ineffective, leading to suboptimal task completion. In this work, we present REACH (Reinforcement Learning for Efficient Allocation in Community and Heterogeneous Networks), a Transformer-based reinforcement learning framework that redefines task scheduling as a sequence scoring problem to balance performance, reliability, cost, and network efficiency. By modeling both global GPU states and task requirements, REACH learns to adaptively co-locate computation with data, prioritize critical jobs, and mitigate the impact of unreliable resources. Extensive simulation results show that REACH improves task completion rates by up to 17%, more than doubles the success rate for high-priority tasks, and reduces bandwidth penalties by over 80% compared to state-of-the-art baselines. Stress tests further demonstrate its robustness to GPU churn and network congestion, while scalability experiments confirm its effectiveness in large-scale, high-contention scenarios.
Representative Forgery Mining for Fake Face Detection
Although vanilla Convolutional Neural Network (CNN) based detectors can achieve satisfactory performance on fake face detection, we observe that the detectors tend to seek forgeries on a limited region of face, which reveals that the detectors is short of understanding of forgery. Therefore, we propose an attention-based data augmentation framework to guide detector refine and enlarge its attention. Specifically, our method tracks and occludes the Top-N sensitive facial regions, encouraging the detector to mine deeper into the regions ignored before for more representative forgery. Especially, our method is simple-to-use and can be easily integrated with various CNN models. Extensive experiments show that the detector trained with our method is capable to separately point out the representative forgery of fake faces generated by different manipulation techniques, and our method enables a vanilla CNN-based detector to achieve state-of-the-art performance without structure modification.
Federated PCA on Grassmann Manifold for Anomaly Detection in IoT Networks
In the era of Internet of Things (IoT), network-wide anomaly detection is a crucial part of monitoring IoT networks due to the inherent security vulnerabilities of most IoT devices. Principal Components Analysis (PCA) has been proposed to separate network traffics into two disjoint subspaces corresponding to normal and malicious behaviors for anomaly detection. However, the privacy concerns and limitations of devices' computing resources compromise the practical effectiveness of PCA. We propose a federated PCA-based Grassmannian optimization framework that coordinates IoT devices to aggregate a joint profile of normal network behaviors for anomaly detection. First, we introduce a privacy-preserving federated PCA framework to simultaneously capture the profile of various IoT devices' traffic. Then, we investigate the alternating direction method of multipliers gradient-based learning on the Grassmann manifold to guarantee fast training and the absence of detecting latency using limited computational resources. Empirical results on the NSL-KDD dataset demonstrate that our method outperforms baseline approaches. Finally, we show that the Grassmann manifold algorithm is highly adapted for IoT anomaly detection, which permits drastically reducing the analysis time of the system. To the best of our knowledge, this is the first federated PCA algorithm for anomaly detection meeting the requirements of IoT networks.
Actor-agnostic Multi-label Action Recognition with Multi-modal Query
Existing action recognition methods are typically actor-specific due to the intrinsic topological and apparent differences among the actors. This requires actor-specific pose estimation (e.g., humans vs. animals), leading to cumbersome model design complexity and high maintenance costs. Moreover, they often focus on learning the visual modality alone and single-label classification whilst neglecting other available information sources (e.g., class name text) and the concurrent occurrence of multiple actions. To overcome these limitations, we propose a new approach called 'actor-agnostic multi-modal multi-label action recognition,' which offers a unified solution for various types of actors, including humans and animals. We further formulate a novel Multi-modal Semantic Query Network (MSQNet) model in a transformer-based object detection framework (e.g., DETR), characterized by leveraging visual and textual modalities to represent the action classes better. The elimination of actor-specific model designs is a key advantage, as it removes the need for actor pose estimation altogether. Extensive experiments on five publicly available benchmarks show that our MSQNet consistently outperforms the prior arts of actor-specific alternatives on human and animal single- and multi-label action recognition tasks by up to 50%. Code is made available at https://github.com/mondalanindya/MSQNet.
Fatigue-PINN: Physics-Informed Fatigue-Driven Motion Modulation and Synthesis
Fatigue modeling is essential for motion synthesis tasks to model human motions under fatigued conditions and biomechanical engineering applications, such as investigating the variations in movement patterns and posture due to fatigue, defining injury risk mitigation and prevention strategies, formulating fatigue minimization schemes and creating improved ergonomic designs. Nevertheless, employing data-driven methods for synthesizing the impact of fatigue on motion, receives little to no attention in the literature. In this work, we present Fatigue-PINN, a deep learning framework based on Physics-Informed Neural Networks, for modeling fatigued human movements, while providing joint-specific fatigue configurations for adaptation and mitigation of motion artifacts on a joint level, resulting in more realistic animations. To account for muscle fatigue, we simulate the fatigue-induced fluctuations in the maximum exerted joint torques by leveraging a PINN adaptation of the Three-Compartment Controller model to exploit physics-domain knowledge for improving accuracy. This model also introduces parametric motion alignment with respect to joint-specific fatigue, hence avoiding sharp frame transitions. Our results indicate that Fatigue-PINN accurately simulates the effects of externally perceived fatigue on open-type human movements being consistent with findings from real-world experimental fatigue studies. Since fatigue is incorporated in torque space, Fatigue-PINN provides an end-to-end encoder-decoder-like architecture, to ensure transforming joint angles to joint torques and vice-versa, thus, being compatible with motion synthesis frameworks operating on joint angles.
ZipGAN: Super-Resolution-based Generative Adversarial Network Framework for Data Compression of Direct Numerical Simulations
The advancement of high-performance computing has enabled the generation of large direct numerical simulation (DNS) datasets of turbulent flows, driving the need for efficient compression/decompression techniques that reduce storage demands while maintaining fidelity. Traditional methods, such as the discrete wavelet transform, cannot achieve compression ratios of 8 or higher for complex turbulent flows without introducing significant encoding/decoding errors. On the other hand, a super-resolution-based generative adversarial network (SR-GAN), called ZipGAN, can accurately reconstruct fine-scale features, preserving velocity gradients and structural details, even at a compression ratio of 512, thanks to the more efficient representation of the data in compact latent space. Additional benefits are ascribed to adversarial training. The high GAN training time is significantly reduced with a progressive transfer learning approach and, once trained, they can be applied independently of the Reynolds number. It is demonstrated that ZipGAN can enhance dataset temporal resolution without additional simulation overhead by generating high-quality intermediate fields from compressed snapshots. The ZipGAN discriminator can reliably evaluate the quality of decoded fields, ensuring fidelity even in the absence of original DNS fields. Hence, ZipGAN compression/decompression method presents a highly efficient and scalable alternative for large-scale DNS storage and transfer, offering substantial advantages over the DWT methods in terms of compression efficiency, reconstruction fidelity, and temporal resolution enhancement.
Kolmogorov Arnold Informed neural network: A physics-informed deep learning framework for solving PDEs based on Kolmogorov Arnold Networks
AI for partial differential equations (PDEs) has garnered significant attention, particularly with the emergence of Physics-informed neural networks (PINNs). The recent advent of Kolmogorov-Arnold Network (KAN) indicates that there is potential to revisit and enhance the previously MLP-based PINNs. Compared to MLPs, KANs offer interpretability and require fewer parameters. PDEs can be described in various forms, such as strong form, energy form, and inverse form. While mathematically equivalent, these forms are not computationally equivalent, making the exploration of different PDE formulations significant in computational physics. Thus, we propose different PDE forms based on KAN instead of MLP, termed Kolmogorov-Arnold-Informed Neural Network (KINN). We systematically compare MLP and KAN in various numerical examples of PDEs, including multi-scale, singularity, stress concentration, nonlinear hyperelasticity, heterogeneous, and complex geometry problems. Our results demonstrate that KINN significantly outperforms MLP in terms of accuracy and convergence speed for numerous PDEs in computational solid mechanics, except for the complex geometry problem. This highlights KINN's potential for more efficient and accurate PDE solutions in AI for PDEs.
K-Core based Temporal Graph Convolutional Network for Dynamic Graphs
Graph representation learning is a fundamental task in various applications that strives to learn low-dimensional embeddings for nodes that can preserve graph topology information. However, many existing methods focus on static graphs while ignoring evolving graph patterns. Inspired by the success of graph convolutional networks(GCNs) in static graph embedding, we propose a novel k-core based temporal graph convolutional network, the CTGCN, to learn node representations for dynamic graphs. In contrast to previous dynamic graph embedding methods, CTGCN can preserve both local connective proximity and global structural similarity while simultaneously capturing graph dynamics. In the proposed framework, the traditional graph convolution is generalized into two phases, feature transformation and feature aggregation, which gives the CTGCN more flexibility and enables the CTGCN to learn connective and structural information under the same framework. Experimental results on 7 real-world graphs demonstrate that the CTGCN outperforms existing state-of-the-art graph embedding methods in several tasks, including link prediction and structural role classification. The source code of this work can be obtained from https://github.com/jhljx/CTGCN.
BigVSAN: Enhancing GAN-based Neural Vocoders with Slicing Adversarial Network
Generative adversarial network (GAN)-based vocoders have been intensively studied because they can synthesize high-fidelity audio waveforms faster than real-time. However, it has been reported that most GANs fail to obtain the optimal projection for discriminating between real and fake data in the feature space. In the literature, it has been demonstrated that slicing adversarial network (SAN), an improved GAN training framework that can find the optimal projection, is effective in the image generation task. In this paper, we investigate the effectiveness of SAN in the vocoding task. For this purpose, we propose a scheme to modify least-squares GAN, which most GAN-based vocoders adopt, so that their loss functions satisfy the requirements of SAN. Through our experiments, we demonstrate that SAN can improve the performance of GAN-based vocoders, including BigVGAN, with small modifications. Our code is available at https://github.com/sony/bigvsan.
EasyRAG: Efficient Retrieval-Augmented Generation Framework for Automated Network Operations
This paper presents EasyRAG, a simple, lightweight, and efficient retrieval-augmented generation framework for automated network operations. Our framework has three advantages. The first is accurate question answering. We designed a straightforward RAG scheme based on (1) a specific data processing workflow (2) dual-route sparse retrieval for coarse ranking (3) LLM Reranker for reranking (4) LLM answer generation and optimization. This approach achieved first place in the GLM4 track in the preliminary round and second place in the GLM4 track in the semifinals. The second is simple deployment. Our method primarily consists of BM25 retrieval and BGE-reranker reranking, requiring no fine-tuning of any models, occupying minimal VRAM, easy to deploy, and highly scalable; we provide a flexible code library with various search and generation strategies, facilitating custom process implementation. The last one is efficient inference. We designed an efficient inference acceleration scheme for the entire coarse ranking, reranking, and generation process that significantly reduces the inference latency of RAG while maintaining a good level of accuracy; each acceleration scheme can be plug-and-play into any component of the RAG process, consistently enhancing the efficiency of the RAG system. Our code and data are released at https://github.com/BUAADreamer/EasyRAG.
Dynamic LLM-Agent Network: An LLM-agent Collaboration Framework with Agent Team Optimization
Large language model (LLM) agents have been shown effective on a wide range of tasks, and by ensembling multiple LLM agents, their performances could be further improved. Existing approaches employ a fixed set of agents to interact with each other in a static architecture, which limits their generalizability to various tasks and requires strong human prior in designing these agents. In this work, we propose to construct a strategic team of agents communicating in a dynamic interaction architecture based on the task query. Specifically, we build a framework named Dynamic LLM-Agent Network (DyLAN) for LLM-agent collaboration on complicated tasks like reasoning and code generation. DyLAN enables agents to interact for multiple rounds in a dynamic architecture with inference-time agent selection and an early-stopping mechanism to improve performance and efficiency. We further design an automatic agent team optimization algorithm based on an unsupervised metric termed Agent Importance Score, enabling the selection of best agents based on the contribution each agent makes. Empirically, we demonstrate that DyLAN performs well in both reasoning and code generation tasks with reasonable computational cost. DyLAN achieves 13.0% and 13.3% improvement on MATH and HumanEval, respectively, compared to a single execution on GPT-35-turbo. On specific subjects of MMLU, agent team optimization in DyLAN increases accuracy by up to 25.0%.
Learning to Schedule: A Supervised Learning Framework for Network-Aware Scheduling of Data-Intensive Workloads
Distributed cloud environments hosting data-intensive applications often experience slowdowns due to network congestion, asymmetric bandwidth, and inter-node data shuffling. These factors are typically not captured by traditional host-level metrics like CPU or memory. Scheduling without accounting for these conditions can lead to poor placement decisions, longer data transfers, and suboptimal job performance. We present a network-aware job scheduler that uses supervised learning to predict the completion time of candidate jobs. Our system introduces a prediction-and-ranking mechanism that collects real-time telemetry from all nodes, uses a trained supervised model to estimate job duration per node, and ranks them to select the best placement. We evaluate the scheduler on a geo-distributed Kubernetes cluster deployed on the FABRIC testbed by running network-intensive Spark workloads. Compared to the default Kubernetes scheduler, which makes placement decisions based on current resource availability alone, our proposed supervised scheduler achieved 34-54% higher accuracy in selecting optimal nodes for job placement. The novelty of our work lies in the demonstration of supervised learning for real-time, network-aware job scheduling on a multi-site cluster.
