new

Get trending papers in your email inbox!

Subscribe

Daily Papers

byAK and the research community

Jun 4

Training-Free Multi-Concept LoRA Composition with Prompt-Aware Weighting

Low-Rank Adaptation (LoRA) successfully enables personalization in text-to-image generation by adapting pre-trained diffusion models to specific visual concepts and styles. However, extending such models to multi-concept customization remains challenging. Naively combining multiple LoRA weights or their outputs often leads to interference among concepts, resulting in degraded visual quality and reduced fidelity to the reference images of individual concepts. This paper proposes a simple yet effective approach for multi-concept customization by optimally combining the outputs of multiple LoRA modules. We leverage the relative importance of each concept during generation, as inferred from its corresponding prompt tokens and introduce two methods, W-Switch and W-Composite, that employ a prompt-aware importance weighting strategy in which each LoRA is weighted according to the semantic influence of its trigger words in the target prompt. In addition, we extend existing quantitative evaluation metrics by proposing a new image-based similarity evaluation framework that assesses image fidelity and identity preservation through comparisons between real-world reference images and automatically segmented concept regions from generated images. We evaluate our approach on the ComposLoRA testbed and demonstrate consistent improvements over existing state-of-the-art methods in terms of visual quality, identity preservation and compositionality. Qualitative evaluations, including a Large Language Model (LLM) based assessment and a user study, further validate the effectiveness of the proposed methods and align with the newly introduced quantitative image-based metrics. Our code is available at https://github.com/GeorgeTsoumplekas/Prompt-Aware-Multi-LoRA-Composition.

  • 3 authors
·
Jun 1 1

Investigation of W-SiC compositionally graded films as a divertor material

W-SiC composite material is a promising plasma-facing material candidate alternative to pure W due to the low neutron activation, low impurity radiation, and low tritium diffusivity of SiC while leveraging the high erosion resistance of the W armor. Additionally, W and SiC have high thermomechanical compatibility given their similar thermal expansion rates. The present study addresses the synthesis and performance of compositionally graded W-SiC films fabricated by pulsed-DC magnetron sputtering. Compositional gradients were characterized using transmission electron microscopy (TEM) and energy-dispersive X-ray spectroscopy (EDS), and crystallographic information was obtained using electron diffraction and X-ray diffraction (XRD). Samples were exposed to L-mode deuterium plasma discharges in the DIII-D tokamak using the Divertor Material Evaluation System (DiMES). Post-mortem characterizations were performed using scanning electron microscopy (SEM) and XRD. Electron diffraction and XRD showed that the compositionally graded W-SiC films were composed of polycrystalline W and amorphous SiC with amorphous W+SiC interlayers. No macroscopic delamination or microstructural changes were observed under mild exposure conditions. This study serves as a preliminary examination of W-SiC compositionally graded composites as a potential candidate divertor material in future tokamak devices.

  • 16 authors
·
Aug 30, 2023

Harnessing Selective State Space Models to Enhance Semianalytical Design of Fabrication-Ready Multilayered Huygens' Metasurfaces: Part II - Generative Inverse Design (MetaMamba)

We present a generative framework for inverse design of five-layer transmissive Huygens' metasurfaces (HMSs), addressing a longstanding challenge in achieving full-phase, high-efficiency unit cell designs with minimal full-wave simulations. The key to achieving this is our reliance on the field-based semianalytical (SA) scheme developed in Part I of this paper, which allows rapid and highly effective synthesis of such multilayer composites, however with limited accuracy. To overcome the prohibitive data demands of traditional pipelines, we employ Mamba, a selective state space model well suited for long-range sequence modeling as the backbone of our learning framework. A bidirectional Mamba (Bi-Mamba) forward surrogate is first trained on SA-generated data and subsequently fine-tuned with full-wave CST samples. An ablation over a 1080-sample CST pool shows that as few as 270 full-wave calibration samples suffice to reach near-CST-level agreement at a fraction of the simulation cost. An autoregressive Mamba inverse generator is subsequently trained on surrogate-augmented data, treating unit-cell synthesis as a sequential generation task. The resulting one-to-many generative model produces diverse unit cell geometries conditioned on target scattering responses. It achieves CST-validated designs with field transmission magnitude 0.9 across the full 0-2π phase range at 20 GHz. Moreover, a CST-calibrated surrogate trained to accurately predict frequency responses (18-22 GHz) enables functional post-selection of inverse generated designs. Together, the hybrid SA-generative methodology in this two-part compilation establishes a scalable and data-efficient solution for multilayer HMS synthesis, with natural extensions toward broadband, oblique-incidence, and higher-dimensional electromagnetic inverse-design problems.

  • 5 authors
·
Mar 4

Generative Compositional Augmentations for Scene Graph Prediction

Inferring objects and their relationships from an image in the form of a scene graph is useful in many applications at the intersection of vision and language. We consider a challenging problem of compositional generalization that emerges in this task due to a long tail data distribution. Current scene graph generation models are trained on a tiny fraction of the distribution corresponding to the most frequent compositions, e.g. <cup, on, table>. However, test images might contain zero- and few-shot compositions of objects and relationships, e.g. <cup, on, surfboard>. Despite each of the object categories and the predicate (e.g. 'on') being frequent in the training data, the models often fail to properly understand such unseen or rare compositions. To improve generalization, it is natural to attempt increasing the diversity of the training distribution. However, in the graph domain this is non-trivial. To that end, we propose a method to synthesize rare yet plausible scene graphs by perturbing real ones. We then propose and empirically study a model based on conditional generative adversarial networks (GANs) that allows us to generate visual features of perturbed scene graphs and learn from them in a joint fashion. When evaluated on the Visual Genome dataset, our approach yields marginal, but consistent improvements in zero- and few-shot metrics. We analyze the limitations of our approach indicating promising directions for future research.

  • 6 authors
·
Jul 11, 2020

HOComp: Interaction-Aware Human-Object Composition

While existing image-guided composition methods may help insert a foreground object onto a user-specified region of a background image, achieving natural blending inside the region with the rest of the image unchanged, we observe that these existing methods often struggle in synthesizing seamless interaction-aware compositions when the task involves human-object interactions. In this paper, we first propose HOComp, a novel approach for compositing a foreground object onto a human-centric background image, while ensuring harmonious interactions between the foreground object and the background person and their consistent appearances. Our approach includes two key designs: (1) MLLMs-driven Region-based Pose Guidance (MRPG), which utilizes MLLMs to identify the interaction region as well as the interaction type (e.g., holding and lefting) to provide coarse-to-fine constraints to the generated pose for the interaction while incorporating human pose landmarks to track action variations and enforcing fine-grained pose constraints; and (2) Detail-Consistent Appearance Preservation (DCAP), which unifies a shape-aware attention modulation mechanism, a multi-view appearance loss, and a background consistency loss to ensure consistent shapes/textures of the foreground and faithful reproduction of the background human. We then propose the first dataset, named Interaction-aware Human-Object Composition (IHOC), for the task. Experimental results on our dataset show that HOComp effectively generates harmonious human-object interactions with consistent appearances, and outperforms relevant methods qualitatively and quantitatively.

  • 4 authors
·
Jul 22, 2025 3

Graph Density-Aware Losses for Novel Compositions in Scene Graph Generation

Scene graph generation (SGG) aims to predict graph-structured descriptions of input images, in the form of objects and relationships between them. This task is becoming increasingly useful for progress at the interface of vision and language. Here, it is important - yet challenging - to perform well on novel (zero-shot) or rare (few-shot) compositions of objects and relationships. In this paper, we identify two key issues that limit such generalization. Firstly, we show that the standard loss used in this task is unintentionally a function of scene graph density. This leads to the neglect of individual edges in large sparse graphs during training, even though these contain diverse few-shot examples that are important for generalization. Secondly, the frequency of relationships can create a strong bias in this task, such that a blind model predicting the most frequent relationship achieves good performance. Consequently, some state-of-the-art models exploit this bias to improve results. We show that such models can suffer the most in their ability to generalize to rare compositions, evaluating two different models on the Visual Genome dataset and its more recent, improved version, GQA. To address these issues, we introduce a density-normalized edge loss, which provides more than a two-fold improvement in certain generalization metrics. Compared to other works in this direction, our enhancements require only a few lines of code and no added computational cost. We also highlight the difficulty of accurately evaluating models using existing metrics, especially on zero/few shots, and introduce a novel weighted metric.

  • 6 authors
·
May 17, 2020

Analyzable Chain-of-Musical-Thought Prompting for High-Fidelity Music Generation

Autoregressive (AR) models have demonstrated impressive capabilities in generating high-fidelity music. However, the conventional next-token prediction paradigm in AR models does not align with the human creative process in music composition, potentially compromising the musicality of generated samples. To overcome this limitation, we introduce MusiCoT, a novel chain-of-thought (CoT) prompting technique tailored for music generation. MusiCoT empowers the AR model to first outline an overall music structure before generating audio tokens, thereby enhancing the coherence and creativity of the resulting compositions. By leveraging the contrastive language-audio pretraining (CLAP) model, we establish a chain of "musical thoughts", making MusiCoT scalable and independent of human-labeled data, in contrast to conventional CoT methods. Moreover, MusiCoT allows for in-depth analysis of music structure, such as instrumental arrangements, and supports music referencing -- accepting variable-length audio inputs as optional style references. This innovative approach effectively addresses copying issues, positioning MusiCoT as a vital practical method for music prompting. Our experimental results indicate that MusiCoT consistently achieves superior performance across both objective and subjective metrics, producing music quality that rivals state-of-the-art generation models. Our samples are available at https://MusiCoT.github.io/.

  • 17 authors
·
Mar 25, 2025

Sylber: Syllabic Embedding Representation of Speech from Raw Audio

Syllables are compositional units of spoken language that play a crucial role in human speech perception and production. However, current neural speech representations lack structure, resulting in dense token sequences that are costly to process. To bridge this gap, we propose a new model, Sylber, that produces speech representations with clean and robust syllabic structure. Specifically, we propose a self-supervised model that regresses features on syllabic segments distilled from a teacher model which is an exponential moving average of the model in training. This results in a highly structured representation of speech features, offering three key benefits: 1) a fast, linear-time syllable segmentation algorithm, 2) efficient syllabic tokenization with an average of 4.27 tokens per second, and 3) syllabic units better suited for lexical and syntactic understanding. We also train token-to-speech generative models with our syllabic units and show that fully intelligible speech can be reconstructed from these tokens. Lastly, we observe that categorical perception, a linguistic phenomenon of speech perception, emerges naturally in our model, making the embedding space more categorical and sparse than previous self-supervised learning approaches. Together, we present a novel self-supervised approach for representing speech as syllables, with significant potential for efficient speech tokenization and spoken language modeling.

  • 7 authors
·
Oct 9, 2024

Efficient 3D Articulated Human Generation with Layered Surface Volumes

Access to high-quality and diverse 3D articulated digital human assets is crucial in various applications, ranging from virtual reality to social platforms. Generative approaches, such as 3D generative adversarial networks (GANs), are rapidly replacing laborious manual content creation tools. However, existing 3D GAN frameworks typically rely on scene representations that leverage either template meshes, which are fast but offer limited quality, or volumes, which offer high capacity but are slow to render, thereby limiting the 3D fidelity in GAN settings. In this work, we introduce layered surface volumes (LSVs) as a new 3D object representation for articulated digital humans. LSVs represent a human body using multiple textured mesh layers around a conventional template. These layers are rendered using alpha compositing with fast differentiable rasterization, and they can be interpreted as a volumetric representation that allocates its capacity to a manifold of finite thickness around the template. Unlike conventional single-layer templates that struggle with representing fine off-surface details like hair or accessories, our surface volumes naturally capture such details. LSVs can be articulated, and they exhibit exceptional efficiency in GAN settings, where a 2D generator learns to synthesize the RGBA textures for the individual layers. Trained on unstructured, single-view 2D image datasets, our LSV-GAN generates high-quality and view-consistent 3D articulated digital humans without the need for view-inconsistent 2D upsampling networks.

  • 6 authors
·
Jul 11, 2023

LegSegNet: A Public Deep Learning System for Lower Extremity CT Tissue Segmentation and Quantification

Lower extremity computed tomography (CT) contains clinically relevant information for body composition analysis, sarcopenia assessment, and musculoskeletal disease monitoring, but extracting these measurements at scale requires accurate tissue segmentation and an automated quantification workflow. Existing public segmentation tools are not designed for comprehensive lower extremity CT analysis, particularly for clinically important inter/intramuscular adipose tissue, and most public methods only provide mask prediction rather than an end-to-end quantification system. To address this problem, we present LegSegNet, a deep learning system for lower extremity CT tissue segmentation and body composition quantification. Given an input CT scan, LegSegNet segments bone, skeletal muscle, subcutaneous adipose tissue, and inter/intramuscular adipose tissue. It then computes quantitative tissue measurements for downstream analysis. We developed the segmentation model using 1,302 manually annotated CT slices and evaluated it on 900 held-out test slices, with all annotations reviewed by radiologists. We benchmark LegSegNet against a broad set of 2D segmentation methods, including CNN-based models, transformer-based models, and finetuned foundation models, and further evaluate its generalization on an external public CT dataset. LegSegNet achieves the best overall segmentation performance, with an average Dice score of 89.31 on the held-out test set. To our knowledge, LegSegNet is the first publicly available end-to-end system for lower extremity CT tissue segmentation and quantification, providing a practical evaluation tool for future computer vision research in medical image analysis. The code and model weights are available at: https://github.com/mazurowski-lab/LegSegNet

  • 7 authors
·
May 28

Hydrodynamic Predictions for the Next Outburst of T Coronae Borealis: It will be the Brightest Classical or Recurrent Nova Ever Observed in X-rays

T Coronae Borealis (TCrB) is a recurrent nova (RN) with recorded outbursts in 1866, and 1946 and possible outbursts in 1217 and 1787. It is predicted to explode again in 2025 or 2026 based on multiple observational studies. The system consists of a massive (M_{wd} gtrsim 1.35 M_odot) white dwarf (WD) and a red giant (M3-M4 III). We have performed 1-D hydrodynamic simulations with NOVA to predict the behavior of the next outburst. These simulations consist of a range of mass accretion rates onto sim1.35 M_odot WDs, designed to bound the conditions necessary to achieve ignition of an explosion after an approx80 year inter-outburst period. We have used both carbon-oxygen and oxygen-neon initial compositions, in order to include the possible ejecta abundances to be measured in the observations of the next outburst. As the WD in the TCrB system is observed to be massive, theoretical predictions reported here imply that the WD is growing in mass as a consequence of the TNR. Therefore, the secular evolution of the WD may allow it to approach the Chandrasekhar limit and either explode as a Type Ia supernova or undergo accretion induced collapse, depending on its underlying composition. We have followed the evolution of just the WD, after removing the ejected matter from the surface layers. Our intent is to illuminate the mystery of the unique, second, maximum in the two well observed outbursts and we have found conditions that bracket the predictions.

  • 14 authors
·
Feb 15, 2025

Open Materials 2024 (OMat24) Inorganic Materials Dataset and Models

The ability to discover new materials with desirable properties is critical for numerous applications from helping mitigate climate change to advances in next generation computing hardware. AI has the potential to accelerate materials discovery and design by more effectively exploring the chemical space compared to other computational methods or by trial-and-error. While substantial progress has been made on AI for materials data, benchmarks, and models, a barrier that has emerged is the lack of publicly available training data and open pre-trained models. To address this, we present a Meta FAIR release of the Open Materials 2024 (OMat24) large-scale open dataset and an accompanying set of pre-trained models. OMat24 contains over 110 million density functional theory (DFT) calculations focused on structural and compositional diversity. Our EquiformerV2 models achieve state-of-the-art performance on the Matbench Discovery leaderboard and are capable of predicting ground-state stability and formation energies to an F1 score above 0.9 and an accuracy of 20 meV/atom, respectively. We explore the impact of model size, auxiliary denoising objectives, and fine-tuning on performance across a range of datasets including OMat24, MPtraj, and Alexandria. The open release of the OMat24 dataset and models enables the research community to build upon our efforts and drive further advancements in AI-assisted materials science.

  • 9 authors
·
Oct 16, 2024 1

Machine Learning Predictions of High-Curie-Temperature Materials

Technologies that function at room temperature often require magnets with a high Curie temperature, T_C, and can be improved with better materials. Discovering magnetic materials with a substantial T_C is challenging because of the large number of candidates and the cost of fabricating and testing them. Using the two largest known data sets of experimental Curie temperatures, we develop machine-learning models to make rapid T_C predictions solely based on the chemical composition of a material. We train a random forest model and a k-NN one and predict on an initial dataset of over 2,500 materials and then validate the model on a new dataset containing over 3,000 entries. The accuracy is compared for multiple compounds' representations ("descriptors") and regression approaches. A random forest model provides the most accurate predictions and is not improved by dimensionality reduction or by using more complex descriptors based on atomic properties. A random forest model trained on a combination of both datasets shows that cobalt-rich and iron-rich materials have the highest Curie temperatures for all binary and ternary compounds. An analysis of the model reveals systematic error that causes the model to over-predict low-T_C materials and under-predict high-T_C materials. For exhaustive searches to find new high-T_C materials, analysis of the learning rate suggests either that much more data is needed or that more efficient descriptors are necessary.

  • 4 authors
·
Jul 13, 2023

Shape-for-Motion: Precise and Consistent Video Editing with 3D Proxy

Recent advances in deep generative modeling have unlocked unprecedented opportunities for video synthesis. In real-world applications, however, users often seek tools to faithfully realize their creative editing intentions with precise and consistent control. Despite the progress achieved by existing methods, ensuring fine-grained alignment with user intentions remains an open and challenging problem. In this work, we present Shape-for-Motion, a novel framework that incorporates a 3D proxy for precise and consistent video editing. Shape-for-Motion achieves this by converting the target object in the input video to a time-consistent mesh, i.e., a 3D proxy, allowing edits to be performed directly on the proxy and then inferred back to the video frames. To simplify the editing process, we design a novel Dual-Propagation Strategy that allows users to perform edits on the 3D mesh of a single frame, and the edits are then automatically propagated to the 3D meshes of the other frames. The 3D meshes for different frames are further projected onto the 2D space to produce the edited geometry and texture renderings, which serve as inputs to a decoupled video diffusion model for generating edited results. Our framework supports various precise and physically-consistent manipulations across the video frames, including pose editing, rotation, scaling, translation, texture modification, and object composition. Our approach marks a key step toward high-quality, controllable video editing workflows. Extensive experiments demonstrate the superiority and effectiveness of our approach. Project page: https://shapeformotion.github.io/

  • 5 authors
·
Jun 27, 2025 1

Governance of the AI, by the AI, and for the AI

Over the past half century, there have been several false dawns during which the "arrival" of world-changing artificial intelligence (AI) has been heralded. Tempting fate, the authors believe the age of AI has, indeed, finally arrived. Powerful image generators, such as DALL-E2 and Midjourney have suddenly allowed anyone with access the ability easily to create rich and complex art. In a similar vein, text generators, such as GPT3.5 (including ChatGPT) and BLOOM, allow users to compose detailed written descriptions of many topics of interest. And, it is even possible now for a person without extensive expertise in writing software to use AI to generate code capable of myriad applications. While AI will continue to evolve and improve, probably at a rapid rate, the current state of AI is already ushering in profound changes to many different sectors of society. Every new technology challenges the ability of humanity to govern it wisely. However, governance is usually viewed as both possible and necessary due to the disruption new technology often poses to social structures, industries, the environment, and other important human concerns. In this article, we offer an analysis of a range of interactions between AI and governance, with the hope that wise decisions may be made that maximize benefits and minimize costs. The article addresses two main aspects of this relationship: the governance of AI by humanity, and the governance of humanity by AI. The approach we have taken is itself informed by AI, as this article was written collaboratively by the authors and ChatGPT.

  • 2 authors
·
May 3, 2023

Relational inductive biases, deep learning, and graph networks

Artificial intelligence (AI) has undergone a renaissance recently, making major progress in key domains such as vision, language, control, and decision-making. This has been due, in part, to cheap data and cheap compute resources, which have fit the natural strengths of deep learning. However, many defining characteristics of human intelligence, which developed under much different pressures, remain out of reach for current approaches. In particular, generalizing beyond one's experiences--a hallmark of human intelligence from infancy--remains a formidable challenge for modern AI. The following is part position paper, part review, and part unification. We argue that combinatorial generalization must be a top priority for AI to achieve human-like abilities, and that structured representations and computations are key to realizing this objective. Just as biology uses nature and nurture cooperatively, we reject the false choice between "hand-engineering" and "end-to-end" learning, and instead advocate for an approach which benefits from their complementary strengths. We explore how using relational inductive biases within deep learning architectures can facilitate learning about entities, relations, and rules for composing them. We present a new building block for the AI toolkit with a strong relational inductive bias--the graph network--which generalizes and extends various approaches for neural networks that operate on graphs, and provides a straightforward interface for manipulating structured knowledge and producing structured behaviors. We discuss how graph networks can support relational reasoning and combinatorial generalization, laying the foundation for more sophisticated, interpretable, and flexible patterns of reasoning. As a companion to this paper, we have released an open-source software library for building graph networks, with demonstrations of how to use them in practice.

  • 27 authors
·
Jun 4, 2018