new

Get trending papers in your email inbox!

Subscribe

Daily Papers

byAK and the research community

May 15

PosterGen: Aesthetic-Aware Paper-to-Poster Generation via Multi-Agent LLMs

Multi-agent systems built upon large language models (LLMs) have demonstrated remarkable capabilities in tackling complex compositional tasks. In this work, we apply this paradigm to the paper-to-poster generation problem, a practical yet time-consuming process faced by researchers preparing for conferences. While recent approaches have attempted to automate this task, most neglect core design and aesthetic principles, resulting in posters that require substantial manual refinement. To address these design limitations, we propose PosterGen, a multi-agent framework that mirrors the workflow of professional poster designers. It consists of four collaborative specialized agents: (1) Parser and Curator agents extract content from the paper and organize storyboard; (2) Layout agent maps the content into a coherent spatial layout; (3) Stylist agents apply visual design elements such as color and typography; and (4) Renderer composes the final poster. Together, these agents produce posters that are both semantically grounded and visually appealing. To evaluate design quality, we introduce a vision-language model (VLM)-based rubric that measures layout balance, readability, and aesthetic coherence. Experimental results show that PosterGen consistently matches in content fidelity, and significantly outperforms existing methods in visual designs, generating posters that are presentation-ready with minimal human refinements.

  • 5 authors
·
Aug 23, 2025 3

AeSlides: Incentivizing Aesthetic Layout in LLM-Based Slide Generation via Verifiable Rewards

Large language models (LLMs) have demonstrated strong potential in agentic tasks, particularly in slide generation. However, slide generation poses a fundamental challenge: the generation process is text-centric, whereas its quality is governed by visual aesthetics. This modality gap leads current models to frequently produce slides with aesthetically suboptimal layouts. Existing solutions typically rely either on heavy visual reflection, which incurs high inference cost yet yields limited gains; or on fine-tuning with large-scale datasets, which still provides weak and indirect aesthetic supervision. In contrast, the explicit use of aesthetic principles as supervision remains unexplored. In this work, we present AeSlides, a reinforcement learning framework with verifiable rewards for Aesthetic layout supervision in Slide generation. We introduce a suite of meticulously designed verifiable metrics to quantify slide layout quality, capturing key layout issues in an accurate, efficient, and low-cost manner. Leveraging these verifiable metrics, we develop a GRPO-based reinforcement learning method that directly optimizes slide generation models for aesthetically coherent layouts. With only 5K training prompts on GLM-4.7-Flash, AeSlides improves aspect ratio compliance from 36% to 85%, while reducing whitespace by 44%, element collisions by 43%, and visual imbalance by 28%. Human evaluation further shows a substantial improvement in overall quality, increasing scores from 3.31 to 3.56 (+7.6%), outperforming both model-based reward optimization and reflection-based agentic approaches, and even edging out Claude-Sonnet-4.5. These results demonstrate that such a verifiable aesthetic paradigm provides an efficient and scalable approach to aligning slide generation with human aesthetic preferences. Our repository is available at https://github.com/ympan0508/aeslides.

  • 9 authors
·
Apr 20

MultiRef: Controllable Image Generation with Multiple Visual References

Visual designers naturally draw inspiration from multiple visual references, combining diverse elements and aesthetic principles to create artwork. However, current image generative frameworks predominantly rely on single-source inputs -- either text prompts or individual reference images. In this paper, we focus on the task of controllable image generation using multiple visual references. We introduce MultiRef-bench, a rigorous evaluation framework comprising 990 synthetic and 1,000 real-world samples that require incorporating visual content from multiple reference images. The synthetic samples are synthetically generated through our data engine RefBlend, with 10 reference types and 33 reference combinations. Based on RefBlend, we further construct a dataset MultiRef containing 38k high-quality images to facilitate further research. Our experiments across three interleaved image-text models (i.e., OmniGen, ACE, and Show-o) and six agentic frameworks (e.g., ChatDiT and LLM + SD) reveal that even state-of-the-art systems struggle with multi-reference conditioning, with the best model OmniGen achieving only 66.6% in synthetic samples and 79.0% in real-world cases on average compared to the golden answer. These findings provide valuable directions for developing more flexible and human-like creative tools that can effectively integrate multiple sources of visual inspiration. The dataset is publicly available at: https://multiref.github.io/.

  • 9 authors
·
Aug 9, 2025 2

Advancing Aesthetic Image Generation via Composition Transfer

Composition is a cornerstone of visual aesthetics, influencing the appeal of an image. While its principles operate independently of specific content, in practice, composition is often coupled with semantics. As a result, existing methods often enhance composition either through implicit learning or by semantics-based layout control, rather than explicitly modeling composition itself. To address this gap, we introduce Composer, a framework rooted in aesthetic theory, designed to model composition in a semantic-agnostic manner. First, it supports composition transfer by extracting key composition-aware representations from a reference image and leveraging a tailored conditional guidance module to control composition based on pre-trained diffusion models. Second, when users specify only text themes without a composition reference, Composer supports theme-driven composition retrieval by leveraging the in-context learning capabilities of Large Vision-Language Models (LVLMs), achieving explicit composition planning. To enhance composition in a reference-free mode, we conduct text-to-composition fine-tuning on the trained control module to enable implicit composition planning. Furthermore, we curated a high-quality dataset comprising 2 million image-text pairs using state-of-the-art generative models to support model training. Experimental results demonstrate that Composer significantly enhances aesthetic quality in text-to-image tasks and facilitates personalized composition control and transfer, offering users precision and flexibility in the creative process.

  • 4 authors
·
May 5

GOBench: Benchmarking Geometric Optics Generation and Understanding of MLLMs

The rapid evolution of Multi-modality Large Language Models (MLLMs) is driving significant advancements in visual understanding and generation. Nevertheless, a comprehensive assessment of their capabilities, concerning the fine-grained physical principles especially in geometric optics, remains underexplored. To address this gap, we introduce GOBench, the first benchmark to systematically evaluate MLLMs' ability across two tasks: 1) Generating Optically Authentic Imagery and 2) Understanding Underlying Optical Phenomena. We curates high-quality prompts of geometric optical scenarios and use MLLMs to construct GOBench-Gen-1k dataset.We then organize subjective experiments to assess the generated imagery based on Optical Authenticity, Aesthetic Quality, and Instruction Fidelity, revealing MLLMs' generation flaws that violate optical principles. For the understanding task, we apply crafted evaluation instructions to test optical understanding ability of eleven prominent MLLMs. The experimental results demonstrate that current models face significant challenges in both optical generation and understanding. The top-performing generative model, GPT-4o-Image, cannot perfectly complete all generation tasks, and the best-performing MLLM model, Gemini-2.5Pro, attains a mere 37.35\% accuracy in optical understanding. Database and codes are publicly available at https://github.com/aiben-ch/GOBench.

  • 9 authors
·
Jun 1, 2025

Presenting a Paper is an Art: Self-Improvement Aesthetic Agents for Academic Presentations

The promotion of academic papers has become an important means of enhancing research visibility. However, existing automated methods struggle limited storytelling, insufficient aesthetic quality, and constrained self-adjustment, making it difficult to achieve efficient and engaging dissemination. At the heart of those challenges is a simple principle: there is no way to improve it when you cannot evaluate it right. To address this, we introduce EvoPresent, a self-improvement agent framework that unifies coherent narratives, aesthetic-aware designs, and realistic presentation delivery via virtual characters. Central to EvoPresent is PresAesth, a multi-task reinforcement learning (RL) aesthetic model that provides reliable aesthetic scoring, defect adjustment, and comparative feedback, enabling iterative self-improvement even under limited aesthetic training data. To systematically evaluate the methods, we introduce EvoPresent Benchmark, a comprehensive benchmark comprising: Presentation Generation Quality, built on 650 top-tier AI conference papers with multimodal resources (slides, videos and scripts) to assess both content and design; and Aesthetic Awareness, consisting of 2,000 slide pairs with varying aesthetic levels, supporting joint training and evaluation on scoring, defect adjustment, and comparison. Our findings highlight that (i) High-quality feedback is essential for agent self-improvement, while initial capability alone does not guarantee effective self-correction. (ii) Automated generation pipelines exhibit a trade-off between visual design and content construction. (iii) Multi-task RL training shows stronger generalization in aesthetic awareness tasks.

CreativeSynth: Creative Blending and Synthesis of Visual Arts based on Multimodal Diffusion

Large-scale text-to-image generative models have made impressive strides, showcasing their ability to synthesize a vast array of high-quality images. However, adapting these models for artistic image editing presents two significant challenges. Firstly, users struggle to craft textual prompts that meticulously detail visual elements of the input image. Secondly, prevalent models, when effecting modifications in specific zones, frequently disrupt the overall artistic style, complicating the attainment of cohesive and aesthetically unified artworks. To surmount these obstacles, we build the innovative unified framework CreativeSynth, which is based on a diffusion model with the ability to coordinate multimodal inputs and multitask in the field of artistic image generation. By integrating multimodal features with customized attention mechanisms, CreativeSynth facilitates the importation of real-world semantic content into the domain of art through inversion and real-time style transfer. This allows for the precise manipulation of image style and content while maintaining the integrity of the original model parameters. Rigorous qualitative and quantitative evaluations underscore that CreativeSynth excels in enhancing artistic images' fidelity and preserves their innate aesthetic essence. By bridging the gap between generative models and artistic finesse, CreativeSynth becomes a custom digital palette.

  • 8 authors
·
Jan 25, 2024 1

VGA-Bench: A Unified Benchmark and Multi-Model Framework for Video Aesthetics and Generation Quality Evaluation

The rapid advancement of AIGC-based video generation has underscored the critical need for comprehensive evaluation frameworks that go beyond traditional generation quality metrics to encompass aesthetic appeal. However, existing benchmarks remain largely focused on technical fidelity, leaving a significant gap in holistic assessment-particularly with respect to perceptual and artistic qualities. To address this limitation, we introduce VGA-Bench, a unified benchmark for joint evaluation of video generation quality and aesthetic quality. VGA-Bench is built upon a principled three-tier taxonomy: Aesthetic Quality, Aesthetic Tagging, and Generation Quality, each decomposed into multiple fine-grained sub-dimensions to enable systematic assessment. Guided by this taxonomy, we design 1,016 diverse prompts and generate a large-scale dataset of over 60,000 videos using 12 video generation models, ensuring broad coverage across content, style, and artifacts. To enable scalable and automated evaluation, we annotate a subset of the dataset via human labeling and develop three dedicated multi-task neural assessors: VAQA-Net for aesthetic quality prediction, VTag-Net for automatic aesthetic tagging, and VGQA-Net for generation and basic quality attributes. Extensive experiments demonstrate that our models achieve reliable alignment with human judgments, offering both accuracy and efficiency. We release VGA-Bench as a public benchmark to foster research in AIGC evaluation, with applications in content moderation, model debugging, and generative model optimization.

  • 10 authors
·
Apr 10

FiVA: Fine-grained Visual Attribute Dataset for Text-to-Image Diffusion Models

Recent advances in text-to-image generation have enabled the creation of high-quality images with diverse applications. However, accurately describing desired visual attributes can be challenging, especially for non-experts in art and photography. An intuitive solution involves adopting favorable attributes from the source images. Current methods attempt to distill identity and style from source images. However, "style" is a broad concept that includes texture, color, and artistic elements, but does not cover other important attributes such as lighting and dynamics. Additionally, a simplified "style" adaptation prevents combining multiple attributes from different sources into one generated image. In this work, we formulate a more effective approach to decompose the aesthetics of a picture into specific visual attributes, allowing users to apply characteristics such as lighting, texture, and dynamics from different images. To achieve this goal, we constructed the first fine-grained visual attributes dataset (FiVA) to the best of our knowledge. This FiVA dataset features a well-organized taxonomy for visual attributes and includes around 1 M high-quality generated images with visual attribute annotations. Leveraging this dataset, we propose a fine-grained visual attribute adaptation framework (FiVA-Adapter), which decouples and adapts visual attributes from one or more source images into a generated one. This approach enhances user-friendly customization, allowing users to selectively apply desired attributes to create images that meet their unique preferences and specific content requirements.

  • 9 authors
·
Dec 10, 2024 2

AttriCtrl: Fine-Grained Control of Aesthetic Attribute Intensity in Diffusion Models

Recent breakthroughs in text-to-image diffusion models have significantly enhanced both the visual fidelity and semantic controllability of generated images. However, fine-grained control over aesthetic attributes remains challenging, especially when users require continuous and intensity-specific adjustments. Existing approaches often rely on vague textual prompts, which are inherently ambiguous in expressing both the aesthetic semantics and the desired intensity, or depend on costly human preference data for alignment, limiting their scalability and practicality. To address these limitations, we propose AttriCtrl, a plug-and-play framework for precise and continuous control of aesthetic attributes. Specifically, we quantify abstract aesthetics by leveraging semantic similarity from pre-trained vision-language models, and employ a lightweight value encoder that maps scalar intensities in [0,1] to learnable embeddings within diffusion-based generation. This design enables intuitive and customizable aesthetic manipulation, with minimal training overhead and seamless integration into existing generation pipelines. Extensive experiments demonstrate that AttriCtrl achieves accurate control over individual attributes as well as flexible multi-attribute composition. Moreover, it is fully compatible with popular open-source controllable generation frameworks, showcasing strong integration capability and practical utility across diverse generation scenarios.

  • 7 authors
·
Aug 4, 2025

Unlocking the Essence of Beauty: Advanced Aesthetic Reasoning with Relative-Absolute Policy Optimization

Multimodal large language models (MLLMs) are well suited to image aesthetic assessment, as they can capture high-level aesthetic features leveraging their cross-modal understanding capacity. However, the scarcity of multimodal aesthetic reasoning data and the inherently subjective nature of aesthetic judgment make it difficult for MLLMs to generate accurate aesthetic judgments with interpretable rationales. To this end, we propose Aes-R1, a comprehensive aesthetic reasoning framework with reinforcement learning (RL). Concretely, Aes-R1 integrates a pipeline, AesCoT, to construct and filter high-quality chain-of-thought aesthetic reasoning data used for cold-start. After teaching the model to generate structured explanations prior to scoring, we then employ the Relative-Absolute Policy Optimization (RAPO), a novel RL algorithm that jointly optimizes absolute score regression and relative ranking order, improving both per-image accuracy and cross-image preference judgments. Aes-R1 enables MLLMs to generate grounded explanations alongside faithful scores, thereby enhancing aesthetic scoring and reasoning in a unified framework. Extensive experiments demonstrate that Aes-R1 improves the backbone's average PLCC/SRCC by 47.9%/34.8%, surpassing state-of-the-art baselines of similar size. More ablation studies validate Aes-R1's robust generalization under limited supervision and in out-of-distribution scenarios.

  • 8 authors
·
Sep 26, 2025

Dynamic Normativity: Necessary and Sufficient Conditions for Value Alignment

The critical inquiry pervading the realm of Philosophy, and perhaps extending its influence across all Humanities disciplines, revolves around the intricacies of morality and normativity. Surprisingly, in recent years, this thematic thread has woven its way into an unexpected domain, one not conventionally associated with pondering "what ought to be": the field of artificial intelligence (AI) research. Central to morality and AI, we find "alignment", a problem related to the challenges of expressing human goals and values in a manner that artificial systems can follow without leading to unwanted adversarial effects. More explicitly and with our current paradigm of AI development in mind, we can think of alignment as teaching human values to non-anthropomorphic entities trained through opaque, gradient-based learning techniques. This work addresses alignment as a technical-philosophical problem that requires solid philosophical foundations and practical implementations that bring normative theory to AI system development. To accomplish this, we propose two sets of necessary and sufficient conditions that, we argue, should be considered in any alignment process. While necessary conditions serve as metaphysical and metaethical roots that pertain to the permissibility of alignment, sufficient conditions establish a blueprint for aligning AI systems under a learning-based paradigm. After laying such foundations, we present implementations of this approach by using state-of-the-art techniques and methods for aligning general-purpose language systems. We call this framework Dynamic Normativity. Its central thesis is that any alignment process under a learning paradigm that cannot fulfill its necessary and sufficient conditions will fail in producing aligned systems.

  • 1 authors
·
Jun 16, 2024

Understanding Aesthetics with Language: A Photo Critique Dataset for Aesthetic Assessment

Computational inference of aesthetics is an ill-defined task due to its subjective nature. Many datasets have been proposed to tackle the problem by providing pairs of images and aesthetic scores based on human ratings. However, humans are better at expressing their opinion, taste, and emotions by means of language rather than summarizing them in a single number. In fact, photo critiques provide much richer information as they reveal how and why users rate the aesthetics of visual stimuli. In this regard, we propose the Reddit Photo Critique Dataset (RPCD), which contains tuples of image and photo critiques. RPCD consists of 74K images and 220K comments and is collected from a Reddit community used by hobbyists and professional photographers to improve their photography skills by leveraging constructive community feedback. The proposed dataset differs from previous aesthetics datasets mainly in three aspects, namely (i) the large scale of the dataset and the extension of the comments criticizing different aspects of the image, (ii) it contains mostly UltraHD images, and (iii) it can easily be extended to new data as it is collected through an automatic pipeline. To the best of our knowledge, in this work, we propose the first attempt to estimate the aesthetic quality of visual stimuli from the critiques. To this end, we exploit the polarity of the sentiment of criticism as an indicator of aesthetic judgment. We demonstrate how sentiment polarity correlates positively with the aesthetic judgment available for two aesthetic assessment benchmarks. Finally, we experiment with several models by using the sentiment scores as a target for ranking images. Dataset and baselines are available (https://github.com/mediatechnologycenter/aestheval).

  • 3 authors
·
Jun 17, 2022

The Photographer Eye: Teaching Multimodal Large Language Models to See and Critique like Photographers

While editing directly from life, photographers have found it too difficult to see simultaneously both the blue and the sky. Photographer and curator, Szarkowski insightfully revealed one of the notable gaps between general and aesthetic visual understanding: while the former focuses on identifying the factual element in an image (sky), the latter transcends such object identification, viewing it instead as an aesthetic component--a pure color block (blue). Such fundamental distinctions between general (detection, localization, etc.) and aesthetic (color, lighting, composition, etc.) visual understanding present a significant challenge for Multimodal Large Language Models (MLLMs). Although some recent works have made initial explorations, they are often limited to general and basic aesthetic commonsense. As a result, they frequently fall short in real-world scenarios (Fig. 1), which require extensive expertise--including photographic techniques, photo pre/post-processing knowledge, and more, to provide a detailed analysis and description. To fundamentally enhance the aesthetics understanding of MLLMs, we first introduce a novel dataset, PhotoCritique, derived from extensive discussions among professional photographers and enthusiasts, and characterized by the large scale, expertise, and diversity. Then, to better learn visual aesthetics from PhotoCritique, we furthur propose a novel model, PhotoEye, featuring a languageguided multi-view vision fusion mechanism to understand image aesthetics from multiple perspectives. Finally, we present a novel benchmark, PhotoBench, a comprehensive and professional benchmark for aesthetic visual understanding. On existing benchmarks and PhotoBench, our model demonstrates clear advantages over existing models.

  • 8 authors
·
Sep 22, 2025 1

HumanAesExpert: Advancing a Multi-Modality Foundation Model for Human Image Aesthetic Assessment

Image Aesthetic Assessment (IAA) is a long-standing and challenging research task. However, its subset, Human Image Aesthetic Assessment (HIAA), has been scarcely explored, even though HIAA is widely used in social media, AI workflows, and related domains. To bridge this research gap, our work pioneers a holistic implementation framework tailored for HIAA. Specifically, we introduce HumanBeauty, the first dataset purpose-built for HIAA, which comprises 108k high-quality human images with manual annotations. To achieve comprehensive and fine-grained HIAA, 50K human images are manually collected through a rigorous curation process and annotated leveraging our trailblazing 12-dimensional aesthetic standard, while the remaining 58K with overall aesthetic labels are systematically filtered from public datasets. Based on the HumanBeauty database, we propose HumanAesExpert, a powerful Vision Language Model for aesthetic evaluation of human images. We innovatively design an Expert head to incorporate human knowledge of aesthetic sub-dimensions while jointly utilizing the Language Modeling (LM) and Regression head. This approach empowers our model to achieve superior proficiency in both overall and fine-grained HIAA. Furthermore, we introduce a MetaVoter, which aggregates scores from all three heads, to effectively balance the capabilities of each head, thereby realizing improved assessment precision. Extensive experiments demonstrate that our HumanAesExpert models deliver significantly better performance in HIAA than other state-of-the-art models. Our datasets, models, and codes are publicly released to advance the HIAA community. Project webpage: https://humanaesexpert.github.io/HumanAesExpert/

  • 9 authors
·
Mar 31, 2025 1