File size: 4,078 Bytes

7776a1c

# PortraitCraft Track 2 Solution Description

## 1. Challenge Track

Track 2: Portrait Composition Generation.

## 2. Method Overview

Our solution is built on Qwen-Image and focuses on improving portrait composition generation through two complementary components: a portrait-composition fine-tuned generation model and a prompt-conditioned adaptive canvas policy. The model is responsible for synthesizing visually coherent portrait images, while the canvas policy selects a suitable generation aspect ratio before sampling.

A key observation during development was that fixed square generation is not always appropriate for portrait composition. Human portraits can require different spatial layouts depending on prompt intent. Close-up and centered portraits often work well on square canvases, full-body portraits benefit from vertical layouts, and environmental portraits or scenes with roads, coastlines, leading lines, or large negative space often need horizontal or wide canvases.

## 3. Training Data And Fine-Tuning

We fine-tuned Qwen-Image using the official 4,500 PortraitCraft training samples together with an additional private portrait aesthetic-composition dataset curated by our team. The private data focuses on portrait layout, aesthetic framing, human-subject placement, environmental context, lighting balance, and composition consistency.

We compared LoRA fine-tuning and full-parameter fine-tuning under the same inference settings. Full-parameter fine-tuning was selected for the final submission because it performed better for this task, especially in aesthetic quality, composition stability, and prompt-to-layout alignment.

## 4. Adaptive Canvas Policy

We do not use a fixed 1:1 canvas for all images. Instead, we use a prompt-conditioned adaptive canvas policy. The policy reads the input prompt and a learned policy state, then outputs a canvas size before image generation. The longer side is normalized to 1584 pixels, while the shorter side is selected from a compact set of portrait-friendly aspect ratios.

The policy was optimized on the training set through an iterative evolutionary-search procedure. The search adjusted keyword weights, decision thresholds, and candidate aspect-ratio choices. This lets the inference system preserve the intended spatial structure for different prompt types, including square portraits, full-body vertical portraits, and horizontal environmental portraits.

For reproducibility, we release the final learned policy state together with the inference code. Reviewers can recover the same canvas selection used by our submission. For unseen prompts, the implementation falls back to a deterministic prompt-only rule policy.

## 5. Inference Configuration

The final inference pipeline uses the released PortraitCraft Track 2 checkpoint and the adaptive canvas policy. Default parameters are:

- Base model: Qwen-Image
- Checkpoint: portraitcraft-track2.safetensors
- Sampling steps: 50
- CFG scale: 4.0
- Seed: 346346
- Adaptive canvas longest side: 1584 pixels

The released GitHub repository contains inference scripts, the adaptive canvas policy implementation, the learned policy state, and submission packaging utilities. The model checkpoint is hosted on Hugging Face.

## 6. Reproducibility

The solution can be reproduced by loading the Qwen-Image base model, applying the released PortraitCraft Track 2 checkpoint, running the adaptive canvas selector on each prompt, and generating images with the default sampling configuration. The output directory can then be packaged as a flat zip file for submission.

Code repository: https://github.com/w-Jessamine/portraitcraft-track2-solution

Model repository: https://huggingface.co/Jessamine/portraitcraft-track2

## 7. Qualitative Results

The examples below illustrate how the adaptive canvas policy supports different portrait composition needs. Square layouts are used for centered portraits, vertical layouts preserve human-body framing and breathing room, and horizontal layouts support environmental context and directional visual flow.