Update README.md
Browse files
README.md
CHANGED
|
@@ -10,6 +10,65 @@ The training captions are like `Yellow blob emoji with smiling face with smiling
|
|
| 10 |
- Blob emoji face drives a red sport car along a curved road on a cliff overlooking the sea. The sea is dotted with whitecaps. The sky is blue, and cumulonimbus clouds float on the horizon. --w 1664 --h 928 --s 50 --d 12345678
|
| 11 |

|
| 12 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 13 |
## fp-1f-kisekae-1024-v4-2-PfPHEMA.safetensors
|
| 14 |
|
| 15 |
Post-Hoc EMA (with Power function sigma_rel=0.2) version of the following LoRA. The usage is the same.
|
|
|
|
| 10 |
- Blob emoji face drives a red sport car along a curved road on a cliff overlooking the sea. The sea is dotted with whitecaps. The sky is blue, and cumulonimbus clouds float on the horizon. --w 1664 --h 928 --s 50 --d 12345678
|
| 11 |

|
| 12 |
|
| 13 |
+
### Dataset Creation Procedure
|
| 14 |
+
|
| 15 |
+
The dataset was created following these steps:
|
| 16 |
+
|
| 17 |
+
- The SVG files from [C1710/blobmoji](https://github.com/C1710/blobmoji) (licensed under ASL 2.0) were used. Specifically, 118 different yellow blob emojis were selected from the SVG files.
|
| 18 |
+
- `cairosvg` was used to convert these SVGs into 512x512 pixel transparent PNGs.
|
| 19 |
+
- A script was then used to pad the images to 640x640 pixels and generate four versions of each image with different background colors: white, light gray, gray, and black. This resulted in a total of 472 images.
|
| 20 |
+
- The captions were generated based on the official Unicode names of the emojis. The prefix `Yellow blob emoji with ` and the suffix `. The background is <color>.` were added to each name.
|
| 21 |
+
- For example: `Yellow blob emoji with smiling face with smiling eyes. The background is gray.`
|
| 22 |
+
- Note: For some emojis (e.g., devil, zombie), the word `Yellow` was omitted from the prefix.
|
| 23 |
+
|
| 24 |
+
### Dataset Definition
|
| 25 |
+
|
| 26 |
+
```
|
| 27 |
+
# general configurations
|
| 28 |
+
[general]
|
| 29 |
+
resolution = [640, 640]
|
| 30 |
+
batch_size = 16
|
| 31 |
+
enable_bucket = true
|
| 32 |
+
bucket_no_upscale = false
|
| 33 |
+
caption_extension = ".txt"
|
| 34 |
+
|
| 35 |
+
[[datasets]]
|
| 36 |
+
image_directory = "path/to/images_and_captions_dir"
|
| 37 |
+
cache_directory = "path/to/cache_dir"
|
| 38 |
+
```
|
| 39 |
+
|
| 40 |
+
### Training Command
|
| 41 |
+
|
| 42 |
+
```
|
| 43 |
+
accelerate launch --num_cpu_threads_per_process 1 --mixed_precision bf16 --rdzv_backend=c10d \
|
| 44 |
+
src/musubi_tuner/qwen_image_train_network.py \
|
| 45 |
+
--dit path/to/dit.safetensors --vae path/to/vae.safetensors \
|
| 46 |
+
--text_encoder path/to/vlm.safetensors \
|
| 47 |
+
--dataset_config path/to/blob_emoji_v1_640_bs16.toml \
|
| 48 |
+
--output_dir path/to/output_dir \
|
| 49 |
+
--learning_rate 2e-4 \
|
| 50 |
+
--timestep_sampling shift --weighting_scheme none --discrete_flow_shift 2.0 \
|
| 51 |
+
--max_train_epochs 16 --mixed_precision bf16 --seed 42 --gradient_checkpointing \
|
| 52 |
+
--network_module=networks.lora_qwen_image \
|
| 53 |
+
--network_dim=4 --network_args loraplus_lr_ratio=4 \
|
| 54 |
+
--save_every_n_epochs=1 --max_data_loader_n_workers 2 \
|
| 55 |
+
--persistent_data_loader_workers \
|
| 56 |
+
--logging_dir ./logs --log_prefix qwenimage-blob4-2e4- \
|
| 57 |
+
--output_name qwenimage-blob4-2e4 \
|
| 58 |
+
--optimizer_type adamw8bit --flash_attn --split_attn \
|
| 59 |
+
--log_with tensorboard \
|
| 60 |
+
--sample_every_n_epochs 1 --sample_prompts path/to/prompts_qwen_blob_emoji.txt \
|
| 61 |
+
--fp8_base --fp8_scaled
|
| 62 |
+
```
|
| 63 |
+
|
| 64 |
+
### Training Details
|
| 65 |
+
|
| 66 |
+
- Training was conducted on a Windows machine with a multi-GPU setup (2x RTX A6000).
|
| 67 |
+
- If you are not using a Windows environment or not performing multi-GPU training, please remove the `--rdzv_backend=c10d` argument.
|
| 68 |
+
- Please note that due to the 2-GPU setup, the effective batch size is 32. To achieve the same results with limited VRAM, increase the gradient accumulation steps. However, you should be able to train successfully with a lower batch size by adjusting the learning rate.
|
| 69 |
+
- The model was trained for 6 epochs (90 steps), which took approximately 1 hour with the Power Limit set to 60%.
|
| 70 |
+
- Finally, the weights from all 6 epochs were merged using the LoRA Post-Hoc EMA script from Musubi Tuner with `sigma_rel=0.2`.
|
| 71 |
+
|
| 72 |
## fp-1f-kisekae-1024-v4-2-PfPHEMA.safetensors
|
| 73 |
|
| 74 |
Post-Hoc EMA (with Power function sigma_rel=0.2) version of the following LoRA. The usage is the same.
|