JasonXF's picture
Upload folder using huggingface_hub
4069eda verified
Run dir : output/_smoke_test_1gpu
Log file: output/_smoke_test_1gpu/train.log
GPU: NVIDIA RTX PRO 6000 Blackwell Workstation Edition | VRAM: 95.0 GiB | PyTorch: 2.10.0+cu128
Final Configuration:
Paths:
transformer_path weights/flux2_dev_fp8mixed.safetensors
vae_path weights/flux2-vae.safetensors
controlnet_path weights/FLUX.2-dev-Fun-Controlnet-Union-2602.safetensors
dataset_dir dataset
color_map_path configs/color_map.json
output_dir output/_smoke_test_1gpu
text_encoder_path weights/mistral_3_small_flux2_fp8.safetensors
precomputed_embeddings output/text_embeddings_global.pt
Model:
image_size 1024
num_classes 6
control_in_dim 3072
fusion_dim 768
num_fusion_blocks 3
num_heads 12
num_fourier_bands 32
boundary_threshold 0.1
Training:
num_epochs 1
batch_size 4
learning_rate 0.0003
weight_decay 0.01
max_grad_norm 1.0
grad_accum_steps 4
guidance_scale 3.5
num_workers 0
Text Encoder:
text_seq_len 512
text_dim 15360
Logging:
log_interval 1
save_every_n_epochs 5
val_every_n_epochs 1
WandB:
wandb_entity
wandb_project _smoke_test_1gpu
Resume:
resume_from (not set)
[MEM @ pre-flight] RAM: 25.5/188.2 GiB (13.6%) | VRAM: 0.0/95.0 GiB (0.0%)
============================================================
[1/8] Text Embeddings
============================================================
Loading cached embedding from output/text_embeddings_global.pt
Loaded global text embedding from output/text_embeddings_global.pt (shape: torch.Size([512, 15360]))
============================================================
[2/8] Loading VAE
============================================================
Done (4.3s), VRAM: 0.16 GiB
[MEM @ after VAE] RAM: 25.9/188.2 GiB (13.8%) | VRAM: 0.2/95.0 GiB (0.2%)
============================================================
[3/8] Loading Transformer
============================================================
Dequantizing FP8 transformer weights...
Dequantized 128 FP8 tensors
Converting ComfyUI β†’ diffusers keys...
Converted: 331 diffusers keys
Loading ControlNet weights...
ControlNet: 76 keys
Creating Flux2ControlTransformer2DModel (control_in_dim=3072)...
Skipped 2 control_img_in keys (dim mismatch):
control_img_in.bias [6144]
control_img_in.weight [6144, 260]
Missing: 2, Unexpected: 0
Initialized control_img_in.weight [6144, 3072] on cuda
Initialized control_img_in.bias [6144] on cuda
FP8 compression: 203 frozen Linears, 67.9 β†’ 37.9 GiB (saved 30.0 GiB)
Done (30.8s), VRAM: 37.87 GiB
Gradient checkpointing: enabled
Backbone FROZEN: all transformer params set requires_grad=False
Gradients will still propagate to HDCΒ²A via control_context autograd
[MEM @ after Transformer] RAM: 27.0/188.2 GiB (14.3%) | VRAM: 37.9/95.0 GiB (39.9%)
============================================================
[4/8] Creating HDCΒ²A Adapter
============================================================
HDCΒ²A: 52.4M params
Control: 0.0M params
Total trainable: 52.4M params
============================================================
[4.5/8] Applying LoRA to ControlNet Control Blocks
============================================================
LoRA rank=32, alpha=32.0, dropout=0
LoRA control_transformer_blocks.0.attn.to_q [6144β†’6144]
LoRA control_transformer_blocks.0.attn.to_k [6144β†’6144]
LoRA control_transformer_blocks.0.attn.to_v [6144β†’6144]
LoRA control_transformer_blocks.0.attn.add_q_proj [6144β†’6144]
LoRA control_transformer_blocks.0.attn.add_k_proj [6144β†’6144]
LoRA control_transformer_blocks.0.attn.add_v_proj [6144β†’6144]
LoRA control_transformer_blocks.0.attn.to_out.0 [6144β†’6144]
LoRA control_transformer_blocks.1.attn.to_q [6144β†’6144]
LoRA control_transformer_blocks.1.attn.to_k [6144β†’6144]
LoRA control_transformer_blocks.1.attn.to_v [6144β†’6144]
LoRA control_transformer_blocks.1.attn.add_q_proj [6144β†’6144]
LoRA control_transformer_blocks.1.attn.add_k_proj [6144β†’6144]
LoRA control_transformer_blocks.1.attn.add_v_proj [6144β†’6144]
LoRA control_transformer_blocks.1.attn.to_out.0 [6144β†’6144]
LoRA control_transformer_blocks.2.attn.to_q [6144β†’6144]
LoRA control_transformer_blocks.2.attn.to_k [6144β†’6144]
LoRA control_transformer_blocks.2.attn.to_v [6144β†’6144]
LoRA control_transformer_blocks.2.attn.add_q_proj [6144β†’6144]
LoRA control_transformer_blocks.2.attn.add_k_proj [6144β†’6144]
LoRA control_transformer_blocks.2.attn.add_v_proj [6144β†’6144]
LoRA control_transformer_blocks.2.attn.to_out.0 [6144β†’6144]
LoRA control_transformer_blocks.3.attn.to_q [6144β†’6144]
LoRA control_transformer_blocks.3.attn.to_k [6144β†’6144]
LoRA control_transformer_blocks.3.attn.to_v [6144β†’6144]
LoRA control_transformer_blocks.3.attn.to_out.0 [6144β†’6144]
LoRA modules injected: 25
LoRA trainable params: 9.83M
Parameter Statistics:
HDCΒ²A Adapter: total=52.4M trainable=52.4M
ControlNet (frozen): total=4143.4M LoRA trainable=9.83M
Flux2 backbone: total=0.0M trainable=0.0M βœ“
──────────────────────────────────────────────────
Total trainable: HDCΒ²A 52.4M + LoRA 9.83M = 62.19M
============================================================
[5/8] Building Optimizer
============================================================
AdamW: adapter_lr=3.00e-04, backbone_lr=0.00e+00
param_group 'adapter': 112 tensors, lr=3.00e-04
Scheduler: 400 warmup steps β†’ cosine over ~25 steps
[6/8] Resume: skipped (no checkpoint specified)
============================================================
[7/8] Forward Sanity Check
============================================================
[test 1/4] Forward pass (eval mode)...
Output shape: torch.Size([1, 4096, 128])
Output stats: mean=0.0427, std=0.5156
VRAM peak (forward): 68.44 GiB
[test 2/4] Loss computation (train mode)...
Loss value: 1.437658
[test 3/4] Backward pass...
Backward completed. VRAM peak (backward): 49.17 GiB
[test 4/4] Gradient flow check...
HDCΒ²A: 112/112 params have non-zero grad
Control: 25/50 params have non-zero grad
Top grad norms (HDCΒ²A):
semantic_encoder.conv_stem.6.weight: 0.005524
depth_encoder.conv_stem.6.weight: 0.004883
W_s.weight: 0.004456
W_d.weight: 0.004181
fusion_blocks.0.ffn_sem.2.weight: 0.003784
Test result: PASSED
[MEM @ after test] RAM: 27.5/188.2 GiB (14.6%) | VRAM: 38.0/95.0 GiB (40.0%)
*** --test passed: all models loaded, forward test OK. Exiting. ***