| Run dir : output/_smoke_test_1gpu |
| Log file: output/_smoke_test_1gpu/train.log |
| GPU: NVIDIA RTX PRO 6000 Blackwell Workstation Edition | VRAM: 95.0 GiB | PyTorch: 2.10.0+cu128 |
|
|
| Final Configuration: |
| Paths: |
| transformer_path weights/flux2_dev_fp8mixed.safetensors |
| vae_path weights/flux2-vae.safetensors |
| controlnet_path weights/FLUX.2-dev-Fun-Controlnet-Union-2602.safetensors |
| dataset_dir dataset |
| color_map_path configs/color_map.json |
| output_dir output/_smoke_test_1gpu |
| text_encoder_path weights/mistral_3_small_flux2_fp8.safetensors |
| precomputed_embeddings output/text_embeddings_global.pt |
| Model: |
| image_size 1024 |
| num_classes 6 |
| control_in_dim 3072 |
| fusion_dim 768 |
| num_fusion_blocks 3 |
| num_heads 12 |
| num_fourier_bands 32 |
| boundary_threshold 0.1 |
| Training: |
| num_epochs 1 |
| batch_size 4 |
| learning_rate 0.0003 |
| weight_decay 0.01 |
| max_grad_norm 1.0 |
| grad_accum_steps 4 |
| guidance_scale 3.5 |
| num_workers 0 |
| Text Encoder: |
| text_seq_len 512 |
| text_dim 15360 |
| Logging: |
| log_interval 1 |
| save_every_n_epochs 5 |
| val_every_n_epochs 1 |
| WandB: |
| wandb_entity |
| wandb_project _smoke_test_1gpu |
| Resume: |
| resume_from (not set) |
| [MEM @ pre-flight] RAM: 25.5/188.2 GiB (13.6%) | VRAM: 0.0/95.0 GiB (0.0%) |
| |
| ============================================================ |
| [1/8] Text Embeddings |
| ============================================================ |
| Loading cached embedding from output/text_embeddings_global.pt |
| Loaded global text embedding from output/text_embeddings_global.pt (shape: torch.Size([512, 15360])) |
| |
| ============================================================ |
| [2/8] Loading VAE |
| ============================================================ |
| Done (4.3s), VRAM: 0.16 GiB |
| [MEM @ after VAE] RAM: 25.9/188.2 GiB (13.8%) | VRAM: 0.2/95.0 GiB (0.2%) |
| |
| ============================================================ |
| [3/8] Loading Transformer |
| ============================================================ |
| Dequantizing FP8 transformer weights... |
| Dequantized 128 FP8 tensors |
| Converting ComfyUI β diffusers keys... |
| Converted: 331 diffusers keys |
| Loading ControlNet weights... |
| ControlNet: 76 keys |
| Creating Flux2ControlTransformer2DModel (control_in_dim=3072)... |
| Skipped 2 control_img_in keys (dim mismatch): |
| control_img_in.bias [6144] |
| control_img_in.weight [6144, 260] |
| Missing: 2, Unexpected: 0 |
| Initialized control_img_in.weight [6144, 3072] on cuda |
| Initialized control_img_in.bias [6144] on cuda |
| FP8 compression: 203 frozen Linears, 67.9 β 37.9 GiB (saved 30.0 GiB) |
| Done (30.8s), VRAM: 37.87 GiB |
| Gradient checkpointing: enabled |
| Backbone FROZEN: all transformer params set requires_grad=False |
| Gradients will still propagate to HDCΒ²A via control_context autograd |
| [MEM @ after Transformer] RAM: 27.0/188.2 GiB (14.3%) | VRAM: 37.9/95.0 GiB (39.9%) |
| |
| ============================================================ |
| [4/8] Creating HDCΒ²A Adapter |
| ============================================================ |
| HDCΒ²A: 52.4M params |
| Control: 0.0M params |
| Total trainable: 52.4M params |
| |
| ============================================================ |
| [4.5/8] Applying LoRA to ControlNet Control Blocks |
| ============================================================ |
| LoRA rank=32, alpha=32.0, dropout=0 |
| LoRA control_transformer_blocks.0.attn.to_q [6144β6144] |
| LoRA control_transformer_blocks.0.attn.to_k [6144β6144] |
| LoRA control_transformer_blocks.0.attn.to_v [6144β6144] |
| LoRA control_transformer_blocks.0.attn.add_q_proj [6144β6144] |
| LoRA control_transformer_blocks.0.attn.add_k_proj [6144β6144] |
| LoRA control_transformer_blocks.0.attn.add_v_proj [6144β6144] |
| LoRA control_transformer_blocks.0.attn.to_out.0 [6144β6144] |
| LoRA control_transformer_blocks.1.attn.to_q [6144β6144] |
| LoRA control_transformer_blocks.1.attn.to_k [6144β6144] |
| LoRA control_transformer_blocks.1.attn.to_v [6144β6144] |
| LoRA control_transformer_blocks.1.attn.add_q_proj [6144β6144] |
| LoRA control_transformer_blocks.1.attn.add_k_proj [6144β6144] |
| LoRA control_transformer_blocks.1.attn.add_v_proj [6144β6144] |
| LoRA control_transformer_blocks.1.attn.to_out.0 [6144β6144] |
| LoRA control_transformer_blocks.2.attn.to_q [6144β6144] |
| LoRA control_transformer_blocks.2.attn.to_k [6144β6144] |
| LoRA control_transformer_blocks.2.attn.to_v [6144β6144] |
| LoRA control_transformer_blocks.2.attn.add_q_proj [6144β6144] |
| LoRA control_transformer_blocks.2.attn.add_k_proj [6144β6144] |
| LoRA control_transformer_blocks.2.attn.add_v_proj [6144β6144] |
| LoRA control_transformer_blocks.2.attn.to_out.0 [6144β6144] |
| LoRA control_transformer_blocks.3.attn.to_q [6144β6144] |
| LoRA control_transformer_blocks.3.attn.to_k [6144β6144] |
| LoRA control_transformer_blocks.3.attn.to_v [6144β6144] |
| LoRA control_transformer_blocks.3.attn.to_out.0 [6144β6144] |
|
|
| LoRA modules injected: 25 |
| LoRA trainable params: 9.83M |
|
|
| Parameter Statistics: |
| HDCΒ²A Adapter: total=52.4M trainable=52.4M |
| ControlNet (frozen): total=4143.4M LoRA trainable=9.83M |
| Flux2 backbone: total=0.0M trainable=0.0M β |
| ββββββββββββββββββββββββββββββββββββββββββββββββββ |
| Total trainable: HDCΒ²A 52.4M + LoRA 9.83M = 62.19M |
|
|
| ============================================================ |
| [5/8] Building Optimizer |
| ============================================================ |
| AdamW: adapter_lr=3.00e-04, backbone_lr=0.00e+00 |
| param_group 'adapter': 112 tensors, lr=3.00e-04 |
| Scheduler: 400 warmup steps β cosine over ~25 steps |
| [6/8] Resume: skipped (no checkpoint specified) |
|
|
| ============================================================ |
| [7/8] Forward Sanity Check |
| ============================================================ |
| [test 1/4] Forward pass (eval mode)... |
| Output shape: torch.Size([1, 4096, 128]) |
| Output stats: mean=0.0427, std=0.5156 |
| VRAM peak (forward): 68.44 GiB |
| [test 2/4] Loss computation (train mode)... |
| Loss value: 1.437658 |
| [test 3/4] Backward pass... |
| Backward completed. VRAM peak (backward): 49.17 GiB |
| [test 4/4] Gradient flow check... |
| HDCΒ²A: 112/112 params have non-zero grad |
| Control: 25/50 params have non-zero grad |
| Top grad norms (HDCΒ²A): |
| semantic_encoder.conv_stem.6.weight: 0.005524 |
| depth_encoder.conv_stem.6.weight: 0.004883 |
| W_s.weight: 0.004456 |
| W_d.weight: 0.004181 |
| fusion_blocks.0.ffn_sem.2.weight: 0.003784 |
| Test result: PASSED |
| [MEM @ after test] RAM: 27.5/188.2 GiB (14.6%) | VRAM: 38.0/95.0 GiB (40.0%) |
| |
| *** --test passed: all models loaded, forward test OK. Exiting. *** |
| |