Run dir : output/_smoke_test_1gpu Log file: output/_smoke_test_1gpu/train.log GPU: NVIDIA RTX PRO 6000 Blackwell Workstation Edition | VRAM: 95.0 GiB | PyTorch: 2.10.0+cu128 Final Configuration: Paths: transformer_path weights/flux2_dev_fp8mixed.safetensors vae_path weights/flux2-vae.safetensors controlnet_path weights/FLUX.2-dev-Fun-Controlnet-Union-2602.safetensors dataset_dir dataset color_map_path configs/color_map.json output_dir output/_smoke_test_1gpu text_encoder_path weights/mistral_3_small_flux2_fp8.safetensors precomputed_embeddings output/text_embeddings_global.pt Model: image_size 1024 num_classes 6 control_in_dim 3072 fusion_dim 768 num_fusion_blocks 3 num_heads 12 num_fourier_bands 32 boundary_threshold 0.1 Training: num_epochs 1 batch_size 4 learning_rate 0.0003 weight_decay 0.01 max_grad_norm 1.0 grad_accum_steps 4 guidance_scale 3.5 num_workers 0 Text Encoder: text_seq_len 512 text_dim 15360 Logging: log_interval 1 save_every_n_epochs 5 val_every_n_epochs 1 WandB: wandb_entity wandb_project _smoke_test_1gpu Resume: resume_from (not set) [MEM @ pre-flight] RAM: 25.5/188.2 GiB (13.6%) | VRAM: 0.0/95.0 GiB (0.0%) ============================================================ [1/8] Text Embeddings ============================================================ Loading cached embedding from output/text_embeddings_global.pt Loaded global text embedding from output/text_embeddings_global.pt (shape: torch.Size([512, 15360])) ============================================================ [2/8] Loading VAE ============================================================ Done (4.3s), VRAM: 0.16 GiB [MEM @ after VAE] RAM: 25.9/188.2 GiB (13.8%) | VRAM: 0.2/95.0 GiB (0.2%) ============================================================ [3/8] Loading Transformer ============================================================ Dequantizing FP8 transformer weights... Dequantized 128 FP8 tensors Converting ComfyUI → diffusers keys... Converted: 331 diffusers keys Loading ControlNet weights... ControlNet: 76 keys Creating Flux2ControlTransformer2DModel (control_in_dim=3072)... Skipped 2 control_img_in keys (dim mismatch): control_img_in.bias [6144] control_img_in.weight [6144, 260] Missing: 2, Unexpected: 0 Initialized control_img_in.weight [6144, 3072] on cuda Initialized control_img_in.bias [6144] on cuda FP8 compression: 203 frozen Linears, 67.9 → 37.9 GiB (saved 30.0 GiB) Done (30.8s), VRAM: 37.87 GiB Gradient checkpointing: enabled Backbone FROZEN: all transformer params set requires_grad=False Gradients will still propagate to HDC²A via control_context autograd [MEM @ after Transformer] RAM: 27.0/188.2 GiB (14.3%) | VRAM: 37.9/95.0 GiB (39.9%) ============================================================ [4/8] Creating HDC²A Adapter ============================================================ HDC²A: 52.4M params Control: 0.0M params Total trainable: 52.4M params ============================================================ [4.5/8] Applying LoRA to ControlNet Control Blocks ============================================================ LoRA rank=32, alpha=32.0, dropout=0 LoRA control_transformer_blocks.0.attn.to_q [6144→6144] LoRA control_transformer_blocks.0.attn.to_k [6144→6144] LoRA control_transformer_blocks.0.attn.to_v [6144→6144] LoRA control_transformer_blocks.0.attn.add_q_proj [6144→6144] LoRA control_transformer_blocks.0.attn.add_k_proj [6144→6144] LoRA control_transformer_blocks.0.attn.add_v_proj [6144→6144] LoRA control_transformer_blocks.0.attn.to_out.0 [6144→6144] LoRA control_transformer_blocks.1.attn.to_q [6144→6144] LoRA control_transformer_blocks.1.attn.to_k [6144→6144] LoRA control_transformer_blocks.1.attn.to_v [6144→6144] LoRA control_transformer_blocks.1.attn.add_q_proj [6144→6144] LoRA control_transformer_blocks.1.attn.add_k_proj [6144→6144] LoRA control_transformer_blocks.1.attn.add_v_proj [6144→6144] LoRA control_transformer_blocks.1.attn.to_out.0 [6144→6144] LoRA control_transformer_blocks.2.attn.to_q [6144→6144] LoRA control_transformer_blocks.2.attn.to_k [6144→6144] LoRA control_transformer_blocks.2.attn.to_v [6144→6144] LoRA control_transformer_blocks.2.attn.add_q_proj [6144→6144] LoRA control_transformer_blocks.2.attn.add_k_proj [6144→6144] LoRA control_transformer_blocks.2.attn.add_v_proj [6144→6144] LoRA control_transformer_blocks.2.attn.to_out.0 [6144→6144] LoRA control_transformer_blocks.3.attn.to_q [6144→6144] LoRA control_transformer_blocks.3.attn.to_k [6144→6144] LoRA control_transformer_blocks.3.attn.to_v [6144→6144] LoRA control_transformer_blocks.3.attn.to_out.0 [6144→6144] LoRA modules injected: 25 LoRA trainable params: 9.83M Parameter Statistics: HDC²A Adapter: total=52.4M trainable=52.4M ControlNet (frozen): total=4143.4M LoRA trainable=9.83M Flux2 backbone: total=0.0M trainable=0.0M ✓ ────────────────────────────────────────────────── Total trainable: HDC²A 52.4M + LoRA 9.83M = 62.19M ============================================================ [5/8] Building Optimizer ============================================================ AdamW: adapter_lr=3.00e-04, backbone_lr=0.00e+00 param_group 'adapter': 112 tensors, lr=3.00e-04 Scheduler: 400 warmup steps → cosine over ~25 steps [6/8] Resume: skipped (no checkpoint specified) ============================================================ [7/8] Forward Sanity Check ============================================================ [test 1/4] Forward pass (eval mode)... Output shape: torch.Size([1, 4096, 128]) Output stats: mean=0.0427, std=0.5156 VRAM peak (forward): 68.44 GiB [test 2/4] Loss computation (train mode)... Loss value: 1.437658 [test 3/4] Backward pass... Backward completed. VRAM peak (backward): 49.17 GiB [test 4/4] Gradient flow check... HDC²A: 112/112 params have non-zero grad Control: 25/50 params have non-zero grad Top grad norms (HDC²A): semantic_encoder.conv_stem.6.weight: 0.005524 depth_encoder.conv_stem.6.weight: 0.004883 W_s.weight: 0.004456 W_d.weight: 0.004181 fusion_blocks.0.ffn_sem.2.weight: 0.003784 Test result: PASSED [MEM @ after test] RAM: 27.5/188.2 GiB (14.6%) | VRAM: 38.0/95.0 GiB (40.0%) *** --test passed: all models loaded, forward test OK. Exiting. ***