AGILLM3.5 Disaggregated DiffusionBlock Report

Public-safe summary. This version intentionally omits private hostnames, IPs, operational paths, credentials, and deployment topology.

Executive Summary

AGILLM3.5 is the AGILLM3 checkpoint/tokenizer contract running through the AGILLM4 transformer runtime and DiffusionBlock training path.

The important correction: the apparent loss=113 was not checkpoint corruption. It was the DiffusionBlock EDM-weighted training objective being logged as if it were ordinary CE. A direct non-DiffusionBlock checkpoint probe showed the port is healthy:

Full AGILLM3 checkpoint loaded in the AGILLM3.5 runtime with no missing or unexpected core keys.
Original split-q/k/v attention and AGILLM3.5 fused-qkv attention matched at block level with max absolute difference 4.77e-7.
Full-runtime probe on a natural-text batch produced AR loss 6.13 and SAT loss 12.36.
A patched worker now reports raw CE-style loss separately from the weighted EDM objective, e.g. raw 10.73 and weighted 53.66.

What Is Solved

AGILLM3 checkpoint compatibility in the AGILLM4 runtime is validated.
Fused QKV remapping is exact for the tested block.
Dense full attention remains available through the manual backend; it is not the sublinear approximation path.
The loss=113 confusion is fixed in reporting: training still optimizes the weighted objective, but logs now expose comparable raw loss and weighted_loss.
Network-disaggregated DiffusionBlock slice training works mechanically: workers can train isolated layer windows and return mergeable updates.
The AGILLM3.5 single-file runtime exists and is published separately.

What Is Not Solved Yet

Long-horizon quality equivalence between serial DiffusionBlock training and multi-node rare-sync DiffusionBlock training is not yet proven.
Optimal sync interval is not yet established for AGILLM3.5 on real data.
MoE-enhanced DiffusionBlock remains experimental and should not be mainlined without loss-per-VRAM and loss-per-token evidence.
CPU-worker economics remain unfavorable versus modern GPUs unless compute is already owned, idle, or otherwise effectively free.
Cheap small-GPU slice workers look more promising than CPU swarms, but need matched AGILLM slice benchmarks before buying hardware.

Science Status

The validated science is that DiffusionBlock slices can be trained independently enough to make network-disaggregated transformer training mechanically viable.

The remaining scientific question is quality, not mechanics: after rare synchronization, does the global model track the same loss curve as a coherent single-node run? That needs scheduled full-checkpoint evaluation, not just per-worker slice loss.

Recommended Next Experiment

Run a controlled A/B:

Baseline: serial AGILLM3.5 DiffusionBlock training from the same checkpoint.
Variant: multi-worker disaggregated DiffusionBlock training from the same checkpoint.
Evaluation: periodic full-model AR/SAT loss on a fixed held-out token stream.
Decision metric: loss delta at equal processed tokens and equal wall-clock.

If the loss delta stays small, the method graduates from systems demo to training method.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support