AGILLM3.5 Loss-113 Bugfix Report

Date: 2026-06-01

Summary

The AGILLM3.5 compatibility port had a loss failure around ~113 during dblock training. The issue was not the fused-QKV checkpoint conversion. A legacy separate-Q/K/V block and the fused-QKV block produced numerically equivalent outputs, with max absolute difference 2.38e-07.

The actual problem was objective alignment:

The SAT objective had drifted from the AGILLM3 target contract.
The dblock objective was training selected transformer layer blocks outside the full transformer prefix/suffix context.

Both issues were fixed. A bounded real-model dblock smoke now reports loss around 10.57, and a resumed fixed run is in the 10-11 loss range rather than ~113.

Root Cause

AGILLM3 trained SAT by predicting tokens shifted by SAT_BLOCK:

sat_in = ids[:, :-SAT_BLOCK]
tgt_sat = ids[:, SAT_BLOCK:]

The port accidentally scored sequence-tail SAT hidden states against sequence-head targets. That broke the SAT objective.

The dblock path also treated a selected block as if it could be trained directly from embeddings to logits:

embedding -> selected layers only -> final head

That is not a valid full-model objective for middle or late transformer blocks. The selected block must be trained under the activation distribution created by its real prefix layers and judged through its real suffix layers.

Fix

The dblock path now trains selected layers inside the full transformer context:

input
-> prefix layers without gradients
-> selected layers with gradients
-> suffix layers frozen but still differentiable with respect to selected output
-> language-model loss

The SAT objective was restored to the original shifted-target contract, and the variable SAT gate was restored as a confidence gate.

Verification

Fused-QKV compatibility:

{
  "max_abs_diff": 2.384185791015625e-07,
  "mean_abs_diff": 1.0332144029234769e-08
}

Full checkpoint eval on a short text batch:

{
  "ar_loss": 6.39673376083374,
  "sat_fixed_loss": 8.313789367675781
}

Bounded real-model dblock smoke:

loss=10.574 ar=9.955 sat=11.193

Fixed resumed run:

loss=11.181
loss=10.486

What Remains

The local full-context dblock objective is now coherent, but distributed slice workers need a stricter protocol before their updates should be treated as scientifically valid. A worker that only sees a layer slice cannot train a middle transformer block from embeddings alone and call that the full-model objective.

Future distributed work should send valid boundary activations, suffix context, teacher gradients, or explicitly defined auxiliary targets.

Takeaway

The lesson is simple but important: transformer blocks are not independent embedding-to-logit predictors. Blockwise/distributed training can work, but the objective must preserve the full model's activation geometry. Once that was restored, the apparent loss-113 failure returned to the expected low-loss regime.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support