Task List Fix validation splitting by token identity Replace class-only split logic in train.py (line 235) Group cached context samples by source_token or token_address Ensure one token can only exist in train or val, never both Keep class balance as a secondary constraint, not the primary identity rule Stop using current validation as decision-grade signal Treat old val curves/checkpoints as contaminated Re-evaluate only after token-grouped split is in place Audit cache metadata and make token identity explicit Ensure every cached sample has stable token identity fields Required minimum: source_token, class_id Prefer also storing lightweight cache-planning metadata for later analysis Redesign cache generation around fixed budgets Define total cache budget first Allocate exact sample counts per token class before writing files Do not let raw source distribution decide cache composition Remove destructive dependence on token class map filtering alone Token class should guide budget allocation It should not be the only logic determining whether cache is useful Add cache-time context-level balancing After sampling a candidate context, evaluate realized future labels for that context Use realized context outcome to decide whether to keep or skip it Do this before saving to disk Start with binary polarity, not movement-threshold balancing Positive if max valid horizon return > 0 Negative otherwise Use this only as cache-selection metadata first Make polarity quotas class-conditional For stronger classes, target positive/negative ratios For garbage classes, do not force positives Keep class 0 mostly natural/negative Keep T_cutoff random during cache generation Do not freeze a single deterministic cutoff per token Determinism should be in the planning/budget logic, not in removing context diversity Add exact acceptance accounting during cache build Track how many samples have already been accepted per class Track polarity counts per class Stop accepting once quotas are filled Avoid cache waste from duplicate low-value contexts Add retry/attempt limits per token If a token cannot satisfy desired quota type, stop oversampling it endlessly Move on to other tokens instead of filling disk with junk Keep label derivation in the data pipeline, not in training logic Loader should produce final labels and masks Collator should only stack/batch them Model should only consume them Reduce or remove train-time class reweighting after cache is fixed Revisit WeightedRandomSampler Revisit class_loss_weights If cache is balanced upstream, training should not need heavy rescue weighting Revisit movement head only after split and cache are fixed Keep it auxiliary Do not let movement-label threshold debates block the more important data fixes Later simplify naming/threshold assumptions if needed Add cache audit tooling Report counts by class_id Report counts by class x polarity Report unique tokens by class Report acceptance/rejection reasons Report train/val token overlap check Add validation integrity checks Assert zero token overlap between train and val Print per-class token counts, not just sample counts Print per-class sample counts too Rebuild cache after the new policy is implemented Old cache is shaped by the wrong distribution Old validation split is not trustworthy New training should start from the rebuilt corpus Retrain and re-baseline from scratch New split New cache Minimal train-time rescue weighting Recompare backbone behavior only after that Recommended implementation order Token-grouped validation split Validation overlap checks Cache metadata cleanup Exact class quotas in cache generation Class-conditional polarity quotas Cache audit reports Remove/reduce train-time weighting Rebuild cache Retrain Reassess movement head