Task List

Fix validation splitting by token identity
Replace class-only split logic in train.py (line 235)
Group cached context samples by source_token or token_address
Ensure one token can only exist in train or val, never both
Keep class balance as a secondary constraint, not the primary identity rule
Stop using current validation as decision-grade signal
Treat old val curves/checkpoints as contaminated
Re-evaluate only after token-grouped split is in place
Audit cache metadata and make token identity explicit
Ensure every cached sample has stable token identity fields
Required minimum: source_token, class_id
Prefer also storing lightweight cache-planning metadata for later analysis
Redesign cache generation around fixed budgets
Define total cache budget first
Allocate exact sample counts per token class before writing files
Do not let raw source distribution decide cache composition
Remove destructive dependence on token class map filtering alone
Token class should guide budget allocation
It should not be the only logic determining whether cache is useful
Add cache-time context-level balancing
After sampling a candidate context, evaluate realized future labels for that context
Use realized context outcome to decide whether to keep or skip it
Do this before saving to disk
Start with binary polarity, not movement-threshold balancing
Positive if max valid horizon return > 0
Negative otherwise
Use this only as cache-selection metadata first
Make polarity quotas class-conditional
For stronger classes, target positive/negative ratios
For garbage classes, do not force positives
Keep class 0 mostly natural/negative
Keep T_cutoff random during cache generation
Do not freeze a single deterministic cutoff per token
Determinism should be in the planning/budget logic, not in removing context diversity
Add exact acceptance accounting during cache build
Track how many samples have already been accepted per class
Track polarity counts per class
Stop accepting once quotas are filled
Avoid cache waste from duplicate low-value contexts
Add retry/attempt limits per token
If a token cannot satisfy desired quota type, stop oversampling it endlessly
Move on to other tokens instead of filling disk with junk
Keep label derivation in the data pipeline, not in training logic
Loader should produce final labels and masks
Collator should only stack/batch them
Model should only consume them
Reduce or remove train-time class reweighting after cache is fixed
Revisit WeightedRandomSampler
Revisit class_loss_weights
If cache is balanced upstream, training should not need heavy rescue weighting
Revisit movement head only after split and cache are fixed
Keep it auxiliary
Do not let movement-label threshold debates block the more important data fixes
Later simplify naming/threshold assumptions if needed
Add cache audit tooling
Report counts by class_id
Report counts by class x polarity
Report unique tokens by class
Report acceptance/rejection reasons
Report train/val token overlap check
Add validation integrity checks
Assert zero token overlap between train and val
Print per-class token counts, not just sample counts
Print per-class sample counts too
Rebuild cache after the new policy is implemented
Old cache is shaped by the wrong distribution
Old validation split is not trustworthy
New training should start from the rebuilt corpus
Retrain and re-baseline from scratch
New split
New cache
Minimal train-time rescue weighting
Recompare backbone behavior only after that
Recommended implementation order

Token-grouped validation split
Validation overlap checks
Cache metadata cleanup
Exact class quotas in cache generation
Class-conditional polarity quotas
Cache audit reports
Remove/reduce train-time weighting
Rebuild cache
Retrain
Reassess movement head