oracle / TASK_LIST.md

Upload folder using huggingface_hub

c471f42 verified about 1 month ago

3.78 kB

	Task List

	Fix validation splitting by token identity
	Replace class-only split logic in train.py (line 235)
	Group cached context samples by source_token or token_address
	Ensure one token can only exist in train or val, never both
	Keep class balance as a secondary constraint, not the primary identity rule
	Stop using current validation as decision-grade signal
	Treat old val curves/checkpoints as contaminated
	Re-evaluate only after token-grouped split is in place
	Audit cache metadata and make token identity explicit
	Ensure every cached sample has stable token identity fields
	Required minimum: source_token, class_id
	Prefer also storing lightweight cache-planning metadata for later analysis
	Redesign cache generation around fixed budgets
	Define total cache budget first
	Allocate exact sample counts per token class before writing files
	Do not let raw source distribution decide cache composition
	Remove destructive dependence on token class map filtering alone
	Token class should guide budget allocation
	It should not be the only logic determining whether cache is useful
	Add cache-time context-level balancing
	After sampling a candidate context, evaluate realized future labels for that context
	Use realized context outcome to decide whether to keep or skip it
	Do this before saving to disk
	Start with binary polarity, not movement-threshold balancing
	Positive if max valid horizon return > 0
	Negative otherwise
	Use this only as cache-selection metadata first
	Make polarity quotas class-conditional
	For stronger classes, target positive/negative ratios
	For garbage classes, do not force positives
	Keep class 0 mostly natural/negative
	Keep T_cutoff random during cache generation
	Do not freeze a single deterministic cutoff per token
	Determinism should be in the planning/budget logic, not in removing context diversity
	Add exact acceptance accounting during cache build
	Track how many samples have already been accepted per class
	Track polarity counts per class
	Stop accepting once quotas are filled
	Avoid cache waste from duplicate low-value contexts
	Add retry/attempt limits per token
	If a token cannot satisfy desired quota type, stop oversampling it endlessly
	Move on to other tokens instead of filling disk with junk
	Keep label derivation in the data pipeline, not in training logic
	Loader should produce final labels and masks
	Collator should only stack/batch them
	Model should only consume them
	Reduce or remove train-time class reweighting after cache is fixed
	Revisit WeightedRandomSampler
	Revisit class_loss_weights
	If cache is balanced upstream, training should not need heavy rescue weighting
	Revisit movement head only after split and cache are fixed
	Keep it auxiliary
	Do not let movement-label threshold debates block the more important data fixes
	Later simplify naming/threshold assumptions if needed
	Add cache audit tooling
	Report counts by class_id
	Report counts by class x polarity
	Report unique tokens by class
	Report acceptance/rejection reasons
	Report train/val token overlap check
	Add validation integrity checks
	Assert zero token overlap between train and val
	Print per-class token counts, not just sample counts
	Print per-class sample counts too
	Rebuild cache after the new policy is implemented
	Old cache is shaped by the wrong distribution
	Old validation split is not trustworthy
	New training should start from the rebuilt corpus
	Retrain and re-baseline from scratch
	New split
	New cache
	Minimal train-time rescue weighting
	Recompare backbone behavior only after that
	Recommended implementation order

	Token-grouped validation split
	Validation overlap checks
	Cache metadata cleanup
	Exact class quotas in cache generation
	Class-conditional polarity quotas
	Cache audit reports
	Remove/reduce train-time weighting
	Rebuild cache
	Retrain
	Reassess movement head