Fix OOM: prev LoRA on CPU + no_grad entire prev contribution (no saved tensors for backward) 3aef764 natmin322 commited on Mar 13
Fix OOM: agg_lora_states accumulate weighted sum incrementally instead of O(N) cat 62e82d3 natmin322 commited on Mar 13
Fix deprecation warnings: torch.load weights_only, cupy fromDlpack, as_target_tokenizer 3c3ef28 natmin322 commited on Mar 13
fix: re-initialize trans_input/prompt_key after from_pretrained, add epsilon to norm, use_reentrant=False d6a9f4f natmin322 commited on Mar 13
fix: pass attention_mask directly to model.generate(), not via GenerationConfig 915a112 natmin322 commited on Mar 12
fix: override _save to disable safetensors for T5 shared embedding weights bb4c9d9 natmin322 commited on Mar 12
fix: denumpify_detensorize moved to trainer_utils in transformers 4.40+ a57a027 natmin322 commited on Mar 12
fix: add explicit trainer_pt_utils imports (nested_truncate etc.) removed from trainer.* in 4.40+ e4e078c natmin322 commited on Mar 12
fix: comprehensive transformers 4.40 compat across all trainer files 2e720a9 natmin322 commited on Mar 12
fix: synced_gpus must be passed to generate(), not GenerationConfig a92aa8a natmin322 commited on Mar 12
fix: restore _set_gradient_checkpointing + enable_input_require_grads for gradient checkpointing 2b87f4b natmin322 commited on Mar 12
fix: use default use_reentrant=True for gradient checkpointing (model expects it) 008c76c natmin322 commited on Mar 12
fix: add Optional/Any import to cl_collator, ipdb try/except in score.py 467dac1 natmin322 commited on Mar 12
fix: add FP16 safety check, dataset script check, LoRA sanity check dfdd675 natmin322 commited on Mar 12
fix: LoRA reinit, gradient checkpointing use_reentrant=False, checkpoint existence check, collator padding alignment 5299479 natmin322 commited on Mar 12
fix: _maybe_log_save_evaluate add grad_norm param for transformers 4.40.2 67281b3 natmin322 commited on Mar 10
fix: gradient checkpointing compat for transformers 4.40.2 (OOM fix) 7bd8512 natmin322 commited on Mar 10
fix: robust SVD fallback with direct regularization instead of pinv f9548d4 natmin322 commited on Mar 10
fix: replace self.use_apex and self.fsdp with getattr for compat 3f132e5 natmin322 commited on Mar 10
fix: add explicit imports for DistributedSampler, DataLoader, IterableDatasetShard, ipdb fd8e73c natmin322 commited on Mar 10
revert: discard all 2-GPU DataParallel/DDP changes, back to f51f791 8644d30 natmin322 commited on Mar 10
fix: robust SVD with try-except+pinv fallback + trans_input.pt exists guard ff3da70 natmin322 commited on Mar 9
fix: SVD jitter + trans_input.pt exists guard in both root and improve c1277de natmin322 commited on Mar 9
fix: port all compat fixes from improve to root trainer (ShardedDDPOption, ipdb, nested_truncate, gradient_checkpointing use_reentrant) 327a1fc natmin322 commited on Mar 9