Monkey-patch transformers to disable flash attention via wrapper script 2900b36 aeb56 commited on 29 days ago
Workaround flash-attn: create fake module with PyTorch fallback attention b705945 aeb56 commited on 29 days ago
Add live status table and improved logging with attn_implementation=eager fix 0b25a32 aeb56 commited on 29 days ago
Fix multi-GPU: use parallelize=True instead of device_map, update env var 96b6724 aeb56 commited on 29 days ago
Aggressive memory cleanup: 5s wait, env vars, optional model loading 3fb1215 aeb56 commited on 29 days ago
Fix OOM: Unload model before evaluation to free VRAM for lm_eval 74f609c aeb56 commited on 30 days ago
Add Evaluation tab with ARC-Challenge, TruthfulQA, and Winogrande benchmarks 29f5263 aeb56 commited on Nov 10
Fix flash attention error by patching model config to use eager attention 2f60fd7 aeb56 commited on Nov 10
Switch to transformers inference (vLLM doesn't support KimiLinear architecture) 9905f0a aeb56 commited on Nov 10
Improve vLLM startup with tensor parallelism, better logging, and 10min timeout a82de92 aeb56 commited on Nov 10
Use sequential device_map to fix key naming conflicts during LoRA merge d3d4339 aeb56 commited on Nov 10
Add safe_merge and better error handling for LoRA merge with MoE models 79334bc aeb56 commited on Nov 10
Add 8-bit quantization support and switch to L4x4 hardware for availability e32298d aeb56 commited on Nov 10