Fix: Remove height parameter for Gradio 4.19.2 compatibility 0cefed5 aeb56 commited on about 1 month ago
Add live status table and improved logging with attn_implementation=eager fix 0b25a32 aeb56 commited on about 1 month ago
Fix multi-GPU: use parallelize=True instead of device_map, update env var 96b6724 aeb56 commited on about 1 month ago
Aggressive memory cleanup: 5s wait, env vars, optional model loading 3fb1215 aeb56 commited on Nov 12
Add Evaluation tab with ARC-Challenge, TruthfulQA, and Winogrande benchmarks 29f5263 aeb56 commited on Nov 10
Fix flash attention error by patching model config to use eager attention 2f60fd7 aeb56 commited on Nov 10
Switch to transformers inference (vLLM doesn't support KimiLinear architecture) 9905f0a aeb56 commited on Nov 10
Improve vLLM startup with tensor parallelism, better logging, and 10min timeout a82de92 aeb56 commited on Nov 10
Use sequential device_map to fix key naming conflicts during LoRA merge d3d4339 aeb56 commited on Nov 10
Add safe_merge and better error handling for LoRA merge with MoE models 79334bc aeb56 commited on Nov 10
Add 8-bit quantization support and switch to L4x4 hardware for availability e32298d aeb56 commited on Nov 10