fnmodel / app.py

Commit History

Aggressive memory cleanup: 5s wait, env vars, optional model loading
3fb1215

aeb56 commited on

Fix OOM: Unload model before evaluation to free VRAM for lm_eval
74f609c

aeb56 commited on

Disable chat/inference, focus on evaluation only
69cd0c5

aeb56 commited on

Add Evaluation tab with ARC-Challenge, TruthfulQA, and Winogrande benchmarks
29f5263

aeb56 commited on

Fix flash attention error by patching model config to use eager attention
2f60fd7

aeb56 commited on

Fix flash attention error by using eager attention implementation
74fe23d

aeb56 commited on

Switch to transformers inference (vLLM doesn't support KimiLinear architecture)
9905f0a

aeb56 commited on

Improve vLLM startup with tensor parallelism, better logging, and 10min timeout
a82de92

aeb56 commited on

Fix vLLM server start command to use python3 instead of python
75c2813

aeb56 commited on

Remove emoji avatars incompatible with Gradio 4.19.2
5f01a47

aeb56 commited on

Fix Gradio version compatibility and enable share mode
d073f8b

aeb56 commited on

Switch to vLLM for high-performance, stable inference
310eb95

aeb56 commited on

Fix variable scope error causing Internal Server Error
e62c736

aeb56 commited on

Transform Space into professional inference UI for fine-tuned model
5e458c4

aeb56 commited on

Implement manual LoRA merging to fix PEFT key naming conflicts
3a259bc

aeb56 commited on

Use sequential device_map to fix key naming conflicts during LoRA merge
d3d4339

aeb56 commited on

Add safe_merge and better error handling for LoRA merge with MoE models
79334bc

aeb56 commited on

Fix 8-bit quantization CPU offload for large models
1a04e17

aeb56 commited on

Add 8-bit quantization support and switch to L4x4 hardware for availability
e32298d

aeb56 commited on

Optimize app.py for 48B model on 4xL40S GPUs with multi-GPU support
b51ac87

aeb56 commited on

Initial commit: LoRA model merger
a951334

aeb56 commited on