Commits · optiviseapp/fnmodel

Monkey-patch transformers to disable flash attention via wrapper script

2900b36

aeb56 commited on 29 days ago

Workaround flash-attn: create fake module with PyTorch fallback attention

b705945

aeb56 commited on 29 days ago

Add flash-attn dependency required by Kimi model

ef25cbe

aeb56 commited on 29 days ago

Fix: Remove height parameter for Gradio 4.19.2 compatibility

0cefed5

aeb56 commited on 29 days ago

Add live status table and improved logging with attn_implementation=eager fix

0b25a32

aeb56 commited on 29 days ago

Fix multi-GPU: use parallelize=True instead of device_map, update env var

96b6724

aeb56 commited on 29 days ago

Aggressive memory cleanup: 5s wait, env vars, optional model loading

3fb1215

aeb56 commited on 29 days ago

Fix OOM: Unload model before evaluation to free VRAM for lm_eval

74f609c

aeb56 commited on 30 days ago

Disable chat/inference, focus on evaluation only

69cd0c5

aeb56 commited on 30 days ago

Add Evaluation tab with ARC-Challenge, TruthfulQA, and Winogrande benchmarks

29f5263

aeb56 commited on Nov 10

Fix flash attention error by patching model config to use eager attention

2f60fd7

aeb56 commited on Nov 10

Fix flash attention error by using eager attention implementation

74fe23d

aeb56 commited on Nov 10

Switch to transformers inference (vLLM doesn't support KimiLinear architecture)

9905f0a

aeb56 commited on Nov 10

Improve vLLM startup with tensor parallelism, better logging, and 10min timeout

a82de92

aeb56 commited on Nov 10

Fix vLLM server start command to use python3 instead of python

75c2813

aeb56 commited on Nov 10

Remove emoji avatars incompatible with Gradio 4.19.2

5f01a47

aeb56 commited on Nov 10

Fix Gradio version compatibility and enable share mode

d073f8b

aeb56 commited on Nov 10

Switch to vLLM for high-performance, stable inference

310eb95

aeb56 commited on Nov 10

Fix variable scope error causing Internal Server Error

e62c736

aeb56 commited on Nov 10

Transform Space into professional inference UI for fine-tuned model

5e458c4

aeb56 commited on Nov 10

Implement manual LoRA merging to fix PEFT key naming conflicts

3a259bc

aeb56 commited on Nov 10

Use sequential device_map to fix key naming conflicts during LoRA merge

d3d4339

aeb56 commited on Nov 10

Add safe_merge and better error handling for LoRA merge with MoE models

79334bc

aeb56 commited on Nov 10

Fix 8-bit quantization CPU offload for large models

1a04e17

aeb56 commited on Nov 10

Add 8-bit quantization support and switch to L4x4 hardware for availability

e32298d

aeb56 commited on Nov 10

Optimize app.py for 48B model on 4xL40S GPUs with multi-GPU support

b51ac87

aeb56 commited on Nov 10

Initial commit: LoRA model merger

a951334

aeb56 commited on Nov 10

Spaces:

optiviseapp
/

fnmodel

Paused

Commit History

Monkey-patch transformers to disable flash attention via wrapper script

2900b36

Workaround flash-attn: create fake module with PyTorch fallback attention

b705945

Add flash-attn dependency required by Kimi model

ef25cbe

Fix: Remove height parameter for Gradio 4.19.2 compatibility

0cefed5

Add live status table and improved logging with attn_implementation=eager fix

0b25a32

Fix multi-GPU: use parallelize=True instead of device_map, update env var

96b6724

Aggressive memory cleanup: 5s wait, env vars, optional model loading

3fb1215

Fix OOM: Unload model before evaluation to free VRAM for lm_eval

74f609c

Disable chat/inference, focus on evaluation only

69cd0c5

Add Evaluation tab with ARC-Challenge, TruthfulQA, and Winogrande benchmarks

29f5263

Fix flash attention error by patching model config to use eager attention

2f60fd7

Fix flash attention error by using eager attention implementation

74fe23d

Switch to transformers inference (vLLM doesn't support KimiLinear architecture)

9905f0a

Improve vLLM startup with tensor parallelism, better logging, and 10min timeout

a82de92

Fix vLLM server start command to use python3 instead of python

75c2813

Remove emoji avatars incompatible with Gradio 4.19.2

5f01a47

Fix Gradio version compatibility and enable share mode

d073f8b

Switch to vLLM for high-performance, stable inference

310eb95

Fix variable scope error causing Internal Server Error

e62c736

Transform Space into professional inference UI for fine-tuned model

5e458c4

Implement manual LoRA merging to fix PEFT key naming conflicts

3a259bc

Use sequential device_map to fix key naming conflicts during LoRA merge

d3d4339

Add safe_merge and better error handling for LoRA merge with MoE models

79334bc

Fix 8-bit quantization CPU offload for large models

1a04e17

Add 8-bit quantization support and switch to L4x4 hardware for availability

e32298d

Optimize app.py for 48B model on 4xL40S GPUs with multi-GPU support

b51ac87

Initial commit: LoRA model merger

a951334

Commit History

Monkey-patch transformers to disable flash attention via wrapper script 2900b36

Workaround flash-attn: create fake module with PyTorch fallback attention b705945

Add flash-attn dependency required by Kimi model ef25cbe

Fix: Remove height parameter for Gradio 4.19.2 compatibility 0cefed5

Add live status table and improved logging with attn_implementation=eager fix 0b25a32

Fix multi-GPU: use parallelize=True instead of device_map, update env var 96b6724

Aggressive memory cleanup: 5s wait, env vars, optional model loading 3fb1215

Fix OOM: Unload model before evaluation to free VRAM for lm_eval 74f609c

Disable chat/inference, focus on evaluation only 69cd0c5

Add Evaluation tab with ARC-Challenge, TruthfulQA, and Winogrande benchmarks 29f5263

Fix flash attention error by patching model config to use eager attention 2f60fd7

Fix flash attention error by using eager attention implementation 74fe23d

Switch to transformers inference (vLLM doesn't support KimiLinear architecture) 9905f0a

Improve vLLM startup with tensor parallelism, better logging, and 10min timeout a82de92

Fix vLLM server start command to use python3 instead of python 75c2813

Remove emoji avatars incompatible with Gradio 4.19.2 5f01a47

Fix Gradio version compatibility and enable share mode d073f8b

Switch to vLLM for high-performance, stable inference 310eb95

Fix variable scope error causing Internal Server Error e62c736

Transform Space into professional inference UI for fine-tuned model 5e458c4

Implement manual LoRA merging to fix PEFT key naming conflicts 3a259bc

Use sequential device_map to fix key naming conflicts during LoRA merge d3d4339

Add safe_merge and better error handling for LoRA merge with MoE models 79334bc

Fix 8-bit quantization CPU offload for large models 1a04e17

Add 8-bit quantization support and switch to L4x4 hardware for availability e32298d

Optimize app.py for 48B model on 4xL40S GPUs with multi-GPU support b51ac87

Initial commit: LoRA model merger a951334

Monkey-patch transformers to disable flash attention via wrapper script

2900b36

Workaround flash-attn: create fake module with PyTorch fallback attention

b705945

Add flash-attn dependency required by Kimi model

ef25cbe

Fix: Remove height parameter for Gradio 4.19.2 compatibility

0cefed5

Add live status table and improved logging with attn_implementation=eager fix

0b25a32

Fix multi-GPU: use parallelize=True instead of device_map, update env var

96b6724

Aggressive memory cleanup: 5s wait, env vars, optional model loading

3fb1215

Fix OOM: Unload model before evaluation to free VRAM for lm_eval

74f609c

Disable chat/inference, focus on evaluation only

69cd0c5

Add Evaluation tab with ARC-Challenge, TruthfulQA, and Winogrande benchmarks

29f5263

Fix flash attention error by patching model config to use eager attention

2f60fd7

Fix flash attention error by using eager attention implementation

74fe23d

Switch to transformers inference (vLLM doesn't support KimiLinear architecture)

9905f0a

Improve vLLM startup with tensor parallelism, better logging, and 10min timeout

a82de92

Fix vLLM server start command to use python3 instead of python

75c2813

Remove emoji avatars incompatible with Gradio 4.19.2

5f01a47

Fix Gradio version compatibility and enable share mode

d073f8b

Switch to vLLM for high-performance, stable inference

310eb95

Fix variable scope error causing Internal Server Error

e62c736

Transform Space into professional inference UI for fine-tuned model

5e458c4

Implement manual LoRA merging to fix PEFT key naming conflicts

3a259bc

Use sequential device_map to fix key naming conflicts during LoRA merge

d3d4339

Add safe_merge and better error handling for LoRA merge with MoE models

79334bc

Fix 8-bit quantization CPU offload for large models

1a04e17

Add 8-bit quantization support and switch to L4x4 hardware for availability

e32298d

Optimize app.py for 48B model on 4xL40S GPUs with multi-GPU support

b51ac87

Initial commit: LoRA model merger

a951334