Spaces:

optiviseapp
/

fnmodel

Paused

App Files Files Community

aeb56 commited on 29 days ago

Commit

74f609c

1 Parent(s): 69cd0c5

Fix OOM: Unload model before evaluation to free VRAM for lm_eval

Browse files

Files changed (2) hide show

README.md +11 -0
app.py +35 -7

README.md CHANGED Viewed

@@ -56,6 +56,8 @@ Model evaluation Space for the fine-tuned Kimi-Linear-48B-A3B-Instruct model. **
    - Go to the "📊 Evaluation" tab
    - Select benchmarks to run (ARC-Challenge, TruthfulQA, Winogrande)
    - Click "🚀 Start Evaluation"
    - Wait 30-60 minutes for results
    - Results will be displayed and saved to `/tmp/eval_results_[timestamp]/`
@@ -78,6 +80,15 @@ The LM Evaluation Harness is a standard framework for evaluating language models
 - **Minimum:** 4x NVIDIA L4 (96GB VRAM)
 - **Model Size:** ~96GB in bfloat16
 ## Technical Details
 ### Fine-tuning Configuration

    - Go to the "📊 Evaluation" tab
    - Select benchmarks to run (ARC-Challenge, TruthfulQA, Winogrande)
    - Click "🚀 Start Evaluation"
+   - The model will be automatically unloaded to free VRAM
+   - lm_eval will load its own instance for evaluation
    - Wait 30-60 minutes for results
    - Results will be displayed and saved to `/tmp/eval_results_[timestamp]/`
 - **Minimum:** 4x NVIDIA L4 (96GB VRAM)
 - **Model Size:** ~96GB in bfloat16
+### Memory Management
+This Space is optimized for limited VRAM:
+- **Pre-loading:** Optional model loading to verify setup
+- **Automatic Cleanup:** Model is unloaded before evaluation starts
+- **Single Instance:** Only lm_eval's model instance runs during evaluation
+- **Batch Size:** Set to 1 to minimize memory usage
+- **Device Mapping:** Automatic distribution across available GPUs
 ## Technical Details
 ### Fine-tuning Configuration

app.py CHANGED Viewed

@@ -139,15 +139,37 @@ class ChatBot:
             timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
             output_dir = f"/tmp/eval_results_{timestamp}"
-            yield f"🔄 **Starting evaluation...**\n\nTasks: {', '.join(tasks_to_run)}\n\nThis will take 30-60 minutes total.\n\n"
-            # Run lm_eval
             cmd = [
                 "lm_eval",
                 "--model", "hf",
-                "--model_args", f"pretrained={MODEL_NAME},trust_remote_code=True,dtype=bfloat16",
                 "--tasks", task_string,
-                "--batch_size", "auto:4",
                 "--output_path", output_dir,
                 "--log_samples"
             ]
@@ -245,10 +267,12 @@ with gr.Blocks(theme=gr.themes.Soft(), title="Kimi 48B Fine-tuned - Evaluation")
             gr.Markdown("""
             ### ℹ️ Instructions
-            1. **Click "Load Model"** - Takes 5-10 minutes
             2. **Use Evaluation tab** - To run benchmarks
-            **Note:** Chat/inference functionality is currently disabled. This Space focuses on model evaluation only.
             """)
         # Tab 2: Chat - DISABLED
@@ -314,7 +338,11 @@ with gr.Blocks(theme=gr.themes.Soft(), title="Kimi 48B Fine-tuned - Evaluation")
             gr.Markdown("""
             ---
-            **Note:** Evaluation requires the model to be loaded first. Results will be saved to `/tmp/eval_results_[timestamp]/`.
             """)
     gr.Markdown("""

             timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
             output_dir = f"/tmp/eval_results_{timestamp}"
+            yield f"🔄 **Preparing for evaluation...**\n\nTasks: {', '.join(tasks_to_run)}\n\n"
+            # IMPORTANT: Unload the model from memory to free VRAM for lm_eval
+            yield f"🔄 **Unloading model to free VRAM...**\n\nThis is necessary because lm_eval will load its own instance.\n\n"
+            if self.model is not None:
+                del self.model
+                self.model = None
+            if self.tokenizer is not None:
+                del self.tokenizer
+                self.tokenizer = None
+            # Clear CUDA cache
+            if torch.cuda.is_available():
+                torch.cuda.empty_cache()
+                torch.cuda.synchronize()
+            import gc
+            gc.collect()
+            self.loaded = False
+            yield f"✅ **Memory cleared! Starting evaluation...**\n\nThis will take 30-60 minutes total.\n\n"
+            # Run lm_eval with optimized memory settings
             cmd = [
                 "lm_eval",
                 "--model", "hf",
+                "--model_args", f"pretrained={MODEL_NAME},trust_remote_code=True,dtype=bfloat16,device_map=auto,low_cpu_mem_usage=True",
                 "--tasks", task_string,
+                "--batch_size", "1",  # Reduced to minimize memory usage
                 "--output_path", output_dir,
                 "--log_samples"
             ]
             gr.Markdown("""
             ### ℹ️ Instructions
+            1. **Click "Load Model"** - Takes 5-10 minutes (verifies setup)
             2. **Use Evaluation tab** - To run benchmarks
+            **Note:**
+            - Chat/inference functionality is currently disabled. This Space focuses on model evaluation only.
+            - The model will be automatically unloaded before evaluation starts to free VRAM for lm_eval.
             """)
         # Tab 2: Chat - DISABLED
             gr.Markdown("""
             ---
+            **Note:**
+            - Click "Load Model" in Controls tab first to verify the setup
+            - The model will be automatically unloaded before evaluation to free VRAM
+            - lm_eval will load its own instance of the model for evaluation
+            - Results will be saved to `/tmp/eval_results_[timestamp]/`
             """)
     gr.Markdown("""