Spaces:

optiviseapp
/

fnmodel

Paused

App Files Files Community

aeb56 commited on 29 days ago

Commit

3fb1215

1 Parent(s): 74f609c

Aggressive memory cleanup: 5s wait, env vars, optional model loading

Browse files

Files changed (2) hide show

README.md +26 -23
app.py +41 -25

README.md CHANGED Viewed

@@ -46,25 +46,26 @@ Model evaluation Space for the fine-tuned Kimi-Linear-48B-A3B-Instruct model. **
 ### Quick Start
-1. **Load Model**
-   - Click "🚀 Load Model" button in the Controls tab
-   - Wait 5-10 minutes for model initialization
-   - Model will be distributed across available GPUs
-   - Look for "✅ Model loaded successfully"
-2. **Run Evaluation**
-   - Go to the "📊 Evaluation" tab
-   - Select benchmarks to run (ARC-Challenge, TruthfulQA, Winogrande)
-   - Click "🚀 Start Evaluation"
-   - The model will be automatically unloaded to free VRAM
-   - lm_eval will load its own instance for evaluation
-   - Wait 30-60 minutes for results
-   - Results will be displayed and saved to `/tmp/eval_results_[timestamp]/`
-3. **View Results**
-   - Evaluation results include metrics for each benchmark
-   - Results are automatically formatted and displayed
-   - Full results JSON files are saved for detailed analysis
 ## Why LM Evaluation Harness?
@@ -82,12 +83,14 @@ The LM Evaluation Harness is a standard framework for evaluating language models
 ### Memory Management
-This Space is optimized for limited VRAM:
-- **Pre-loading:** Optional model loading to verify setup
-- **Automatic Cleanup:** Model is unloaded before evaluation starts
 - **Single Instance:** Only lm_eval's model instance runs during evaluation
-- **Batch Size:** Set to 1 to minimize memory usage
 - **Device Mapping:** Automatic distribution across available GPUs
 ## Technical Details

 ### Quick Start
+**Option 1: Direct Evaluation (Recommended)**
+1. Go directly to the "📊 Evaluation" tab
+2. Select benchmarks to run (ARC-Challenge, TruthfulQA, Winogrande)
+3. Click "🚀 Start Evaluation"
+4. lm_eval will automatically load and evaluate the model
+5. Wait 30-60 minutes for results
+6. Results will be displayed and saved to `/tmp/eval_results_[timestamp]/`
+**Option 2: With Model Verification**
+1. **(Optional)** Click "🚀 Load Model" in Controls tab to verify setup (5-10 min)
+2. Go to the "📊 Evaluation" tab
+3. Select benchmarks and click "🚀 Start Evaluation"
+4. The pre-loaded model will be automatically unloaded to free VRAM
+5. lm_eval will load its own fresh instance for evaluation
+6. Wait 30-60 minutes for results
+**View Results**
+- Evaluation results include metrics for each benchmark
+- Results are automatically formatted and displayed
+- Full results JSON files are saved for detailed analysis
 ## Why LM Evaluation Harness?
 ### Memory Management
+This Space is optimized for limited VRAM (92GB across 4x L4):
+- **Direct Evaluation:** Skip model pre-loading and go straight to evaluation (recommended)
+- **Automatic Cleanup:** Any pre-loaded model is unloaded before evaluation starts
+- **Aggressive Memory Clearing:** Multiple garbage collection passes + 5s wait time
 - **Single Instance:** Only lm_eval's model instance runs during evaluation
+- **Batch Size:** Set to 1 to minimize memory usage during evaluation
 - **Device Mapping:** Automatic distribution across available GPUs
+- **Memory Fragmentation:** PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True set by default
 ## Technical Details

app.py CHANGED Viewed

@@ -5,9 +5,11 @@ import os
 import subprocess
 import json
 from datetime import datetime
-# Set environment variable for flash-linear-attention
 os.environ["FLA_USE_TRITON"] = "1"
 # Model configuration
 MODEL_NAME = "optiviseapp/kimi-linear-48b-a3b-instruct-fine-tune"
@@ -121,9 +123,8 @@ class ChatBot:
     def run_evaluation(self, tasks_to_run):
         """Run lm_eval on selected tasks"""
-        if not self.loaded:
-            yield "❌ Please load the model first!"
-            return
         try:
             # Map friendly names to lm_eval task names
@@ -141,25 +142,39 @@ class ChatBot:
             yield f"🔄 **Preparing for evaluation...**\n\nTasks: {', '.join(tasks_to_run)}\n\n"
-            # IMPORTANT: Unload the model from memory to free VRAM for lm_eval
-            yield f"🔄 **Unloading model to free VRAM...**\n\nThis is necessary because lm_eval will load its own instance.\n\n"
-            if self.model is not None:
-                del self.model
-                self.model = None
-            if self.tokenizer is not None:
-                del self.tokenizer
-                self.tokenizer = None
-            # Clear CUDA cache
             if torch.cuda.is_available():
-                torch.cuda.empty_cache()
-                torch.cuda.synchronize()
-            import gc
-            gc.collect()
-            self.loaded = False
             yield f"✅ **Memory cleared! Starting evaluation...**\n\nThis will take 30-60 minutes total.\n\n"
@@ -261,18 +276,19 @@ with gr.Blocks(theme=gr.themes.Soft(), title="Kimi 48B Fine-tuned - Evaluation")
     with gr.Tabs():
         # Tab 1: Controls (always visible)
         with gr.Tab("🎛️ Controls"):
-            gr.Markdown("### Load Model First")
             load_btn = gr.Button("🚀 Load Model", variant="primary", size="lg")
             status = gr.Markdown("**Status:** Model not loaded")
             gr.Markdown("""
             ### ℹ️ Instructions
-            1. **Click "Load Model"** - Takes 5-10 minutes (verifies setup)
-            2. **Use Evaluation tab** - To run benchmarks
             **Note:**
             - Chat/inference functionality is currently disabled. This Space focuses on model evaluation only.
-            - The model will be automatically unloaded before evaluation starts to free VRAM for lm_eval.
             """)
         # Tab 2: Chat - DISABLED
@@ -339,9 +355,9 @@ with gr.Blocks(theme=gr.themes.Soft(), title="Kimi 48B Fine-tuned - Evaluation")
             gr.Markdown("""
             ---
             **Note:**
-            - Click "Load Model" in Controls tab first to verify the setup
-            - The model will be automatically unloaded before evaluation to free VRAM
-            - lm_eval will load its own instance of the model for evaluation
             - Results will be saved to `/tmp/eval_results_[timestamp]/`
             """)

 import subprocess
 import json
 from datetime import datetime
+import time
+# Set environment variables for flash-linear-attention and memory management
 os.environ["FLA_USE_TRITON"] = "1"
+os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
 # Model configuration
 MODEL_NAME = "optiviseapp/kimi-linear-48b-a3b-instruct-fine-tune"
     def run_evaluation(self, tasks_to_run):
         """Run lm_eval on selected tasks"""
+        # Note: We don't strictly require the model to be loaded first
+        # since we'll be unloading it anyway. The load step is just for verification.
         try:
             # Map friendly names to lm_eval task names
             yield f"🔄 **Preparing for evaluation...**\n\nTasks: {', '.join(tasks_to_run)}\n\n"
+            # IMPORTANT: Clean up any loaded model to free VRAM for lm_eval
+            if self.loaded and self.model is not None:
+                yield f"🔄 **Unloading model to free VRAM...**\n\nThis is necessary because lm_eval will load its own instance.\n\n"
+                if self.model is not None:
+                    del self.model
+                    self.model = None
+                if self.tokenizer is not None:
+                    del self.tokenizer
+                    self.tokenizer = None
+                self.loaded = False
+            else:
+                yield f"🔄 **Cleaning up memory...**\n\nPreparing environment for evaluation.\n\n"
+            # Aggressive memory cleanup
+            import gc
+            for _ in range(3):
+                gc.collect()
             if torch.cuda.is_available():
+                for i in range(torch.cuda.device_count()):
+                    torch.cuda.empty_cache()
+                    torch.cuda.synchronize(device=i)
+                    torch.cuda.reset_peak_memory_stats(device=i)
+                    torch.cuda.reset_accumulated_memory_stats(device=i)
+            # Wait for memory to be fully released
+            yield f"🔄 **Waiting for memory cleanup...**\n\nGiving the system time to fully release VRAM.\n\n"
+            time.sleep(5)
+            # Final garbage collection
+            gc.collect()
             yield f"✅ **Memory cleared! Starting evaluation...**\n\nThis will take 30-60 minutes total.\n\n"
     with gr.Tabs():
         # Tab 1: Controls (always visible)
         with gr.Tab("🎛️ Controls"):
+            gr.Markdown("### Load Model (Optional)")
             load_btn = gr.Button("🚀 Load Model", variant="primary", size="lg")
             status = gr.Markdown("**Status:** Model not loaded")
             gr.Markdown("""
             ### ℹ️ Instructions
+            1. **(Optional)** Click "Load Model" to verify setup (takes 5-10 minutes)
+            2. **Go directly to Evaluation tab** to run benchmarks
             **Note:**
             - Chat/inference functionality is currently disabled. This Space focuses on model evaluation only.
+            - Loading the model first is optional - you can go straight to the Evaluation tab
+            - Any loaded model will be automatically unloaded before evaluation starts to free VRAM for lm_eval.
             """)
         # Tab 2: Chat - DISABLED
             gr.Markdown("""
             ---
             **Note:**
+            - You can start evaluation immediately - no need to load the model first
+            - If you did load the model, it will be automatically unloaded before evaluation to free VRAM
+            - lm_eval will load its own fresh instance of the model for evaluation
             - Results will be saved to `/tmp/eval_results_[timestamp]/`
             """)