aeb56 commited on
Commit
3fb1215
Β·
1 Parent(s): 74f609c

Aggressive memory cleanup: 5s wait, env vars, optional model loading

Browse files
Files changed (2) hide show
  1. README.md +26 -23
  2. app.py +41 -25
README.md CHANGED
@@ -46,25 +46,26 @@ Model evaluation Space for the fine-tuned Kimi-Linear-48B-A3B-Instruct model. **
46
 
47
  ### Quick Start
48
 
49
- 1. **Load Model**
50
- - Click "πŸš€ Load Model" button in the Controls tab
51
- - Wait 5-10 minutes for model initialization
52
- - Model will be distributed across available GPUs
53
- - Look for "βœ… Model loaded successfully"
54
-
55
- 2. **Run Evaluation**
56
- - Go to the "πŸ“Š Evaluation" tab
57
- - Select benchmarks to run (ARC-Challenge, TruthfulQA, Winogrande)
58
- - Click "πŸš€ Start Evaluation"
59
- - The model will be automatically unloaded to free VRAM
60
- - lm_eval will load its own instance for evaluation
61
- - Wait 30-60 minutes for results
62
- - Results will be displayed and saved to `/tmp/eval_results_[timestamp]/`
63
-
64
- 3. **View Results**
65
- - Evaluation results include metrics for each benchmark
66
- - Results are automatically formatted and displayed
67
- - Full results JSON files are saved for detailed analysis
 
68
 
69
  ## Why LM Evaluation Harness?
70
 
@@ -82,12 +83,14 @@ The LM Evaluation Harness is a standard framework for evaluating language models
82
 
83
  ### Memory Management
84
 
85
- This Space is optimized for limited VRAM:
86
- - **Pre-loading:** Optional model loading to verify setup
87
- - **Automatic Cleanup:** Model is unloaded before evaluation starts
 
88
  - **Single Instance:** Only lm_eval's model instance runs during evaluation
89
- - **Batch Size:** Set to 1 to minimize memory usage
90
  - **Device Mapping:** Automatic distribution across available GPUs
 
91
 
92
  ## Technical Details
93
 
 
46
 
47
  ### Quick Start
48
 
49
+ **Option 1: Direct Evaluation (Recommended)**
50
+ 1. Go directly to the "πŸ“Š Evaluation" tab
51
+ 2. Select benchmarks to run (ARC-Challenge, TruthfulQA, Winogrande)
52
+ 3. Click "πŸš€ Start Evaluation"
53
+ 4. lm_eval will automatically load and evaluate the model
54
+ 5. Wait 30-60 minutes for results
55
+ 6. Results will be displayed and saved to `/tmp/eval_results_[timestamp]/`
56
+
57
+ **Option 2: With Model Verification**
58
+ 1. **(Optional)** Click "πŸš€ Load Model" in Controls tab to verify setup (5-10 min)
59
+ 2. Go to the "πŸ“Š Evaluation" tab
60
+ 3. Select benchmarks and click "πŸš€ Start Evaluation"
61
+ 4. The pre-loaded model will be automatically unloaded to free VRAM
62
+ 5. lm_eval will load its own fresh instance for evaluation
63
+ 6. Wait 30-60 minutes for results
64
+
65
+ **View Results**
66
+ - Evaluation results include metrics for each benchmark
67
+ - Results are automatically formatted and displayed
68
+ - Full results JSON files are saved for detailed analysis
69
 
70
  ## Why LM Evaluation Harness?
71
 
 
83
 
84
  ### Memory Management
85
 
86
+ This Space is optimized for limited VRAM (92GB across 4x L4):
87
+ - **Direct Evaluation:** Skip model pre-loading and go straight to evaluation (recommended)
88
+ - **Automatic Cleanup:** Any pre-loaded model is unloaded before evaluation starts
89
+ - **Aggressive Memory Clearing:** Multiple garbage collection passes + 5s wait time
90
  - **Single Instance:** Only lm_eval's model instance runs during evaluation
91
+ - **Batch Size:** Set to 1 to minimize memory usage during evaluation
92
  - **Device Mapping:** Automatic distribution across available GPUs
93
+ - **Memory Fragmentation:** PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True set by default
94
 
95
  ## Technical Details
96
 
app.py CHANGED
@@ -5,9 +5,11 @@ import os
5
  import subprocess
6
  import json
7
  from datetime import datetime
 
8
 
9
- # Set environment variable for flash-linear-attention
10
  os.environ["FLA_USE_TRITON"] = "1"
 
11
 
12
  # Model configuration
13
  MODEL_NAME = "optiviseapp/kimi-linear-48b-a3b-instruct-fine-tune"
@@ -121,9 +123,8 @@ class ChatBot:
121
 
122
  def run_evaluation(self, tasks_to_run):
123
  """Run lm_eval on selected tasks"""
124
- if not self.loaded:
125
- yield "❌ Please load the model first!"
126
- return
127
 
128
  try:
129
  # Map friendly names to lm_eval task names
@@ -141,25 +142,39 @@ class ChatBot:
141
 
142
  yield f"πŸ”„ **Preparing for evaluation...**\n\nTasks: {', '.join(tasks_to_run)}\n\n"
143
 
144
- # IMPORTANT: Unload the model from memory to free VRAM for lm_eval
145
- yield f"πŸ”„ **Unloading model to free VRAM...**\n\nThis is necessary because lm_eval will load its own instance.\n\n"
 
 
 
 
 
 
 
 
 
 
 
 
146
 
147
- if self.model is not None:
148
- del self.model
149
- self.model = None
150
- if self.tokenizer is not None:
151
- del self.tokenizer
152
- self.tokenizer = None
153
 
154
- # Clear CUDA cache
155
  if torch.cuda.is_available():
156
- torch.cuda.empty_cache()
157
- torch.cuda.synchronize()
 
 
 
158
 
159
- import gc
160
- gc.collect()
 
161
 
162
- self.loaded = False
 
163
 
164
  yield f"βœ… **Memory cleared! Starting evaluation...**\n\nThis will take 30-60 minutes total.\n\n"
165
 
@@ -261,18 +276,19 @@ with gr.Blocks(theme=gr.themes.Soft(), title="Kimi 48B Fine-tuned - Evaluation")
261
  with gr.Tabs():
262
  # Tab 1: Controls (always visible)
263
  with gr.Tab("πŸŽ›οΈ Controls"):
264
- gr.Markdown("### Load Model First")
265
  load_btn = gr.Button("πŸš€ Load Model", variant="primary", size="lg")
266
  status = gr.Markdown("**Status:** Model not loaded")
267
 
268
  gr.Markdown("""
269
  ### ℹ️ Instructions
270
- 1. **Click "Load Model"** - Takes 5-10 minutes (verifies setup)
271
- 2. **Use Evaluation tab** - To run benchmarks
272
 
273
  **Note:**
274
  - Chat/inference functionality is currently disabled. This Space focuses on model evaluation only.
275
- - The model will be automatically unloaded before evaluation starts to free VRAM for lm_eval.
 
276
  """)
277
 
278
  # Tab 2: Chat - DISABLED
@@ -339,9 +355,9 @@ with gr.Blocks(theme=gr.themes.Soft(), title="Kimi 48B Fine-tuned - Evaluation")
339
  gr.Markdown("""
340
  ---
341
  **Note:**
342
- - Click "Load Model" in Controls tab first to verify the setup
343
- - The model will be automatically unloaded before evaluation to free VRAM
344
- - lm_eval will load its own instance of the model for evaluation
345
  - Results will be saved to `/tmp/eval_results_[timestamp]/`
346
  """)
347
 
 
5
  import subprocess
6
  import json
7
  from datetime import datetime
8
+ import time
9
 
10
+ # Set environment variables for flash-linear-attention and memory management
11
  os.environ["FLA_USE_TRITON"] = "1"
12
+ os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
13
 
14
  # Model configuration
15
  MODEL_NAME = "optiviseapp/kimi-linear-48b-a3b-instruct-fine-tune"
 
123
 
124
  def run_evaluation(self, tasks_to_run):
125
  """Run lm_eval on selected tasks"""
126
+ # Note: We don't strictly require the model to be loaded first
127
+ # since we'll be unloading it anyway. The load step is just for verification.
 
128
 
129
  try:
130
  # Map friendly names to lm_eval task names
 
142
 
143
  yield f"πŸ”„ **Preparing for evaluation...**\n\nTasks: {', '.join(tasks_to_run)}\n\n"
144
 
145
+ # IMPORTANT: Clean up any loaded model to free VRAM for lm_eval
146
+ if self.loaded and self.model is not None:
147
+ yield f"πŸ”„ **Unloading model to free VRAM...**\n\nThis is necessary because lm_eval will load its own instance.\n\n"
148
+
149
+ if self.model is not None:
150
+ del self.model
151
+ self.model = None
152
+ if self.tokenizer is not None:
153
+ del self.tokenizer
154
+ self.tokenizer = None
155
+
156
+ self.loaded = False
157
+ else:
158
+ yield f"πŸ”„ **Cleaning up memory...**\n\nPreparing environment for evaluation.\n\n"
159
 
160
+ # Aggressive memory cleanup
161
+ import gc
162
+ for _ in range(3):
163
+ gc.collect()
 
 
164
 
 
165
  if torch.cuda.is_available():
166
+ for i in range(torch.cuda.device_count()):
167
+ torch.cuda.empty_cache()
168
+ torch.cuda.synchronize(device=i)
169
+ torch.cuda.reset_peak_memory_stats(device=i)
170
+ torch.cuda.reset_accumulated_memory_stats(device=i)
171
 
172
+ # Wait for memory to be fully released
173
+ yield f"πŸ”„ **Waiting for memory cleanup...**\n\nGiving the system time to fully release VRAM.\n\n"
174
+ time.sleep(5)
175
 
176
+ # Final garbage collection
177
+ gc.collect()
178
 
179
  yield f"βœ… **Memory cleared! Starting evaluation...**\n\nThis will take 30-60 minutes total.\n\n"
180
 
 
276
  with gr.Tabs():
277
  # Tab 1: Controls (always visible)
278
  with gr.Tab("πŸŽ›οΈ Controls"):
279
+ gr.Markdown("### Load Model (Optional)")
280
  load_btn = gr.Button("πŸš€ Load Model", variant="primary", size="lg")
281
  status = gr.Markdown("**Status:** Model not loaded")
282
 
283
  gr.Markdown("""
284
  ### ℹ️ Instructions
285
+ 1. **(Optional)** Click "Load Model" to verify setup (takes 5-10 minutes)
286
+ 2. **Go directly to Evaluation tab** to run benchmarks
287
 
288
  **Note:**
289
  - Chat/inference functionality is currently disabled. This Space focuses on model evaluation only.
290
+ - Loading the model first is optional - you can go straight to the Evaluation tab
291
+ - Any loaded model will be automatically unloaded before evaluation starts to free VRAM for lm_eval.
292
  """)
293
 
294
  # Tab 2: Chat - DISABLED
 
355
  gr.Markdown("""
356
  ---
357
  **Note:**
358
+ - You can start evaluation immediately - no need to load the model first
359
+ - If you did load the model, it will be automatically unloaded before evaluation to free VRAM
360
+ - lm_eval will load its own fresh instance of the model for evaluation
361
  - Results will be saved to `/tmp/eval_results_[timestamp]/`
362
  """)
363