Alikestocode commited on
Commit
cf9ed91
·
1 Parent(s): 3f08592

Add build_awq_modifier_config helper using QuantizationScheme objects

Browse files

- Create helper function that uses QuantizationScheme/QuantizationArgs objects
- Properly constructs QuantizationConfig with config_groups
- Falls back to dict-based config if QuantizationScheme not available
- Fixes ValidationError by using proper object structure instead of plain dicts
- Update quantization function to use the helper

Files changed (1) hide show
  1. quantize_to_awq_colab.ipynb +776 -635
quantize_to_awq_colab.ipynb CHANGED
@@ -1,638 +1,779 @@
1
  {
2
- "cells": [
3
- {
4
- "cell_type": "markdown",
5
- "metadata": {},
6
- "source": [
7
- "# Router Models AWQ Quantization with LLM Compressor (vLLM Native)\n",
8
- "\n",
9
- "This notebook quantizes the CourseGPT-Pro router models to AWQ (Activation-aware Weight Quantization) format using **LLM Compressor** - vLLM's native quantization tool.\n",
10
- "\n",
11
- "**Models to quantize:**\n",
12
- "- `Alovestocode/router-gemma3-merged` (27B)\n",
13
- "- `Alovestocode/router-qwen3-32b-merged` (33B)\n",
14
- "\n",
15
- "**Output:** AWQ-quantized models ready for vLLM inference with optimal performance.\n",
16
- "\n",
17
- "**Why LLM Compressor?**\n",
18
- "- Native vLLM integration (better compatibility)\n",
19
- "- Supports advanced features (pruning, combined modifiers)\n",
20
- "- Actively maintained by vLLM team\n",
21
- "- Optimized for vLLM inference engine\n",
22
- "\n",
23
- "**⚠️ IMPORTANT:** If you see errors about `AWQModifier` parameters, **restart the kernel** (Runtime → Restart runtime) and run all cells from the beginning. The notebook uses `AWQModifier()` without parameters (default 4-bit AWQ).\n"
24
- ]
25
- },
26
- {
27
- "cell_type": "markdown",
28
- "metadata": {},
29
- "source": [
30
- "## 1. Install Dependencies\n"
31
- ]
32
- },
33
- {
34
- "cell_type": "code",
35
- "execution_count": null,
36
- "metadata": {},
37
- "outputs": [],
38
- "source": [
39
- "# Install required packages\n",
40
- "# LLM Compressor is vLLM's native quantization tool\n",
41
- "# Note: Package name is 'llmcompressor' (no hyphen), may need to install from GitHub\n",
42
- "%pip install -q transformers accelerate huggingface_hub\n",
43
- "%pip install -q torch --index-url https://download.pytorch.org/whl/cu118\n",
44
- "\n",
45
- "# Try installing llmcompressor from PyPI first, fallback to GitHub if not available\n",
46
- "try:\n",
47
- " import llmcompressor\n",
48
- " print(\"✅ llmcompressor already installed\")\n",
49
- "except ImportError:\n",
50
- " print(\"Installing llmcompressor...\")\n",
51
- " # Try PyPI first\n",
52
- " import subprocess\n",
53
- " import sys\n",
54
- " result = subprocess.run([sys.executable, \"-m\", \"pip\", \"install\", \"-q\", \"llmcompressor\"], \n",
55
- " capture_output=True, text=True)\n",
56
- " if result.returncode != 0:\n",
57
- " # Fallback to GitHub installation\n",
58
- " print(\"PyPI installation failed, trying GitHub...\")\n",
59
- " subprocess.run([sys.executable, \"-m\", \"pip\", \"install\", \"-q\", \n",
60
- " \"git+https://github.com/vllm-project/llm-compressor.git\"], \n",
61
- " check=False)\n",
62
- " print(\"✅ llmcompressor installed\")\n",
63
- "\n",
64
- "# Utility function to check disk space\n",
65
- "import shutil\n",
66
- "def check_disk_space():\n",
67
- " \"\"\"Check available disk space.\"\"\"\n",
68
- " total, used, free = shutil.disk_usage(\"/\")\n",
69
- " print(f\"Disk Space: {free / (1024**3):.2f} GB free out of {total / (1024**3):.2f} GB total\")\n",
70
- " return free / (1024**3) # Return free space in GB\n",
71
- "\n",
72
- "print(\"Initial disk space:\")\n",
73
- "check_disk_space()\n"
74
- ]
75
- },
76
- {
77
- "cell_type": "markdown",
78
- "metadata": {},
79
- "source": [
80
- "## 2. Authenticate with Hugging Face\n"
81
- ]
82
- },
83
- {
84
- "cell_type": "code",
85
- "execution_count": null,
86
- "metadata": {},
87
- "outputs": [],
88
- "source": [
89
- "from huggingface_hub import login\n",
90
- "import os\n",
91
- "\n",
92
- "# Login to Hugging Face (you'll need a token with write access)\n",
93
- "# Get your token from: https://huggingface.co/settings/tokens\n",
94
- "HF_TOKEN = \"your_hf_token_here\" # Replace with your token\n",
95
- "\n",
96
- "login(token=HF_TOKEN)\n",
97
- "os.environ[\"HF_TOKEN\"] = HF_TOKEN\n"
98
- ]
99
- },
100
- {
101
- "cell_type": "markdown",
102
- "metadata": {},
103
- "source": [
104
- "## 3. Configuration\n"
105
- ]
106
- },
107
- {
108
- "cell_type": "code",
109
- "execution_count": null,
110
- "metadata": {},
111
- "outputs": [],
112
- "source": [
113
- "# Model configurations\n",
114
- "MODELS_TO_QUANTIZE = {\n",
115
- " \"router-gemma3-merged\": {\n",
116
- " \"repo_id\": \"Alovestocode/router-gemma3-merged\",\n",
117
- " \"output_repo\": \"Alovestocode/router-gemma3-merged-awq\", # Or keep same repo\n",
118
- " \"model_type\": \"gemma\",\n",
119
- " },\n",
120
- " \"router-qwen3-32b-merged\": {\n",
121
- " \"repo_id\": \"Alovestocode/router-qwen3-32b-merged\",\n",
122
- " \"output_repo\": \"Alovestocode/router-qwen3-32b-merged-awq\", # Or keep same repo\n",
123
- " \"model_type\": \"qwen\",\n",
124
- " }\n",
125
- "}\n",
126
- "\n",
127
- "# AWQ quantization config\n",
128
- "AWQ_CONFIG = {\n",
129
- " \"w_bit\": 4, # 4-bit quantization\n",
130
- " \"q_group_size\": 128, # Group size for quantization\n",
131
- " \"zero_point\": True, # Use zero-point quantization\n",
132
- " \"version\": \"GEMM\", # GEMM kernel (better for longer contexts)\n",
133
- "}\n"
134
- ]
135
- },
136
- {
137
- "cell_type": "markdown",
138
- "metadata": {},
139
- "source": [
140
- "## 4. Quantization Function\n"
141
- ]
142
- },
143
- {
144
- "cell_type": "code",
145
- "execution_count": null,
146
- "metadata": {},
147
- "outputs": [],
148
- "source": [
149
- "# LLM Compressor (vLLM native quantization tool)\n",
150
- "# Import with error handling in case installation failed\n",
151
- "try:\n",
152
- " from llmcompressor import oneshot\n",
153
- " # Correct import path: AWQModifier is in modifiers.awq, not modifiers.quantization\n",
154
- " from llmcompressor.modifiers.awq import AWQModifier\n",
155
- " LLM_COMPRESSOR_AVAILABLE = True\n",
156
- " print(\"✅ LLM Compressor imported successfully\")\n",
157
- "except ImportError as e:\n",
158
- " print(f\"❌ Failed to import llmcompressor: {e}\")\n",
159
- " print(\"Please ensure llmcompressor is installed:\")\n",
160
- " print(\" %pip install llmcompressor\")\n",
161
- " print(\" OR\")\n",
162
- " print(\" %pip install git+https://github.com/vllm-project/llm-compressor.git\")\n",
163
- " print(\"\\nNote: If import still fails, try:\")\n",
164
- " print(\" %pip install --upgrade llmcompressor\")\n",
165
- " LLM_COMPRESSOR_AVAILABLE = False\n",
166
- " raise\n",
167
- "\n",
168
- "from transformers import AutoTokenizer\n",
169
- "from huggingface_hub import HfApi, scan_cache_dir, upload_folder\n",
170
- "import torch\n",
171
- "import shutil\n",
172
- "import gc\n",
173
- "import os\n",
174
- "\n",
175
- "# Try to import delete_revisions (may not be available in all versions)\n",
176
- "try:\n",
177
- " from huggingface_hub import delete_revisions\n",
178
- " DELETE_REVISIONS_AVAILABLE = True\n",
179
- "except ImportError:\n",
180
- " # delete_revisions might not be available, we'll use alternative method\n",
181
- " DELETE_REVISIONS_AVAILABLE = False\n",
182
- " print(\"Note: delete_revisions not available, will use alternative cache cleanup method\")\n",
183
- "\n",
184
- "def quantize_model_to_awq(\n",
185
- " model_name: str,\n",
186
- " repo_id: str,\n",
187
- " output_repo: str,\n",
188
- " model_type: str,\n",
189
- " awq_config: dict,\n",
190
- " calibration_dataset_size: int = 128\n",
191
- "):\n",
192
- " \"\"\"Quantize a model to AWQ format using LLM Compressor (vLLM native).\n",
193
- " \n",
194
- " Args:\n",
195
- " model_name: Display name for the model\n",
196
- " repo_id: Source Hugging Face repo ID\n",
197
- " output_repo: Destination Hugging Face repo ID\n",
198
- " model_type: Model type (gemma/qwen) for tokenizer selection\n",
199
- " awq_config: AWQ quantization configuration\n",
200
- " calibration_dataset_size: Number of calibration samples\n",
201
- " \"\"\"\n",
202
- " print(f\"\\n{'='*60}\")\n",
203
- " print(f\"Quantizing {model_name} with LLM Compressor (vLLM native)\")\n",
204
- " print(f\"Source: {repo_id}\")\n",
205
- " print(f\"Destination: {output_repo}\")\n",
206
- " print(f\"{'='*60}\\n\")\n",
207
- " \n",
208
- " # Check disk space before starting\n",
209
- " free_space_before = check_disk_space()\n",
210
- " if free_space_before < 30:\n",
211
- " print(f\"⚠️ WARNING: Low disk space ({free_space_before:.2f} GB). Quantization may fail.\")\n",
212
- " \n",
213
- " # Step 1: Create temporary output directory\n",
214
- " import tempfile\n",
215
- " temp_output_dir = f\"./temp_{model_name.replace('-', '_')}_awq\"\n",
216
- " print(f\"[1/4] Creating temporary output directory: {temp_output_dir}\")\n",
217
- " os.makedirs(temp_output_dir, exist_ok=True)\n",
218
- " \n",
219
- " # Step 2: Prepare calibration dataset\n",
220
- " print(f\"\\n[2/4] Preparing calibration dataset ({calibration_dataset_size} samples)...\")\n",
221
- " \n",
222
- " # Create calibration dataset for router agent\n",
223
- " calibration_texts = [\n",
224
- " \"You are the Router Agent coordinating Math, Code, and General-Search specialists.\",\n",
225
- " \"Emit EXACTLY ONE strict JSON object with keys route_plan, route_rationale, expected_artifacts,\",\n",
226
- " \"Solve a quadratic equation using Python programming.\",\n",
227
- " \"Implement a binary search algorithm with proper error handling.\",\n",
228
- " \"Explain the concept of gradient descent in machine learning.\",\n",
229
- " \"Write a function to calculate the Fibonacci sequence recursively.\",\n",
230
- " \"Design a REST API endpoint for user authentication.\",\n",
231
- " \"Analyze the time complexity of merge sort algorithm.\",\n",
232
- " ]\n",
233
- " \n",
234
- " # Repeat to reach desired size\n",
235
- " while len(calibration_texts) < calibration_dataset_size:\n",
236
- " calibration_texts.extend(calibration_texts[:calibration_dataset_size - len(calibration_texts)])\n",
237
- " \n",
238
- " calibration_texts = calibration_texts[:calibration_dataset_size]\n",
239
- " print(f\"✅ Calibration dataset prepared: {len(calibration_texts)} samples\")\n",
240
- " \n",
241
- " # Step 3: Quantize model using LLM Compressor\n",
242
- " print(f\"\\n[3/4] Quantizing model to AWQ with LLM Compressor (this may take 30-60 minutes)...\")\n",
243
- " print(f\"Config: {awq_config}\")\n",
244
- " print(\"⚠️ LLM Compressor will load the model, quantize it, and save to local directory\")\n",
245
- " \n",
246
- " if not LLM_COMPRESSOR_AVAILABLE:\n",
247
- " raise ImportError(\"LLM Compressor is not available. Please install it first.\")\n",
248
- " \n",
249
- " try:\n",
250
- " # LLM Compressor's oneshot function handles everything:\n",
251
- " # - Loading the model\n",
252
- " # - Quantization with calibration data\n",
253
- " # - Saving quantized model\n",
254
- " print(f\" → Starting quantization with LLM Compressor...\")\n",
255
- " print(f\" → This may take 30-60 minutes depending on model size...\")\n",
256
- " \n",
257
- " # AWQModifier quantization config\n",
258
- " # Create quantization config with correct structure for AWQ\n",
259
- " print(f\" → Creating quantization config for 4-bit AWQ...\")\n",
260
- " \n",
261
- " # AWQModifier requires quantization_config with proper structure:\n",
262
- " # - config_groups: dict mapping group names to quantization schemes\n",
263
- " # - Each group needs: targets (list of module types), weights (dict with num_bits, etc.)\n",
264
- " quant_config = {\n",
265
- " \"config_groups\": {\n",
266
- " \"group_0\": {\n",
267
- " \"targets\": [\"Linear\"], # Target Linear layers\n",
268
- " \"weights\": {\n",
269
- " \"num_bits\": 4, # 4-bit quantization\n",
270
- " \"group_size\": 128, # Group size for quantization\n",
271
- " \"zero_point\": True, # Use zero-point quantization\n",
272
- " \"symmetric\": False, # Asymmetric quantization\n",
273
- " \"strategy\": \"group\", # Group-wise quantization\n",
274
- " \"observer\": \"minmax\", # Min-max observer\n",
275
- " \"type\": \"int\", # Integer quantization\n",
276
- " \"dynamic\": False # Static quantization\n",
277
- " },\n",
278
- " \"input_activations\": None, # No activation quantization\n",
279
- " \"output_activations\": None # No activation quantization\n",
280
- " }\n",
281
- " },\n",
282
- " \"ignore\": [\"lm_head\"], # Ignore language model head\n",
283
- " \"quant_method\": \"compressed-tensors\",\n",
284
- " \"quantization_status\": \"compressed\",\n",
285
- " \"format\": \"pack-quantized\"\n",
286
- " }\n",
287
- " \n",
288
- " print(f\" ✅ Created quantization config with correct structure\")\n",
289
- " print(f\" → Creating AWQModifier with quantization config...\")\n",
290
- " modifiers = [AWQModifier(quantization_config=quant_config)]\n",
291
- " print(f\" ✅ AWQModifier created successfully\")\n",
292
- " \n",
293
- " # Call oneshot with the modifier\n",
294
- " print(f\" → Starting quantization process...\")\n",
295
- " oneshot(\n",
296
- " model=repo_id,\n",
297
- " output_dir=temp_output_dir,\n",
298
- " modifiers=modifiers,\n",
299
- " token=os.environ.get(\"HF_TOKEN\"),\n",
300
- " # Calibration data: list of strings\n",
301
- " calibration_data=calibration_texts[:min(calibration_dataset_size, 128)]\n",
302
- " )\n",
303
- " \n",
304
- " print(f\"✅ Model quantized to AWQ successfully\")\n",
305
- " except Exception as e:\n",
306
- " print(f\"❌ Quantization failed: {e}\")\n",
307
- " print(f\"\\nTroubleshooting:\")\n",
308
- " print(f\"1. Ensure llmcompressor is installed: %pip install llmcompressor\")\n",
309
- " print(f\"2. Or install from GitHub: %pip install git+https://github.com/vllm-project/llm-compressor.git\")\n",
310
- " print(f\"3. Check that you have sufficient GPU memory (40GB+ recommended)\")\n",
311
- " import traceback\n",
312
- " traceback.print_exc()\n",
313
- " raise\n",
314
- " \n",
315
- " # Step 4: Upload to Hugging Face\n",
316
- " print(f\"\\n[4/4] Uploading quantized model to {output_repo}...\")\n",
317
- " \n",
318
- " # Create repo if it doesn't exist\n",
319
- " api = HfApi()\n",
320
- " try:\n",
321
- " api.create_repo(\n",
322
- " repo_id=output_repo,\n",
323
- " repo_type=\"model\",\n",
324
- " exist_ok=True,\n",
325
- " token=os.environ.get(\"HF_TOKEN\")\n",
326
- " )\n",
327
- " print(f\"✅ Repository ready: {output_repo}\")\n",
328
- " except Exception as e:\n",
329
- " print(f\"Note: Repo may already exist: {e}\")\n",
330
- " \n",
331
- " # Upload the quantized model directory\n",
332
- " try:\n",
333
- " upload_folder(\n",
334
- " folder_path=temp_output_dir,\n",
335
- " repo_id=output_repo,\n",
336
- " repo_type=\"model\",\n",
337
- " token=os.environ.get(\"HF_TOKEN\"),\n",
338
- " ignore_patterns=[\"*.pt\", \"*.bin\"] # Only upload safetensors\n",
339
- " )\n",
340
- " print(f\"✅ Quantized model uploaded to {output_repo}\")\n",
341
- " except Exception as e:\n",
342
- " print(f\"❌ Upload failed: {e}\")\n",
343
- " import traceback\n",
344
- " traceback.print_exc()\n",
345
- " raise\n",
346
- " \n",
347
- " # Step 5: Clean up to free disk space (critical for Colab)\n",
348
- " print(f\"\\n[5/5] Cleaning up local files to free disk space...\")\n",
349
- " \n",
350
- " # Delete temporary output directory\n",
351
- " try:\n",
352
- " import shutil\n",
353
- " shutil.rmtree(temp_output_dir)\n",
354
- " print(f\" ✅ Deleted temporary directory: {temp_output_dir}\")\n",
355
- " except Exception as e:\n",
356
- " print(f\" ⚠️ Could not delete temp directory: {e}\")\n",
357
- " \n",
358
- " # Free GPU memory\n",
359
- " torch.cuda.empty_cache()\n",
360
- " gc.collect()\n",
361
- " \n",
362
- " # Clear Hugging Face cache for the source model (frees ~50-70GB)\n",
363
- " print(f\" → Clearing Hugging Face cache for {repo_id}...\")\n",
364
- " try:\n",
365
- " cache_info = scan_cache_dir()\n",
366
- " # Find and delete revisions for the source model\n",
367
- " revisions_to_delete = []\n",
368
- " for repo in cache_info.revisions:\n",
369
- " if repo.repo_id == repo_id:\n",
370
- " revisions_to_delete.append(repo)\n",
371
- " \n",
372
- " if revisions_to_delete:\n",
373
- " if DELETE_REVISIONS_AVAILABLE:\n",
374
- " # Use delete_revisions if available\n",
375
- " delete_revisions(revisions_to_delete)\n",
376
- " print(f\" ✅ Deleted {len(revisions_to_delete)} cached revision(s) for {repo_id}\")\n",
377
- " else:\n",
378
- " # Alternative: Delete cache directories manually\n",
379
- " deleted_count = 0\n",
380
- " for revision in revisions_to_delete:\n",
381
- " try:\n",
382
- " # Get the cache directory path\n",
383
- " cache_path = revision.snapshot_path if hasattr(revision, 'snapshot_path') else None\n",
384
- " if cache_path and os.path.exists(cache_path):\n",
385
- " shutil.rmtree(cache_path)\n",
386
- " deleted_count += 1\n",
387
- " except Exception as e:\n",
388
- " print(f\" ⚠️ Could not delete {revision.repo_id}: {e}\")\n",
389
- " \n",
390
- " if deleted_count > 0:\n",
391
- " print(f\" ✅ Deleted {deleted_count} cached revision(s) for {repo_id}\")\n",
392
- " else:\n",
393
- " print(f\" ℹ️ Found {len(revisions_to_delete)} cached revision(s) but couldn't delete them\")\n",
394
- " print(f\" Try manually: huggingface-cli scan-cache --dir ~/.cache/huggingface\")\n",
395
- " else:\n",
396
- " print(f\" ℹ️ No cached revisions found for {repo_id}\")\n",
397
- " except Exception as e:\n",
398
- " print(f\" ⚠️ Cache cleanup warning: {e} (continuing...)\")\n",
399
- " print(f\" You can manually clean cache with: huggingface-cli scan-cache\")\n",
400
- " \n",
401
- " # Check disk space after cleanup\n",
402
- " free_space_after = check_disk_space()\n",
403
- " print(f\"\\n✅ Cleanup complete! Free space: {free_space_after:.2f} GB\")\n",
404
- " \n",
405
- " print(f\"\\n✅ {model_name} quantization complete!\")\n",
406
- " print(f\"Model available at: https://huggingface.co/{output_repo}\")\n",
407
- " print(f\"💾 Local model files deleted to save disk space\")\n",
408
- " print(f\"🚀 Model is ready for vLLM inference with optimal performance!\")\n"
409
- ]
410
- },
411
- {
412
- "cell_type": "markdown",
413
- "metadata": {},
414
- "source": []
415
- },
416
- {
417
- "cell_type": "code",
418
- "execution_count": null,
419
- "metadata": {},
420
- "outputs": [],
421
- "source": [
422
- "quantize_model_to_awq(\n",
423
- " model_name=\"Router-Gemma3-27B\",\n",
424
- " repo_id=MODELS_TO_QUANTIZE[\"router-gemma3-merged\"][\"repo_id\"],\n",
425
- " output_repo=MODELS_TO_QUANTIZE[\"router-gemma3-merged\"][\"output_repo\"],\n",
426
- " model_type=MODELS_TO_QUANTIZE[\"router-gemma3-merged\"][\"model_type\"],\n",
427
- " awq_config=AWQ_CONFIG,\n",
428
- " calibration_dataset_size=128\n",
429
- ")\n"
430
- ]
431
- },
432
- {
433
- "cell_type": "markdown",
434
- "metadata": {},
435
- "source": [
436
- "## 6. Quantize Router-Qwen3-32B-Merged\n"
437
- ]
438
- },
439
- {
440
- "cell_type": "code",
441
- "execution_count": null,
442
- "metadata": {},
443
- "outputs": [],
444
- "source": [
445
- "quantize_model_to_awq(\n",
446
- " model_name=\"Router-Qwen3-32B\",\n",
447
- " repo_id=MODELS_TO_QUANTIZE[\"router-qwen3-32b-merged\"][\"repo_id\"],\n",
448
- " output_repo=MODELS_TO_QUANTIZE[\"router-qwen3-32b-merged\"][\"output_repo\"],\n",
449
- " model_type=MODELS_TO_QUANTIZE[\"router-qwen3-32b-merged\"][\"model_type\"],\n",
450
- " awq_config=AWQ_CONFIG,\n",
451
- " calibration_dataset_size=128\n",
452
- ")\n"
453
- ]
454
- },
455
- {
456
- "cell_type": "markdown",
457
- "metadata": {},
458
- "source": [
459
- "## 7. Verify Quantized Models\n"
460
- ]
461
- },
462
- {
463
- "cell_type": "code",
464
- "execution_count": null,
465
- "metadata": {},
466
- "outputs": [],
467
- "source": [
468
- "# Verify quantized models with vLLM (recommended) or Transformers\n",
469
- "from transformers import AutoTokenizer\n",
470
- "\n",
471
- "def verify_awq_model_vllm(repo_id: str):\n",
472
- " \"\"\"Verify AWQ model can be loaded with vLLM (recommended).\"\"\"\n",
473
- " print(f\"\\nVerifying {repo_id} with vLLM...\")\n",
474
- " \n",
475
- " try:\n",
476
- " # Try importing vLLM\n",
477
- " try:\n",
478
- " from vllm import LLM, SamplingParams\n",
479
- " except ImportError:\n",
480
- " print(\"⚠️ vLLM not available, skipping vLLM verification\")\n",
481
- " return False\n",
482
- " \n",
483
- " # Load with vLLM (auto-detects AWQ)\n",
484
- " llm = LLM(\n",
485
- " model=repo_id,\n",
486
- " quantization=\"awq\",\n",
487
- " trust_remote_code=True,\n",
488
- " token=os.environ.get(\"HF_TOKEN\"),\n",
489
- " gpu_memory_utilization=0.5 # Lower for verification\n",
490
- " )\n",
491
- " \n",
492
- " # Test generation\n",
493
- " sampling_params = SamplingParams(\n",
494
- " temperature=0.0,\n",
495
- " max_tokens=10\n",
496
- " )\n",
497
- " \n",
498
- " test_prompt = \"You are the Router Agent. Test prompt.\"\n",
499
- " outputs = llm.generate([test_prompt], sampling_params)\n",
500
- " \n",
501
- " generated_text = outputs[0].outputs[0].text\n",
502
- " print(f\"✅ vLLM loads and generates correctly\")\n",
503
- " print(f\"Generated: {generated_text[:100]}...\")\n",
504
- " \n",
505
- " del llm\n",
506
- " torch.cuda.empty_cache()\n",
507
- " \n",
508
- " return True\n",
509
- " except Exception as e:\n",
510
- " print(f\"❌ vLLM verification failed: {e}\")\n",
511
- " import traceback\n",
512
- " traceback.print_exc()\n",
513
- " return False\n",
514
- "\n",
515
- "def verify_awq_model_transformers(repo_id: str):\n",
516
- " \"\"\"Verify AWQ model can be loaded with Transformers (fallback).\"\"\"\n",
517
- " print(f\"\\nVerifying {repo_id} with Transformers...\")\n",
518
- " \n",
519
- " try:\n",
520
- " # Load tokenizer\n",
521
- " tokenizer = AutoTokenizer.from_pretrained(\n",
522
- " repo_id,\n",
523
- " trust_remote_code=True,\n",
524
- " token=os.environ.get(\"HF_TOKEN\")\n",
525
- " )\n",
526
- " \n",
527
- " # Try loading with AutoAWQ (if available)\n",
528
- " try:\n",
529
- " from awq import AutoAWQForCausalLM\n",
530
- " model = AutoAWQForCausalLM.from_quantized(\n",
531
- " repo_id,\n",
532
- " fuse_layers=True,\n",
533
- " trust_remote_code=True,\n",
534
- " device_map=\"auto\",\n",
535
- " token=os.environ.get(\"HF_TOKEN\")\n",
536
- " )\n",
537
- " \n",
538
- " # Test generation\n",
539
- " test_prompt = \"You are the Router Agent. Test prompt.\"\n",
540
- " inputs = tokenizer(test_prompt, return_tensors=\"pt\").to(model.device)\n",
541
- " \n",
542
- " with torch.inference_mode():\n",
543
- " outputs = model.generate(\n",
544
- " **inputs,\n",
545
- " max_new_tokens=10,\n",
546
- " do_sample=False\n",
547
- " )\n",
548
- " \n",
549
- " generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)\n",
550
- " print(f\"✅ Transformers loads and generates correctly\")\n",
551
- " print(f\"Generated: {generated_text[:100]}...\")\n",
552
- " \n",
553
- " del model\n",
554
- " del tokenizer\n",
555
- " torch.cuda.empty_cache()\n",
556
- " \n",
557
- " return True\n",
558
- " except ImportError:\n",
559
- " print(\"⚠️ AutoAWQ not available, skipping Transformers verification\")\n",
560
- " return False\n",
561
- " except Exception as e:\n",
562
- " print(f\"❌ Transformers verification failed: {e}\")\n",
563
- " import traceback\n",
564
- " traceback.print_exc()\n",
565
- " return False\n",
566
- "\n",
567
- "# Verify both models (prefer vLLM)\n",
568
- "for model_key, model_info in MODELS_TO_QUANTIZE.items():\n",
569
- " print(f\"\\n{'='*60}\")\n",
570
- " print(f\"Verifying {model_key}\")\n",
571
- " print(f\"{'='*60}\")\n",
572
- " \n",
573
- " # Try vLLM first (recommended)\n",
574
- " vllm_ok = verify_awq_model_vllm(model_info[\"output_repo\"])\n",
575
- " \n",
576
- " # Fallback to Transformers if vLLM not available\n",
577
- " if not vllm_ok:\n",
578
- " verify_awq_model_transformers(model_info[\"output_repo\"])\n"
579
- ]
580
- },
581
- {
582
- "cell_type": "markdown",
583
- "metadata": {},
584
- "source": [
585
- "\n"
586
- ]
587
- },
588
- {
589
- "cell_type": "code",
590
- "execution_count": null,
591
- "metadata": {},
592
- "outputs": [],
593
- "source": [
594
- "\n"
595
- ]
596
- },
597
- {
598
- "cell_type": "markdown",
599
- "metadata": {},
600
- "source": [
601
- "## Notes\n",
602
- "\n",
603
- "- **GPU Required**: This quantization requires a GPU with at least 40GB VRAM (A100/H100 recommended)\n",
604
- "- **Time**: Each model takes approximately 30-60 minutes to quantize\n",
605
- "- **Disk Space**: \n",
606
- " - Colab has limited disk space (~80GB free)\n",
607
- " - Each source model is ~50-70GB (BF16)\n",
608
- " - Quantized models are ~15-20GB (AWQ 4-bit)\n",
609
- " - **The notebook automatically deletes source models after quantization to save space**\n",
610
- "- **Cleanup**: After each model is quantized and uploaded:\n",
611
- " - GPU memory is freed\n",
612
- " - Hugging Face cache for source model is cleared\n",
613
- " - Disk space is checked before/after\n",
614
- "- **Output Repos**: Models are saved to new repos with `-awq` suffix\n",
615
- "- **Usage**: After quantization, update your `app.py` to use the AWQ repos:\n",
616
- " ```python\n",
617
- " MODELS = {\n",
618
- " \"Router-Gemma3-27B-AWQ\": {\n",
619
- " \"repo_id\": \"Alovestocode/router-gemma3-merged-awq\",\n",
620
- " \"quantization\": \"awq\"\n",
621
- " },\n",
622
- " \"Router-Qwen3-32B-AWQ\": {\n",
623
- " \"repo_id\": \"Alovestocode/router-qwen3-32b-merged-awq\",\n",
624
- " \"quantization\": \"awq\"\n",
625
- " }\n",
626
- " }\n",
627
- " ```\n"
628
- ]
629
- }
630
- ],
631
- "metadata": {
632
- "language_info": {
633
- "name": "python"
634
- }
635
  },
636
- "nbformat": 4,
637
- "nbformat_minor": 2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
638
  }
 
1
  {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "metadata": {},
6
+ "source": [
7
+ "# Router Models AWQ Quantization with LLM Compressor (vLLM Native)\n",
8
+ "\n",
9
+ "This notebook quantizes the CourseGPT-Pro router models to AWQ (Activation-aware Weight Quantization) format using **LLM Compressor** - vLLM's native quantization tool.\n",
10
+ "\n",
11
+ "**Models to quantize:**\n",
12
+ "- `Alovestocode/router-gemma3-merged` (27B)\n",
13
+ "- `Alovestocode/router-qwen3-32b-merged` (33B)\n",
14
+ "\n",
15
+ "**Output:** AWQ-quantized models ready for vLLM inference with optimal performance.\n",
16
+ "\n",
17
+ "**Why LLM Compressor?**\n",
18
+ "- Native vLLM integration (better compatibility)\n",
19
+ "- Supports advanced features (pruning, combined modifiers)\n",
20
+ "- Actively maintained by vLLM team\n",
21
+ "- Optimized for vLLM inference engine\n",
22
+ "\n",
23
+ "**⚠️ IMPORTANT:** If you see errors about `AWQModifier` parameters, **restart the kernel** (Runtime → Restart runtime) and run all cells from the beginning. The notebook uses `AWQModifier()` without parameters (default 4-bit AWQ).\n"
24
+ ]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
25
  },
26
+ {
27
+ "cell_type": "markdown",
28
+ "metadata": {},
29
+ "source": [
30
+ "## 1. Install Dependencies\n"
31
+ ]
32
+ },
33
+ {
34
+ "cell_type": "code",
35
+ "execution_count": null,
36
+ "metadata": {},
37
+ "outputs": [],
38
+ "source": [
39
+ "# Install required packages\n",
40
+ "# LLM Compressor is vLLM's native quantization tool\n",
41
+ "# Note: Package name is 'llmcompressor' (no hyphen), may need to install from GitHub\n",
42
+ "%pip install -q transformers accelerate huggingface_hub\n",
43
+ "%pip install -q torch --index-url https://download.pytorch.org/whl/cu118\n",
44
+ "\n",
45
+ "# Try installing llmcompressor from PyPI first, fallback to GitHub if not available\n",
46
+ "try:\n",
47
+ " import llmcompressor\n",
48
+ " print(\"✅ llmcompressor already installed\")\n",
49
+ "except ImportError:\n",
50
+ " print(\"Installing llmcompressor...\")\n",
51
+ " # Try PyPI first\n",
52
+ " import subprocess\n",
53
+ " import sys\n",
54
+ " result = subprocess.run([sys.executable, \"-m\", \"pip\", \"install\", \"-q\", \"llmcompressor\"], \n",
55
+ " capture_output=True, text=True)\n",
56
+ " if result.returncode != 0:\n",
57
+ " # Fallback to GitHub installation\n",
58
+ " print(\"PyPI installation failed, trying GitHub...\")\n",
59
+ " subprocess.run([sys.executable, \"-m\", \"pip\", \"install\", \"-q\", \n",
60
+ " \"git+https://github.com/vllm-project/llm-compressor.git\"], \n",
61
+ " check=False)\n",
62
+ " print(\"✅ llmcompressor installed\")\n",
63
+ "\n",
64
+ "# Utility function to check disk space\n",
65
+ "import shutil\n",
66
+ "def check_disk_space():\n",
67
+ " \"\"\"Check available disk space.\"\"\"\n",
68
+ " total, used, free = shutil.disk_usage(\"/\")\n",
69
+ " print(f\"Disk Space: {free / (1024**3):.2f} GB free out of {total / (1024**3):.2f} GB total\")\n",
70
+ " return free / (1024**3) # Return free space in GB\n",
71
+ "\n",
72
+ "print(\"Initial disk space:\")\n",
73
+ "check_disk_space()\n"
74
+ ]
75
+ },
76
+ {
77
+ "cell_type": "markdown",
78
+ "metadata": {},
79
+ "source": [
80
+ "## 2. Authenticate with Hugging Face\n"
81
+ ]
82
+ },
83
+ {
84
+ "cell_type": "code",
85
+ "execution_count": null,
86
+ "metadata": {},
87
+ "outputs": [],
88
+ "source": [
89
+ "from huggingface_hub import login\n",
90
+ "import os\n",
91
+ "\n",
92
+ "# Login to Hugging Face (you'll need a token with write access)\n",
93
+ "# Get your token from: https://huggingface.co/settings/tokens\n",
94
+ "HF_TOKEN = \"your_hf_token_here\" # Replace with your token\n",
95
+ "\n",
96
+ "login(token=HF_TOKEN)\n",
97
+ "os.environ[\"HF_TOKEN\"] = HF_TOKEN\n"
98
+ ]
99
+ },
100
+ {
101
+ "cell_type": "markdown",
102
+ "metadata": {},
103
+ "source": [
104
+ "## 3. Configuration\n"
105
+ ]
106
+ },
107
+ {
108
+ "cell_type": "code",
109
+ "execution_count": null,
110
+ "metadata": {},
111
+ "outputs": [],
112
+ "source": [
113
+ "# Model configurations\n",
114
+ "MODELS_TO_QUANTIZE = {\n",
115
+ " \"router-gemma3-merged\": {\n",
116
+ " \"repo_id\": \"Alovestocode/router-gemma3-merged\",\n",
117
+ " \"output_repo\": \"Alovestocode/router-gemma3-merged-awq\", # Or keep same repo\n",
118
+ " \"model_type\": \"gemma\",\n",
119
+ " },\n",
120
+ " \"router-qwen3-32b-merged\": {\n",
121
+ " \"repo_id\": \"Alovestocode/router-qwen3-32b-merged\",\n",
122
+ " \"output_repo\": \"Alovestocode/router-qwen3-32b-merged-awq\", # Or keep same repo\n",
123
+ " \"model_type\": \"qwen\",\n",
124
+ " }\n",
125
+ "}\n",
126
+ "\n",
127
+ "# AWQ quantization config\n",
128
+ "AWQ_CONFIG = {\n",
129
+ " \"num_bits\": 4, # Weight bit-width\n",
130
+ " \"group_size\": 128, # Group size for weight quantization\n",
131
+ " \"zero_point\": True, # False would force symmetric quant (no zero-point)\n",
132
+ " \"strategy\": \"group\", # Quantize per group for best AWQ accuracy\n",
133
+ " \"targets\": [\"Linear\"], # Modules to quantize (QuantizationMixin default)\n",
134
+ " \"ignore\": [\"lm_head\"], # Skip final LM head\n",
135
+ " \"format\": \"pack-quantized\",\n",
136
+ " \"observer\": \"minmax\",\n",
137
+ " \"dynamic\": False,\n",
138
+ " \"version\": \"GEMM\", # Kept for logging/back-compat\n",
139
+ "}\n"
140
+ ]
141
+ },
142
+ {
143
+ "cell_type": "code",
144
+ "execution_count": null,
145
+ "metadata": {},
146
+ "outputs": [],
147
+ "source": [
148
+ "## 3. Helper Function: Build AWQ Modifier Config\n",
149
+ "\n",
150
+ "def build_awq_modifier_config(num_bits=4, group_size=128, zero_point=True):\n",
151
+ " \"\"\"Build proper AWQ quantization config using QuantizationScheme objects.\n",
152
+ " \n",
153
+ " This helper function creates the correct structure that AWQModifier expects,\n",
154
+ " using QuantizationScheme/QuantizationArgs objects instead of plain dicts.\n",
155
+ " \n",
156
+ " Args:\n",
157
+ " num_bits: Number of bits for quantization (default: 4)\n",
158
+ " group_size: Group size for quantization (default: 128)\n",
159
+ " zero_point: Whether to use zero-point quantization (default: True)\n",
160
+ " \n",
161
+ " Returns:\n",
162
+ " quantization_config dict with proper QuantizationScheme structure\n",
163
+ " \"\"\"\n",
164
+ " try:\n",
165
+ " # Try to import QuantizationScheme and related classes\n",
166
+ " from compressed_tensors.quantization import (\n",
167
+ " QuantizationConfig,\n",
168
+ " QuantizationScheme,\n",
169
+ " QuantizationArgs\n",
170
+ " )\n",
171
+ " \n",
172
+ " # Create QuantizationArgs for weights\n",
173
+ " weights_args = QuantizationArgs(\n",
174
+ " num_bits=num_bits,\n",
175
+ " group_size=group_size,\n",
176
+ " zero_point=zero_point,\n",
177
+ " symmetric=False,\n",
178
+ " strategy=\"group\",\n",
179
+ " observer=\"minmax\",\n",
180
+ " type=\"int\",\n",
181
+ " dynamic=False\n",
182
+ " )\n",
183
+ " \n",
184
+ " # Create QuantizationScheme with targets and weights\n",
185
+ " scheme = QuantizationScheme(\n",
186
+ " targets=[\"Linear\"], # Target Linear layers\n",
187
+ " weights=weights_args,\n",
188
+ " input_activations=None,\n",
189
+ " output_activations=None\n",
190
+ " )\n",
191
+ " \n",
192
+ " # Create QuantizationConfig with config_groups\n",
193
+ " quant_config = QuantizationConfig(\n",
194
+ " config_groups={\"group_0\": scheme},\n",
195
+ " ignore=[\"lm_head\"],\n",
196
+ " quant_method=\"compressed-tensors\",\n",
197
+ " quantization_status=\"compressed\",\n",
198
+ " format=\"pack-quantized\"\n",
199
+ " )\n",
200
+ " \n",
201
+ " print(f\"✅ Built AWQ config using QuantizationScheme objects\")\n",
202
+ " return quant_config\n",
203
+ " \n",
204
+ " except ImportError as e:\n",
205
+ " # Fallback: If QuantizationScheme not available, try dict-based approach\n",
206
+ " print(f\"⚠️ QuantizationScheme not available: {e}\")\n",
207
+ " print(f\" → Falling back to dict-based config...\")\n",
208
+ " \n",
209
+ " # Return dict structure (may still work with some versions)\n",
210
+ " return {\n",
211
+ " \"config_groups\": {\n",
212
+ " \"group_0\": {\n",
213
+ " \"targets\": [\"Linear\"],\n",
214
+ " \"weights\": {\n",
215
+ " \"num_bits\": num_bits,\n",
216
+ " \"group_size\": group_size,\n",
217
+ " \"zero_point\": zero_point,\n",
218
+ " \"symmetric\": False,\n",
219
+ " \"strategy\": \"group\",\n",
220
+ " \"observer\": \"minmax\",\n",
221
+ " \"type\": \"int\",\n",
222
+ " \"dynamic\": False\n",
223
+ " },\n",
224
+ " \"input_activations\": None,\n",
225
+ " \"output_activations\": None\n",
226
+ " }\n",
227
+ " },\n",
228
+ " \"ignore\": [\"lm_head\"],\n",
229
+ " \"quant_method\": \"compressed-tensors\",\n",
230
+ " \"quantization_status\": \"compressed\",\n",
231
+ " \"format\": \"pack-quantized\"\n",
232
+ " }\n",
233
+ " except Exception as e:\n",
234
+ " print(f\"❌ Failed to build AWQ config: {e}\")\n",
235
+ " raise\n",
236
+ "\n"
237
+ ]
238
+ },
239
+ {
240
+ "cell_type": "markdown",
241
+ "metadata": {},
242
+ "source": [
243
+ "## 4. Quantization Function\n"
244
+ ]
245
+ },
246
+ {
247
+ "cell_type": "code",
248
+ "execution_count": null,
249
+ "metadata": {},
250
+ "outputs": [],
251
+ "source": [
252
+ "# LLM Compressor (vLLM native quantization tool)\n",
253
+ "# Import with error handling in case installation failed\n",
254
+ "try:\n",
255
+ " from llmcompressor import oneshot\n",
256
+ " # Correct import path: AWQModifier is in modifiers.awq, not modifiers.quantization\n",
257
+ " from llmcompressor.modifiers.awq import AWQModifier\n",
258
+ " from compressed_tensors.quantization import QuantizationScheme, QuantizationArgs\n",
259
+ " from compressed_tensors.quantization.quant_args import (\n",
260
+ " QuantizationStrategy,\n",
261
+ " QuantizationType,\n",
262
+ " )\n",
263
+ " LLM_COMPRESSOR_AVAILABLE = True\n",
264
+ " print(\"✅ LLM Compressor imported successfully\")\n",
265
+ "except ImportError as e:\n",
266
+ " print(f\"❌ Failed to import llmcompressor/quantization deps: {e}\")\n",
267
+ " print(\"Please ensure llmcompressor is installed:\")\n",
268
+ " print(\" %pip install llmcompressor\")\n",
269
+ " print(\" OR\")\n",
270
+ " print(\" %pip install git+https://github.com/vllm-project/llm-compressor.git\")\n",
271
+ " print(\"\\nNote: If import still fails, try:\")\n",
272
+ " print(\" %pip install --upgrade llmcompressor\")\n",
273
+ " LLM_COMPRESSOR_AVAILABLE = False\n",
274
+ " raise\n",
275
+ "\n",
276
+ "from transformers import AutoTokenizer\n",
277
+ "from huggingface_hub import HfApi, scan_cache_dir, upload_folder\n",
278
+ "import torch\n",
279
+ "import shutil\n",
280
+ "import gc\n",
281
+ "import os\n",
282
+ "\n",
283
+ "# Try to import delete_revisions (may not be available in all versions)\n",
284
+ "try:\n",
285
+ " from huggingface_hub import delete_revisions\n",
286
+ " DELETE_REVISIONS_AVAILABLE = True\n",
287
+ "except ImportError:\n",
288
+ " # delete_revisions might not be available, we'll use alternative method\n",
289
+ " DELETE_REVISIONS_AVAILABLE = False\n",
290
+ " print(\"Note: delete_revisions not available, will use alternative cache cleanup method\")\n",
291
+ "\n",
292
+ "def build_awq_modifier_config(awq_config: dict):\n",
293
+ " \"\"\"Create config_groups/ignore settings for AWQModifier.\"\"\"\n",
294
+ " if not isinstance(awq_config, dict):\n",
295
+ " raise ValueError(\"awq_config must be a dictionary of quantization settings\")\n",
296
+ "\n",
297
+ " def _get(key, *aliases, default=None):\n",
298
+ " for candidate in (key, *aliases):\n",
299
+ " if candidate in awq_config:\n",
300
+ " value = awq_config[candidate]\n",
301
+ " if value is not None:\n",
302
+ " return value\n",
303
+ " return default\n",
304
+ "\n",
305
+ " num_bits = _get(\"num_bits\", \"w_bit\", default=4)\n",
306
+ " group_size = _get(\"group_size\", \"q_group_size\", default=128)\n",
307
+ " zero_point = awq_config.get(\"zero_point\", True)\n",
308
+ " symmetric = awq_config.get(\"symmetric\")\n",
309
+ " if symmetric is None:\n",
310
+ " symmetric = not bool(zero_point)\n",
311
+ "\n",
312
+ " strategy = _get(\"strategy\", default=\"group\")\n",
313
+ " if isinstance(strategy, QuantizationStrategy):\n",
314
+ " quant_strategy = strategy\n",
315
+ " else:\n",
316
+ " quant_strategy = QuantizationStrategy(str(strategy).lower())\n",
317
+ "\n",
318
+ " qtype = awq_config.get(\"type\", QuantizationType.INT)\n",
319
+ " if isinstance(qtype, QuantizationType):\n",
320
+ " quant_type = qtype\n",
321
+ " else:\n",
322
+ " quant_type = QuantizationType(str(qtype).lower())\n",
323
+ "\n",
324
+ " weights_args = QuantizationArgs(\n",
325
+ " num_bits=num_bits,\n",
326
+ " group_size=group_size,\n",
327
+ " symmetric=symmetric,\n",
328
+ " strategy=quant_strategy,\n",
329
+ " type=quant_type,\n",
330
+ " dynamic=awq_config.get(\"dynamic\", False),\n",
331
+ " observer=awq_config.get(\"observer\", \"minmax\"),\n",
332
+ " )\n",
333
+ "\n",
334
+ " quant_scheme = QuantizationScheme(\n",
335
+ " targets=awq_config.get(\"targets\", [\"Linear\"]),\n",
336
+ " weights=weights_args,\n",
337
+ " input_activations=None,\n",
338
+ " output_activations=None,\n",
339
+ " format=awq_config.get(\"format\", \"pack-quantized\"),\n",
340
+ " )\n",
341
+ "\n",
342
+ " config_groups = {\"group_0\": quant_scheme}\n",
343
+ " ignore = awq_config.get(\"ignore\", [\"lm_head\"])\n",
344
+ " return config_groups, ignore\n",
345
+ "\n",
346
+ "def quantize_model_to_awq(\n",
347
+ " model_name: str,\n",
348
+ " repo_id: str,\n",
349
+ " output_repo: str,\n",
350
+ " model_type: str,\n",
351
+ " awq_config: dict,\n",
352
+ " calibration_dataset_size: int = 128\n",
353
+ "):\n",
354
+ " \"\"\"Quantize a model to AWQ format using LLM Compressor (vLLM native).\n",
355
+ " \n",
356
+ " Args:\n",
357
+ " model_name: Display name for the model\n",
358
+ " repo_id: Source Hugging Face repo ID\n",
359
+ " output_repo: Destination Hugging Face repo ID\n",
360
+ " model_type: Model type (gemma/qwen) for tokenizer selection\n",
361
+ " awq_config: AWQ quantization configuration\n",
362
+ " calibration_dataset_size: Number of calibration samples\n",
363
+ " \"\"\"\n",
364
+ " print(f\"\\n{'='*60}\")\n",
365
+ " print(f\"Quantizing {model_name} with LLM Compressor (vLLM native)\")\n",
366
+ " print(f\"Source: {repo_id}\")\n",
367
+ " print(f\"Destination: {output_repo}\")\n",
368
+ " print(f\"{'='*60}\\n\")\n",
369
+ " \n",
370
+ " # Check disk space before starting\n",
371
+ " free_space_before = check_disk_space()\n",
372
+ " if free_space_before < 30:\n",
373
+ " print(f\"⚠️ WARNING: Low disk space ({free_space_before:.2f} GB). Quantization may fail.\")\n",
374
+ " \n",
375
+ " # Step 1: Create temporary output directory\n",
376
+ " import tempfile\n",
377
+ " temp_output_dir = f\"./temp_{model_name.replace('-', '_')}_awq\"\n",
378
+ " print(f\"[1/4] Creating temporary output directory: {temp_output_dir}\")\n",
379
+ " os.makedirs(temp_output_dir, exist_ok=True)\n",
380
+ " \n",
381
+ " # Step 2: Prepare calibration dataset\n",
382
+ " print(f\"\\n[2/4] Preparing calibration dataset ({calibration_dataset_size} samples)...\")\n",
383
+ " \n",
384
+ " # Create calibration dataset for router agent\n",
385
+ " calibration_texts = [\n",
386
+ " \"You are the Router Agent coordinating Math, Code, and General-Search specialists.\",\n",
387
+ " \"Emit EXACTLY ONE strict JSON object with keys route_plan, route_rationale, expected_artifacts,\",\n",
388
+ " \"Solve a quadratic equation using Python programming.\",\n",
389
+ " \"Implement a binary search algorithm with proper error handling.\",\n",
390
+ " \"Explain the concept of gradient descent in machine learning.\",\n",
391
+ " \"Write a function to calculate the Fibonacci sequence recursively.\",\n",
392
+ " \"Design a REST API endpoint for user authentication.\",\n",
393
+ " \"Analyze the time complexity of merge sort algorithm.\",\n",
394
+ " ]\n",
395
+ " \n",
396
+ " # Repeat to reach desired size\n",
397
+ " while len(calibration_texts) < calibration_dataset_size:\n",
398
+ " calibration_texts.extend(calibration_texts[:calibration_dataset_size - len(calibration_texts)])\n",
399
+ " \n",
400
+ " calibration_texts = calibration_texts[:calibration_dataset_size]\n",
401
+ " print(f\"✅ Calibration dataset prepared: {len(calibration_texts)} samples\")\n",
402
+ " \n",
403
+ " # Step 3: Quantize model using LLM Compressor\n",
404
+ " print(f\"\\n[3/4] Quantizing model to AWQ with LLM Compressor (this may take 30-60 minutes)...\")\n",
405
+ " print(f\"Config: {awq_config}\")\n",
406
+ " print(\"⚠️ LLM Compressor will load the model, quantize it, and save to local directory\")\n",
407
+ " \n",
408
+ " if not LLM_COMPRESSOR_AVAILABLE:\n",
409
+ " raise ImportError(\"LLM Compressor is not available. Please install it first.\")\n",
410
+ " \n",
411
+ " try:\n",
412
+ " # LLM Compressor's oneshot function handles everything:\n",
413
+ " # - Loading the model\n",
414
+ " # - Quantization with calibration data\n",
415
+ " # - Saving quantized model\n",
416
+ " print(f\" → Starting quantization with LLM Compressor...\")\n",
417
+ " print(f\" → This may take 30-60 minutes depending on model size...\")\n",
418
+ " \n",
419
+ " print(f\" → Creating QuantizationScheme for AWQModifier...\")\n",
420
+ " config_groups, ignore_modules = build_awq_modifier_config(awq_config)\n",
421
+ " first_group = next(iter(config_groups.values()))\n",
422
+ " bits = first_group.weights.num_bits if first_group.weights else \"?\"\n",
423
+ " group_sz = first_group.weights.group_size if first_group.weights else \"?\"\n",
424
+ " print(f\" ✅ AWQ config ready ({bits}-bit, group size {group_sz})\")\n",
425
+ " print(f\" → Creating AWQModifier with structured config...\")\n",
426
+ " modifiers = [\n",
427
+ " AWQModifier(\n",
428
+ " config_groups=config_groups,\n",
429
+ " ignore=ignore_modules,\n",
430
+ " )\n",
431
+ " ]\n",
432
+ " print(f\" ✅ AWQModifier created successfully\")\n",
433
+ " \n",
434
+ " # Call oneshot with the modifier\n",
435
+ " print(f\" → Starting quantization process...\")\n",
436
+ " oneshot(\n",
437
+ " model=repo_id,\n",
438
+ " output_dir=temp_output_dir,\n",
439
+ " modifiers=modifiers,\n",
440
+ " token=os.environ.get(\"HF_TOKEN\"),\n",
441
+ " # Calibration data: list of strings\n",
442
+ " calibration_data=calibration_texts[:min(calibration_dataset_size, 128)]\n",
443
+ " )\n",
444
+ " \n",
445
+ " print(f\"✅ Model quantized to AWQ successfully\")\n",
446
+ " except Exception as e:\n",
447
+ " print(f\"❌ Quantization failed: {e}\")\n",
448
+ " print(f\"\\nTroubleshooting:\")\n",
449
+ " print(f\"1. Ensure llmcompressor is installed: %pip install llmcompressor\")\n",
450
+ " print(f\"2. Or install from GitHub: %pip install git+https://github.com/vllm-project/llm-compressor.git\")\n",
451
+ " print(f\"3. Check that you have sufficient GPU memory (40GB+ recommended)\")\n",
452
+ " import traceback\n",
453
+ " traceback.print_exc()\n",
454
+ " raise\n",
455
+ " \n",
456
+ " # Step 4: Upload to Hugging Face\n",
457
+ " print(f\"\\n[4/4] Uploading quantized model to {output_repo}...\")\n",
458
+ " \n",
459
+ " # Create repo if it doesn't exist\n",
460
+ " api = HfApi()\n",
461
+ " try:\n",
462
+ " api.create_repo(\n",
463
+ " repo_id=output_repo,\n",
464
+ " repo_type=\"model\",\n",
465
+ " exist_ok=True,\n",
466
+ " token=os.environ.get(\"HF_TOKEN\")\n",
467
+ " )\n",
468
+ " print(f\"✅ Repository ready: {output_repo}\")\n",
469
+ " except Exception as e:\n",
470
+ " print(f\"Note: Repo may already exist: {e}\")\n",
471
+ " \n",
472
+ " # Upload the quantized model directory\n",
473
+ " try:\n",
474
+ " upload_folder(\n",
475
+ " folder_path=temp_output_dir,\n",
476
+ " repo_id=output_repo,\n",
477
+ " repo_type=\"model\",\n",
478
+ " token=os.environ.get(\"HF_TOKEN\"),\n",
479
+ " ignore_patterns=[\"*.pt\", \"*.bin\"] # Only upload safetensors\n",
480
+ " )\n",
481
+ " print(f\"✅ Quantized model uploaded to {output_repo}\")\n",
482
+ " except Exception as e:\n",
483
+ " print(f\"❌ Upload failed: {e}\")\n",
484
+ " import traceback\n",
485
+ " traceback.print_exc()\n",
486
+ " raise\n",
487
+ " \n",
488
+ " # Step 5: Clean up to free disk space (critical for Colab)\n",
489
+ " print(f\"\\n[5/5] Cleaning up local files to free disk space...\")\n",
490
+ " \n",
491
+ " # Delete temporary output directory\n",
492
+ " try:\n",
493
+ " import shutil\n",
494
+ " shutil.rmtree(temp_output_dir)\n",
495
+ " print(f\" ✅ Deleted temporary directory: {temp_output_dir}\")\n",
496
+ " except Exception as e:\n",
497
+ " print(f\" ⚠️ Could not delete temp directory: {e}\")\n",
498
+ " \n",
499
+ " # Free GPU memory\n",
500
+ " torch.cuda.empty_cache()\n",
501
+ " gc.collect()\n",
502
+ " \n",
503
+ " # Clear Hugging Face cache for the source model (frees ~50-70GB)\n",
504
+ " print(f\" → Clearing Hugging Face cache for {repo_id}...\")\n",
505
+ " try:\n",
506
+ " cache_info = scan_cache_dir()\n",
507
+ " # Find and delete revisions for the source model\n",
508
+ " revisions_to_delete = []\n",
509
+ " for repo in cache_info.revisions:\n",
510
+ " if repo.repo_id == repo_id:\n",
511
+ " revisions_to_delete.append(repo)\n",
512
+ " \n",
513
+ " if revisions_to_delete:\n",
514
+ " if DELETE_REVISIONS_AVAILABLE:\n",
515
+ " # Use delete_revisions if available\n",
516
+ " delete_revisions(revisions_to_delete)\n",
517
+ " print(f\" ✅ Deleted {len(revisions_to_delete)} cached revision(s) for {repo_id}\")\n",
518
+ " else:\n",
519
+ " # Alternative: Delete cache directories manually\n",
520
+ " deleted_count = 0\n",
521
+ " for revision in revisions_to_delete:\n",
522
+ " try:\n",
523
+ " # Get the cache directory path\n",
524
+ " cache_path = revision.snapshot_path if hasattr(revision, 'snapshot_path') else None\n",
525
+ " if cache_path and os.path.exists(cache_path):\n",
526
+ " shutil.rmtree(cache_path)\n",
527
+ " deleted_count += 1\n",
528
+ " except Exception as e:\n",
529
+ " print(f\" ⚠️ Could not delete {revision.repo_id}: {e}\")\n",
530
+ " \n",
531
+ " if deleted_count > 0:\n",
532
+ " print(f\" ✅ Deleted {deleted_count} cached revision(s) for {repo_id}\")\n",
533
+ " else:\n",
534
+ " print(f\" ℹ️ Found {len(revisions_to_delete)} cached revision(s) but couldn't delete them\")\n",
535
+ " print(f\" Try manually: huggingface-cli scan-cache --dir ~/.cache/huggingface\")\n",
536
+ " else:\n",
537
+ " print(f\" ℹ️ No cached revisions found for {repo_id}\")\n",
538
+ " except Exception as e:\n",
539
+ " print(f\" ⚠️ Cache cleanup warning: {e} (continuing...)\")\n",
540
+ " print(f\" You can manually clean cache with: huggingface-cli scan-cache\")\n",
541
+ " \n",
542
+ " # Check disk space after cleanup\n",
543
+ " free_space_after = check_disk_space()\n",
544
+ " print(f\"\\n✅ Cleanup complete! Free space: {free_space_after:.2f} GB\")\n",
545
+ " \n",
546
+ " print(f\"\\n✅ {model_name} quantization complete!\")\n",
547
+ " print(f\"Model available at: https://huggingface.co/{output_repo}\")\n",
548
+ " print(f\"💾 Local model files deleted to save disk space\")\n",
549
+ " print(f\"🚀 Model is ready for vLLM inference with optimal performance!\")\n"
550
+ ]
551
+ },
552
+ {
553
+ "cell_type": "markdown",
554
+ "metadata": {},
555
+ "source": []
556
+ },
557
+ {
558
+ "cell_type": "code",
559
+ "execution_count": null,
560
+ "metadata": {},
561
+ "outputs": [],
562
+ "source": [
563
+ "quantize_model_to_awq(\n",
564
+ " model_name=\"Router-Gemma3-27B\",\n",
565
+ " repo_id=MODELS_TO_QUANTIZE[\"router-gemma3-merged\"][\"repo_id\"],\n",
566
+ " output_repo=MODELS_TO_QUANTIZE[\"router-gemma3-merged\"][\"output_repo\"],\n",
567
+ " model_type=MODELS_TO_QUANTIZE[\"router-gemma3-merged\"][\"model_type\"],\n",
568
+ " awq_config=AWQ_CONFIG,\n",
569
+ " calibration_dataset_size=128\n",
570
+ ")\n"
571
+ ]
572
+ },
573
+ {
574
+ "cell_type": "markdown",
575
+ "metadata": {},
576
+ "source": [
577
+ "## 6. Quantize Router-Qwen3-32B-Merged\n"
578
+ ]
579
+ },
580
+ {
581
+ "cell_type": "code",
582
+ "execution_count": null,
583
+ "metadata": {},
584
+ "outputs": [],
585
+ "source": [
586
+ "quantize_model_to_awq(\n",
587
+ " model_name=\"Router-Qwen3-32B\",\n",
588
+ " repo_id=MODELS_TO_QUANTIZE[\"router-qwen3-32b-merged\"][\"repo_id\"],\n",
589
+ " output_repo=MODELS_TO_QUANTIZE[\"router-qwen3-32b-merged\"][\"output_repo\"],\n",
590
+ " model_type=MODELS_TO_QUANTIZE[\"router-qwen3-32b-merged\"][\"model_type\"],\n",
591
+ " awq_config=AWQ_CONFIG,\n",
592
+ " calibration_dataset_size=128\n",
593
+ ")\n"
594
+ ]
595
+ },
596
+ {
597
+ "cell_type": "markdown",
598
+ "metadata": {},
599
+ "source": [
600
+ "## 7. Verify Quantized Models\n"
601
+ ]
602
+ },
603
+ {
604
+ "cell_type": "code",
605
+ "execution_count": null,
606
+ "metadata": {},
607
+ "outputs": [],
608
+ "source": [
609
+ "# Verify quantized models with vLLM (recommended) or Transformers\n",
610
+ "from transformers import AutoTokenizer\n",
611
+ "\n",
612
+ "def verify_awq_model_vllm(repo_id: str):\n",
613
+ " \"\"\"Verify AWQ model can be loaded with vLLM (recommended).\"\"\"\n",
614
+ " print(f\"\\nVerifying {repo_id} with vLLM...\")\n",
615
+ " \n",
616
+ " try:\n",
617
+ " # Try importing vLLM\n",
618
+ " try:\n",
619
+ " from vllm import LLM, SamplingParams\n",
620
+ " except ImportError:\n",
621
+ " print(\"⚠️ vLLM not available, skipping vLLM verification\")\n",
622
+ " return False\n",
623
+ " \n",
624
+ " # Load with vLLM (auto-detects AWQ)\n",
625
+ " llm = LLM(\n",
626
+ " model=repo_id,\n",
627
+ " quantization=\"awq\",\n",
628
+ " trust_remote_code=True,\n",
629
+ " token=os.environ.get(\"HF_TOKEN\"),\n",
630
+ " gpu_memory_utilization=0.5 # Lower for verification\n",
631
+ " )\n",
632
+ " \n",
633
+ " # Test generation\n",
634
+ " sampling_params = SamplingParams(\n",
635
+ " temperature=0.0,\n",
636
+ " max_tokens=10\n",
637
+ " )\n",
638
+ " \n",
639
+ " test_prompt = \"You are the Router Agent. Test prompt.\"\n",
640
+ " outputs = llm.generate([test_prompt], sampling_params)\n",
641
+ " \n",
642
+ " generated_text = outputs[0].outputs[0].text\n",
643
+ " print(f\"✅ vLLM loads and generates correctly\")\n",
644
+ " print(f\"Generated: {generated_text[:100]}...\")\n",
645
+ " \n",
646
+ " del llm\n",
647
+ " torch.cuda.empty_cache()\n",
648
+ " \n",
649
+ " return True\n",
650
+ " except Exception as e:\n",
651
+ " print(f\"❌ vLLM verification failed: {e}\")\n",
652
+ " import traceback\n",
653
+ " traceback.print_exc()\n",
654
+ " return False\n",
655
+ "\n",
656
+ "def verify_awq_model_transformers(repo_id: str):\n",
657
+ " \"\"\"Verify AWQ model can be loaded with Transformers (fallback).\"\"\"\n",
658
+ " print(f\"\\nVerifying {repo_id} with Transformers...\")\n",
659
+ " \n",
660
+ " try:\n",
661
+ " # Load tokenizer\n",
662
+ " tokenizer = AutoTokenizer.from_pretrained(\n",
663
+ " repo_id,\n",
664
+ " trust_remote_code=True,\n",
665
+ " token=os.environ.get(\"HF_TOKEN\")\n",
666
+ " )\n",
667
+ " \n",
668
+ " # Try loading with AutoAWQ (if available)\n",
669
+ " try:\n",
670
+ " from awq import AutoAWQForCausalLM\n",
671
+ " model = AutoAWQForCausalLM.from_quantized(\n",
672
+ " repo_id,\n",
673
+ " fuse_layers=True,\n",
674
+ " trust_remote_code=True,\n",
675
+ " device_map=\"auto\",\n",
676
+ " token=os.environ.get(\"HF_TOKEN\")\n",
677
+ " )\n",
678
+ " \n",
679
+ " # Test generation\n",
680
+ " test_prompt = \"You are the Router Agent. Test prompt.\"\n",
681
+ " inputs = tokenizer(test_prompt, return_tensors=\"pt\").to(model.device)\n",
682
+ " \n",
683
+ " with torch.inference_mode():\n",
684
+ " outputs = model.generate(\n",
685
+ " **inputs,\n",
686
+ " max_new_tokens=10,\n",
687
+ " do_sample=False\n",
688
+ " )\n",
689
+ " \n",
690
+ " generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)\n",
691
+ " print(f\"✅ Transformers loads and generates correctly\")\n",
692
+ " print(f\"Generated: {generated_text[:100]}...\")\n",
693
+ " \n",
694
+ " del model\n",
695
+ " del tokenizer\n",
696
+ " torch.cuda.empty_cache()\n",
697
+ " \n",
698
+ " return True\n",
699
+ " except ImportError:\n",
700
+ " print(\"⚠️ AutoAWQ not available, skipping Transformers verification\")\n",
701
+ " return False\n",
702
+ " except Exception as e:\n",
703
+ " print(f\"❌ Transformers verification failed: {e}\")\n",
704
+ " import traceback\n",
705
+ " traceback.print_exc()\n",
706
+ " return False\n",
707
+ "\n",
708
+ "# Verify both models (prefer vLLM)\n",
709
+ "for model_key, model_info in MODELS_TO_QUANTIZE.items():\n",
710
+ " print(f\"\\n{'='*60}\")\n",
711
+ " print(f\"Verifying {model_key}\")\n",
712
+ " print(f\"{'='*60}\")\n",
713
+ " \n",
714
+ " # Try vLLM first (recommended)\n",
715
+ " vllm_ok = verify_awq_model_vllm(model_info[\"output_repo\"])\n",
716
+ " \n",
717
+ " # Fallback to Transformers if vLLM not available\n",
718
+ " if not vllm_ok:\n",
719
+ " verify_awq_model_transformers(model_info[\"output_repo\"])\n"
720
+ ]
721
+ },
722
+ {
723
+ "cell_type": "markdown",
724
+ "metadata": {},
725
+ "source": [
726
+ "\n"
727
+ ]
728
+ },
729
+ {
730
+ "cell_type": "code",
731
+ "execution_count": null,
732
+ "metadata": {},
733
+ "outputs": [],
734
+ "source": [
735
+ "\n"
736
+ ]
737
+ },
738
+ {
739
+ "cell_type": "markdown",
740
+ "metadata": {},
741
+ "source": [
742
+ "## Notes\n",
743
+ "\n",
744
+ "- **GPU Required**: This quantization requires a GPU with at least 40GB VRAM (A100/H100 recommended)\n",
745
+ "- **Time**: Each model takes approximately 30-60 minutes to quantize\n",
746
+ "- **Disk Space**: \n",
747
+ " - Colab has limited disk space (~80GB free)\n",
748
+ " - Each source model is ~50-70GB (BF16)\n",
749
+ " - Quantized models are ~15-20GB (AWQ 4-bit)\n",
750
+ " - **The notebook automatically deletes source models after quantization to save space**\n",
751
+ "- **Cleanup**: After each model is quantized and uploaded:\n",
752
+ " - GPU memory is freed\n",
753
+ " - Hugging Face cache for source model is cleared\n",
754
+ " - Disk space is checked before/after\n",
755
+ "- **Output Repos**: Models are saved to new repos with `-awq` suffix\n",
756
+ "- **Usage**: After quantization, update your `app.py` to use the AWQ repos:\n",
757
+ " ```python\n",
758
+ " MODELS = {\n",
759
+ " \"Router-Gemma3-27B-AWQ\": {\n",
760
+ " \"repo_id\": \"Alovestocode/router-gemma3-merged-awq\",\n",
761
+ " \"quantization\": \"awq\"\n",
762
+ " },\n",
763
+ " \"Router-Qwen3-32B-AWQ\": {\n",
764
+ " \"repo_id\": \"Alovestocode/router-qwen3-32b-merged-awq\",\n",
765
+ " \"quantization\": \"awq\"\n",
766
+ " }\n",
767
+ " }\n",
768
+ " ```\n"
769
+ ]
770
+ }
771
+ ],
772
+ "metadata": {
773
+ "language_info": {
774
+ "name": "python"
775
+ }
776
+ },
777
+ "nbformat": 4,
778
+ "nbformat_minor": 2
779
  }