Update README.md
Browse files
README.md
CHANGED
|
@@ -13,7 +13,6 @@ base_model:
|
|
| 13 |
- MiniMaxAI/MiniMax-M2
|
| 14 |
---
|
| 15 |
|
| 16 |
-
|
| 17 |

|
| 18 |
|
| 19 |
# THRIFT — Targeted Reduction for Inference and Fine-Tuning
|
|
@@ -22,27 +21,27 @@ A performance-optimized variant of the base model that delivers faster responses
|
|
| 22 |
|
| 23 |
## TLDR
|
| 24 |
|
| 25 |
-
We, over-caffinated researchers at VibeStud.io wanted to create a 50% pruned version of the SOTA MiniMax M2 that is best suited for local/air-gapped coding. This version we achieved
|
| 26 |
|
| 27 |
## Why it’s useful
|
| 28 |
|
| 29 |
-
* **Lower latency:** Snappier responses for interactive apps and chatbots.
|
| 30 |
-
* **Smaller memory footprint:** Runs on cheaper GPUs or with fewer resources per replica.
|
| 31 |
-
* **Higher throughput:** Serve more concurrent users at the same cost.
|
| 32 |
-
* **Deployment-friendly:** Drop-in replacement for the base model in most inference stacks.
|
| 33 |
* **Adaptable:** Supports light fine-tuning to match your domain and style guidelines.
|
| 34 |
|
| 35 |
## Intended use
|
| 36 |
|
| 37 |
-
* General chat and coding assistance
|
| 38 |
-
* Enterprise assistants with strict latency/VRAM budgets
|
| 39 |
-
* Batch or realtime serving in cloud and on-prem environments
|
| 40 |
* Edge or cost-sensitive deployments where efficiency matters
|
| 41 |
|
| 42 |
## When to use it
|
| 43 |
|
| 44 |
-
* You’re constrained by GPU memory or need shorter response times
|
| 45 |
-
* You want to increase QPS without scaling infrastructure
|
| 46 |
* You need a model that is “good enough” for most tasks at a better cost profile
|
| 47 |
|
| 48 |
---
|
|
@@ -51,115 +50,108 @@ We, over-caffinated researchers at VibeStud.io wanted to create a 50% pruned ver
|
|
| 51 |
|
| 52 |
**Models Under Evaluation**
|
| 53 |
|
| 54 |
-
| Model
|
| 55 |
-
|
|
| 56 |
-
| ModelCloud/MiniMax-M2-BF16
|
| 57 |
| VibeStudio/MiniMax-M2-THRIFT | Compressed/Optimized |
|
| 58 |
|
| 59 |
-
**Evaluation
|
| 60 |
|
| 61 |
## 📊 Results Comparison
|
| 62 |
|
| 63 |
-
### 1
|
| 64 |
|
| 65 |
**Overall MMLU Performance**
|
| 66 |
|
| 67 |
-
| Model
|
| 68 |
-
|
|
| 69 |
-
| MiniMax-M2-BF16
|
| 70 |
-
| MiniMax-M2-THRIFT
|
| 71 |
-
| **Δ (Difference)** |
|
| 72 |
|
| 73 |
**Individual Task Performance**
|
| 74 |
|
| 75 |
-
| Task
|
| 76 |
-
|
|
| 77 |
-
|
|
| 78 |
-
|
|
| 79 |
-
| boolq
|
| 80 |
-
| hellaswag |
|
| 81 |
-
| mmlu
|
| 82 |
-
| openbookqa |
|
| 83 |
-
| rte
|
| 84 |
-
| winogrande
|
| 85 |
|
| 86 |
-
**Average Accuracy Drop
|
| 87 |
|
| 88 |
-
### 2
|
| 89 |
|
| 90 |
-
**MBPP Results**
|
| 91 |
|
| 92 |
-
| Model
|
| 93 |
-
|
|
| 94 |
-
| MiniMax-M2-BF16
|
| 95 |
-
| MiniMax-M2-THRIFT |
|
|
|
|
| 96 |
|
| 97 |
-
**HumanEval Results**
|
| 98 |
|
| 99 |
-
| Model
|
| 100 |
-
|
|
| 101 |
-
| MiniMax-M2-BF16 |
|
| 102 |
-
| MiniMax-M2-THRIFT |
|
|
|
|
| 103 |
|
| 104 |
-
### 3
|
| 105 |
|
| 106 |
**GSM8K Results**
|
| 107 |
|
| 108 |
-
| Model
|
| 109 |
-
|
|
| 110 |
-
| MiniMax-M2-BF16
|
| 111 |
-
| MiniMax-M2-THRIFT
|
|
|
|
| 112 |
|
| 113 |
**MATH-500 Results**
|
| 114 |
|
| 115 |
-
| Model
|
| 116 |
-
|
|
| 117 |
-
| MiniMax-M2-BF16
|
| 118 |
-
| MiniMax-M2-THRIFT |
|
| 119 |
|
| 120 |
-
### 4
|
| 121 |
|
| 122 |
-
| Model
|
| 123 |
-
|
|
| 124 |
-
| **MiniMax-M2-BF16**
|
| 125 |
-
| **MiniMax-M2-THRIFT** |
|
|
|
|
| 126 |
|
| 127 |
---
|
| 128 |
|
| 129 |
-
## 📈 Analysis (
|
| 130 |
-
|
| 131 |
-
### Key Findings
|
| 132 |
-
|
| 133 |
-
**MMLU Performance Drop**
|
| 134 |
-
|
| 135 |
-
* THRIFT-BF16 shows **\-5.44%** overall MMLU drop
|
| 136 |
-
* Largest drop: **arc\_challenge (-12.20%)**
|
| 137 |
-
* Smallest drop: **winogrande (-1.58%)**
|
| 138 |
-
* **RTE improved by \+4.69%** 🎉
|
| 139 |
|
| 140 |
-
**
|
| 141 |
|
| 142 |
-
*
|
| 143 |
-
*
|
| 144 |
-
* STEM: **Moderate drop (-3.30%)**
|
| 145 |
|
| 146 |
**Compression Trade-off**
|
| 147 |
|
| 148 |
-
*
|
| 149 |
-
* Average accuracy loss: **\~4–5%**
|
| 150 |
-
* Expected for compressed/quantized models
|
| 151 |
|
| 152 |
-
**
|
| 153 |
|
| 154 |
-
| Category
|
| 155 |
-
|
|
| 156 |
-
| High School Government |
|
| 157 |
-
| High School Psychology |
|
| 158 |
-
| Marketing
|
| 159 |
-
| Professional Medicine
|
| 160 |
-
| Clinical Knowledge
|
| 161 |
|
| 162 |
---
|
|
|
|
| 163 |
## **sglang Deployment with Python**
|
| 164 |
|
| 165 |
It is recommended to use a virtual environment (such as **venv**, **conda**, or **uv**) to avoid dependency conflicts.
|
|
@@ -221,9 +213,10 @@ curl http://localhost:8000/v1/chat/completions \
|
|
| 221 |
]
|
| 222 |
}'
|
| 223 |
```
|
|
|
|
| 224 |
## Benchmarks
|
| 225 |
|
| 226 |
-
|
| 227 |
|
| 228 |
## Research paper
|
| 229 |
|
|
@@ -267,7 +260,7 @@ Model conversion and HF Transformers code by @Qubitum at ModelCloud.
|
|
| 267 |
|
| 268 |
@article{yang2025wanda++,
|
| 269 |
title = {Wanda++: Pruning Large Language Models via Regional Gradients},
|
| 270 |
-
author = {Yang, Yifan and Zhen, Kai and Ganesh, Bhavana and Galstyan, Aram and Huybrechts, Goeric and
|
| 271 |
journal = {arXiv preprint arXiv:2503.04992},
|
| 272 |
year = {2025},
|
| 273 |
eprinttype = {arXiv},
|
|
@@ -307,7 +300,7 @@ Model conversion and HF Transformers code by @Qubitum at ModelCloud.
|
|
| 307 |
|
| 308 |
@article{yang2023wanda,
|
| 309 |
title = {Wanda: Pruning by Weights and Activation-based Discriminant Analysis},
|
| 310 |
-
author = {Yang, Yifan and Ganesh, Bhavana and Galstyan, Aram and Huybrechts, Goeric and
|
| 311 |
journal = {arXiv preprint arXiv:2306.11695},
|
| 312 |
year = {2023},
|
| 313 |
eprinttype = {arXiv},
|
|
|
|
| 13 |
- MiniMaxAI/MiniMax-M2
|
| 14 |
---
|
| 15 |
|
|
|
|
| 16 |

|
| 17 |
|
| 18 |
# THRIFT — Targeted Reduction for Inference and Fine-Tuning
|
|
|
|
| 21 |
|
| 22 |
## TLDR
|
| 23 |
|
| 24 |
+
We, over-caffinated researchers at VibeStud.io wanted to create a 50% pruned version of the SOTA MiniMax M2 that is best suited for local/air-gapped coding. This version we achieved ~25%. A 50% pruned version is under development while a not so sucky team of ours is working on a 50% pruned version of Kimi K2 Thinking.We’re writing the paper and expanding the evaluation set to substantiate the results. Check back later, cheers!
|
| 25 |
|
| 26 |
## Why it’s useful
|
| 27 |
|
| 28 |
+
* **Lower latency:** Snappier responses for interactive apps and chatbots.
|
| 29 |
+
* **Smaller memory footprint:** Runs on cheaper GPUs or with fewer resources per replica.
|
| 30 |
+
* **Higher throughput:** Serve more concurrent users at the same cost.
|
| 31 |
+
* **Deployment-friendly:** Drop-in replacement for the base model in most inference stacks.
|
| 32 |
* **Adaptable:** Supports light fine-tuning to match your domain and style guidelines.
|
| 33 |
|
| 34 |
## Intended use
|
| 35 |
|
| 36 |
+
* General chat and coding assistance
|
| 37 |
+
* Enterprise assistants with strict latency/VRAM budgets
|
| 38 |
+
* Batch or realtime serving in cloud and on-prem environments
|
| 39 |
* Edge or cost-sensitive deployments where efficiency matters
|
| 40 |
|
| 41 |
## When to use it
|
| 42 |
|
| 43 |
+
* You’re constrained by GPU memory or need shorter response times
|
| 44 |
+
* You want to increase QPS without scaling infrastructure
|
| 45 |
* You need a model that is “good enough” for most tasks at a better cost profile
|
| 46 |
|
| 47 |
---
|
|
|
|
| 50 |
|
| 51 |
**Models Under Evaluation**
|
| 52 |
|
| 53 |
+
| Model | Type |
|
| 54 |
+
| :--------------------------- | :------------------- |
|
| 55 |
+
| ModelCloud/MiniMax-M2-BF16 | Base Model |
|
| 56 |
| VibeStudio/MiniMax-M2-THRIFT | Compressed/Optimized |
|
| 57 |
|
| 58 |
+
**Evaluation Dates:** November 7–9, 2025
|
| 59 |
|
| 60 |
## 📊 Results Comparison
|
| 61 |
|
| 62 |
+
### 1) Multiple Choice Q&A (lm-eval)
|
| 63 |
|
| 64 |
**Overall MMLU Performance**
|
| 65 |
|
| 66 |
+
| Model | MMLU Overall | Humanities | STEM | Social Sciences | Other |
|
| 67 |
+
| :----------------- | -----------: | ---------: | -----: | --------------: | -----: |
|
| 68 |
+
| MiniMax-M2-BF16 | **83.16%** | 77.45% | 80.91% | **90.02%** | 87.29% |
|
| 69 |
+
| MiniMax-M2-THRIFT | **77.72%** | 70.14% | 77.61% | 86.84% | 80.27% |
|
| 70 |
+
| **Δ (Difference)** | **-5.44%** | -7.31% | -3.30% | -3.18% | -7.02% |
|
| 71 |
|
| 72 |
**Individual Task Performance**
|
| 73 |
|
| 74 |
+
| Task | BF16 (Base) | THRIFT-BF16 | Difference |
|
| 75 |
+
| :----------------------- | ----------: | ----------: | ------------: |
|
| 76 |
+
| arc_challenge (acc_norm) | 73.21% | 61.01% | -12.20% ⬇️ |
|
| 77 |
+
| arc_easy | 88.30% | 83.08% | -5.22% ⬇️ |
|
| 78 |
+
| boolq | 87.95% | 84.95% | -3.00% ⬇️ |
|
| 79 |
+
| hellaswag (acc_norm) | 83.00% | 77.09% | -5.91% ⬇️ |
|
| 80 |
+
| mmlu | 83.16% | 77.72% | -5.44% ⬇️ |
|
| 81 |
+
| openbookqa (acc_norm) | 48.60% | 43.00% | -5.60% ⬇️ |
|
| 82 |
+
| rte | 75.45% | **80.14%** | **+4.69% ⬆️** |
|
| 83 |
+
| winogrande | 76.48% | 74.90% | -1.58% ⬇️ |
|
| 84 |
|
| 85 |
+
**Average Accuracy Drop:** **-4.28%**
|
| 86 |
|
| 87 |
+
### 2) Code Generation (EvalPlus)
|
| 88 |
|
| 89 |
+
**MBPP Results (Python, 378 problems)**
|
| 90 |
|
| 91 |
+
| Model | MBPP (base) | MBPP+ (extended) | Average |
|
| 92 |
+
| :----------------- | ----------: | ---------------: | --------: |
|
| 93 |
+
| MiniMax-M2-BF16 | **73.8%** | **64.0%** | 68.9% |
|
| 94 |
+
| MiniMax-M2-THRIFT | **70.1%** | **60.1%** | 65.1% |
|
| 95 |
+
| **Δ (Difference)** | **-3.7%** | **-3.9%** | **-3.8%** |
|
| 96 |
|
| 97 |
+
**HumanEval Results (164 problems)**
|
| 98 |
|
| 99 |
+
| Model | HumanEval (base) | HumanEval+ (extended) | Average |
|
| 100 |
+
| :----------------- | ---------------: | --------------------: | --------: |
|
| 101 |
+
| MiniMax-M2-BF16 | **72.6%** | **71.3%** | 72.0% |
|
| 102 |
+
| MiniMax-M2-THRIFT | **65.2%** | **63.4%** | 64.3% |
|
| 103 |
+
| **Δ (Difference)** | **-7.4%** | **-7.9%** | **-7.7%** |
|
| 104 |
|
| 105 |
+
### 3) Math Benchmarks
|
| 106 |
|
| 107 |
**GSM8K Results**
|
| 108 |
|
| 109 |
+
| Model | Accuracy | Problems | Status |
|
| 110 |
+
| :----------------- | ------------: | -------: | :------------------- |
|
| 111 |
+
| MiniMax-M2-BF16 | **92.72%** | 1,319 | ✅ Complete |
|
| 112 |
+
| MiniMax-M2-THRIFT | **93.25%** | 1,319 | ✅ Complete |
|
| 113 |
+
| **Δ (Difference)** | **+0.53% ⬆️** | - | **THRIFT Better!** ✨ |
|
| 114 |
|
| 115 |
**MATH-500 Results**
|
| 116 |
|
| 117 |
+
| Model | Overall | Level 1 | Level 2 | Level 3 | Level 4 | Level 5 | Status |
|
| 118 |
+
| :---------------- | --------: | ------: | ------: | ------: | ------: | ------: | :-------------- |
|
| 119 |
+
| MiniMax-M2-BF16 | **87.2%** | 90.7% | 95.56% | 82.86% | 85.16% | 85.82% | ✅ Complete |
|
| 120 |
+
| MiniMax-M2-THRIFT | 🔄 — | 🔄 | 🔄 | 🔄 | 🔄 | 🔄 | **In Progress** |
|
| 121 |
|
| 122 |
+
### 4) LiveCodeBench (Live Coding Problems)
|
| 123 |
|
| 124 |
+
| Model | pass@1 | Problems | Status |
|
| 125 |
+
| :-------------------- | ------------: | -------: | :------------------- |
|
| 126 |
+
| **MiniMax-M2-BF16** | **35.71%** | 182 | ✅ Complete |
|
| 127 |
+
| **MiniMax-M2-THRIFT** | **36.81%** | 182 | ✅ Complete |
|
| 128 |
+
| **Δ (Difference)** | **+1.10% ⬆️** | - | **THRIFT Better!** ✨ |
|
| 129 |
|
| 130 |
---
|
| 131 |
|
| 132 |
+
## 📈 Analysis (Updated)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 133 |
|
| 134 |
+
**Highlights**
|
| 135 |
|
| 136 |
+
* **THRIFT wins** on **GSM8K (+0.53%)** and **LiveCodeBench (+1.10%)**, and on **RTE (+4.69%)**.
|
| 137 |
+
* **BF16 leads** on broad **MMLU**, **HumanEval**, **MBPP**, and tasks like **arc_challenge**.
|
|
|
|
| 138 |
|
| 139 |
**Compression Trade-off**
|
| 140 |
|
| 141 |
+
* Average knowledge-task drop for THRIFT is ~**4–5%**, with **math preserved or slightly improved**.
|
|
|
|
|
|
|
| 142 |
|
| 143 |
+
**Subject Breakdown (MMLU)**
|
| 144 |
|
| 145 |
+
| Category | BF16 (Base) | THRIFT-BF16 | Difference | Status |
|
| 146 |
+
| :--------------------- | ----------: | ----------: | ---------: | :---------------- |
|
| 147 |
+
| High School Government | 97.93% | 94.82% | -3.11% | ✅ Still Excellent |
|
| 148 |
+
| High School Psychology | 95.41% | 93.58% | -1.83% | ✅ Well Preserved |
|
| 149 |
+
| Marketing | 95.73% | 91.88% | -3.85% | ✅ Good |
|
| 150 |
+
| Professional Medicine | 92.28% | 79.78% | -12.50% | ⚠️ Notable Drop |
|
| 151 |
+
| Clinical Knowledge | 92.83% | 85.66% | -7.17% | ⚠️ Moderate Drop |
|
| 152 |
|
| 153 |
---
|
| 154 |
+
|
| 155 |
## **sglang Deployment with Python**
|
| 156 |
|
| 157 |
It is recommended to use a virtual environment (such as **venv**, **conda**, or **uv**) to avoid dependency conflicts.
|
|
|
|
| 213 |
]
|
| 214 |
}'
|
| 215 |
```
|
| 216 |
+
|
| 217 |
## Benchmarks
|
| 218 |
|
| 219 |
+
See the tables above for the latest **MMLU**, **MBPP**, **HumanEval**, **GSM8K**, **MATH-500**, and **LiveCodeBench** results (updated **November 9, 2025**).
|
| 220 |
|
| 221 |
## Research paper
|
| 222 |
|
|
|
|
| 260 |
|
| 261 |
@article{yang2025wanda++,
|
| 262 |
title = {Wanda++: Pruning Large Language Models via Regional Gradients},
|
| 263 |
+
author = {Yang, Yifan and Zhen, Kai and Ganesh, Bhavana and Galstyan, Aram and Huybrechts, Goeric and Müller, Markus and Kübler, Jonas M. and Swaminathan, Rupak Vignesh and Mouchtaris, Athanasios and Bodapati, Sravan Babu and Susanj, Nathan and Zhang, Zheng and FitzGerald, Jack and Kumar, Abhishek},
|
| 264 |
journal = {arXiv preprint arXiv:2503.04992},
|
| 265 |
year = {2025},
|
| 266 |
eprinttype = {arXiv},
|
|
|
|
| 300 |
|
| 301 |
@article{yang2023wanda,
|
| 302 |
title = {Wanda: Pruning by Weights and Activation-based Discriminant Analysis},
|
| 303 |
+
author = {Yang, Yifan and Ganesh, Bhavana and Galstyan, Aram and Huybrechts, Goeric and Müller, Markus and Kübler, Jonas M. and Swaminathan, Rupak Vignesh and Mouchtaris, Athanasios and Bodapati, Sravan Babu and Zhang, Zheng and FitzGerald, Jack and Kumar, Abhishek},
|
| 304 |
journal = {arXiv preprint arXiv:2306.11695},
|
| 305 |
year = {2023},
|
| 306 |
eprinttype = {arXiv},
|