File size: 9,673 Bytes
9a9c5c5 eab066b 9a9c5c5 eab066b 77ad185 eab066b 77ad185 eab066b 9a9c5c5 eab066b 77ad185 eab066b 77ad185 eab066b 77ad185 eab066b 53b7cce eab066b 53b7cce eab066b 53b7cce eab066b 53b7cce eab066b 53b7cce eab066b 77ad185 eab066b 77ad185 eab066b 77ad185 eab066b 77ad185 eab066b 5f76079 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 |
---
license: apache-2.0
base_model: Kwaipilot/KAT-Dev-72B-Exp
pipeline_tag: text-generation
library_name: llama.cpp
language:
- en
tags:
- gguf
- quantized
- ollama
- coding
- llama-cpp
- text-generation
quantized_by: richardyoung
---
<div align="center">
# ๐ป KAT-Dev 72B - GGUF
### Enterprise-Grade 72B Coding Model, Optimized for Local Inference
[](https://github.com/ggerganov/llama.cpp)
[](https://huggingface.co/richardyoung/kat-dev-72b)
[](https://ollama.ai/)
[](https://opensource.org/licenses/Apache-2.0)
**[Original Model](https://huggingface.co/Kwaipilot/KAT-Dev-72B-Exp)** | **[Ollama Registry](https://ollama.com/richardyoung/kat-dev-72b)** | **[llama.cpp](https://github.com/ggerganov/llama.cpp)**
---
</div>
## ๐ What is This?
This is **KAT-Dev 72B**, a powerful coding model with 72 billion parameters, quantized to **GGUF format** for efficient local inference. Perfect for developers who want enterprise-grade code assistance running entirely on their own hardware with Ollama or llama.cpp!
### โจ Why You'll Love It
- ๐ป **Coding-Focused** - Optimized specifically for programming tasks
- ๐ง **72B Parameters** - Large enough for complex reasoning and refactoring
- โก **Local Inference** - Run entirely on your machine, no API calls
- ๐ **Privacy First** - Your code never leaves your computer
- ๐ฏ **Multiple Quantizations** - Choose your speed/quality trade-off
- ๐ **Ollama Ready** - One command to start coding
- ๐ง **llama.cpp Compatible** - Works with your favorite tools
## ๐ฏ Quick Start
### Option 1: Ollama (Easiest!)
Pull and run directly from the Ollama registry:
```bash
# Recommended: IQ3_M (best balance)
ollama run richardyoung/kat-dev-72b:iq3_m
# Other variants
ollama run richardyoung/kat-dev-72b:iq4_xs # Better quality
ollama run richardyoung/kat-dev-72b:iq2_m # Faster, smaller
ollama run richardyoung/kat-dev-72b:iq2_xxs # Most compact
```
That's it! Start asking coding questions! ๐
### Option 2: Build from Modelfile
Download this repo and build locally:
```bash
# Clone or download the modelfiles
ollama create kat-dev-72b-iq3_m -f modelfiles/kat-dev-72b--iq3_m.Modelfile
ollama run kat-dev-72b-iq3_m
```
### Option 3: llama.cpp
Use with llama.cpp directly:
```bash
# Download the GGUF file (replace variant as needed)
huggingface-cli download richardyoung/kat-dev-72b kat-dev-72b-iq3_m.gguf --local-dir ./
# Run with llama.cpp
./llama-cli -m kat-dev-72b-iq3_m.gguf -p "Write a Python function to"
```
## ๐ป System Requirements
| Component | Minimum | Recommended |
|-----------|---------|-------------|
| **RAM** | 32 GB | 64 GB+ |
| **Storage** | 40 GB free | 50+ GB free |
| **CPU** | Modern 8-core | 16+ cores |
| **GPU** | Optional (CPU-only works!) | Metal/CUDA for acceleration |
| **OS** | macOS, Linux, Windows | Latest versions |
> ๐ก **Tip:** Larger quantizations (IQ4_XS) need more RAM but produce better code. Smaller ones (IQ2_XXS) are faster but less precise.
## ๐จ Available Quantizations
Choose the right balance for your needs:
| Quantization | Size | Quality | Speed | RAM Usage | Best For |
|--------------|------|---------|-------|-----------|----------|
| **IQ4_XS** | 37 GB | โญโญโญโญโญ | โญโญโญ | ~50 GB | Production code, complex refactoring |
| **IQ3_M** (recommended) | 33 GB | โญโญโญโญ | โญโญโญโญ | ~40 GB | Daily development, best balance |
| **IQ2_M** | 27 GB | โญโญโญ | โญโญโญโญโญ | ~35 GB | Quick prototyping, fast iteration |
| **IQ2_XXS** | 24 GB | โญโญ | โญโญโญโญโญ | ~30 GB | Testing, very constrained systems |
### Variant Details
| Variant | Size | Blob SHA256 |
|---------|------|-------------|
| `iq4_xs` | 36.98 GB | `c4cb9c6e...` |
| `iq3_m` | 33.07 GB | `14d07184...` |
| `iq2_m` | 27.32 GB | `cbe26a3c...` |
| `iq2_xxs` | 23.74 GB | `a49c7526...` |
## ๐ Usage Examples
### Code Generation
```bash
ollama run richardyoung/kat-dev-72b:iq3_m "Write a Python function to validate email addresses with regex"
```
### Code Explanation
```bash
ollama run richardyoung/kat-dev-72b:iq3_m "Explain this code: def fib(n): return n if n < 2 else fib(n-1) + fib(n-2)"
```
### Debugging Help
```bash
ollama run richardyoung/kat-dev-72b:iq3_m "Why does this Python code raise a KeyError?"
```
### Refactoring
```bash
ollama run richardyoung/kat-dev-72b:iq3_m "Refactor this JavaScript function to use async/await instead of callbacks"
```
### Multi-turn Conversation
```bash
ollama run richardyoung/kat-dev-72b:iq3_m
>>> I need to build a REST API in Python
>>> Show me a FastAPI example with authentication
>>> How do I add rate limiting?
```
## ๐๏ธ Model Details
<details>
<summary><b>Click to expand technical details</b></summary>
### Architecture
- **Base Model:** KAT-Dev 72B Exp by Kwaipilot
- **Parameters:** ~72 Billion
- **Quantization:** GGUF format (IQ2_XXS to IQ4_XS)
- **Context Length:** Standard (check base model for specifics)
- **Optimization:** Code generation and understanding
- **Training:** Specialized for programming tasks
### Supported Languages
The model excels at:
- Python
- JavaScript/TypeScript
- Java
- C/C++
- Go
- Rust
- And many more!
</details>
## โก Performance Tips
<details>
<summary><b>Getting the best results</b></summary>
1. **Choose the right quantization** - IQ3_M is recommended for daily use
2. **Use specific prompts** - "Write a Python function to X" works better than "code for X"
3. **Provide context** - Share error messages, file structures, or requirements
4. **Iterate** - Ask follow-up questions to refine the code
5. **GPU acceleration** - Use Metal (Mac) or CUDA (NVIDIA) for faster inference
6. **Temperature settings** - Lower (0.1-0.3) for precise code, higher (0.7-0.9) for creative solutions
### Example Ollama Configuration
```bash
# Create with custom parameters
ollama create my-kat-dev -f modelfiles/kat-dev-72b--iq3_m.Modelfile
# Edit the Modelfile to add:
PARAMETER temperature 0.2
PARAMETER top_p 0.9
PARAMETER repeat_penalty 1.1
```
</details>
## ๐ง Building Custom Variants
You can modify the included Modelfiles to customize behavior:
```dockerfile
FROM ./kat-dev-72b-iq3_m.gguf
# System prompt
SYSTEM You are an expert programmer specializing in Python and web development.
# Parameters
PARAMETER temperature 0.2
PARAMETER num_ctx 8192
PARAMETER stop "<|endoftext|>"
```
Then build:
```bash
ollama create my-custom-kat -f custom.Modelfile
```
## โ ๏ธ Known Limitations
- ๐พ **Large Size** - Even the smallest variant needs 24+ GB of storage
- ๐ **RAM Intensive** - Requires significant system memory
- โฑ๏ธ **Inference Speed** - Slower than smaller models (trade-off for quality)
- ๐ **English-Focused** - Best performance with English prompts
- ๐ **Code-Specialized** - Not optimized for general conversation
## ๐ License
Apache 2.0 - Same as the original model. Free for commercial use!
## ๐ Acknowledgments
- **Original Model:** [Kwaipilot](https://huggingface.co/Kwaipilot) for creating KAT-Dev 72B
- **GGUF Format:** [Georgi Gerganov](https://github.com/ggerganov) for llama.cpp
- **Ollama:** [Ollama team](https://ollama.ai/) for the amazing runtime
- **Community:** All the developers testing and providing feedback
## ๐ Useful Links
- ๐ฆ **Original Model:** [Kwaipilot/KAT-Dev-72B-Exp](https://huggingface.co/Kwaipilot/KAT-Dev-72B-Exp)
- ๐ **Ollama Registry:** [richardyoung/kat-dev-72b](https://ollama.com/richardyoung/kat-dev-72b)
- ๐ ๏ธ **llama.cpp:** [GitHub](https://github.com/ggerganov/llama.cpp)
- ๐ **Ollama Docs:** [Documentation](https://github.com/ollama/ollama)
- ๐ฌ **Discussions:** [Ask questions here!](https://huggingface.co/richardyoung/kat-dev-72b/discussions)
## ๐ฎ Pro Tips
<details>
<summary><b>Advanced usage patterns</b></summary>
### 1. Integration with VS Code
Use with Continue.dev or other coding assistants:
```json
{
"models": [
{
"title": "KAT-Dev 72B",
"provider": "ollama",
"model": "richardyoung/kat-dev-72b:iq3_m"
}
]
}
```
### 2. API Server Mode
Run as an OpenAI-compatible API:
```bash
ollama serve
# Then use the API at http://localhost:11434
```
### 3. Batch Processing
Process multiple files:
```bash
for file in *.py; do
ollama run richardyoung/kat-dev-72b:iq3_m "Review this code: $(cat $file)" > "${file}.review"
done
```
</details>
---
<div align="center">
**Quantized with โค๏ธ by [richardyoung](https://deepneuro.ai/richard)**
*If you find this useful, please โญ star the repo and share with other developers!*
**Format:** GGUF | **Runtime:** Ollama / llama.cpp | **Created:** October 2025
</div>
## Hardware Requirements
KAT-Dev 72B is a large coding model. Choose your quantization based on available VRAM/RAM:
| Quantization | Model Size | VRAM Required | Quality |
|:------------:|:----------:|:-------------:|:--------|
| **Q2_K** | ~27 GB | 32 GB | Acceptable |
| **Q3_K_M** | ~34 GB | 40 GB | Good |
| **Q4_K_M** | ~42 GB | 48 GB | Very Good - recommended |
| **Q5_K_M** | ~50 GB | 56 GB | Excellent |
| **Q6_K** | ~58 GB | 64 GB | Near original |
| **Q8_0** | ~77 GB | 80 GB | Original quality |
### Recommended Setups
| Hardware | Recommended Quantization |
|:---------|:-------------------------|
| RTX 4090 (24GB) | Q2_K with offloading |
| 2x RTX 4090 (48GB) | Q4_K_M |
| A100 (80GB) | Q8_0 |
| Mac Studio M2 Ultra (192GB) | Q8_0 via llama.cpp |
|