File size: 9,673 Bytes

---
license: apache-2.0
base_model: Kwaipilot/KAT-Dev-72B-Exp
pipeline_tag: text-generation
library_name: llama.cpp
language:
  - en
tags:
  - gguf
  - quantized
  - ollama
  - coding
  - llama-cpp
  - text-generation
quantized_by: richardyoung
---

<div align="center">

# 💻 KAT-Dev 72B - GGUF

### Enterprise-Grade 72B Coding Model, Optimized for Local Inference

[![GGUF](https://img.shields.io/badge/Format-GGUF-blue)](https://github.com/ggerganov/llama.cpp)
[![Size](https://img.shields.io/badge/Variants-4_Quantizations-green)](https://huggingface.co/richardyoung/kat-dev-72b)
[![Ollama](https://img.shields.io/badge/Runtime-Ollama-orange)](https://ollama.ai/)
[![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)

**[Original Model](https://huggingface.co/Kwaipilot/KAT-Dev-72B-Exp)** | **[Ollama Registry](https://ollama.com/richardyoung/kat-dev-72b)** | **[llama.cpp](https://github.com/ggerganov/llama.cpp)**

---

</div>

## 📖 What is This?

This is **KAT-Dev 72B**, a powerful coding model with 72 billion parameters, quantized to **GGUF format** for efficient local inference. Perfect for developers who want enterprise-grade code assistance running entirely on their own hardware with Ollama or llama.cpp!

### ✨ Why You'll Love It

- 💻 **Coding-Focused** - Optimized specifically for programming tasks
- 🧠 **72B Parameters** - Large enough for complex reasoning and refactoring
- ⚡ **Local Inference** - Run entirely on your machine, no API calls
- 🔒 **Privacy First** - Your code never leaves your computer
- 🎯 **Multiple Quantizations** - Choose your speed/quality trade-off
- 🚀 **Ollama Ready** - One command to start coding
- 🔧 **llama.cpp Compatible** - Works with your favorite tools

## 🎯 Quick Start

### Option 1: Ollama (Easiest!)

Pull and run directly from the Ollama registry:

```bash
# Recommended: IQ3_M (best balance)
ollama run richardyoung/kat-dev-72b:iq3_m

# Other variants
ollama run richardyoung/kat-dev-72b:iq4_xs  # Better quality
ollama run richardyoung/kat-dev-72b:iq2_m   # Faster, smaller
ollama run richardyoung/kat-dev-72b:iq2_xxs # Most compact
```

That's it! Start asking coding questions! 🎉

### Option 2: Build from Modelfile

Download this repo and build locally:

```bash
# Clone or download the modelfiles
ollama create kat-dev-72b-iq3_m -f modelfiles/kat-dev-72b--iq3_m.Modelfile
ollama run kat-dev-72b-iq3_m
```

### Option 3: llama.cpp

Use with llama.cpp directly:

```bash
# Download the GGUF file (replace variant as needed)
huggingface-cli download richardyoung/kat-dev-72b kat-dev-72b-iq3_m.gguf --local-dir ./

# Run with llama.cpp
./llama-cli -m kat-dev-72b-iq3_m.gguf -p "Write a Python function to"
```

## 💻 System Requirements

| Component | Minimum | Recommended |
|-----------|---------|-------------|
| **RAM** | 32 GB | 64 GB+ |
| **Storage** | 40 GB free | 50+ GB free |
| **CPU** | Modern 8-core | 16+ cores |
| **GPU** | Optional (CPU-only works!) | Metal/CUDA for acceleration |
| **OS** | macOS, Linux, Windows | Latest versions |

> 💡 **Tip:** Larger quantizations (IQ4_XS) need more RAM but produce better code. Smaller ones (IQ2_XXS) are faster but less precise.

## 🎨 Available Quantizations

Choose the right balance for your needs:

| Quantization | Size | Quality | Speed | RAM Usage | Best For |
|--------------|------|---------|-------|-----------|----------|
| **IQ4_XS** | 37 GB | ⭐⭐⭐⭐⭐ | ⭐⭐⭐ | ~50 GB | Production code, complex refactoring |
| **IQ3_M** (recommended) | 33 GB | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ~40 GB | Daily development, best balance |
| **IQ2_M** | 27 GB | ⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ~35 GB | Quick prototyping, fast iteration |
| **IQ2_XXS** | 24 GB | ⭐⭐ | ⭐⭐⭐⭐⭐ | ~30 GB | Testing, very constrained systems |

### Variant Details

| Variant | Size | Blob SHA256 |
|---------|------|-------------|
| `iq4_xs` | 36.98 GB | `c4cb9c6e...` |
| `iq3_m` | 33.07 GB | `14d07184...` |
| `iq2_m` | 27.32 GB | `cbe26a3c...` |
| `iq2_xxs` | 23.74 GB | `a49c7526...` |

## 📚 Usage Examples

### Code Generation

```bash
ollama run richardyoung/kat-dev-72b:iq3_m "Write a Python function to validate email addresses with regex"
```

### Code Explanation

```bash
ollama run richardyoung/kat-dev-72b:iq3_m "Explain this code: def fib(n): return n if n < 2 else fib(n-1) + fib(n-2)"
```

### Debugging Help

```bash
ollama run richardyoung/kat-dev-72b:iq3_m "Why does this Python code raise a KeyError?"
```

### Refactoring

```bash
ollama run richardyoung/kat-dev-72b:iq3_m "Refactor this JavaScript function to use async/await instead of callbacks"
```

### Multi-turn Conversation

```bash
ollama run richardyoung/kat-dev-72b:iq3_m
>>> I need to build a REST API in Python
>>> Show me a FastAPI example with authentication
>>> How do I add rate limiting?
```

## 🏗️ Model Details

<details>
<summary><b>Click to expand technical details</b></summary>

### Architecture

- **Base Model:** KAT-Dev 72B Exp by Kwaipilot
- **Parameters:** ~72 Billion
- **Quantization:** GGUF format (IQ2_XXS to IQ4_XS)
- **Context Length:** Standard (check base model for specifics)
- **Optimization:** Code generation and understanding
- **Training:** Specialized for programming tasks

### Supported Languages

The model excels at:
- Python
- JavaScript/TypeScript
- Java
- C/C++
- Go
- Rust
- And many more!

</details>

## ⚡ Performance Tips

<details>
<summary><b>Getting the best results</b></summary>

1. **Choose the right quantization** - IQ3_M is recommended for daily use
2. **Use specific prompts** - "Write a Python function to X" works better than "code for X"
3. **Provide context** - Share error messages, file structures, or requirements
4. **Iterate** - Ask follow-up questions to refine the code
5. **GPU acceleration** - Use Metal (Mac) or CUDA (NVIDIA) for faster inference
6. **Temperature settings** - Lower (0.1-0.3) for precise code, higher (0.7-0.9) for creative solutions

### Example Ollama Configuration

```bash
# Create with custom parameters
ollama create my-kat-dev -f modelfiles/kat-dev-72b--iq3_m.Modelfile

# Edit the Modelfile to add:
PARAMETER temperature 0.2
PARAMETER top_p 0.9
PARAMETER repeat_penalty 1.1
```

</details>

## 🔧 Building Custom Variants

You can modify the included Modelfiles to customize behavior:

```dockerfile
FROM ./kat-dev-72b-iq3_m.gguf

# System prompt
SYSTEM You are an expert programmer specializing in Python and web development.

# Parameters
PARAMETER temperature 0.2
PARAMETER num_ctx 8192
PARAMETER stop "<|endoftext|>"
```

Then build:

```bash
ollama create my-custom-kat -f custom.Modelfile
```

## ⚠️ Known Limitations

- 💾 **Large Size** - Even the smallest variant needs 24+ GB of storage
- 🐏 **RAM Intensive** - Requires significant system memory
- ⏱️ **Inference Speed** - Slower than smaller models (trade-off for quality)
- 🌐 **English-Focused** - Best performance with English prompts
- 📝 **Code-Specialized** - Not optimized for general conversation

## 📄 License

Apache 2.0 - Same as the original model. Free for commercial use!

## 🙏 Acknowledgments

- **Original Model:** [Kwaipilot](https://huggingface.co/Kwaipilot) for creating KAT-Dev 72B
- **GGUF Format:** [Georgi Gerganov](https://github.com/ggerganov) for llama.cpp
- **Ollama:** [Ollama team](https://ollama.ai/) for the amazing runtime
- **Community:** All the developers testing and providing feedback

## 🔗 Useful Links

- 📦 **Original Model:** [Kwaipilot/KAT-Dev-72B-Exp](https://huggingface.co/Kwaipilot/KAT-Dev-72B-Exp)
- 🚀 **Ollama Registry:** [richardyoung/kat-dev-72b](https://ollama.com/richardyoung/kat-dev-72b)
- 🛠️ **llama.cpp:** [GitHub](https://github.com/ggerganov/llama.cpp)
- 📖 **Ollama Docs:** [Documentation](https://github.com/ollama/ollama)
- 💬 **Discussions:** [Ask questions here!](https://huggingface.co/richardyoung/kat-dev-72b/discussions)

## 🎮 Pro Tips

<details>
<summary><b>Advanced usage patterns</b></summary>

### 1. Integration with VS Code

Use with Continue.dev or other coding assistants:

```json
{
  "models": [
    {
      "title": "KAT-Dev 72B",
      "provider": "ollama",
      "model": "richardyoung/kat-dev-72b:iq3_m"
    }
  ]
}
```

### 2. API Server Mode

Run as an OpenAI-compatible API:

```bash
ollama serve
# Then use the API at http://localhost:11434
```

### 3. Batch Processing

Process multiple files:

```bash
for file in *.py; do
  ollama run richardyoung/kat-dev-72b:iq3_m "Review this code: $(cat $file)" > "${file}.review"
done
```

</details>

---

<div align="center">

**Quantized with ❤️ by [richardyoung](https://deepneuro.ai/richard)**

*If you find this useful, please ⭐ star the repo and share with other developers!*

**Format:** GGUF | **Runtime:** Ollama / llama.cpp | **Created:** October 2025

</div>


## Hardware Requirements

KAT-Dev 72B is a large coding model. Choose your quantization based on available VRAM/RAM:

| Quantization | Model Size | VRAM Required | Quality |
|:------------:|:----------:|:-------------:|:--------|
| **Q2_K** | ~27 GB | 32 GB | Acceptable |
| **Q3_K_M** | ~34 GB | 40 GB | Good |
| **Q4_K_M** | ~42 GB | 48 GB | Very Good - recommended |
| **Q5_K_M** | ~50 GB | 56 GB | Excellent |
| **Q6_K** | ~58 GB | 64 GB | Near original |
| **Q8_0** | ~77 GB | 80 GB | Original quality |

### Recommended Setups

| Hardware | Recommended Quantization |
|:---------|:-------------------------|
| RTX 4090 (24GB) | Q2_K with offloading |
| 2x RTX 4090 (48GB) | Q4_K_M |
| A100 (80GB) | Q8_0 |
| Mac Studio M2 Ultra (192GB) | Q8_0 via llama.cpp |