|
|
--- |
|
|
license: apache-2.0 |
|
|
base_model: Kwaipilot/KAT-Dev-72B-Exp |
|
|
pipeline_tag: text-generation |
|
|
library_name: llama.cpp |
|
|
language: |
|
|
- en |
|
|
tags: |
|
|
- gguf |
|
|
- quantized |
|
|
- ollama |
|
|
- coding |
|
|
- llama-cpp |
|
|
- text-generation |
|
|
quantized_by: richardyoung |
|
|
--- |
|
|
|
|
|
<div align="center"> |
|
|
|
|
|
# ๐ป KAT-Dev 72B - GGUF |
|
|
|
|
|
### Enterprise-Grade 72B Coding Model, Optimized for Local Inference |
|
|
|
|
|
[](https://github.com/ggerganov/llama.cpp) |
|
|
[](https://huggingface.co/richardyoung/kat-dev-72b) |
|
|
[](https://ollama.ai/) |
|
|
[](https://opensource.org/licenses/Apache-2.0) |
|
|
|
|
|
**[Original Model](https://huggingface.co/Kwaipilot/KAT-Dev-72B-Exp)** | **[Ollama Registry](https://ollama.com/richardyoung/kat-dev-72b)** | **[llama.cpp](https://github.com/ggerganov/llama.cpp)** |
|
|
|
|
|
--- |
|
|
|
|
|
</div> |
|
|
|
|
|
## ๐ What is This? |
|
|
|
|
|
This is **KAT-Dev 72B**, a powerful coding model with 72 billion parameters, quantized to **GGUF format** for efficient local inference. Perfect for developers who want enterprise-grade code assistance running entirely on their own hardware with Ollama or llama.cpp! |
|
|
|
|
|
### โจ Why You'll Love It |
|
|
|
|
|
- ๐ป **Coding-Focused** - Optimized specifically for programming tasks |
|
|
- ๐ง **72B Parameters** - Large enough for complex reasoning and refactoring |
|
|
- โก **Local Inference** - Run entirely on your machine, no API calls |
|
|
- ๐ **Privacy First** - Your code never leaves your computer |
|
|
- ๐ฏ **Multiple Quantizations** - Choose your speed/quality trade-off |
|
|
- ๐ **Ollama Ready** - One command to start coding |
|
|
- ๐ง **llama.cpp Compatible** - Works with your favorite tools |
|
|
|
|
|
## ๐ฏ Quick Start |
|
|
|
|
|
### Option 1: Ollama (Easiest!) |
|
|
|
|
|
Pull and run directly from the Ollama registry: |
|
|
|
|
|
```bash |
|
|
# Recommended: IQ3_M (best balance) |
|
|
ollama run richardyoung/kat-dev-72b:iq3_m |
|
|
|
|
|
# Other variants |
|
|
ollama run richardyoung/kat-dev-72b:iq4_xs # Better quality |
|
|
ollama run richardyoung/kat-dev-72b:iq2_m # Faster, smaller |
|
|
ollama run richardyoung/kat-dev-72b:iq2_xxs # Most compact |
|
|
``` |
|
|
|
|
|
That's it! Start asking coding questions! ๐ |
|
|
|
|
|
### Option 2: Build from Modelfile |
|
|
|
|
|
Download this repo and build locally: |
|
|
|
|
|
```bash |
|
|
# Clone or download the modelfiles |
|
|
ollama create kat-dev-72b-iq3_m -f modelfiles/kat-dev-72b--iq3_m.Modelfile |
|
|
ollama run kat-dev-72b-iq3_m |
|
|
``` |
|
|
|
|
|
### Option 3: llama.cpp |
|
|
|
|
|
Use with llama.cpp directly: |
|
|
|
|
|
```bash |
|
|
# Download the GGUF file (replace variant as needed) |
|
|
huggingface-cli download richardyoung/kat-dev-72b kat-dev-72b-iq3_m.gguf --local-dir ./ |
|
|
|
|
|
# Run with llama.cpp |
|
|
./llama-cli -m kat-dev-72b-iq3_m.gguf -p "Write a Python function to" |
|
|
``` |
|
|
|
|
|
## ๐ป System Requirements |
|
|
|
|
|
| Component | Minimum | Recommended | |
|
|
|-----------|---------|-------------| |
|
|
| **RAM** | 32 GB | 64 GB+ | |
|
|
| **Storage** | 40 GB free | 50+ GB free | |
|
|
| **CPU** | Modern 8-core | 16+ cores | |
|
|
| **GPU** | Optional (CPU-only works!) | Metal/CUDA for acceleration | |
|
|
| **OS** | macOS, Linux, Windows | Latest versions | |
|
|
|
|
|
> ๐ก **Tip:** Larger quantizations (IQ4_XS) need more RAM but produce better code. Smaller ones (IQ2_XXS) are faster but less precise. |
|
|
|
|
|
## ๐จ Available Quantizations |
|
|
|
|
|
Choose the right balance for your needs: |
|
|
|
|
|
| Quantization | Size | Quality | Speed | RAM Usage | Best For | |
|
|
|--------------|------|---------|-------|-----------|----------| |
|
|
| **IQ4_XS** | 37 GB | โญโญโญโญโญ | โญโญโญ | ~50 GB | Production code, complex refactoring | |
|
|
| **IQ3_M** (recommended) | 33 GB | โญโญโญโญ | โญโญโญโญ | ~40 GB | Daily development, best balance | |
|
|
| **IQ2_M** | 27 GB | โญโญโญ | โญโญโญโญโญ | ~35 GB | Quick prototyping, fast iteration | |
|
|
| **IQ2_XXS** | 24 GB | โญโญ | โญโญโญโญโญ | ~30 GB | Testing, very constrained systems | |
|
|
|
|
|
### Variant Details |
|
|
|
|
|
| Variant | Size | Blob SHA256 | |
|
|
|---------|------|-------------| |
|
|
| `iq4_xs` | 36.98 GB | `c4cb9c6e...` | |
|
|
| `iq3_m` | 33.07 GB | `14d07184...` | |
|
|
| `iq2_m` | 27.32 GB | `cbe26a3c...` | |
|
|
| `iq2_xxs` | 23.74 GB | `a49c7526...` | |
|
|
|
|
|
## ๐ Usage Examples |
|
|
|
|
|
### Code Generation |
|
|
|
|
|
```bash |
|
|
ollama run richardyoung/kat-dev-72b:iq3_m "Write a Python function to validate email addresses with regex" |
|
|
``` |
|
|
|
|
|
### Code Explanation |
|
|
|
|
|
```bash |
|
|
ollama run richardyoung/kat-dev-72b:iq3_m "Explain this code: def fib(n): return n if n < 2 else fib(n-1) + fib(n-2)" |
|
|
``` |
|
|
|
|
|
### Debugging Help |
|
|
|
|
|
```bash |
|
|
ollama run richardyoung/kat-dev-72b:iq3_m "Why does this Python code raise a KeyError?" |
|
|
``` |
|
|
|
|
|
### Refactoring |
|
|
|
|
|
```bash |
|
|
ollama run richardyoung/kat-dev-72b:iq3_m "Refactor this JavaScript function to use async/await instead of callbacks" |
|
|
``` |
|
|
|
|
|
### Multi-turn Conversation |
|
|
|
|
|
```bash |
|
|
ollama run richardyoung/kat-dev-72b:iq3_m |
|
|
>>> I need to build a REST API in Python |
|
|
>>> Show me a FastAPI example with authentication |
|
|
>>> How do I add rate limiting? |
|
|
``` |
|
|
|
|
|
## ๐๏ธ Model Details |
|
|
|
|
|
<details> |
|
|
<summary><b>Click to expand technical details</b></summary> |
|
|
|
|
|
### Architecture |
|
|
|
|
|
- **Base Model:** KAT-Dev 72B Exp by Kwaipilot |
|
|
- **Parameters:** ~72 Billion |
|
|
- **Quantization:** GGUF format (IQ2_XXS to IQ4_XS) |
|
|
- **Context Length:** Standard (check base model for specifics) |
|
|
- **Optimization:** Code generation and understanding |
|
|
- **Training:** Specialized for programming tasks |
|
|
|
|
|
### Supported Languages |
|
|
|
|
|
The model excels at: |
|
|
- Python |
|
|
- JavaScript/TypeScript |
|
|
- Java |
|
|
- C/C++ |
|
|
- Go |
|
|
- Rust |
|
|
- And many more! |
|
|
|
|
|
</details> |
|
|
|
|
|
## โก Performance Tips |
|
|
|
|
|
<details> |
|
|
<summary><b>Getting the best results</b></summary> |
|
|
|
|
|
1. **Choose the right quantization** - IQ3_M is recommended for daily use |
|
|
2. **Use specific prompts** - "Write a Python function to X" works better than "code for X" |
|
|
3. **Provide context** - Share error messages, file structures, or requirements |
|
|
4. **Iterate** - Ask follow-up questions to refine the code |
|
|
5. **GPU acceleration** - Use Metal (Mac) or CUDA (NVIDIA) for faster inference |
|
|
6. **Temperature settings** - Lower (0.1-0.3) for precise code, higher (0.7-0.9) for creative solutions |
|
|
|
|
|
### Example Ollama Configuration |
|
|
|
|
|
```bash |
|
|
# Create with custom parameters |
|
|
ollama create my-kat-dev -f modelfiles/kat-dev-72b--iq3_m.Modelfile |
|
|
|
|
|
# Edit the Modelfile to add: |
|
|
PARAMETER temperature 0.2 |
|
|
PARAMETER top_p 0.9 |
|
|
PARAMETER repeat_penalty 1.1 |
|
|
``` |
|
|
|
|
|
</details> |
|
|
|
|
|
## ๐ง Building Custom Variants |
|
|
|
|
|
You can modify the included Modelfiles to customize behavior: |
|
|
|
|
|
```dockerfile |
|
|
FROM ./kat-dev-72b-iq3_m.gguf |
|
|
|
|
|
# System prompt |
|
|
SYSTEM You are an expert programmer specializing in Python and web development. |
|
|
|
|
|
# Parameters |
|
|
PARAMETER temperature 0.2 |
|
|
PARAMETER num_ctx 8192 |
|
|
PARAMETER stop "<|endoftext|>" |
|
|
``` |
|
|
|
|
|
Then build: |
|
|
|
|
|
```bash |
|
|
ollama create my-custom-kat -f custom.Modelfile |
|
|
``` |
|
|
|
|
|
## โ ๏ธ Known Limitations |
|
|
|
|
|
- ๐พ **Large Size** - Even the smallest variant needs 24+ GB of storage |
|
|
- ๐ **RAM Intensive** - Requires significant system memory |
|
|
- โฑ๏ธ **Inference Speed** - Slower than smaller models (trade-off for quality) |
|
|
- ๐ **English-Focused** - Best performance with English prompts |
|
|
- ๐ **Code-Specialized** - Not optimized for general conversation |
|
|
|
|
|
## ๐ License |
|
|
|
|
|
Apache 2.0 - Same as the original model. Free for commercial use! |
|
|
|
|
|
## ๐ Acknowledgments |
|
|
|
|
|
- **Original Model:** [Kwaipilot](https://huggingface.co/Kwaipilot) for creating KAT-Dev 72B |
|
|
- **GGUF Format:** [Georgi Gerganov](https://github.com/ggerganov) for llama.cpp |
|
|
- **Ollama:** [Ollama team](https://ollama.ai/) for the amazing runtime |
|
|
- **Community:** All the developers testing and providing feedback |
|
|
|
|
|
## ๐ Useful Links |
|
|
|
|
|
- ๐ฆ **Original Model:** [Kwaipilot/KAT-Dev-72B-Exp](https://huggingface.co/Kwaipilot/KAT-Dev-72B-Exp) |
|
|
- ๐ **Ollama Registry:** [richardyoung/kat-dev-72b](https://ollama.com/richardyoung/kat-dev-72b) |
|
|
- ๐ ๏ธ **llama.cpp:** [GitHub](https://github.com/ggerganov/llama.cpp) |
|
|
- ๐ **Ollama Docs:** [Documentation](https://github.com/ollama/ollama) |
|
|
- ๐ฌ **Discussions:** [Ask questions here!](https://huggingface.co/richardyoung/kat-dev-72b/discussions) |
|
|
|
|
|
## ๐ฎ Pro Tips |
|
|
|
|
|
<details> |
|
|
<summary><b>Advanced usage patterns</b></summary> |
|
|
|
|
|
### 1. Integration with VS Code |
|
|
|
|
|
Use with Continue.dev or other coding assistants: |
|
|
|
|
|
```json |
|
|
{ |
|
|
"models": [ |
|
|
{ |
|
|
"title": "KAT-Dev 72B", |
|
|
"provider": "ollama", |
|
|
"model": "richardyoung/kat-dev-72b:iq3_m" |
|
|
} |
|
|
] |
|
|
} |
|
|
``` |
|
|
|
|
|
### 2. API Server Mode |
|
|
|
|
|
Run as an OpenAI-compatible API: |
|
|
|
|
|
```bash |
|
|
ollama serve |
|
|
# Then use the API at http://localhost:11434 |
|
|
``` |
|
|
|
|
|
### 3. Batch Processing |
|
|
|
|
|
Process multiple files: |
|
|
|
|
|
```bash |
|
|
for file in *.py; do |
|
|
ollama run richardyoung/kat-dev-72b:iq3_m "Review this code: $(cat $file)" > "${file}.review" |
|
|
done |
|
|
``` |
|
|
|
|
|
</details> |
|
|
|
|
|
--- |
|
|
|
|
|
<div align="center"> |
|
|
|
|
|
**Quantized with โค๏ธ by [richardyoung](https://deepneuro.ai/richard)** |
|
|
|
|
|
*If you find this useful, please โญ star the repo and share with other developers!* |
|
|
|
|
|
**Format:** GGUF | **Runtime:** Ollama / llama.cpp | **Created:** October 2025 |
|
|
|
|
|
</div> |
|
|
|
|
|
|
|
|
## Hardware Requirements |
|
|
|
|
|
KAT-Dev 72B is a large coding model. Choose your quantization based on available VRAM/RAM: |
|
|
|
|
|
| Quantization | Model Size | VRAM Required | Quality | |
|
|
|:------------:|:----------:|:-------------:|:--------| |
|
|
| **Q2_K** | ~27 GB | 32 GB | Acceptable | |
|
|
| **Q3_K_M** | ~34 GB | 40 GB | Good | |
|
|
| **Q4_K_M** | ~42 GB | 48 GB | Very Good - recommended | |
|
|
| **Q5_K_M** | ~50 GB | 56 GB | Excellent | |
|
|
| **Q6_K** | ~58 GB | 64 GB | Near original | |
|
|
| **Q8_0** | ~77 GB | 80 GB | Original quality | |
|
|
|
|
|
### Recommended Setups |
|
|
|
|
|
| Hardware | Recommended Quantization | |
|
|
|:---------|:-------------------------| |
|
|
| RTX 4090 (24GB) | Q2_K with offloading | |
|
|
| 2x RTX 4090 (48GB) | Q4_K_M | |
|
|
| A100 (80GB) | Q8_0 | |
|
|
| Mac Studio M2 Ultra (192GB) | Q8_0 via llama.cpp | |
|
|
|
|
|
|