kat-dev-72b / README.md

Upload README.md with huggingface_hub

5f76079 verified about 2 months ago

9.67 kB

	---
	license: apache-2.0
	base_model: Kwaipilot/KAT-Dev-72B-Exp
	pipeline_tag: text-generation
	library_name: llama.cpp
	language:
	- en
	tags:
	- gguf
	- quantized
	- ollama
	- coding
	- llama-cpp
	- text-generation
	quantized_by: richardyoung
	---

	<div align="center">

	# 💻 KAT-Dev 72B - GGUF

	### Enterprise-Grade 72B Coding Model, Optimized for Local Inference

	[![GGUF](https://img.shields.io/badge/Format-GGUF-blue)](https://github.com/ggerganov/llama.cpp)
	[![Size](https://img.shields.io/badge/Variants-4_Quantizations-green)](https://huggingface.co/richardyoung/kat-dev-72b)
	[![Ollama](https://img.shields.io/badge/Runtime-Ollama-orange)](https://ollama.ai/)
	[![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)

	[Original Model](https://huggingface.co/Kwaipilot/KAT-Dev-72B-Exp) \| [Ollama Registry](https://ollama.com/richardyoung/kat-dev-72b) \| [llama.cpp](https://github.com/ggerganov/llama.cpp)

	---

	</div>

	## 📖 What is This?

	This is KAT-Dev 72B, a powerful coding model with 72 billion parameters, quantized to GGUF format for efficient local inference. Perfect for developers who want enterprise-grade code assistance running entirely on their own hardware with Ollama or llama.cpp!

	### ✨ Why You'll Love It

	- 💻 Coding-Focused - Optimized specifically for programming tasks
	- 🧠 72B Parameters - Large enough for complex reasoning and refactoring
	- ⚡ Local Inference - Run entirely on your machine, no API calls
	- 🔒 Privacy First - Your code never leaves your computer
	- 🎯 Multiple Quantizations - Choose your speed/quality trade-off
	- 🚀 Ollama Ready - One command to start coding
	- 🔧 llama.cpp Compatible - Works with your favorite tools

	## 🎯 Quick Start

	### Option 1: Ollama (Easiest!)

	Pull and run directly from the Ollama registry:

	```bash
	# Recommended: IQ3_M (best balance)
	ollama run richardyoung/kat-dev-72b:iq3_m

	# Other variants
	ollama run richardyoung/kat-dev-72b:iq4_xs # Better quality
	ollama run richardyoung/kat-dev-72b:iq2_m # Faster, smaller
	ollama run richardyoung/kat-dev-72b:iq2_xxs # Most compact
	```

	That's it! Start asking coding questions! 🎉

	### Option 2: Build from Modelfile

	Download this repo and build locally:

	```bash
	# Clone or download the modelfiles
	ollama create kat-dev-72b-iq3_m -f modelfiles/kat-dev-72b--iq3_m.Modelfile
	ollama run kat-dev-72b-iq3_m
	```

	### Option 3: llama.cpp

	Use with llama.cpp directly:

	```bash
	# Download the GGUF file (replace variant as needed)
	huggingface-cli download richardyoung/kat-dev-72b kat-dev-72b-iq3_m.gguf --local-dir ./

	# Run with llama.cpp
	./llama-cli -m kat-dev-72b-iq3_m.gguf -p "Write a Python function to"
	```

	## 💻 System Requirements

	\| Component \| Minimum \| Recommended \|
	\|-----------\|---------\|-------------\|
	\| RAM \| 32 GB \| 64 GB+ \|
	\| Storage \| 40 GB free \| 50+ GB free \|
	\| CPU \| Modern 8-core \| 16+ cores \|
	\| GPU \| Optional (CPU-only works!) \| Metal/CUDA for acceleration \|
	\| OS \| macOS, Linux, Windows \| Latest versions \|

	> 💡 Tip: Larger quantizations (IQ4_XS) need more RAM but produce better code. Smaller ones (IQ2_XXS) are faster but less precise.

	## 🎨 Available Quantizations

	Choose the right balance for your needs:

	\| Quantization \| Size \| Quality \| Speed \| RAM Usage \| Best For \|
	\|--------------\|------\|---------\|-------\|-----------\|----------\|
	\| IQ4_XS \| 37 GB \| ⭐⭐⭐⭐⭐ \| ⭐⭐⭐ \| ~50 GB \| Production code, complex refactoring \|
	\| IQ3_M (recommended) \| 33 GB \| ⭐⭐⭐⭐ \| ⭐⭐⭐⭐ \| ~40 GB \| Daily development, best balance \|
	\| IQ2_M \| 27 GB \| ⭐⭐⭐ \| ⭐⭐⭐⭐⭐ \| ~35 GB \| Quick prototyping, fast iteration \|
	\| IQ2_XXS \| 24 GB \| ⭐⭐ \| ⭐⭐⭐⭐⭐ \| ~30 GB \| Testing, very constrained systems \|

	### Variant Details

	\| Variant \| Size \| Blob SHA256 \|
	\|---------\|------\|-------------\|
	\| `iq4_xs` \| 36.98 GB \| `c4cb9c6e...` \|
	\| `iq3_m` \| 33.07 GB \| `14d07184...` \|
	\| `iq2_m` \| 27.32 GB \| `cbe26a3c...` \|
	\| `iq2_xxs` \| 23.74 GB \| `a49c7526...` \|

	## 📚 Usage Examples

	### Code Generation

	```bash
	ollama run richardyoung/kat-dev-72b:iq3_m "Write a Python function to validate email addresses with regex"
	```

	### Code Explanation

	```bash
	ollama run richardyoung/kat-dev-72b:iq3_m "Explain this code: def fib(n): return n if n < 2 else fib(n-1) + fib(n-2)"
	```

	### Debugging Help

	```bash
	ollama run richardyoung/kat-dev-72b:iq3_m "Why does this Python code raise a KeyError?"
	```

	### Refactoring

	```bash
	ollama run richardyoung/kat-dev-72b:iq3_m "Refactor this JavaScript function to use async/await instead of callbacks"
	```

	### Multi-turn Conversation

	```bash
	ollama run richardyoung/kat-dev-72b:iq3_m
	>>> I need to build a REST API in Python
	>>> Show me a FastAPI example with authentication
	>>> How do I add rate limiting?
	```

	## 🏗️ Model Details

	<details>
	<summary><b>Click to expand technical details</b></summary>

	### Architecture

	- Base Model: KAT-Dev 72B Exp by Kwaipilot
	- Parameters: ~72 Billion
	- Quantization: GGUF format (IQ2_XXS to IQ4_XS)
	- Context Length: Standard (check base model for specifics)
	- Optimization: Code generation and understanding
	- Training: Specialized for programming tasks

	### Supported Languages

	The model excels at:
	- Python
	- JavaScript/TypeScript
	- Java
	- C/C++
	- Go
	- Rust
	- And many more!

	</details>

	## ⚡ Performance Tips

	<details>
	<summary><b>Getting the best results</b></summary>

	1. Choose the right quantization - IQ3_M is recommended for daily use
	2. Use specific prompts - "Write a Python function to X" works better than "code for X"
	3. Provide context - Share error messages, file structures, or requirements
	4. Iterate - Ask follow-up questions to refine the code
	5. GPU acceleration - Use Metal (Mac) or CUDA (NVIDIA) for faster inference
	6. Temperature settings - Lower (0.1-0.3) for precise code, higher (0.7-0.9) for creative solutions

	### Example Ollama Configuration

	```bash
	# Create with custom parameters
	ollama create my-kat-dev -f modelfiles/kat-dev-72b--iq3_m.Modelfile

	# Edit the Modelfile to add:
	PARAMETER temperature 0.2
	PARAMETER top_p 0.9
	PARAMETER repeat_penalty 1.1
	```

	</details>

	## 🔧 Building Custom Variants

	You can modify the included Modelfiles to customize behavior:

	```dockerfile
	FROM ./kat-dev-72b-iq3_m.gguf

	# System prompt
	SYSTEM You are an expert programmer specializing in Python and web development.

	# Parameters
	PARAMETER temperature 0.2
	PARAMETER num_ctx 8192
	PARAMETER stop "<\|endoftext\|>"
	```

	Then build:

	```bash
	ollama create my-custom-kat -f custom.Modelfile
	```

	## ⚠️ Known Limitations

	- 💾 Large Size - Even the smallest variant needs 24+ GB of storage
	- 🐏 RAM Intensive - Requires significant system memory
	- ⏱️ Inference Speed - Slower than smaller models (trade-off for quality)
	- 🌐 English-Focused - Best performance with English prompts
	- 📝 Code-Specialized - Not optimized for general conversation

	## 📄 License

	Apache 2.0 - Same as the original model. Free for commercial use!

	## 🙏 Acknowledgments

	- Original Model: [Kwaipilot](https://huggingface.co/Kwaipilot) for creating KAT-Dev 72B
	- GGUF Format: [Georgi Gerganov](https://github.com/ggerganov) for llama.cpp
	- Ollama: [Ollama team](https://ollama.ai/) for the amazing runtime
	- Community: All the developers testing and providing feedback

	## 🔗 Useful Links

	- 📦 Original Model: [Kwaipilot/KAT-Dev-72B-Exp](https://huggingface.co/Kwaipilot/KAT-Dev-72B-Exp)
	- 🚀 Ollama Registry: [richardyoung/kat-dev-72b](https://ollama.com/richardyoung/kat-dev-72b)
	- 🛠️ llama.cpp: [GitHub](https://github.com/ggerganov/llama.cpp)
	- 📖 Ollama Docs: [Documentation](https://github.com/ollama/ollama)
	- 💬 Discussions: [Ask questions here!](https://huggingface.co/richardyoung/kat-dev-72b/discussions)

	## 🎮 Pro Tips

	<details>
	<summary><b>Advanced usage patterns</b></summary>

	### 1. Integration with VS Code

	Use with Continue.dev or other coding assistants:

	```json
	{
	"models": [
	{
	"title": "KAT-Dev 72B",
	"provider": "ollama",
	"model": "richardyoung/kat-dev-72b:iq3_m"
	}
	]
	}
	```

	### 2. API Server Mode

	Run as an OpenAI-compatible API:

	```bash
	ollama serve
	# Then use the API at http://localhost:11434
	```

	### 3. Batch Processing

	Process multiple files:

	```bash
	for file in *.py; do
	ollama run richardyoung/kat-dev-72b:iq3_m "Review this code: $(cat $file)" > "${file}.review"
	done
	```

	</details>

	---

	<div align="center">

	Quantized with ❤️ by [richardyoung](https://deepneuro.ai/richard)

	If you find this useful, please ⭐ star the repo and share with other developers!

	Format: GGUF \| Runtime: Ollama / llama.cpp \| Created: October 2025

	</div>


	## Hardware Requirements

	KAT-Dev 72B is a large coding model. Choose your quantization based on available VRAM/RAM:

	\| Quantization \| Model Size \| VRAM Required \| Quality \|
	\|:------------:\|:----------:\|:-------------:\|:--------\|
	\| Q2_K \| ~27 GB \| 32 GB \| Acceptable \|
	\| Q3_K_M \| ~34 GB \| 40 GB \| Good \|
	\| Q4_K_M \| ~42 GB \| 48 GB \| Very Good - recommended \|
	\| Q5_K_M \| ~50 GB \| 56 GB \| Excellent \|
	\| Q6_K \| ~58 GB \| 64 GB \| Near original \|
	\| Q8_0 \| ~77 GB \| 80 GB \| Original quality \|

	### Recommended Setups

	\| Hardware \| Recommended Quantization \|
	\|:---------\|:-------------------------\|
	\| RTX 4090 (24GB) \| Q2_K with offloading \|
	\| 2x RTX 4090 (48GB) \| Q4_K_M \|
	\| A100 (80GB) \| Q8_0 \|
	\| Mac Studio M2 Ultra (192GB) \| Q8_0 via llama.cpp \|