README.md · SimpleLLM/kode-32b-GGUF at main

kode-32b-GGUF / README.md

KevinKickass

Update README.md

b6b7b1e verified 26 days ago

preview code

raw

history blame contribute delete

6.28 kB

	---
	language:
	- en
	- de
	license: apache-2.0
	library_name: transformers
	base_model:
	- Qwen/Qwen2.5-Coder-14B
	- Qwen/Qwen2.5-Coder-32B
	tags:
	- code
	- coding
	- tool-calling
	- code-generation
	- eu-trained
	- dpo
	- sft
	- qlora
	pipeline_tag: text-generation
	model-index:
	- name: Kode
	results: []
	---

	# Kode — EU-Trained Coding Models

	Kode is a family of instruction-tuned coding models built for real-world software engineering tasks. Fine-tuned on Qwen2.5-Coder using DPO + SFT with Claude-generated training samples on A100 GPUs.

	Kode is the backbone of Kode CLI/Web UI, an open-source local alternative to Claude Code. Github coming soon.

	\| Model \| Parameters \| VRAM \| Best For \|
	\|-------\|-----------\|------\|----------\|
	\| kode-14b \| 14B \| ~10 GB (Q8) / ~9 GB (Q4) \| Consumer GPUs, fast iteration \|
	\| kode-32b \| 32B \| ~19 GB (Q4) \| Maximum quality, production use \|

	## Key Features

	- 🇪🇺 Trained in the EU — DSGVO/GDPR compliant, no data leaves Europe
	- 🔧 Tool-calling native — Trained specifically for file operations, shell commands, code search
	- 🎯 Production code focus — Training data from real codebases, not synthetic benchmarks
	- 📐 7 languages — Rust, Go, TypeScript, Python, C#, SQL, CSS/Tailwind
	- 🏠 Runs locally — 14B fits on a single consumer GPU (RTX 3080+)

	## Supported Languages & Tasks

	### Languages
	Rust • Go • TypeScript • Python • C# • PostgreSQL • CSS/Tailwind

	### Tasks
	- Code generation — Complete functions, modules, and files from natural language
	- Code refactoring — Improve existing code structure and performance
	- Code review — Identify bugs, security issues, and improvements
	- Tool calling — File I/O, shell commands, grep/search (Kode CLI integration)
	- Code completion — Context-aware completions

	## Training Details

	### Base Model
	[Qwen2.5-Coder](https://huggingface.co/Qwen/Qwen2.5-Coder-32B) (14B and 32B variants)

	### Training Pipeline
	1. SFT (Supervised Fine-Tuning) — Claude-generated training samples across 7 languages (~841 curated queries covering data structures, async, error handling, APIs, testing, and more)
	2. DPO (Direct Preference Optimization) — Preference pairs from Claude evaluations of model outputs
	3. Tool-call SFT — Specialized training for tool-calling patterns (read_file, write_file, bash_execute, grep, etc.)

	### Infrastructure
	- GPU: NVIDIA A100 80GB (2× for 32B full fine-tune, 1× for QLoRA)
	- Framework: Transformers + PEFT + TRL + Unsloth
	- LoRA config (32B): r=64, alpha=128, dropout=0.05, targeting all attention + MLP projections
	- Precision: bfloat16
	- Sequence length: 4096 tokens

	### Training Data
	- ~841 curated training queries across 7 programming languages
	- Claude-generated reference solutions (chosen) vs. local model outputs (rejected) for DPO
	- Bilingual prompts (English + German)

	## Usage

	### Ollama (Recommended)

	```bash
	# Install and run
	ollama pull simplellm/kode-14b
	ollama run simplellm/kode-14b

	# Or the larger model
	ollama pull simplellm/kode-32b
	ollama run simplellm/kode-32b
	```

	### Ollama API

	```bash
	curl http://localhost:11434/api/chat -d '{
	"model": "simplellm/kode-14b",
	"messages": [
	{"role": "user", "content": "Write a Rust function to find prime numbers using the Sieve of Eratosthenes"}
	]
	}'
	```

	### 🤗 Transformers

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model_name = "simplellm/kode-14b"
	tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
	model = AutoModelForCausalLM.from_pretrained(
	model_name,
	torch_dtype="auto",
	device_map="auto",
	trust_remote_code=True,
	)

	messages = [
	{"role": "system", "content": "You are a coding assistant. Respond with clean, production-ready code."},
	{"role": "user", "content": "Write a thread-safe LRU cache in Rust using Arc and Mutex"},
	]

	text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
	inputs = tokenizer(text, return_tensors="pt").to(model.device)
	outputs = model.generate(**inputs, max_new_tokens=2048, temperature=0.7, top_p=0.9)
	print(tokenizer.decode(outputs[0][inputs.input_ids.shape[-1]:], skip_special_tokens=True))
	```

	### llama.cpp

	```bash
	# Download GGUF
	wget https://huggingface.co/simplellm/kode-14b-GGUF/resolve/main/kode-14b-Q8_0.gguf

	# Run
	./llama-cli -m kode-14b-Q8_0.gguf -p "Write a Go HTTP server with middleware" -n 1024
	```

	### Hosted Inference

	Try Kode without downloading at [SimpleLLM.eu](https://simplellm.eu) — EU-hosted, GDPR-compliant inference API.

	## Quantized Versions

	\| Variant \| Size \| Quality \| Speed \|
	\|---------\|------\|---------\|-------\|
	\| kode-14b (FP16) \| ~28 GB \| Baseline \| Baseline \|
	\| kode-14b-Q8 \| ~15 GB \| Near-lossless \| ~1.2× faster \|
	\| kode-14b (Q4) \| ~9 GB \| Good \| ~1.5× faster \|
	\| kode-32b (native/FP16) \| ~64 GB \| Best \| Slowest \|
	\| kode-32b-Q4 \| ~19 GB \| Very good \| Fast \|

	## Benchmarks

	> 🚧 Coming soon — We are running HumanEval, MBPP, MultiPL-E, and tool-calling benchmarks. Results will be published here.

	\| Benchmark \| kode-14b \| kode-32b \| Qwen2.5-Coder-14B (base) \|
	\|-----------\|----------\|----------\|--------------------------\|
	\| HumanEval \| TBD \| TBD \| TBD \|
	\| MBPP \| TBD \| TBD \| TBD \|
	\| MultiPL-E (Rust) \| TBD \| TBD \| TBD \|
	\| Tool-call accuracy \| TBD \| TBD \| N/A \|

	## Limitations

	- Optimized for the 7 supported languages; may underperform on others
	- 4096 token context window (inherited from training config)
	- Tool-calling format is specific to Kode CLI's tool schema
	- Training data is bilingual (EN/DE) — other languages may have reduced quality

	## License

	Apache 2.0 (inherited from [Qwen2.5-Coder](https://huggingface.co/Qwen/Qwen2.5-Coder-32B))

	## Citation

	```bibtex
	@misc{kode2025,
	title={Kode: EU-Trained Coding Models for Real-World Software Engineering},
	author={Kevin and SimpleLLM Team},
	year={2025},
	url={https://huggingface.co/simplellm/kode-14b}
	}
	```

	## Links

	- 🌐 [SimpleLLM.eu](https://simplellm.eu) — Hosted inference
	- 💻 [Kode CLI](https://github.com/kevco/kode) — Local coding assistant
	- 🤗 [All models](https://huggingface.co/simplellm) — HuggingFace collection