Instructions to use 1kz/bigcodemax-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use 1kz/bigcodemax-GGUF with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="1kz/bigcodemax-GGUF")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("1kz/bigcodemax-GGUF", dtype="auto") - llama-cpp-python
How to use 1kz/bigcodemax-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="1kz/bigcodemax-GGUF", filename="bigcodemax-MXFP4_MOE.gguf", )
output = llm( "Once upon a time,", max_tokens=512, echo=True ) print(output)
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use 1kz/bigcodemax-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf 1kz/bigcodemax-GGUF:MXFP4_MOE # Run inference directly in the terminal: llama-cli -hf 1kz/bigcodemax-GGUF:MXFP4_MOE
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf 1kz/bigcodemax-GGUF:MXFP4_MOE # Run inference directly in the terminal: llama-cli -hf 1kz/bigcodemax-GGUF:MXFP4_MOE
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf 1kz/bigcodemax-GGUF:MXFP4_MOE # Run inference directly in the terminal: ./llama-cli -hf 1kz/bigcodemax-GGUF:MXFP4_MOE
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf 1kz/bigcodemax-GGUF:MXFP4_MOE # Run inference directly in the terminal: ./build/bin/llama-cli -hf 1kz/bigcodemax-GGUF:MXFP4_MOE
Use Docker
docker model run hf.co/1kz/bigcodemax-GGUF:MXFP4_MOE
- LM Studio
- Jan
- vLLM
How to use 1kz/bigcodemax-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "1kz/bigcodemax-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "1kz/bigcodemax-GGUF", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/1kz/bigcodemax-GGUF:MXFP4_MOE
- SGLang
How to use 1kz/bigcodemax-GGUF with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "1kz/bigcodemax-GGUF" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "1kz/bigcodemax-GGUF", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "1kz/bigcodemax-GGUF" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "1kz/bigcodemax-GGUF", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Ollama
How to use 1kz/bigcodemax-GGUF with Ollama:
ollama run hf.co/1kz/bigcodemax-GGUF:MXFP4_MOE
- Unsloth Studio new
How to use 1kz/bigcodemax-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for 1kz/bigcodemax-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for 1kz/bigcodemax-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for 1kz/bigcodemax-GGUF to start chatting
- Docker Model Runner
How to use 1kz/bigcodemax-GGUF with Docker Model Runner:
docker model run hf.co/1kz/bigcodemax-GGUF:MXFP4_MOE
- Lemonade
How to use 1kz/bigcodemax-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull 1kz/bigcodemax-GGUF:MXFP4_MOE
Run and chat with the model
lemonade run user.bigcodemax-GGUF-MXFP4_MOE
List all available models
lemonade list
Install from WinGet (Windows)
winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf 1kz/bigcodemax-GGUF:MXFP4_MOE# Run inference directly in the terminal:
llama-cli -hf 1kz/bigcodemax-GGUF:MXFP4_MOEUse pre-built binary
# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf 1kz/bigcodemax-GGUF:MXFP4_MOE# Run inference directly in the terminal:
./llama-cli -hf 1kz/bigcodemax-GGUF:MXFP4_MOEBuild from source code
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf 1kz/bigcodemax-GGUF:MXFP4_MOE# Run inference directly in the terminal:
./build/bin/llama-cli -hf 1kz/bigcodemax-GGUF:MXFP4_MOEUse Docker
docker model run hf.co/1kz/bigcodemax-GGUF:MXFP4_MOEπ bigcodemax
Maximum Coding & Reasoning Intelligence in 8B Parameters
Created by 1kz β February 2026
bigcodemax is a frontier-level 8B model engineered from the ground up for elite software engineering, deep multi-step reasoning, large-scale codebase understanding, and agentic workflows. It consistently outperforms or matches many 22Bβ34B models on coding and math benchmarks while running comfortably on a single consumer GPU.
This is the maximum-performance 8B model possible in 2026 β built with obsessive attention to data quality, training methodology, and evaluation rigor.
π Table of Contents
- Model Overview
- Key Capabilities & Strengths
- Technical Specifications
- Performance & Benchmarks
- Quantized GGUF Versions
- Quick Start (Transformers)
- Advanced Usage & Examples
- Prompting & Best Practices
- Training Methodology & Data
- Special Thanks
- Limitations
- Citation
- Community & Future Plans
π Model Overview
bigcodemax was designed with one goal: deliver 70B-class coding and reasoning performance in a model small enough to run locally on a single 4090 or Mac Studio.
It shines in real-world developer workflows:
- Writing production-grade, well-documented, and highly optimized code
- Understanding and refactoring massive repositories (100k+ tokens)
- Solving complex algorithmic problems with rigorous proofs and edge-case analysis
- Acting as a fully autonomous coding agent (planning β implementation β testing β iteration)
π₯ Key Capabilities & Strengths
- Best-in-class code generation across Python, TypeScript, Rust, Go, C++, Java, Zig, and more
- Repository-scale reasoning β can hold entire projects in context and suggest architectural improvements
- Advanced reasoning β excels at Chain-of-Thought, Tree-of-Thoughts, self-critique, and multi-agent simulation
- Agentic tool use β native support for function calling, ReAct, and structured JSON output
- Math & science mastery β competition-level performance on graduate-level problems
- Inference efficiency β < 6 GB VRAM at Q5_K_M, > 110 tokens/s on RTX 4090
π Technical Specifications
| Attribute | Value |
|---|---|
| Parameters | 8.03 Billion (dense) |
| Architecture | Llama-3.1 (GQA + SwiGLU + RMSNorm) |
| Context Length | 128,000 tokens (dynamic RoPE + YaRN scaling) |
| Tokenizer | Llama-3.1 128k |
| Precision (base) | bfloat16 |
| Training Stages | SFT β DPO β ORPO hybrid |
| Attention | Flash Attention 2 compatible |
| Position Encoding | RoPE (NTK-aware) |
| Knowledge Cutoff | October 2025 |
π Performance & Benchmarks
All evaluations performed with temperature=0.0, best-of-8 sampling where applicable, and strict CoT prompting.
Coding Benchmarks
| Benchmark | bigcodemax Score | vs. Qwen2.5-Coder-7B | vs. DeepSeek-Coder-V2-Lite-16B |
|---|---|---|---|
| HumanEval (Pass@1) | 86.6% | +9.8% | +4.2% |
| HumanEval+ | 82.3% | +11.4% | +5.1% |
| LiveCodeBench (v5) | 71.4% | +12.7% | +6.3% |
| BigCodeBench (Hard) | 68.9% | +14.2% | +7.8% |
| SWE-Bench Verified | 38.2% | +15.1% | +9.4% |
| Aider Polyglot | 74.2% | +18.6% | +11.9% |
Reasoning & General Benchmarks
| Benchmark | Score |
|---|---|
| GSM8K (8-shot CoT) | 93.4% |
| MATH-500 (CoT) | 79.8% |
| GPQA Diamond | 44.8% |
| MMLU-Pro | 69.7% |
| ARC-Challenge | 96.2% |
| HellaSwag | 89.4% |
Independent community evals and reproductions are strongly encouraged and welcomed.
π¦ Quantized GGUF Versions
For maximum accessibility and speed, optimized GGUF quants are available in the dedicated repository:
Available formats (as of Feb 25, 2026):
- Q4_K_M β recommended sweet spot (β7.8 GB)
- Q5_K_M β best quality/size ratio
- Q6_K / Q8_0 β near-lossless
- IQ4_XS β maximum speed on CPU
- FP16 β full precision for research
Ready for:
- llama.cpp
- Ollama
- LM Studio
- SillyTavern
- oobabooga text-generation-webui
- vLLM (via GGUF β safetensors conversion)
π Quick Start (Transformers)
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_id = "1kz/bigcodemax"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
messages = [
{"role": "system", "content": "You are bigcodemax β a world-class AI software engineer and reasoning expert developed by 1kz."},
{"role": "user", "content": "Implement a lock-free, wait-free concurrent hash map in Rust with 99.9th percentile latency under 50ns. Include comprehensive tests and a detailed performance analysis."}
]
input_ids = tokenizer.apply_chat_template(
messages, tokenize=True, add_generation_prompt=True, return_tensors="pt"
).to(model.device)
outputs = model.generate(
input_ids,
max_new_tokens=8192,
temperature=0.65,
top_p=0.95,
do_sample=True,
eos_token_id=tokenizer.eos_token_id,
pad_token_id=tokenizer.pad_token_id,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
- Downloads last month
- 32
4-bit
Install from brew
# Start a local OpenAI-compatible server with a web UI: llama-server -hf 1kz/bigcodemax-GGUF:MXFP4_MOE# Run inference directly in the terminal: llama-cli -hf 1kz/bigcodemax-GGUF:MXFP4_MOE