Instructions to use rafw007/bielik-codex-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use rafw007/bielik-codex-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="rafw007/bielik-codex-GGUF", filename="Bielik-Minitron-7B-v3.0-Instruct.Q6_K.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use rafw007/bielik-codex-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf rafw007/bielik-codex-GGUF:Q6_K # Run inference directly in the terminal: llama-cli -hf rafw007/bielik-codex-GGUF:Q6_K
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf rafw007/bielik-codex-GGUF:Q6_K # Run inference directly in the terminal: llama-cli -hf rafw007/bielik-codex-GGUF:Q6_K
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf rafw007/bielik-codex-GGUF:Q6_K # Run inference directly in the terminal: ./llama-cli -hf rafw007/bielik-codex-GGUF:Q6_K
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf rafw007/bielik-codex-GGUF:Q6_K # Run inference directly in the terminal: ./build/bin/llama-cli -hf rafw007/bielik-codex-GGUF:Q6_K
Use Docker
docker model run hf.co/rafw007/bielik-codex-GGUF:Q6_K
- LM Studio
- Jan
- vLLM
How to use rafw007/bielik-codex-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "rafw007/bielik-codex-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "rafw007/bielik-codex-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/rafw007/bielik-codex-GGUF:Q6_K
- Ollama
How to use rafw007/bielik-codex-GGUF with Ollama:
ollama run hf.co/rafw007/bielik-codex-GGUF:Q6_K
- Unsloth Studio
How to use rafw007/bielik-codex-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for rafw007/bielik-codex-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for rafw007/bielik-codex-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for rafw007/bielik-codex-GGUF to start chatting
- Docker Model Runner
How to use rafw007/bielik-codex-GGUF with Docker Model Runner:
docker model run hf.co/rafw007/bielik-codex-GGUF:Q6_K
- Lemonade
How to use rafw007/bielik-codex-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull rafw007/bielik-codex-GGUF:Q6_K
Run and chat with the model
lemonade run user.bielik-codex-GGUF-Q6_K
List all available models
lemonade list
Bielik Codex — Polish local coding agent
A custom model built on Bielik-Minitron-7B v3.0 Instruct, tuned to act as an autonomous coding and administration agent. It speaks the Anthropic-compatible API, so it drives Claude Code and OpenCode fully locally — your code never leaves your machine, and cloud token cost drops to zero.
This is the first Polish model in this family. Its system prompt is focused on real work in a terminal: use tools instead of guessing, write files instead of pasting code, ground every answer in real tool output (never invent results), stay in one language, and act instead of writing tutorials. It defaults to Polish when you write Polish.
What Bielik is (the base)
Bielik is a family of open, Polish large language models developed by the SpeakLeash foundation (a.k.a. Spichlerz) together with ACK Cyfronet AGH — the Academic Computer Centre Cyfronet AGH in Kraków, operator of the supercomputers (Helios, Athena) the model was trained on. It is the flagship Polish sovereign-AI project, built in large part by volunteers and the Polish community.
The specific base here is Bielik-Minitron-7B v3.0 Instruct — a ~7.5B variant pruned with the Minitron technique (NVIDIA, pruning + distillation) from a larger Bielik, in a llama-style architecture (causal decoder-only).
- Developed by: SpeakLeash & ACK Cyfronet AGH
- Languages: multilingual — 32 European languages (including all EU languages), optimized for Polish
- Type: causal decoder-only, ~7.5B parameters (Minitron-pruned)
- License: Apache 2.0
Why it was made — the problem
Stock / official Bielik is an excellent chat model, but out of the box it was not suited to agentic work with tools. Under harnesses (Codex / OpenCode / Claude Code):
- it emitted tool calls as raw JSON in the content (
{"name":"exec_command",...}), which Ollama does not parse intomessage.tool_calls→ the harness executed nothing; - it fabricated tool results (made-up
df/nmapoutput) instead of reading the real one; - it wrote tutorials and invented non-existent CLI flags instead of acting;
- it got stuck in loops (roundtrip / retry-on-error).
What the tuning fixed
Three bugs in the official template: (1) the tool-call format was switched to native ChatML
<tool_call> (Ollama parses it into tool_calls); (2) added handling of the tool role with
grounding ("answer from the real result, do not call again"); (3) fixed stop tokens
(<|im_end|> instead of llama <|eot_id|>). Plus: temperature 0, hard result anti-hallucination,
no refusals on tool tasks, one tool per turn, anti-tutorial.
The result: from "doesn't agent at all" → "actually executes tools in the two main harnesses".
Validated harnesses (real test, not just API)
| Harness | Result |
|---|---|
ollama /api/chat |
✅ real message.tool_calls, grounding, zero hallucination |
| OpenCode | ✅ benchmark 3/3: real df -h, nmap -sn (honestly 1 host up, no fabrication), file written via the write tool |
| Claude Code | ✅ drives CC, calls MCP tools (memory, filesystem) — works as an agent |
What it's for
- Driving Claude Code / OpenCode locally (
ollama launch claude --model rafw007/bielik-codex). - Agentic code writing and editing with native function calling / tool use.
- Sysadmin / devops tasks in a real terminal —
df,du,nmapwith actual output. - Full privacy and offline operation — no code or prompt is sent to the cloud.
- Polish as a first-class language — natural commands and answers in Polish.
Quick start
ollama run rafw007/bielik-codex
In Claude Code / OpenCode:
ollama launch claude --model rafw007/bielik-codex
opencode run -m ollama/bielik-codex "a concrete command"
Sampling / context
- temperature 0, top_p 0.9, top_k 20, repeat_penalty 1.05, num_ctx 32768.
- The 32K context is baked into the base GGUF (Ollama hard-cap — it cannot be raised via
num_ctx). Claude Code prefers ≥64K, but for concrete, single commands 32K is enough.
Test hardware
Built and tested on:
- Mac Studio M2, 32 GB RAM, macOS 15.6 (Sequoia) — GPU (Metal) inference
- Mac Mini M4, 32 GB RAM, macOS 15.6 (Sequoia) — GPU (Metal) inference
- Ollama 0.30.0
- Base quantization: Q6_K (~6.1 GB of weights) — fits entirely on the GPU, no CPU spill.
Measured behavior
- Tool-calling without hallucination — real
message.tool_calls; admin tasks (df,nmap) report the actual result, not a made-up one. - Acts, doesn't talk — runs a tool instead of writing a tutorial or asking "should I continue".
- Native Polish — no drift to another language on Polish commands.
Known limits (7B ceiling)
- 32768 context baked into the GGUF — cannot be raised via the Modelfile.
- Flakiness: ~50% of the time the first shot can come back empty on an open, abstract task → retry or give concrete, single commands.
- Open abstract tasks → it sometimes confabulates concepts (7B brain ceiling); for heavy, multi-step agentic work a 35B-class model is better.
- Under Claude Code an occasional loop on bad tool arguments was observed, plus the 32k output token limit being exceeded on very long generations.
Safety — read before you deploy
An important note, without fear-mongering: the Bielik base is NOT an abliterated model — it ships with built-in, factory refusal mechanisms (this is an honest, "normal" model, not a brakes-removed version).
However, our purely agentic tuning (a system prompt of "always act with a tool, never refuse") inherently loosens those brakes — in a red-team test the tuned build accepted a harmful-code request without refusing. That is the normal price of tuning for agentic productivity, not a flaw of the base itself.
So: use it carefully, and for any public or production deployment wire in a guard layer. The natural choice is Sójka — the guardian model from SpeakLeash — placed as a pre/post-filter on input prompts and responses. With Sójka in front of Bielik you get a healthy refusal layer back without losing agentic capability.
How it was made
Designed, built and tested with the help of Claude Opus — the best coding model in the world. Its choice of template, parameters and context configuration come straight from that work: the world's best coding model preparing a Polish, local model that takes over the job right on your desk.
License
Apache 2.0 (inherited from the base Bielik-Minitron-7B v3.0).
Bielik® is a project of the SpeakLeash foundation and ACK Cyfronet AGH. This model is an independent tune of the public base — it is not an official SpeakLeash release.
- Downloads last month
- 53
6-bit
Model tree for rafw007/bielik-codex-GGUF
Base model
speakleash/Bielik-11B-v3-Base-20250730