Instructions to use ledgergap/Pollux-4B-Judge-mlx-q8 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use ledgergap/Pollux-4B-Judge-mlx-q8 with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("ledgergap/Pollux-4B-Judge-mlx-q8") prompt = "Write a story about Einstein" messages = [{"role": "user", "content": prompt}] prompt = tokenizer.apply_chat_template( messages, add_generation_prompt=True ) text = generate(model, tokenizer, prompt=prompt, verbose=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
- Pi
How to use ledgergap/Pollux-4B-Judge-mlx-q8 with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "ledgergap/Pollux-4B-Judge-mlx-q8"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "ledgergap/Pollux-4B-Judge-mlx-q8" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use ledgergap/Pollux-4B-Judge-mlx-q8 with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "ledgergap/Pollux-4B-Judge-mlx-q8"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default ledgergap/Pollux-4B-Judge-mlx-q8
Run Hermes
hermes
- MLX LM
How to use ledgergap/Pollux-4B-Judge-mlx-q8 with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Interactive chat REPL mlx_lm.chat --model "ledgergap/Pollux-4B-Judge-mlx-q8"
Run an OpenAI-compatible server
# Install MLX LM uv tool install mlx-lm # Start the server mlx_lm.server --model "ledgergap/Pollux-4B-Judge-mlx-q8" # Calling the OpenAI-compatible server with curl curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ledgergap/Pollux-4B-Judge-mlx-q8", "messages": [ {"role": "user", "content": "Hello"} ] }'
Pollux-4B-Judge-mlx-q8
This is an 8-bit quantized MLX conversion of Pollux-4B-Judge (a Qwen3-4B based LLM-as-a-judge model).
It was converted from the original Hugging Face checkpoint using mlx-lm with
8-bit quantization (group size 64), and runs natively on Apple Silicon. On the
sanity-check prompt it produced output identical to the bf16 model, at roughly
half the memory and ~2x the generation throughput.
| Architecture | Qwen3ForCausalLM (4B) |
| Quantization | 8-bit, group size 64, affine (~8.5 bits/weight) |
| Residual dtype | bfloat16 (non-quantized tensors: norms, scales) |
| On-disk size | ~4.3 GB |
| Peak inference memory | ~4.4 GB |
Note:
config.jsonlists"dtype": "bfloat16"— that is the type of the non-quantized tensors. The actual 8-bit quantization lives in the"quantization"block ({"bits": 8, "group_size": 64, "mode": "affine"}).
Installation
Requires Apple Silicon (M-series). Install mlx-lm:
pip install mlx-lm
You can run the model straight from the Hub (replace <your-hf-username> with the
account you upload it to) or from a local path.
Prompt format
The model is an LLM-as-a-judge: it scores one response against one criterion
and returns a numeric score (plus a <think> rationale). Format each evaluation as a
single user message using this template:
### Задание для оценки:
{instruction}
### Эталонный ответ:
{reference_answer}
### Ответ для оценки:
{answer}
### Критерий оценки:
{criteria_name}
### Шкала оценивания по критерию:
{criteria_rubrics}
The ### Эталонный ответ: (reference answer) block is optional — drop it when you have
no gold answer.
Quick start (CLI)
mlx_lm.generate --model <your-hf-username>/Pollux-4B-Judge-mlx-q8 --temp 0.0 --max-tokens 512 \
--prompt $'### Задание для оценки:\nСколько будет 2+2?\n\n### Эталонный ответ:\n4\n\n### Ответ для оценки:\nБудет 4\n\n### Критерий оценки:\nПравильность ответа\n\n### Шкала оценивания по критерию:\n0: Дан неправильный ответ или ответ отсутствует.\n1: Ответ модели неполный.\n2: Ответ модели совпадает с эталонным или эквивалентен ему.'
Python
from mlx_lm import load, generate
model, tokenizer = load("<your-hf-username>/Pollux-4B-Judge-mlx-q8")
PROMPT_TEMPLATE = """### Задание для оценки:
{instruction}
### Эталонный ответ:
{reference_answer}
### Ответ для оценки:
{answer}
### Критерий оценки:
{criteria_name}
### Шкала оценивания по критерию:
{criteria_rubrics}
"""
prompt = PROMPT_TEMPLATE.format(
instruction="Сколько будет 2+2?",
reference_answer="4",
answer="Будет 4",
criteria_name="Правильность ответа",
criteria_rubrics=(
"0: Дан неправильный ответ или ответ отсутствует.\n"
"1: Ответ модели неполный.\n"
"2: Ответ модели совпадает с эталонным или эквивалентен ему."
),
)
messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
print(generate(model, tokenizer, prompt=text, max_tokens=512, verbose=True))
OpenAI-compatible server + curl
Start the server (it exposes /v1/chat/completions on port 8080):
mlx_lm.server --model <your-hf-username>/Pollux-4B-Judge-mlx-q8 --port 8080
Then send a judge request:
curl -s http://127.0.0.1:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Pollux-4B-Judge-mlx-q8",
"temperature": 0.0,
"max_tokens": 512,
"messages": [
{"role": "user", "content": "### Задание для оценки:\nСколько будет 2+2?\n\n### Эталонный ответ:\n4\n\n### Ответ для оценки:\nБудет 4\n\n### Критерий оценки:\nПравильность ответа\n\n### Шкала оценивания по критерию:\n0: Дан неправильный ответ или ответ отсутствует.\n1: Ответ модели неполный.\n2: Ответ модели совпадает с эталонным или эквивалентен ему."}
]
}'
The model returns a short <think> rationale followed by the numeric score (2 for
this example; 0 if you swap in a wrong answer).
See the original model card for evaluation details, full rubrics, and intended use.
- Downloads last month
- 27
8-bit