Instructions to use dreeseaw/cleo with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use dreeseaw/cleo with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="dreeseaw/cleo")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForMultimodalLM

tokenizer = AutoTokenizer.from_pretrained("dreeseaw/cleo")
model = AutoModelForMultimodalLM.from_pretrained("dreeseaw/cleo")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

llama-cpp-python

How to use dreeseaw/cleo with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="dreeseaw/cleo",
	filename="cleo-Q8_0.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use dreeseaw/cleo with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf dreeseaw/cleo:Q8_0
# Run inference directly in the terminal:
llama-cli -hf dreeseaw/cleo:Q8_0

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf dreeseaw/cleo:Q8_0
# Run inference directly in the terminal:
llama-cli -hf dreeseaw/cleo:Q8_0

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf dreeseaw/cleo:Q8_0
# Run inference directly in the terminal:
./llama-cli -hf dreeseaw/cleo:Q8_0

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf dreeseaw/cleo:Q8_0
# Run inference directly in the terminal:
./build/bin/llama-cli -hf dreeseaw/cleo:Q8_0

Use Docker

docker model run hf.co/dreeseaw/cleo:Q8_0

LM Studio
Jan

vLLM

How to use dreeseaw/cleo with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "dreeseaw/cleo"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "dreeseaw/cleo",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/dreeseaw/cleo:Q8_0

SGLang

How to use dreeseaw/cleo with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "dreeseaw/cleo" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "dreeseaw/cleo",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "dreeseaw/cleo" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "dreeseaw/cleo",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Ollama
How to use dreeseaw/cleo with Ollama:
```
ollama run hf.co/dreeseaw/cleo:Q8_0
```

Unsloth Studio

How to use dreeseaw/cleo with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for dreeseaw/cleo to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for dreeseaw/cleo to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for dreeseaw/cleo to start chatting

How to use dreeseaw/cleo with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf dreeseaw/cleo:Q8_0

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "dreeseaw/cleo:Q8_0"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use dreeseaw/cleo with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf dreeseaw/cleo:Q8_0

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default dreeseaw/cleo:Q8_0

Run Hermes

hermes

Atomic Chat new
Docker Model Runner
How to use dreeseaw/cleo with Docker Model Runner:
```
docker model run hf.co/dreeseaw/cleo:Q8_0
```

Lemonade

How to use dreeseaw/cleo with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull dreeseaw/cleo:Q8_0

Run and chat with the model

lemonade run user.cleo-Q8_0

List all available models

lemonade list

Cleo

Cleo is a small SQL analyst hardel: a Qwen3.5-2B fine-tune paired with a read-only SQL harness/runtime for live database connections. The model is trained to inspect schemas, gather real values from the database, repair SQL from execution feedback, and return analyst-ready read-only queries.

The recommended entry point is the Python package and MCP server: github.com/Dreeseaw/cleo.

pip install "cleo-sql[hf] @ git+https://github.com/Dreeseaw/cleo.git@master"

from cleo import Cleo

cleo = Cleo.from_hf("dreeseaw/cleo")
ans = cleo("How many employees are currently in each department?", conn)

print(ans.sql)
print(ans.rows)
print(ans.clarification)

cleo(...) and cleo.ask(...) use the hardel runtime by default: a greedy candidate plus sampled candidates are executed through the same read-only harness, then selected with product-visible execution evidence. Use cleo.ask_once(...) when you explicitly want a single-candidate, lower-latency path.

Files

file	purpose
root model files	Current hardel Transformers weights in bf16-compatible safetensors format.
`v1.4-hardel-v3/`	Archived copy of the current hardel checkpoint.
`cleo-Q8_0.gguf`	Legacy llama-cpp-python GGUF alias from a prior release.
`cleo_v1_2_bird-no_mtp-Q8_0.gguf`	Prior versioned Q8_0 GGUF artifact.
`v1.0/`	Earlier archived tool-use checkpoint.

For the current release, use the Hugging Face Transformers backend through Cleo.from_hf("dreeseaw/cleo"). The GGUF files are retained for compatibility with earlier runtime paths.

Public Benchmark

All rows below use denotation scoring on BIRD minidev SQLite: predicted and gold SQL are executed, normalized row sets are compared, and the formula_1 database is excluded because of training overlap. BIRD-434 is reported as a broad analytical SQL benchmark, not as the only measure of the product runtime.

model / runtime	BIRD-434 execution accuracy
Cleo v1.4 hardel K=4	143/434 = 33.0%
Gemini 2.5 Flash	55.5%
DeepSeek Chat	50.5%

Cleo was evaluated through the public package runtime with k=4, temperature=0.7, max_gather=3, and max_repair=2.

Model Lineage

Cleo starts from Qwen/Qwen3.5-2B-Base, then adds SQL analyst behavior in stages:

Analyst SQL contract SFT: strict JSON outputs, read-only SQL, clarification behavior, and schema-grounded query writing across schema-diverse tasks.
Real-schema teacher distillation: on-policy trajectories from larger SQL-capable models teach the first analyst checkpoint to work against realistic database shapes.
Tool-use continuation: ECHO-format traces teach gather -> observation -> final behavior, so the model can discover stored values, codes, sentinels, and naming conventions before producing final SQL.
Repair and runtime continuation: the steadier full-fine-tuned ECHO branch is continued with train-safe observed-value corrections and replay, emphasizing reliable harness behavior over one-off repair memorization.
Hardel runtime selection: the shipped package combines the model with live execution, candidate search, repair, and an evidence selector, making the model and harness one product surface rather than two separate demos.

Runtime Notes

The root model files are current bf16 Transformers weights. They are not stored as int8 weights. For CUDA machines that need lower VRAM, install the optional extra and load with runtime quantization:

pip install "cleo-sql[hf,int8] @ git+https://github.com/Dreeseaw/cleo.git@master"

cleo = Cleo.from_hf("dreeseaw/cleo", quantization="int8")

By default, Cleo.from_hf() asks PyTorch what is available and chooses CUDA, then XPU, then MPS, then CPU. CUDA is the tested fast path for this release; CPU is supported by compatible Transformers installs but is slower.

Limitations

Cleo is designed for analyst SQL workflows over live, read-only database connections.
Use least-privilege read-only credentials, query timeouts, and normal application-level review for production data access.
The package supports common DB-API connections and dialect transpilation across SQLite, Postgres, MySQL, and DuckDB-style workflows; validate outputs for production-critical reporting.
Very large schemas should be scoped with tables= or a provided schema= string so the runtime sees the relevant part of the database.

Model tree for dreeseaw/cleo

Base model

Qwen/Qwen3.5-2B-Base

Finetuned

(46)

this model

dreeseaw
/

cleo