Instructions to use UraionLabs/Uraion-Agent-Small with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use UraionLabs/Uraion-Agent-Small with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="UraionLabs/Uraion-Agent-Small",
	filename="Uraion-Agent-Small-F16.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use UraionLabs/Uraion-Agent-Small with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf UraionLabs/Uraion-Agent-Small:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf UraionLabs/Uraion-Agent-Small:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf UraionLabs/Uraion-Agent-Small:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf UraionLabs/Uraion-Agent-Small:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf UraionLabs/Uraion-Agent-Small:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf UraionLabs/Uraion-Agent-Small:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf UraionLabs/Uraion-Agent-Small:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf UraionLabs/Uraion-Agent-Small:Q4_K_M

Use Docker

docker model run hf.co/UraionLabs/Uraion-Agent-Small:Q4_K_M

LM Studio
Jan

vLLM

How to use UraionLabs/Uraion-Agent-Small with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "UraionLabs/Uraion-Agent-Small"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "UraionLabs/Uraion-Agent-Small",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/UraionLabs/Uraion-Agent-Small:Q4_K_M

Ollama
How to use UraionLabs/Uraion-Agent-Small with Ollama:
```
ollama run hf.co/UraionLabs/Uraion-Agent-Small:Q4_K_M
```

Unsloth Studio

How to use UraionLabs/Uraion-Agent-Small with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for UraionLabs/Uraion-Agent-Small to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for UraionLabs/Uraion-Agent-Small to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for UraionLabs/Uraion-Agent-Small to start chatting

How to use UraionLabs/Uraion-Agent-Small with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf UraionLabs/Uraion-Agent-Small:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "UraionLabs/Uraion-Agent-Small:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use UraionLabs/Uraion-Agent-Small with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf UraionLabs/Uraion-Agent-Small:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default UraionLabs/Uraion-Agent-Small:Q4_K_M

Run Hermes

hermes

Atomic Chat new
Docker Model Runner
How to use UraionLabs/Uraion-Agent-Small with Docker Model Runner:
```
docker model run hf.co/UraionLabs/Uraion-Agent-Small:Q4_K_M
```

Lemonade

How to use UraionLabs/Uraion-Agent-Small with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull UraionLabs/Uraion-Agent-Small:Q4_K_M

Run and chat with the model

lemonade run user.Uraion-Agent-Small-Q4_K_M

List all available models

lemonade list

Uraion Labs
Foundational systems research.

Uraion-Agent-Small
A compact tool-calling agent model — fine-tuned from first principles.

Uraion-Agent-Small is a 2-billion parameter model fine-tuned from Qwen/Qwen3.5-2B for agentic tool use and function calling. It is a research artifact in Uraion Labs' systems-first approach: studying the harness, orchestration, evaluation, and deployment layers that make foundation models useful in real workflows.

This model was trained via QLoRA (4-bit NF4 base + LoRA adapters, merged for deployment simplicity) on a curated mix of function-calling and instruction-following datasets — prioritizing data signal over data volume, in keeping with our systems philosophy.

Intelligence is a systems problem. This model is one piece of that system.

Quick navigation

Audience	Recommended path
Just want to chat	LM Studio / Ollama — one click
Building an agent	Function calling with llama-cpp-python or vLLM server
CPU-only / edge	CPU inference (slow) or Convert to proper GGUF
GPU (6 GB+ VRAM)	Transformers + bitsandbytes
Apple Silicon	MLX / LM Studio
Troubleshooting	I can't load the GGUF files

Systems philosophy

Stage	This model's role
Model	Qwen3.5-2B — hybrid linear + full attention, 262K native context
Harness	QLoRA fine-tuned for structured tool call output via `qwen3_coder` parser
Orchestrate	Multi-turn function calling, API composition, agent loops
Evaluate	Benchmarked on BFCL-v4, IFEval; tested in real multi-turn agent workflows
Adapt	4-bit merged — runs on consumer GPUs, deployable via vLLM
Deploy	OpenAI-compatible API, local-first, no opaque cloud dependence

This model sits in the Harness layer of our research pipeline — the tooling and runtime that makes foundation models useful, inspectable, and composable.

Model Details

Property	Value
Base model	Qwen/Qwen3.5-2B
Architecture	qwen35 hybrid (24 layers: 18 Gated DeltaNet linear + 6 full-attention every 4th)
Context length	262,144 tokens (native, inherited)
Parameters	~1.9B total, 21.8M LoRA trainable
Precision	4-bit NF4 (QLoRA base), LoRA in BF16, merged to 4-bit
License	Apache 2.0 (inherited from Qwen3.5)
Tool parser	`qwen3_coder` (native vLLM support)
On-disk size	~2.6 GB (Transformers NF4); GGUF variants range 1.8–3.4 GB
Hub layout	GGUF files at repo root (quantization selector); NF4 Transformers weights in `transformers/`

Hybrid architecture

Qwen3.5-2B uses a hybrid attention design: 18 Gated DeltaNet (linear attention) layers for efficient long-context inference, interleaved with 6 full-attention layers every 4th position for full expressive power where it matters. This is the systems-over-scale principle applied at the architecture level — better composition of attention mechanisms, not just more parameters.

⚠️ Known issues before you start

1. GGUF files have an open shape issue

The GGUF files at repo root were generated from the NF4 QLoRA weights without a full dequantization step. As a result, some tensors have incorrect shapes (1×N instead of 2D), and certain llama.cpp / llama-cpp-python builds reject them with:

check_tensor_dims: tensor 'blk.0.attn_qkv.weight' has wrong shape;
expected 2048, 6144, got 1, 6291456, 1, 1

Workaround: Use the Transformers + bitsandbytes path instead (see below). If you need a working GGUF, follow the conversion guide to rebuild from the NF4 weights.

2. Qwen3.5 `qwen35` architecture requires a recent llama.cpp

The hybrid attention arch (qwen35) was added to llama.cpp in mid-2026. If you're using llama-cpp-python:

Pre-built CPU/GPU wheels at version 0.3.32 do not support qwen35
You must install from git or compile from source with the latest llama.cpp

3. NF4 is slow on CPU

The Transformers weights are stored in bitsandbytes NF4 format. On GPU this is fast, but on CPU each weight is dequantized on-the-fly — expect 0.1–0.5 tok/s at 2B params. For CPU use, prefer GGUF after conversion.

Setup guides by use case

LM Studio / Ollama (recommended for beginners)

LM Studio (and soon Ollama) can pull models directly from HuggingFace. The model card renders a GGUF variant selector — pick one and click "Open in LM Studio".

For Ollama, import a GGUF file manually:

# After downloading e.g. Q4_K_M from the repo
ollama create uraion-agent-small -f ./Modelfile
ollama run uraion-agent-small

With a Modelfile:

FROM ./Uraion-Agent-Small-Q4_K_M.gguf
TEMPLATE """{{ if .System }}<|im_start|>system
{{ .System }}<|im_end|>
{{ end }}<|im_start|>user
{{ .Prompt }}<|im_end|>
<|im_start|>assistant
"""
PARAMETER temperature 0.0
PARAMETER top_p 0.95
PARAMETER stop "<|im_end|>"

Note: If the GGUF file fails to load in your runner, switch to the Transformers + bitsandbytes path below.

Transformers + bitsandbytes (GPU, 6 GB VRAM)

This is the most reliable path — uses the original NF4 weights as published.

Requirements

Python 3.10+
transformers>=5.12.0
bitsandbytes>=0.46.1
torch>=2.0 (CUDA or CPU)
6 GB free disk, ~4 GB VRAM (GPU) or ~8 GB RAM (CPU)

Installation

pip install transformers bitsandbytes torch sentencepiece accelerate

Basic inference

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "UraionLabs/Uraion-Agent-Small"
subfolder = "transformers"

tokenizer = AutoTokenizer.from_pretrained(
    model_id, subfolder=subfolder, trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
    model_id, subfolder=subfolder,
    trust_remote_code=True, device_map="auto",
)

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is the capital of France?"},
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs, max_new_tokens=256, temperature=0.0, do_sample=False,
)
response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(response)

Function calling (agentic)

import torch, json
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "UraionLabs/Uraion-Agent-Small"
subfolder = "transformers"

tokenizer = AutoTokenizer.from_pretrained(
    model_id, subfolder=subfolder, trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
    model_id, subfolder=subfolder,
    trust_remote_code=True, device_map="auto",
)

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather for a city",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string", "description": "City name"}
                },
                "required": ["location"]
            }
        }
    }
]

messages = [
    {"role": "system", "content": "You are a helpful assistant with access to function calling. When the user asks about weather, use the get_weather tool."},
    {"role": "user", "content": "What's the weather like in Paris?"}
]

# Inject tool definitions into system message
tool_text = json.dumps({"tools": tools})
sys_msg = messages[0]["content"] + "\n\nAvailable tools:\n" + tool_text
messages[0]["content"] = sys_msg

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs, max_new_tokens=512, temperature=0.0, do_sample=False,
)
response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(response)
# Expected: <tool_call>\n<function=get_weather>\n<parameter=location>\nParis\n</parameter>\n</function>\n</tool_call>

llama-cpp-python (GPU or CPU)

For GGUF files. Requires a recent build with qwen35 architecture support.

Installation (build from source)

# CPU only (fastest build)
CMAKE_ARGS="-DGGML_CUDA=off" pip install llama-cpp-python \
  --no-binary llama-cpp-python

# CUDA (takes 5-10 min, needs nvcc)
CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python \
  --no-binary llama-cpp-python

# Or from git for the absolute latest llama.cpp
CMAKE_ARGS="-DGGML_CUDA=off" pip install \
  "llama-cpp-python @ git+https://github.com/abetlen/llama-cpp-python.git" \
  --no-build-isolation

Basic inference

from llama_cpp import Llama

llm = Llama.from_pretrained(
    repo_id="UraionLabs/Uraion-Agent-Small",
    filename="Uraion-Agent-Small-Q4_K_M.gguf",  # or Q6_K, Q3_K_M, etc.
    n_ctx=8192,
    n_gpu_layers=-1,  # -1 = all on GPU, 0 = CPU only
    flash_attn=True,
)

response = llm.create_chat_completion(
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is the capital of France?"}
    ],
    temperature=0.0,
    max_tokens=256,
)
print(response["choices"][0]["message"]["content"])

Function calling with tool-use

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather for a city",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string", "description": "City name"}
                },
                "required": ["location"]
            }
        }
    }
]

response = llm.create_chat_completion(
    messages=[
        {"role": "system", "content": "You are a helpful assistant with access to function calling."},
        {"role": "user", "content": "What's the weather in Tokyo?"}
    ],
    tools=tools,
    temperature=0.0,
    max_tokens=512,
)

if response["choices"][0]["message"].get("tool_calls"):
    for tc in response["choices"][0]["message"]["tool_calls"]:
        print(f"Tool: {tc['function']['name']}")
        print(f"Args: {tc['function']['arguments']}")

Troubleshooting: If you get ValueError: Failed to load model from file, your llama-cpp-python version is too old and doesn't support the qwen35 architecture. Build from source as shown above, or use the Transformers + bitsandbytes path.

vLLM (OpenAI-compatible API server, recommended for production agents)

For production agent deployments, use the Transformers weights from the transformers/ subfolder:

pip install vllm

# Serve the model (NF4 weights, requires bitsandbytes)
vllm serve UraionLabs/Uraion-Agent-Small \
    --trust-remote-code \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder \
    --host 0.0.0.0 \
    --port 8000 \
    --dtype auto

OpenAI-compatible client (works with LangChain, AutoGen, CrewAI, etc.):

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

response = client.chat.completions.create(
    model="UraionLabs/Uraion-Agent-Small",
    messages=[{"role": "user", "content": "What's the weather in Tokyo?"}],
    tools=[{
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather for a city",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string"}
                },
                "required": ["location"]
            }
        }
    }],
    temperature=0.0,
)
tool_calls = response.choices[0].message.tool_calls
if tool_calls:
    for tc in tool_calls:
        print(f"{tc.function.name}({tc.function.arguments})")

LangChain integration example

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    model="uraion-agent-small",
    base_url="http://localhost:8000/v1",
    api_key="not-needed",
    temperature=0.0,
)

# Define tools
from langchain_core.tools import tool

@tool
def get_weather(location: str) -> str:
    """Get the current weather for a city."""
    return f"The weather in {location} is sunny, 22°C."

tools = [get_weather]
llm_with_tools = llm.bind_tools(tools)

response = llm_with_tools.invoke("What's the weather in Paris?")
print(response.tool_calls)

CPU-only / edge devices (slow)

Running NF4 weights on CPU is possible but slow (~0.1 tok/s). Use this for verification or throwaway agent loops on low-end hardware.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import os
os.environ["BITSANDBYTES_NOWELCOME"] = "1"

model_id = "UraionLabs/Uraion-Agent-Small"
subfolder = "transformers"

tokenizer = AutoTokenizer.from_pretrained(
    model_id, subfolder=subfolder, trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
    model_id, subfolder=subfolder,
    trust_remote_code=True, device_map="cpu",
)

# Use very short max_new_tokens to keep wait times bearable
messages = [{"role": "user", "content": "Hello, what's 2+2?"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt")

outputs = model.generate(
    **inputs, max_new_tokens=64, temperature=0.0, do_sample=False,
)
print(tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

If you need usable CPU speed, follow the GGUF conversion guide below, then use llama.cpp with Q4_K_M — expect ~5–10 tok/s on a modern CPU.

Apple Silicon (MLX / LM Studio)

The GGUF variants work with LM Studio and llama.cpp on Apple Silicon:

# Via llama.cpp (after downloading a GGUF file)
./llama-cli -m Uraion-Agent-Small-Q4_K_M.gguf \
  -p "What city is the capital of France?" \
  -n 128 -t 8

For MLX, convert from the Transformers weights:

pip install mlx-lm
mlx_lm.convert --hf-path UraionLabs/Uraion-Agent-Small \
  --subfolder transformers

mlx_lm.generate --model ./mlx_model \
  --prompt "What is the capital of France?" \
  --temp 0.0

Fixing the GGUF files (advanced)

If you need working GGUF files (for Ollama, LM Studio, or speed on CPU), rebuild them from the NF4 weights. This is a one-time procedure.

Step 1: Dequantize NF4 → FP32

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

model_id = "UraionLabs/Uraion-Agent-Small"
subfolder = "transformers"

# Load the NF4 model (this is the critical step)
model = AutoModelForCausalLM.from_pretrained(
    model_id, subfolder=subfolder,
    trust_remote_code=True, device_map="cpu",
)
tokenizer = AutoTokenizer.from_pretrained(
    model_id, subfolder=subfolder, trust_remote_code=True
)

# The NF4 quantized model is loaded; now save in FP32
model = model.to(torch.float32)
model.save_pretrained("./uraion-agent-small-fp32", safe_serialization=True)
tokenizer.save_pretrained("./uraion-agent-small-fp32")

Step 2: Convert FP32 safetensors → GGUF FP16

Using the convert_hf_to_gguf.py script from llama.cpp:

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
pip install -r requirements.txt

python convert_hf_to_gguf.py \
  ./uraion-agent-small-fp32 \
  --outfile ./uraion-agent-small-f16.gguf \
  --outtype f16

Step 3: Quantize to GGUF variants

# Build llama-quantize
cmake -B build -DGGML_CUDA=OFF
cmake --build build --target llama-quantize -j4

# Create all quant variants
./build/bin/llama-quantize ./uraion-agent-small-f16.gguf ./Q4_K_M.gguf Q4_K_M
./build/bin/llama-quantize ./uraion-agent-small-f16.gguf ./Q5_K_M.gguf Q5_K_M
./build/bin/llama-quantize ./uraion-agent-small-f16.gguf ./Q6_K.gguf  Q6_K
# ... and any other quants you need

Step 4: Verify

./build/bin/llama-cli -m ./Q4_K_M.gguf \
  -p "What is the capital of France?" \
  -n 20 --temp 0

This produces a fully working, slimmed-down GGUF that loads in any llama.cpp-based runner.

GGUF Quantizations available

Filename	Type	Size	Quality
`Uraion-Agent-Small-F16.gguf`	F16	~3.4 GB	Reference
`Uraion-Agent-Small-Q6_K.gguf`	Q6_K	~2.2 GB	Very high
`Uraion-Agent-Small-Q5_K_M.gguf`	Q5_K_M	~2.0 GB	High
`Uraion-Agent-Small-Q5_K_S.gguf`	Q5_K_S	~1.9 GB	High
`Uraion-Agent-Small-Q4_K_M.gguf`	Q4_K_M	~1.9 GB	Good (recommended)
`Uraion-Agent-Small-Q4_K_S.gguf`	Q4_K_S	~1.8 GB	Good
`Uraion-Agent-Small-Q3_K_L.gguf`	Q3_K_L	~1.7 GB	Acceptable
`Uraion-Agent-Small-Q3_K_M.gguf`	Q3_K_M	~1.6 GB	Acceptable
`Uraion-Agent-Small-Q3_K_S.gguf`	Q3_K_S	~1.5 GB	Acceptable
`Uraion-Agent-Small-Q2_K.gguf`	Q2_K	~1.9 GB	Low
`Uraion-Agent-Small-IQ4_XS.gguf`	IQ4_XS	~1.9 GB	Good+ (I-quant)
`Uraion-Agent-Small-IQ3_XXS.gguf`	IQ3_XXS	~1.8 GB	Good (I-quant)
`Uraion-Agent-Small-IQ3_XS.gguf`	IQ3_XS	~1.8 GB	Good (I-quant)
`Uraion-Agent-Small-IQ3_S.gguf`	IQ3_S	~1.8 GB	Good (I-quant)
`Uraion-Agent-Small-IQ3_M.gguf`	IQ3_M	~1.8 GB	Good (I-quant)
`Uraion-Agent-Small-IQ2_XXS.gguf`	IQ2_XXS	~1.8 GB	Acceptable (I-quant)
`Uraion-Agent-Small-IQ2_XS.gguf`	IQ2_XS	~1.8 GB	Acceptable (I-quant)
`Uraion-Agent-Small-IQ2_S.gguf`	IQ2_S	~1.9 GB	Acceptable (I-quant)
`Uraion-Agent-Small-IQ2_M.gguf`	IQ2_M	~1.9 GB	Acceptable (I-quant)
`Uraion-Agent-Small-IQ1_S.gguf`	IQ1_S	~1.8 GB	Low (I-quant)
`Uraion-Agent-Small-IQ1_M.gguf`	IQ1_M	~1.8 GB	Low (I-quant)

Note: Q8_0, Q4_0, Q5_0, Q5_1, and IQ4_NL are unavailable — Qwen3.5's hybrid architecture (Gated DeltaNet) has irregular 1D tensors incompatible with those block quant formats.

Hardware comparison

Setup	Memory needed	Speed	Quality	Effort
vLLM (A100/H100)	4 GB VRAM	~2000 tok/s	N/A	Low
vLLM (RTX 3090/4090)	6 GB VRAM	~500 tok/s	N/A	Low
Transformers + bitsandbytes (GPU)	6 GB VRAM	~50 tok/s	Good	Low
llama.cpp (GPU offload, Q4_K_M)	4 GB VRAM	~80 tok/s	Good	Medium
llama.cpp (CPU, Q4_K_M)	4 GB RAM	~8 tok/s	Good	Medium
Transformers + bitsandbytes (CPU)	8 GB RAM	~0.1 tok/s	Good	Low
MLX (Apple Silicon, M2+)	8 GB unified	~40 tok/s	Good	Low
Ollama / LM Studio	4 GB	~8 tok/s	Good	Minimal

Troubleshooting

"Failed to load model from file" with GGUF

Cause: Your llama.cpp / llama-cpp-python version doesn't support the qwen35 architecture.

Fix: Build from source (see llama-cpp-python section) or switch to Transformers + bitsandbytes.

"check_tensor_dims: tensor ... has wrong shape" with GGUF

Cause: The GGUF files were generated from NF4-packed weights without a full dequantization step. This is a known issue (see above).

Fix: Use the Transformers + bitsandbytes path instead, or rebuild the GGUFs following the conversion guide.

"CUDA error: out of memory"

Cause: The NF4 model still uses ~4 GB VRAM when dequantized during forward passes.

Fix: Use CPU offload (device_map="cpu") or a smaller GGUF variant (Q3_K_M, IQ2_XXS).

"bitsandbytes requires CUDA"

The Transformers NF4 weights require bitsandbytes, which needs either CUDA or a recent CPU-compatible version:

pip install -U bitsandbytes>=0.46.1

On CPU-only systems, bitsandbytes 0.46+ has basic CPU support. Expect slow inference.

"Cannot use chat template functions because tokenizer.chat_template is not set"

The tokenizer on HuggingFace Hub doesn't include the chat template in its config. Load it manually:

with open("transformers/chat_template.jinja") as f:
    tokenizer.chat_template = f.read()

Or for remote loading:

import requests
url = "https://huggingface.co/UraionLabs/Uraion-Agent-Small/raw/main/transformers/chat_template.jinja"
tokenizer.chat_template = requests.get(url).text

Ollama / LM Studio can't find the model

Import a GGUF file manually:

Download a GGUF file from the repo
In LM Studio: drag the file into the model folder
In Ollama: create a Modelfile (see LM Studio / Ollama section)

Intended Uses & Limitations

Intended use

Tool-calling agents — function calling, API orchestration, multi-turn tool use
Agent frameworks — drop-in replacement for agent runtimes behind an OpenAI-compatible API
Local / edge inference — runs on consumer GPUs (6 GB+ VRAM) due to 4-bit quantization
Systems research — studying harness behavior, evaluation loops, and model composition at a manageable scale (~2B params)

Out-of-scope

Multimodal tasks — despite Qwen3.5-2B's vision backbone, this fine-tune was text-only and unevaluated on image/video inputs
High-stakes decision making — research artifact; not intended for medical, legal, or financial advice without human oversight
Unsupported languages — trained exclusively on English data

Limitations

Trained for 1 epoch on ~27K examples. More data and more epochs would improve tool-calling reliability.
May produce malformed JSON tool calls in edge cases — validate output before execution.
4-bit quantization introduces minor rounding error in the merged weights.
This is a research-stage model, not a production product. We publish methods, configs, and artifacts that others can inspect, rerun, and improve — in keeping with our reproducible research principle.

Training Data

The training mix sampled 26,893 examples across three datasets — prioritizing signal density over raw scale:

Dataset	Type	Samples	Focus
NousResearch/hermes-function-calling-v1	Function calling	1,893	Single-turn and multi-turn tool use conversations
Salesforce/APIGen-MT-5k	API generation	5,000	Multi-turn API call generation across diverse APIs
mlabonne/FineTome-100k	Instruction following	20,000	General instruct/chat data (curated sample from 100K)

All data formatted via tokenizer.apply_chat_template() with the Qwen2.5-ChatML template. Examples without a user role were filtered. Sequence length capped at 2048 tokens for this training run.

Training Procedure

Framework

Training: HuggingFace TRL SFTTrainer (v1.7.0) with SFTConfig
PEFT: LoRA via peft (v0.18.0)
Quantization: bitsandbytes (v0.47.0) NF4 4-bit
Attention: PyTorch SDPA (attn_implementation="sdpa")
Loss: Standard causal language modeling (no packing, no assistant-only masking)

Pipeline

Model loading: 4-bit QLoRA via BitsAndBytesConfig
Gradient checkpointing: Enabled with use_reentrant=True
LoRA injection: LoraConfig applied to all linear projections
Dataset processing: ShareGPT → ChatML → filtered → concatenated → shuffled
Training: SFTTrainer with dataset_text_field="text", packing=False
Export: merge_and_unload() → save_pretrained(safe_serialization=True) → single model.safetensors

Infrastructure

Hardware: 1× NVIDIA A100-SXM4-40GB (provisioned via Google Colab CLI)
Training time: ~22 minutes (60 steps, single-dataset initial pass)
Full run estimate: ~4–5 hours on A100-40GB for all 27K examples
Provisioning: colab run --gpu A100 --keep — self-bootstrapping script with automatic dependency installation

Hyperparameters

QLoRA

Parameter	Value
`r`	32
`lora_alpha`	32
`lora_dropout`	0.0
`bias`	none
`task_type`	CAUSAL_LM
`target_modules`	`q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj`

Quantization

Parameter	Value
`load_in_4bit`	True
`bnb_4bit_quant_type`	nf4
`bnb_4bit_use_double_quant`	True
`bnb_4bit_compute_dtype`	bfloat16

Training

Parameter	Value
Sequence length	2048
Effective batch size	32
Per-device batch	8 (A100) / 2 (T4)
Gradient accumulation	4 (A100) / 18 (T4)
Learning rate	2×10⁻⁴
LR scheduler	Linear
Warmup steps	100
Optimizer	AdamW 8-bit
Epochs	1
Weight decay	0.0
Gradient checkpointing	True
Precision	BF16 (fallback: FP16)

Training loss (1 epoch, 1,893-function-calling-example run)

Step	Training Loss
10	2.106
20	1.748
30	1.608
40	1.424
50	1.382
60	1.304

Loss decreased steadily across all steps — clean convergence on the function-calling data.

Ethical Considerations

This model is a fine-tune of Qwen3.5-2B and inherits its base capabilities and biases:

Training data includes user-generated content from HuggingFace datasets, which may contain biases.
Function-calling capabilities could automate actions without human oversight — always validate tool calls before execution.
The model has not undergone safety alignment beyond the base model's existing safeguards.
At ~2B parameters, it has limited reasoning capacity compared to larger models — use appropriate guardrails in production.
This is a research-stage artifact from Uraion Labs. We are a systems research lab, not a product company. Use accordingly.

Changelog

Date	Change
2026-06-30	Initial release. GGUF + NF4 Transformers weights published.

Citations

Qwen3.5

@misc{qwen3.5,
  title = {Qwen3.5: A New Generation of Large Language Models},
  author = {Qwen Team},
  year = {2026},
  publisher = {GitHub},
  url = {https://github.com/QwenLM/Qwen3.5}
}

TRL

@software{vonwerra2020trl,
  title = {{TRL: Transformers Reinforcement Learning}},
  author = {von Werra, Leandro and Belkada, Younes and Tunstall, Lewis and Beeching, Edward and Thrush, Tristan and Lambert, Nathan and Huang, Shengyi and Rasul, Kashif and Gallouédec, Quentin},
  license = {Apache-2.0},
  url = {https://github.com/huggingface/trl},
  year = {2020}
}

QLoRA

@article{dettmers2023qlora,
  title = {QLoRA: Efficient Finetuning of Quantized Language Models},
  author = {Dettmers, Tim and Pagnoni, Artidoro and Holtzman, Ari and Zettlemoyer, Luke},
  journal = {arXiv preprint arXiv:2305.14314},
  year = {2023}
}

Hermes Function Calling

@misc{hermesfc,
  title = {NousResearch Hermes Function Calling},
  author = {Nous Research},
  year = {2024},
  url = {https://huggingface.co/datasets/NousResearch/hermes-function-calling-v1}
}

APIGen

@misc{apigen2024,
  title = {APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets},
  author = {Salesforce AI Research},
  year = {2024},
  url = {https://huggingface.co/datasets/Salesforce/APIGen-MT-5k}
}

FineTome

@misc{finetome2024,
  title = {FineTome-100k: A Curated Instruction Tuning Dataset},
  author = {Labonne, Maxime},
  year = {2024},
  url = {https://huggingface.co/datasets/mlabonne/FineTome-100k}
}

Uraion Labs — Foundational systems research.
uraionlabs.com

Intelligence is a systems problem.
Licensed under Apache 2.0.

Downloads last month: 1,952

GGUF

Model size

2B params

Architecture

qwen35

Hardware compatibility

1-bit

2-bit

3-bit

4-bit

5-bit

6-bit

16-bit

Model tree for UraionLabs/Uraion-Agent-Small

Base model

Qwen/Qwen3.5-2B-Base

Finetuned

Qwen/Qwen3.5-2B

Quantized

(136)

this model

Datasets used to train UraionLabs/Uraion-Agent-Small

Paper for UraionLabs/Uraion-Agent-Small

QLoRA: Efficient Finetuning of Quantized LLMs

Paper • 2305.14314 • Published May 23, 2023 • 62

Quick navigation

Systems philosophy

Model Details

Hybrid architecture

⚠️ Known issues before you start

1. GGUF files have an open shape issue

2. Qwen3.5 qwen35 architecture requires a recent llama.cpp

3. NF4 is slow on CPU

Setup guides by use case

LM Studio / Ollama (recommended for beginners)

Transformers + bitsandbytes (GPU, 6 GB VRAM)

Requirements

Installation

Basic inference

Function calling (agentic)

llama-cpp-python (GPU or CPU)

Installation (build from source)

Basic inference

Function calling with tool-use

vLLM (OpenAI-compatible API server, recommended for production agents)

LangChain integration example

CPU-only / edge devices (slow)

Apple Silicon (MLX / LM Studio)

Fixing the GGUF files (advanced)

Step 1: Dequantize NF4 → FP32

Step 2: Convert FP32 safetensors → GGUF FP16

Step 3: Quantize to GGUF variants

Step 4: Verify

GGUF Quantizations available

Hardware comparison

Troubleshooting

"Failed to load model from file" with GGUF

"check_tensor_dims: tensor ... has wrong shape" with GGUF

"CUDA error: out of memory"

"bitsandbytes requires CUDA"

"Cannot use chat template functions because tokenizer.chat_template is not set"

Ollama / LM Studio can't find the model

Intended Uses & Limitations

Intended use

Out-of-scope

Limitations

Training Data

Training Procedure

Framework

Pipeline

Infrastructure

Hyperparameters

QLoRA

Quantization

Training

Training loss (1 epoch, 1,893-function-calling-example run)

Ethical Considerations

Changelog

Citations

Qwen3.5

TRL

QLoRA

Hermes Function Calling

APIGen

FineTome

Model tree for UraionLabs/Uraion-Agent-Small

Datasets used to train UraionLabs/Uraion-Agent-Small

Paper for UraionLabs/Uraion-Agent-Small

2. Qwen3.5 `qwen35` architecture requires a recent llama.cpp