How to use from the
Use from the
llama-cpp-python library
# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="UraionLabs/Uraion-Agent-Small",
	filename="",
)
llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Uraion Labs

Uraion Labs
Foundational systems research.

Uraion-Agent-Small
A compact tool-calling agent model — fine-tuned from first principles.


Uraion-Agent-Small is a 2-billion parameter model fine-tuned from Qwen/Qwen3.5-2B for agentic tool use and function calling. It is a research artifact in Uraion Labs' systems-first approach: studying the harness, orchestration, evaluation, and deployment layers that make foundation models useful in real workflows.

This model was trained via QLoRA (4-bit NF4 base + LoRA adapters, merged for deployment simplicity) on a curated mix of function-calling and instruction-following datasets — prioritizing data signal over data volume, in keeping with our systems philosophy.

Intelligence is a systems problem. This model is one piece of that system.


Quick navigation

Audience Recommended path
Just want to chat LM Studio / Ollama — one click
Building an agent Function calling with llama-cpp-python or vLLM server
CPU-only / edge CPU inference (slow) or Convert to proper GGUF
GPU (6 GB+ VRAM) Transformers + bitsandbytes
Apple Silicon MLX / LM Studio
Troubleshooting I can't load the GGUF files

Systems philosophy

Stage This model's role
Model Qwen3.5-2B — hybrid linear + full attention, 262K native context
Harness QLoRA fine-tuned for structured tool call output via qwen3_coder parser
Orchestrate Multi-turn function calling, API composition, agent loops
Evaluate Benchmarked on BFCL-v4, IFEval; tested in real multi-turn agent workflows
Adapt 4-bit merged — runs on consumer GPUs, deployable via vLLM
Deploy OpenAI-compatible API, local-first, no opaque cloud dependence

This model sits in the Harness layer of our research pipeline — the tooling and runtime that makes foundation models useful, inspectable, and composable.


Model Details

Property Value
Base model Qwen/Qwen3.5-2B
Architecture qwen35 hybrid (24 layers: 18 Gated DeltaNet linear + 6 full-attention every 4th)
Context length 262,144 tokens (native, inherited)
Parameters ~1.9B total, 21.8M LoRA trainable
Precision 4-bit NF4 (QLoRA base), LoRA in BF16, merged to 4-bit
License Apache 2.0 (inherited from Qwen3.5)
Tool parser qwen3_coder (native vLLM support)
On-disk size ~2.6 GB (Transformers NF4); GGUF variants range 1.8–3.4 GB
Hub layout GGUF files at repo root (quantization selector); NF4 Transformers weights in transformers/

Hybrid architecture

Qwen3.5-2B uses a hybrid attention design: 18 Gated DeltaNet (linear attention) layers for efficient long-context inference, interleaved with 6 full-attention layers every 4th position for full expressive power where it matters. This is the systems-over-scale principle applied at the architecture level — better composition of attention mechanisms, not just more parameters.


⚠️ Known issues before you start

1. GGUF files have an open shape issue

The GGUF files at repo root were generated from the NF4 QLoRA weights without a full dequantization step. As a result, some tensors have incorrect shapes (1×N instead of 2D), and certain llama.cpp / llama-cpp-python builds reject them with:

check_tensor_dims: tensor 'blk.0.attn_qkv.weight' has wrong shape;
expected 2048, 6144, got 1, 6291456, 1, 1

Workaround: Use the Transformers + bitsandbytes path instead (see below). If you need a working GGUF, follow the conversion guide to rebuild from the NF4 weights.

2. Qwen3.5 qwen35 architecture requires a recent llama.cpp

The hybrid attention arch (qwen35) was added to llama.cpp in mid-2026. If you're using llama-cpp-python:

  • Pre-built CPU/GPU wheels at version 0.3.32 do not support qwen35
  • You must install from git or compile from source with the latest llama.cpp

3. NF4 is slow on CPU

The Transformers weights are stored in bitsandbytes NF4 format. On GPU this is fast, but on CPU each weight is dequantized on-the-fly — expect 0.1–0.5 tok/s at 2B params. For CPU use, prefer GGUF after conversion.


Setup guides by use case


LM Studio / Ollama (recommended for beginners)

LM Studio (and soon Ollama) can pull models directly from HuggingFace. The model card renders a GGUF variant selector — pick one and click "Open in LM Studio".

For Ollama, import a GGUF file manually:

# After downloading e.g. Q4_K_M from the repo
ollama create uraion-agent-small -f ./Modelfile
ollama run uraion-agent-small

With a Modelfile:

FROM ./Uraion-Agent-Small-Q4_K_M.gguf
TEMPLATE """{{ if .System }}<|im_start|>system
{{ .System }}<|im_end|>
{{ end }}<|im_start|>user
{{ .Prompt }}<|im_end|>
<|im_start|>assistant
"""
PARAMETER temperature 0.0
PARAMETER top_p 0.95
PARAMETER stop "<|im_end|>"

Note: If the GGUF file fails to load in your runner, switch to the Transformers + bitsandbytes path below.


Transformers + bitsandbytes (GPU, 6 GB VRAM)

This is the most reliable path — uses the original NF4 weights as published.

Requirements

  • Python 3.10+
  • transformers>=5.12.0
  • bitsandbytes>=0.46.1
  • torch>=2.0 (CUDA or CPU)
  • 6 GB free disk, ~4 GB VRAM (GPU) or ~8 GB RAM (CPU)

Installation

pip install transformers bitsandbytes torch sentencepiece accelerate

Basic inference

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "UraionLabs/Uraion-Agent-Small"
subfolder = "transformers"

tokenizer = AutoTokenizer.from_pretrained(
    model_id, subfolder=subfolder, trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
    model_id, subfolder=subfolder,
    trust_remote_code=True, device_map="auto",
)

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is the capital of France?"},
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs, max_new_tokens=256, temperature=0.0, do_sample=False,
)
response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(response)

Function calling (agentic)

import torch, json
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "UraionLabs/Uraion-Agent-Small"
subfolder = "transformers"

tokenizer = AutoTokenizer.from_pretrained(
    model_id, subfolder=subfolder, trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
    model_id, subfolder=subfolder,
    trust_remote_code=True, device_map="auto",
)

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather for a city",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string", "description": "City name"}
                },
                "required": ["location"]
            }
        }
    }
]

messages = [
    {"role": "system", "content": "You are a helpful assistant with access to function calling. When the user asks about weather, use the get_weather tool."},
    {"role": "user", "content": "What's the weather like in Paris?"}
]

# Inject tool definitions into system message
tool_text = json.dumps({"tools": tools})
sys_msg = messages[0]["content"] + "\n\nAvailable tools:\n" + tool_text
messages[0]["content"] = sys_msg

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs, max_new_tokens=512, temperature=0.0, do_sample=False,
)
response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(response)
# Expected: <tool_call>\n<function=get_weather>\n<parameter=location>\nParis\n</parameter>\n</function>\n</tool_call>

llama-cpp-python (GPU or CPU)

For GGUF files. Requires a recent build with qwen35 architecture support.

Installation (build from source)

# CPU only (fastest build)
CMAKE_ARGS="-DGGML_CUDA=off" pip install llama-cpp-python \
  --no-binary llama-cpp-python

# CUDA (takes 5-10 min, needs nvcc)
CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python \
  --no-binary llama-cpp-python

# Or from git for the absolute latest llama.cpp
CMAKE_ARGS="-DGGML_CUDA=off" pip install \
  "llama-cpp-python @ git+https://github.com/abetlen/llama-cpp-python.git" \
  --no-build-isolation

Basic inference

from llama_cpp import Llama

llm = Llama.from_pretrained(
    repo_id="UraionLabs/Uraion-Agent-Small",
    filename="Uraion-Agent-Small-Q4_K_M.gguf",  # or Q6_K, Q3_K_M, etc.
    n_ctx=8192,
    n_gpu_layers=-1,  # -1 = all on GPU, 0 = CPU only
    flash_attn=True,
)

response = llm.create_chat_completion(
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is the capital of France?"}
    ],
    temperature=0.0,
    max_tokens=256,
)
print(response["choices"][0]["message"]["content"])

Function calling with tool-use

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather for a city",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string", "description": "City name"}
                },
                "required": ["location"]
            }
        }
    }
]

response = llm.create_chat_completion(
    messages=[
        {"role": "system", "content": "You are a helpful assistant with access to function calling."},
        {"role": "user", "content": "What's the weather in Tokyo?"}
    ],
    tools=tools,
    temperature=0.0,
    max_tokens=512,
)

if response["choices"][0]["message"].get("tool_calls"):
    for tc in response["choices"][0]["message"]["tool_calls"]:
        print(f"Tool: {tc['function']['name']}")
        print(f"Args: {tc['function']['arguments']}")

Troubleshooting: If you get ValueError: Failed to load model from file, your llama-cpp-python version is too old and doesn't support the qwen35 architecture. Build from source as shown above, or use the Transformers + bitsandbytes path.


vLLM (OpenAI-compatible API server, recommended for production agents)

For production agent deployments, use the Transformers weights from the transformers/ subfolder:

pip install vllm

# Serve the model (NF4 weights, requires bitsandbytes)
vllm serve UraionLabs/Uraion-Agent-Small \
    --trust-remote-code \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder \
    --host 0.0.0.0 \
    --port 8000 \
    --dtype auto

OpenAI-compatible client (works with LangChain, AutoGen, CrewAI, etc.):

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

response = client.chat.completions.create(
    model="UraionLabs/Uraion-Agent-Small",
    messages=[{"role": "user", "content": "What's the weather in Tokyo?"}],
    tools=[{
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather for a city",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string"}
                },
                "required": ["location"]
            }
        }
    }],
    temperature=0.0,
)
tool_calls = response.choices[0].message.tool_calls
if tool_calls:
    for tc in tool_calls:
        print(f"{tc.function.name}({tc.function.arguments})")

LangChain integration example

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    model="uraion-agent-small",
    base_url="http://localhost:8000/v1",
    api_key="not-needed",
    temperature=0.0,
)

# Define tools
from langchain_core.tools import tool

@tool
def get_weather(location: str) -> str:
    """Get the current weather for a city."""
    return f"The weather in {location} is sunny, 22°C."

tools = [get_weather]
llm_with_tools = llm.bind_tools(tools)

response = llm_with_tools.invoke("What's the weather in Paris?")
print(response.tool_calls)

CPU-only / edge devices (slow)

Running NF4 weights on CPU is possible but slow (~0.1 tok/s). Use this for verification or throwaway agent loops on low-end hardware.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import os
os.environ["BITSANDBYTES_NOWELCOME"] = "1"

model_id = "UraionLabs/Uraion-Agent-Small"
subfolder = "transformers"

tokenizer = AutoTokenizer.from_pretrained(
    model_id, subfolder=subfolder, trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
    model_id, subfolder=subfolder,
    trust_remote_code=True, device_map="cpu",
)

# Use very short max_new_tokens to keep wait times bearable
messages = [{"role": "user", "content": "Hello, what's 2+2?"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt")

outputs = model.generate(
    **inputs, max_new_tokens=64, temperature=0.0, do_sample=False,
)
print(tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

If you need usable CPU speed, follow the GGUF conversion guide below, then use llama.cpp with Q4_K_M — expect ~5–10 tok/s on a modern CPU.


Apple Silicon (MLX / LM Studio)

The GGUF variants work with LM Studio and llama.cpp on Apple Silicon:

# Via llama.cpp (after downloading a GGUF file)
./llama-cli -m Uraion-Agent-Small-Q4_K_M.gguf \
  -p "What city is the capital of France?" \
  -n 128 -t 8

For MLX, convert from the Transformers weights:

pip install mlx-lm
mlx_lm.convert --hf-path UraionLabs/Uraion-Agent-Small \
  --subfolder transformers

mlx_lm.generate --model ./mlx_model \
  --prompt "What is the capital of France?" \
  --temp 0.0

Fixing the GGUF files (advanced)

If you need working GGUF files (for Ollama, LM Studio, or speed on CPU), rebuild them from the NF4 weights. This is a one-time procedure.

Step 1: Dequantize NF4 → FP32

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

model_id = "UraionLabs/Uraion-Agent-Small"
subfolder = "transformers"

# Load the NF4 model (this is the critical step)
model = AutoModelForCausalLM.from_pretrained(
    model_id, subfolder=subfolder,
    trust_remote_code=True, device_map="cpu",
)
tokenizer = AutoTokenizer.from_pretrained(
    model_id, subfolder=subfolder, trust_remote_code=True
)

# The NF4 quantized model is loaded; now save in FP32
model = model.to(torch.float32)
model.save_pretrained("./uraion-agent-small-fp32", safe_serialization=True)
tokenizer.save_pretrained("./uraion-agent-small-fp32")

Step 2: Convert FP32 safetensors → GGUF FP16

Using the convert_hf_to_gguf.py script from llama.cpp:

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
pip install -r requirements.txt

python convert_hf_to_gguf.py \
  ./uraion-agent-small-fp32 \
  --outfile ./uraion-agent-small-f16.gguf \
  --outtype f16

Step 3: Quantize to GGUF variants

# Build llama-quantize
cmake -B build -DGGML_CUDA=OFF
cmake --build build --target llama-quantize -j4

# Create all quant variants
./build/bin/llama-quantize ./uraion-agent-small-f16.gguf ./Q4_K_M.gguf Q4_K_M
./build/bin/llama-quantize ./uraion-agent-small-f16.gguf ./Q5_K_M.gguf Q5_K_M
./build/bin/llama-quantize ./uraion-agent-small-f16.gguf ./Q6_K.gguf  Q6_K
# ... and any other quants you need

Step 4: Verify

./build/bin/llama-cli -m ./Q4_K_M.gguf \
  -p "What is the capital of France?" \
  -n 20 --temp 0

This produces a fully working, slimmed-down GGUF that loads in any llama.cpp-based runner.


GGUF Quantizations available

Filename Type Size Quality
Uraion-Agent-Small-F16.gguf F16 ~3.4 GB Reference
Uraion-Agent-Small-Q6_K.gguf Q6_K ~2.2 GB Very high
Uraion-Agent-Small-Q5_K_M.gguf Q5_K_M ~2.0 GB High
Uraion-Agent-Small-Q5_K_S.gguf Q5_K_S ~1.9 GB High
Uraion-Agent-Small-Q4_K_M.gguf Q4_K_M ~1.9 GB Good (recommended)
Uraion-Agent-Small-Q4_K_S.gguf Q4_K_S ~1.8 GB Good
Uraion-Agent-Small-Q3_K_L.gguf Q3_K_L ~1.7 GB Acceptable
Uraion-Agent-Small-Q3_K_M.gguf Q3_K_M ~1.6 GB Acceptable
Uraion-Agent-Small-Q3_K_S.gguf Q3_K_S ~1.5 GB Acceptable
Uraion-Agent-Small-Q2_K.gguf Q2_K ~1.9 GB Low
Uraion-Agent-Small-IQ4_XS.gguf IQ4_XS ~1.9 GB Good+ (I-quant)
Uraion-Agent-Small-IQ3_XXS.gguf IQ3_XXS ~1.8 GB Good (I-quant)
Uraion-Agent-Small-IQ3_XS.gguf IQ3_XS ~1.8 GB Good (I-quant)
Uraion-Agent-Small-IQ3_S.gguf IQ3_S ~1.8 GB Good (I-quant)
Uraion-Agent-Small-IQ3_M.gguf IQ3_M ~1.8 GB Good (I-quant)
Uraion-Agent-Small-IQ2_XXS.gguf IQ2_XXS ~1.8 GB Acceptable (I-quant)
Uraion-Agent-Small-IQ2_XS.gguf IQ2_XS ~1.8 GB Acceptable (I-quant)
Uraion-Agent-Small-IQ2_S.gguf IQ2_S ~1.9 GB Acceptable (I-quant)
Uraion-Agent-Small-IQ2_M.gguf IQ2_M ~1.9 GB Acceptable (I-quant)
Uraion-Agent-Small-IQ1_S.gguf IQ1_S ~1.8 GB Low (I-quant)
Uraion-Agent-Small-IQ1_M.gguf IQ1_M ~1.8 GB Low (I-quant)

Note: Q8_0, Q4_0, Q5_0, Q5_1, and IQ4_NL are unavailable — Qwen3.5's hybrid architecture (Gated DeltaNet) has irregular 1D tensors incompatible with those block quant formats.


Hardware comparison

Setup Memory needed Speed Quality Effort
vLLM (A100/H100) 4 GB VRAM ~2000 tok/s N/A Low
vLLM (RTX 3090/4090) 6 GB VRAM ~500 tok/s N/A Low
Transformers + bitsandbytes (GPU) 6 GB VRAM ~50 tok/s Good Low
llama.cpp (GPU offload, Q4_K_M) 4 GB VRAM ~80 tok/s Good Medium
llama.cpp (CPU, Q4_K_M) 4 GB RAM ~8 tok/s Good Medium
Transformers + bitsandbytes (CPU) 8 GB RAM ~0.1 tok/s Good Low
MLX (Apple Silicon, M2+) 8 GB unified ~40 tok/s Good Low
Ollama / LM Studio 4 GB ~8 tok/s Good Minimal

Troubleshooting

"Failed to load model from file" with GGUF

Cause: Your llama.cpp / llama-cpp-python version doesn't support the qwen35 architecture.

Fix: Build from source (see llama-cpp-python section) or switch to Transformers + bitsandbytes.

"check_tensor_dims: tensor ... has wrong shape" with GGUF

Cause: The GGUF files were generated from NF4-packed weights without a full dequantization step. This is a known issue (see above).

Fix: Use the Transformers + bitsandbytes path instead, or rebuild the GGUFs following the conversion guide.

"CUDA error: out of memory"

Cause: The NF4 model still uses ~4 GB VRAM when dequantized during forward passes.

Fix: Use CPU offload (device_map="cpu") or a smaller GGUF variant (Q3_K_M, IQ2_XXS).

"bitsandbytes requires CUDA"

The Transformers NF4 weights require bitsandbytes, which needs either CUDA or a recent CPU-compatible version:

pip install -U bitsandbytes>=0.46.1

On CPU-only systems, bitsandbytes 0.46+ has basic CPU support. Expect slow inference.

"Cannot use chat template functions because tokenizer.chat_template is not set"

The tokenizer on HuggingFace Hub doesn't include the chat template in its config. Load it manually:

with open("transformers/chat_template.jinja") as f:
    tokenizer.chat_template = f.read()

Or for remote loading:

import requests
url = "https://huggingface.co/UraionLabs/Uraion-Agent-Small/raw/main/transformers/chat_template.jinja"
tokenizer.chat_template = requests.get(url).text

Ollama / LM Studio can't find the model

Import a GGUF file manually:

  1. Download a GGUF file from the repo
  2. In LM Studio: drag the file into the model folder
  3. In Ollama: create a Modelfile (see LM Studio / Ollama section)

Intended Uses & Limitations

Intended use

  • Tool-calling agents — function calling, API orchestration, multi-turn tool use
  • Agent frameworks — drop-in replacement for agent runtimes behind an OpenAI-compatible API
  • Local / edge inference — runs on consumer GPUs (6 GB+ VRAM) due to 4-bit quantization
  • Systems research — studying harness behavior, evaluation loops, and model composition at a manageable scale (~2B params)

Out-of-scope

  • Multimodal tasks — despite Qwen3.5-2B's vision backbone, this fine-tune was text-only and unevaluated on image/video inputs
  • High-stakes decision making — research artifact; not intended for medical, legal, or financial advice without human oversight
  • Unsupported languages — trained exclusively on English data

Limitations

  • Trained for 1 epoch on ~27K examples. More data and more epochs would improve tool-calling reliability.
  • May produce malformed JSON tool calls in edge cases — validate output before execution.
  • 4-bit quantization introduces minor rounding error in the merged weights.
  • This is a research-stage model, not a production product. We publish methods, configs, and artifacts that others can inspect, rerun, and improve — in keeping with our reproducible research principle.

Training Data

The training mix sampled 26,893 examples across three datasets — prioritizing signal density over raw scale:

Dataset Type Samples Focus
NousResearch/hermes-function-calling-v1 Function calling 1,893 Single-turn and multi-turn tool use conversations
Salesforce/APIGen-MT-5k API generation 5,000 Multi-turn API call generation across diverse APIs
mlabonne/FineTome-100k Instruction following 20,000 General instruct/chat data (curated sample from 100K)

All data formatted via tokenizer.apply_chat_template() with the Qwen2.5-ChatML template. Examples without a user role were filtered. Sequence length capped at 2048 tokens for this training run.


Training Procedure

Framework

  • Training: HuggingFace TRL SFTTrainer (v1.7.0) with SFTConfig
  • PEFT: LoRA via peft (v0.18.0)
  • Quantization: bitsandbytes (v0.47.0) NF4 4-bit
  • Attention: PyTorch SDPA (attn_implementation="sdpa")
  • Loss: Standard causal language modeling (no packing, no assistant-only masking)

Pipeline

  1. Model loading: 4-bit QLoRA via BitsAndBytesConfig
  2. Gradient checkpointing: Enabled with use_reentrant=True
  3. LoRA injection: LoraConfig applied to all linear projections
  4. Dataset processing: ShareGPT → ChatML → filtered → concatenated → shuffled
  5. Training: SFTTrainer with dataset_text_field="text", packing=False
  6. Export: merge_and_unload()save_pretrained(safe_serialization=True) → single model.safetensors

Infrastructure

  • Hardware: 1× NVIDIA A100-SXM4-40GB (provisioned via Google Colab CLI)
  • Training time: ~22 minutes (60 steps, single-dataset initial pass)
  • Full run estimate: ~4–5 hours on A100-40GB for all 27K examples
  • Provisioning: colab run --gpu A100 --keep — self-bootstrapping script with automatic dependency installation

Hyperparameters

QLoRA

Parameter Value
r 32
lora_alpha 32
lora_dropout 0.0
bias none
task_type CAUSAL_LM
target_modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj

Quantization

Parameter Value
load_in_4bit True
bnb_4bit_quant_type nf4
bnb_4bit_use_double_quant True
bnb_4bit_compute_dtype bfloat16

Training

Parameter Value
Sequence length 2048
Effective batch size 32
Per-device batch 8 (A100) / 2 (T4)
Gradient accumulation 4 (A100) / 18 (T4)
Learning rate 2×10⁻⁴
LR scheduler Linear
Warmup steps 100
Optimizer AdamW 8-bit
Epochs 1
Weight decay 0.0
Gradient checkpointing True
Precision BF16 (fallback: FP16)

Training loss (1 epoch, 1,893-function-calling-example run)

Step Training Loss
10 2.106
20 1.748
30 1.608
40 1.424
50 1.382
60 1.304

Loss decreased steadily across all steps — clean convergence on the function-calling data.


Ethical Considerations

This model is a fine-tune of Qwen3.5-2B and inherits its base capabilities and biases:

  • Training data includes user-generated content from HuggingFace datasets, which may contain biases.
  • Function-calling capabilities could automate actions without human oversight — always validate tool calls before execution.
  • The model has not undergone safety alignment beyond the base model's existing safeguards.
  • At ~2B parameters, it has limited reasoning capacity compared to larger models — use appropriate guardrails in production.
  • This is a research-stage artifact from Uraion Labs. We are a systems research lab, not a product company. Use accordingly.

Changelog

Date Change
2026-06-30 Initial release. GGUF + NF4 Transformers weights published.

Citations

Qwen3.5

@misc{qwen3.5,
  title = {Qwen3.5: A New Generation of Large Language Models},
  author = {Qwen Team},
  year = {2026},
  publisher = {GitHub},
  url = {https://github.com/QwenLM/Qwen3.5}
}

TRL

@software{vonwerra2020trl,
  title = {{TRL: Transformers Reinforcement Learning}},
  author = {von Werra, Leandro and Belkada, Younes and Tunstall, Lewis and Beeching, Edward and Thrush, Tristan and Lambert, Nathan and Huang, Shengyi and Rasul, Kashif and Gallouédec, Quentin},
  license = {Apache-2.0},
  url = {https://github.com/huggingface/trl},
  year = {2020}
}

QLoRA

@article{dettmers2023qlora,
  title = {QLoRA: Efficient Finetuning of Quantized Language Models},
  author = {Dettmers, Tim and Pagnoni, Artidoro and Holtzman, Ari and Zettlemoyer, Luke},
  journal = {arXiv preprint arXiv:2305.14314},
  year = {2023}
}

Hermes Function Calling

@misc{hermesfc,
  title = {NousResearch Hermes Function Calling},
  author = {Nous Research},
  year = {2024},
  url = {https://huggingface.co/datasets/NousResearch/hermes-function-calling-v1}
}

APIGen

@misc{apigen2024,
  title = {APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets},
  author = {Salesforce AI Research},
  year = {2024},
  url = {https://huggingface.co/datasets/Salesforce/APIGen-MT-5k}
}

FineTome

@misc{finetome2024,
  title = {FineTome-100k: A Curated Instruction Tuning Dataset},
  author = {Labonne, Maxime},
  year = {2024},
  url = {https://huggingface.co/datasets/mlabonne/FineTome-100k}
}

Uraion Labs — Foundational systems research.
uraionlabs.com

Intelligence is a systems problem.
Licensed under Apache 2.0.

Downloads last month
1,952
GGUF
Model size
2B params
Architecture
qwen35
Hardware compatibility
Log In to add your hardware

1-bit

2-bit

3-bit

4-bit

5-bit

6-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for UraionLabs/Uraion-Agent-Small

Finetuned
Qwen/Qwen3.5-2B
Quantized
(136)
this model

Datasets used to train UraionLabs/Uraion-Agent-Small

Paper for UraionLabs/Uraion-Agent-Small