Instructions to use UraionLabs/Uraion-Agent-Small with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use UraionLabs/Uraion-Agent-Small with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="UraionLabs/Uraion-Agent-Small", filename="Uraion-Agent-Small-F16.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use UraionLabs/Uraion-Agent-Small with llama.cpp:
Install (macOS, Linux)
curl -LsSf https://llama.app/install.sh | sh # Start a local OpenAI-compatible server with a web UI: llama serve -hf UraionLabs/Uraion-Agent-Small:Q4_K_M # Run inference directly in the terminal: llama cli -hf UraionLabs/Uraion-Agent-Small:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama serve -hf UraionLabs/Uraion-Agent-Small:Q4_K_M # Run inference directly in the terminal: llama cli -hf UraionLabs/Uraion-Agent-Small:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf UraionLabs/Uraion-Agent-Small:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf UraionLabs/Uraion-Agent-Small:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf UraionLabs/Uraion-Agent-Small:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf UraionLabs/Uraion-Agent-Small:Q4_K_M
Use Docker
docker model run hf.co/UraionLabs/Uraion-Agent-Small:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use UraionLabs/Uraion-Agent-Small with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "UraionLabs/Uraion-Agent-Small" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "UraionLabs/Uraion-Agent-Small", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/UraionLabs/Uraion-Agent-Small:Q4_K_M
- Ollama
How to use UraionLabs/Uraion-Agent-Small with Ollama:
ollama run hf.co/UraionLabs/Uraion-Agent-Small:Q4_K_M
- Unsloth Studio
How to use UraionLabs/Uraion-Agent-Small with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for UraionLabs/Uraion-Agent-Small to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for UraionLabs/Uraion-Agent-Small to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for UraionLabs/Uraion-Agent-Small to start chatting
- Pi
How to use UraionLabs/Uraion-Agent-Small with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf UraionLabs/Uraion-Agent-Small:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "UraionLabs/Uraion-Agent-Small:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use UraionLabs/Uraion-Agent-Small with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf UraionLabs/Uraion-Agent-Small:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default UraionLabs/Uraion-Agent-Small:Q4_K_M
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use UraionLabs/Uraion-Agent-Small with Docker Model Runner:
docker model run hf.co/UraionLabs/Uraion-Agent-Small:Q4_K_M
- Lemonade
How to use UraionLabs/Uraion-Agent-Small with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull UraionLabs/Uraion-Agent-Small:Q4_K_M
Run and chat with the model
lemonade run user.Uraion-Agent-Small-Q4_K_M
List all available models
lemonade list
- Quick navigation
- Systems philosophy
- Model Details
- ⚠️ Known issues before you start
- Setup guides by use case
- Fixing the GGUF files (advanced)
- GGUF Quantizations available
- Hardware comparison
- Troubleshooting
- Intended Uses & Limitations
- Training Data
- Training Procedure
- Hyperparameters
- Ethical Considerations
- Changelog
- Citations
Uraion Labs
Foundational systems research.
Uraion-Agent-Small
A compact tool-calling agent model — fine-tuned from first principles.
Uraion-Agent-Small is a 2-billion parameter model fine-tuned from Qwen/Qwen3.5-2B for agentic tool use and function calling. It is a research artifact in Uraion Labs' systems-first approach: studying the harness, orchestration, evaluation, and deployment layers that make foundation models useful in real workflows.
This model was trained via QLoRA (4-bit NF4 base + LoRA adapters, merged for deployment simplicity) on a curated mix of function-calling and instruction-following datasets — prioritizing data signal over data volume, in keeping with our systems philosophy.
Intelligence is a systems problem. This model is one piece of that system.
Quick navigation
| Audience | Recommended path |
|---|---|
| Just want to chat | LM Studio / Ollama — one click |
| Building an agent | Function calling with llama-cpp-python or vLLM server |
| CPU-only / edge | CPU inference (slow) or Convert to proper GGUF |
| GPU (6 GB+ VRAM) | Transformers + bitsandbytes |
| Apple Silicon | MLX / LM Studio |
| Troubleshooting | I can't load the GGUF files |
Systems philosophy
| Stage | This model's role |
|---|---|
| Model | Qwen3.5-2B — hybrid linear + full attention, 262K native context |
| Harness | QLoRA fine-tuned for structured tool call output via qwen3_coder parser |
| Orchestrate | Multi-turn function calling, API composition, agent loops |
| Evaluate | Benchmarked on BFCL-v4, IFEval; tested in real multi-turn agent workflows |
| Adapt | 4-bit merged — runs on consumer GPUs, deployable via vLLM |
| Deploy | OpenAI-compatible API, local-first, no opaque cloud dependence |
This model sits in the Harness layer of our research pipeline — the tooling and runtime that makes foundation models useful, inspectable, and composable.
Model Details
| Property | Value |
|---|---|
| Base model | Qwen/Qwen3.5-2B |
| Architecture | qwen35 hybrid (24 layers: 18 Gated DeltaNet linear + 6 full-attention every 4th) |
| Context length | 262,144 tokens (native, inherited) |
| Parameters | ~1.9B total, 21.8M LoRA trainable |
| Precision | 4-bit NF4 (QLoRA base), LoRA in BF16, merged to 4-bit |
| License | Apache 2.0 (inherited from Qwen3.5) |
| Tool parser | qwen3_coder (native vLLM support) |
| On-disk size | ~2.6 GB (Transformers NF4); GGUF variants range 1.8–3.4 GB |
| Hub layout | GGUF files at repo root (quantization selector); NF4 Transformers weights in transformers/ |
Hybrid architecture
Qwen3.5-2B uses a hybrid attention design: 18 Gated DeltaNet (linear attention) layers for efficient long-context inference, interleaved with 6 full-attention layers every 4th position for full expressive power where it matters. This is the systems-over-scale principle applied at the architecture level — better composition of attention mechanisms, not just more parameters.
⚠️ Known issues before you start
1. GGUF files have an open shape issue
The GGUF files at repo root were generated from the NF4 QLoRA weights without a full dequantization step. As a result, some tensors have incorrect shapes (1×N instead of 2D), and certain llama.cpp / llama-cpp-python builds reject them with:
check_tensor_dims: tensor 'blk.0.attn_qkv.weight' has wrong shape;
expected 2048, 6144, got 1, 6291456, 1, 1
Workaround: Use the Transformers + bitsandbytes path instead (see below). If you need a working GGUF, follow the conversion guide to rebuild from the NF4 weights.
2. Qwen3.5 qwen35 architecture requires a recent llama.cpp
The hybrid attention arch (qwen35) was added to llama.cpp in mid-2026. If you're using llama-cpp-python:
- Pre-built CPU/GPU wheels at version 0.3.32 do not support
qwen35 - You must install from git or compile from source with the latest llama.cpp
3. NF4 is slow on CPU
The Transformers weights are stored in bitsandbytes NF4 format. On GPU this is fast, but on CPU each weight is dequantized on-the-fly — expect 0.1–0.5 tok/s at 2B params. For CPU use, prefer GGUF after conversion.
Setup guides by use case
LM Studio / Ollama (recommended for beginners)
LM Studio (and soon Ollama) can pull models directly from HuggingFace. The model card renders a GGUF variant selector — pick one and click "Open in LM Studio".
For Ollama, import a GGUF file manually:
# After downloading e.g. Q4_K_M from the repo
ollama create uraion-agent-small -f ./Modelfile
ollama run uraion-agent-small
With a Modelfile:
FROM ./Uraion-Agent-Small-Q4_K_M.gguf
TEMPLATE """{{ if .System }}<|im_start|>system
{{ .System }}<|im_end|>
{{ end }}<|im_start|>user
{{ .Prompt }}<|im_end|>
<|im_start|>assistant
"""
PARAMETER temperature 0.0
PARAMETER top_p 0.95
PARAMETER stop "<|im_end|>"
Note: If the GGUF file fails to load in your runner, switch to the Transformers + bitsandbytes path below.
Transformers + bitsandbytes (GPU, 6 GB VRAM)
This is the most reliable path — uses the original NF4 weights as published.
Requirements
- Python 3.10+
transformers>=5.12.0bitsandbytes>=0.46.1torch>=2.0(CUDA or CPU)- 6 GB free disk, ~4 GB VRAM (GPU) or ~8 GB RAM (CPU)
Installation
pip install transformers bitsandbytes torch sentencepiece accelerate
Basic inference
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "UraionLabs/Uraion-Agent-Small"
subfolder = "transformers"
tokenizer = AutoTokenizer.from_pretrained(
model_id, subfolder=subfolder, trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
model_id, subfolder=subfolder,
trust_remote_code=True, device_map="auto",
)
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"},
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs, max_new_tokens=256, temperature=0.0, do_sample=False,
)
response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(response)
Function calling (agentic)
import torch, json
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "UraionLabs/Uraion-Agent-Small"
subfolder = "transformers"
tokenizer = AutoTokenizer.from_pretrained(
model_id, subfolder=subfolder, trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
model_id, subfolder=subfolder,
trust_remote_code=True, device_map="auto",
)
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather for a city",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string", "description": "City name"}
},
"required": ["location"]
}
}
}
]
messages = [
{"role": "system", "content": "You are a helpful assistant with access to function calling. When the user asks about weather, use the get_weather tool."},
{"role": "user", "content": "What's the weather like in Paris?"}
]
# Inject tool definitions into system message
tool_text = json.dumps({"tools": tools})
sys_msg = messages[0]["content"] + "\n\nAvailable tools:\n" + tool_text
messages[0]["content"] = sys_msg
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs, max_new_tokens=512, temperature=0.0, do_sample=False,
)
response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(response)
# Expected: <tool_call>\n<function=get_weather>\n<parameter=location>\nParis\n</parameter>\n</function>\n</tool_call>
llama-cpp-python (GPU or CPU)
For GGUF files. Requires a recent build with qwen35 architecture support.
Installation (build from source)
# CPU only (fastest build)
CMAKE_ARGS="-DGGML_CUDA=off" pip install llama-cpp-python \
--no-binary llama-cpp-python
# CUDA (takes 5-10 min, needs nvcc)
CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python \
--no-binary llama-cpp-python
# Or from git for the absolute latest llama.cpp
CMAKE_ARGS="-DGGML_CUDA=off" pip install \
"llama-cpp-python @ git+https://github.com/abetlen/llama-cpp-python.git" \
--no-build-isolation
Basic inference
from llama_cpp import Llama
llm = Llama.from_pretrained(
repo_id="UraionLabs/Uraion-Agent-Small",
filename="Uraion-Agent-Small-Q4_K_M.gguf", # or Q6_K, Q3_K_M, etc.
n_ctx=8192,
n_gpu_layers=-1, # -1 = all on GPU, 0 = CPU only
flash_attn=True,
)
response = llm.create_chat_completion(
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"}
],
temperature=0.0,
max_tokens=256,
)
print(response["choices"][0]["message"]["content"])
Function calling with tool-use
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather for a city",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string", "description": "City name"}
},
"required": ["location"]
}
}
}
]
response = llm.create_chat_completion(
messages=[
{"role": "system", "content": "You are a helpful assistant with access to function calling."},
{"role": "user", "content": "What's the weather in Tokyo?"}
],
tools=tools,
temperature=0.0,
max_tokens=512,
)
if response["choices"][0]["message"].get("tool_calls"):
for tc in response["choices"][0]["message"]["tool_calls"]:
print(f"Tool: {tc['function']['name']}")
print(f"Args: {tc['function']['arguments']}")
Troubleshooting: If you get
ValueError: Failed to load model from file, your llama-cpp-python version is too old and doesn't support theqwen35architecture. Build from source as shown above, or use the Transformers + bitsandbytes path.
vLLM (OpenAI-compatible API server, recommended for production agents)
For production agent deployments, use the Transformers weights from the transformers/ subfolder:
pip install vllm
# Serve the model (NF4 weights, requires bitsandbytes)
vllm serve UraionLabs/Uraion-Agent-Small \
--trust-remote-code \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--host 0.0.0.0 \
--port 8000 \
--dtype auto
OpenAI-compatible client (works with LangChain, AutoGen, CrewAI, etc.):
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
response = client.chat.completions.create(
model="UraionLabs/Uraion-Agent-Small",
messages=[{"role": "user", "content": "What's the weather in Tokyo?"}],
tools=[{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a city",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string"}
},
"required": ["location"]
}
}
}],
temperature=0.0,
)
tool_calls = response.choices[0].message.tool_calls
if tool_calls:
for tc in tool_calls:
print(f"{tc.function.name}({tc.function.arguments})")
LangChain integration example
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(
model="uraion-agent-small",
base_url="http://localhost:8000/v1",
api_key="not-needed",
temperature=0.0,
)
# Define tools
from langchain_core.tools import tool
@tool
def get_weather(location: str) -> str:
"""Get the current weather for a city."""
return f"The weather in {location} is sunny, 22°C."
tools = [get_weather]
llm_with_tools = llm.bind_tools(tools)
response = llm_with_tools.invoke("What's the weather in Paris?")
print(response.tool_calls)
CPU-only / edge devices (slow)
Running NF4 weights on CPU is possible but slow (~0.1 tok/s). Use this for verification or throwaway agent loops on low-end hardware.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import os
os.environ["BITSANDBYTES_NOWELCOME"] = "1"
model_id = "UraionLabs/Uraion-Agent-Small"
subfolder = "transformers"
tokenizer = AutoTokenizer.from_pretrained(
model_id, subfolder=subfolder, trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
model_id, subfolder=subfolder,
trust_remote_code=True, device_map="cpu",
)
# Use very short max_new_tokens to keep wait times bearable
messages = [{"role": "user", "content": "Hello, what's 2+2?"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(
**inputs, max_new_tokens=64, temperature=0.0, do_sample=False,
)
print(tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))
If you need usable CPU speed, follow the GGUF conversion guide below, then use llama.cpp with Q4_K_M — expect ~5–10 tok/s on a modern CPU.
Apple Silicon (MLX / LM Studio)
The GGUF variants work with LM Studio and llama.cpp on Apple Silicon:
# Via llama.cpp (after downloading a GGUF file)
./llama-cli -m Uraion-Agent-Small-Q4_K_M.gguf \
-p "What city is the capital of France?" \
-n 128 -t 8
For MLX, convert from the Transformers weights:
pip install mlx-lm
mlx_lm.convert --hf-path UraionLabs/Uraion-Agent-Small \
--subfolder transformers
mlx_lm.generate --model ./mlx_model \
--prompt "What is the capital of France?" \
--temp 0.0
Fixing the GGUF files (advanced)
If you need working GGUF files (for Ollama, LM Studio, or speed on CPU), rebuild them from the NF4 weights. This is a one-time procedure.
Step 1: Dequantize NF4 → FP32
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
model_id = "UraionLabs/Uraion-Agent-Small"
subfolder = "transformers"
# Load the NF4 model (this is the critical step)
model = AutoModelForCausalLM.from_pretrained(
model_id, subfolder=subfolder,
trust_remote_code=True, device_map="cpu",
)
tokenizer = AutoTokenizer.from_pretrained(
model_id, subfolder=subfolder, trust_remote_code=True
)
# The NF4 quantized model is loaded; now save in FP32
model = model.to(torch.float32)
model.save_pretrained("./uraion-agent-small-fp32", safe_serialization=True)
tokenizer.save_pretrained("./uraion-agent-small-fp32")
Step 2: Convert FP32 safetensors → GGUF FP16
Using the convert_hf_to_gguf.py script from llama.cpp:
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
pip install -r requirements.txt
python convert_hf_to_gguf.py \
./uraion-agent-small-fp32 \
--outfile ./uraion-agent-small-f16.gguf \
--outtype f16
Step 3: Quantize to GGUF variants
# Build llama-quantize
cmake -B build -DGGML_CUDA=OFF
cmake --build build --target llama-quantize -j4
# Create all quant variants
./build/bin/llama-quantize ./uraion-agent-small-f16.gguf ./Q4_K_M.gguf Q4_K_M
./build/bin/llama-quantize ./uraion-agent-small-f16.gguf ./Q5_K_M.gguf Q5_K_M
./build/bin/llama-quantize ./uraion-agent-small-f16.gguf ./Q6_K.gguf Q6_K
# ... and any other quants you need
Step 4: Verify
./build/bin/llama-cli -m ./Q4_K_M.gguf \
-p "What is the capital of France?" \
-n 20 --temp 0
This produces a fully working, slimmed-down GGUF that loads in any llama.cpp-based runner.
GGUF Quantizations available
| Filename | Type | Size | Quality |
|---|---|---|---|
Uraion-Agent-Small-F16.gguf |
F16 | ~3.4 GB | Reference |
Uraion-Agent-Small-Q6_K.gguf |
Q6_K | ~2.2 GB | Very high |
Uraion-Agent-Small-Q5_K_M.gguf |
Q5_K_M | ~2.0 GB | High |
Uraion-Agent-Small-Q5_K_S.gguf |
Q5_K_S | ~1.9 GB | High |
Uraion-Agent-Small-Q4_K_M.gguf |
Q4_K_M | ~1.9 GB | Good (recommended) |
Uraion-Agent-Small-Q4_K_S.gguf |
Q4_K_S | ~1.8 GB | Good |
Uraion-Agent-Small-Q3_K_L.gguf |
Q3_K_L | ~1.7 GB | Acceptable |
Uraion-Agent-Small-Q3_K_M.gguf |
Q3_K_M | ~1.6 GB | Acceptable |
Uraion-Agent-Small-Q3_K_S.gguf |
Q3_K_S | ~1.5 GB | Acceptable |
Uraion-Agent-Small-Q2_K.gguf |
Q2_K | ~1.9 GB | Low |
Uraion-Agent-Small-IQ4_XS.gguf |
IQ4_XS | ~1.9 GB | Good+ (I-quant) |
Uraion-Agent-Small-IQ3_XXS.gguf |
IQ3_XXS | ~1.8 GB | Good (I-quant) |
Uraion-Agent-Small-IQ3_XS.gguf |
IQ3_XS | ~1.8 GB | Good (I-quant) |
Uraion-Agent-Small-IQ3_S.gguf |
IQ3_S | ~1.8 GB | Good (I-quant) |
Uraion-Agent-Small-IQ3_M.gguf |
IQ3_M | ~1.8 GB | Good (I-quant) |
Uraion-Agent-Small-IQ2_XXS.gguf |
IQ2_XXS | ~1.8 GB | Acceptable (I-quant) |
Uraion-Agent-Small-IQ2_XS.gguf |
IQ2_XS | ~1.8 GB | Acceptable (I-quant) |
Uraion-Agent-Small-IQ2_S.gguf |
IQ2_S | ~1.9 GB | Acceptable (I-quant) |
Uraion-Agent-Small-IQ2_M.gguf |
IQ2_M | ~1.9 GB | Acceptable (I-quant) |
Uraion-Agent-Small-IQ1_S.gguf |
IQ1_S | ~1.8 GB | Low (I-quant) |
Uraion-Agent-Small-IQ1_M.gguf |
IQ1_M | ~1.8 GB | Low (I-quant) |
Note: Q8_0, Q4_0, Q5_0, Q5_1, and IQ4_NL are unavailable — Qwen3.5's hybrid architecture (Gated DeltaNet) has irregular 1D tensors incompatible with those block quant formats.
Hardware comparison
| Setup | Memory needed | Speed | Quality | Effort |
|---|---|---|---|---|
| vLLM (A100/H100) | 4 GB VRAM | ~2000 tok/s | N/A | Low |
| vLLM (RTX 3090/4090) | 6 GB VRAM | ~500 tok/s | N/A | Low |
| Transformers + bitsandbytes (GPU) | 6 GB VRAM | ~50 tok/s | Good | Low |
| llama.cpp (GPU offload, Q4_K_M) | 4 GB VRAM | ~80 tok/s | Good | Medium |
| llama.cpp (CPU, Q4_K_M) | 4 GB RAM | ~8 tok/s | Good | Medium |
| Transformers + bitsandbytes (CPU) | 8 GB RAM | ~0.1 tok/s | Good | Low |
| MLX (Apple Silicon, M2+) | 8 GB unified | ~40 tok/s | Good | Low |
| Ollama / LM Studio | 4 GB | ~8 tok/s | Good | Minimal |
Troubleshooting
"Failed to load model from file" with GGUF
Cause: Your llama.cpp / llama-cpp-python version doesn't support the qwen35 architecture.
Fix: Build from source (see llama-cpp-python section) or switch to Transformers + bitsandbytes.
"check_tensor_dims: tensor ... has wrong shape" with GGUF
Cause: The GGUF files were generated from NF4-packed weights without a full dequantization step. This is a known issue (see above).
Fix: Use the Transformers + bitsandbytes path instead, or rebuild the GGUFs following the conversion guide.
"CUDA error: out of memory"
Cause: The NF4 model still uses ~4 GB VRAM when dequantized during forward passes.
Fix: Use CPU offload (device_map="cpu") or a smaller GGUF variant (Q3_K_M, IQ2_XXS).
"bitsandbytes requires CUDA"
The Transformers NF4 weights require bitsandbytes, which needs either CUDA or a recent CPU-compatible version:
pip install -U bitsandbytes>=0.46.1
On CPU-only systems, bitsandbytes 0.46+ has basic CPU support. Expect slow inference.
"Cannot use chat template functions because tokenizer.chat_template is not set"
The tokenizer on HuggingFace Hub doesn't include the chat template in its config. Load it manually:
with open("transformers/chat_template.jinja") as f:
tokenizer.chat_template = f.read()
Or for remote loading:
import requests
url = "https://huggingface.co/UraionLabs/Uraion-Agent-Small/raw/main/transformers/chat_template.jinja"
tokenizer.chat_template = requests.get(url).text
Ollama / LM Studio can't find the model
Import a GGUF file manually:
- Download a GGUF file from the repo
- In LM Studio: drag the file into the model folder
- In Ollama: create a Modelfile (see LM Studio / Ollama section)
Intended Uses & Limitations
Intended use
- Tool-calling agents — function calling, API orchestration, multi-turn tool use
- Agent frameworks — drop-in replacement for agent runtimes behind an OpenAI-compatible API
- Local / edge inference — runs on consumer GPUs (6 GB+ VRAM) due to 4-bit quantization
- Systems research — studying harness behavior, evaluation loops, and model composition at a manageable scale (~2B params)
Out-of-scope
- Multimodal tasks — despite Qwen3.5-2B's vision backbone, this fine-tune was text-only and unevaluated on image/video inputs
- High-stakes decision making — research artifact; not intended for medical, legal, or financial advice without human oversight
- Unsupported languages — trained exclusively on English data
Limitations
- Trained for 1 epoch on ~27K examples. More data and more epochs would improve tool-calling reliability.
- May produce malformed JSON tool calls in edge cases — validate output before execution.
- 4-bit quantization introduces minor rounding error in the merged weights.
- This is a research-stage model, not a production product. We publish methods, configs, and artifacts that others can inspect, rerun, and improve — in keeping with our reproducible research principle.
Training Data
The training mix sampled 26,893 examples across three datasets — prioritizing signal density over raw scale:
| Dataset | Type | Samples | Focus |
|---|---|---|---|
| NousResearch/hermes-function-calling-v1 | Function calling | 1,893 | Single-turn and multi-turn tool use conversations |
| Salesforce/APIGen-MT-5k | API generation | 5,000 | Multi-turn API call generation across diverse APIs |
| mlabonne/FineTome-100k | Instruction following | 20,000 | General instruct/chat data (curated sample from 100K) |
All data formatted via tokenizer.apply_chat_template() with the Qwen2.5-ChatML template. Examples without a user role were filtered. Sequence length capped at 2048 tokens for this training run.
Training Procedure
Framework
- Training: HuggingFace TRL
SFTTrainer(v1.7.0) withSFTConfig - PEFT: LoRA via
peft(v0.18.0) - Quantization:
bitsandbytes(v0.47.0) NF4 4-bit - Attention: PyTorch SDPA (
attn_implementation="sdpa") - Loss: Standard causal language modeling (no packing, no assistant-only masking)
Pipeline
- Model loading: 4-bit QLoRA via
BitsAndBytesConfig - Gradient checkpointing: Enabled with
use_reentrant=True - LoRA injection:
LoraConfigapplied to all linear projections - Dataset processing: ShareGPT → ChatML → filtered → concatenated → shuffled
- Training:
SFTTrainerwithdataset_text_field="text",packing=False - Export:
merge_and_unload()→save_pretrained(safe_serialization=True)→ singlemodel.safetensors
Infrastructure
- Hardware: 1× NVIDIA A100-SXM4-40GB (provisioned via Google Colab CLI)
- Training time: ~22 minutes (60 steps, single-dataset initial pass)
- Full run estimate: ~4–5 hours on A100-40GB for all 27K examples
- Provisioning:
colab run --gpu A100 --keep— self-bootstrapping script with automatic dependency installation
Hyperparameters
QLoRA
| Parameter | Value |
|---|---|
r |
32 |
lora_alpha |
32 |
lora_dropout |
0.0 |
bias |
none |
task_type |
CAUSAL_LM |
target_modules |
q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
Quantization
| Parameter | Value |
|---|---|
load_in_4bit |
True |
bnb_4bit_quant_type |
nf4 |
bnb_4bit_use_double_quant |
True |
bnb_4bit_compute_dtype |
bfloat16 |
Training
| Parameter | Value |
|---|---|
| Sequence length | 2048 |
| Effective batch size | 32 |
| Per-device batch | 8 (A100) / 2 (T4) |
| Gradient accumulation | 4 (A100) / 18 (T4) |
| Learning rate | 2×10⁻⁴ |
| LR scheduler | Linear |
| Warmup steps | 100 |
| Optimizer | AdamW 8-bit |
| Epochs | 1 |
| Weight decay | 0.0 |
| Gradient checkpointing | True |
| Precision | BF16 (fallback: FP16) |
Training loss (1 epoch, 1,893-function-calling-example run)
| Step | Training Loss |
|---|---|
| 10 | 2.106 |
| 20 | 1.748 |
| 30 | 1.608 |
| 40 | 1.424 |
| 50 | 1.382 |
| 60 | 1.304 |
Loss decreased steadily across all steps — clean convergence on the function-calling data.
Ethical Considerations
This model is a fine-tune of Qwen3.5-2B and inherits its base capabilities and biases:
- Training data includes user-generated content from HuggingFace datasets, which may contain biases.
- Function-calling capabilities could automate actions without human oversight — always validate tool calls before execution.
- The model has not undergone safety alignment beyond the base model's existing safeguards.
- At ~2B parameters, it has limited reasoning capacity compared to larger models — use appropriate guardrails in production.
- This is a research-stage artifact from Uraion Labs. We are a systems research lab, not a product company. Use accordingly.
Changelog
| Date | Change |
|---|---|
| 2026-06-30 | Initial release. GGUF + NF4 Transformers weights published. |
Citations
Qwen3.5
@misc{qwen3.5,
title = {Qwen3.5: A New Generation of Large Language Models},
author = {Qwen Team},
year = {2026},
publisher = {GitHub},
url = {https://github.com/QwenLM/Qwen3.5}
}
TRL
@software{vonwerra2020trl,
title = {{TRL: Transformers Reinforcement Learning}},
author = {von Werra, Leandro and Belkada, Younes and Tunstall, Lewis and Beeching, Edward and Thrush, Tristan and Lambert, Nathan and Huang, Shengyi and Rasul, Kashif and Gallouédec, Quentin},
license = {Apache-2.0},
url = {https://github.com/huggingface/trl},
year = {2020}
}
QLoRA
@article{dettmers2023qlora,
title = {QLoRA: Efficient Finetuning of Quantized Language Models},
author = {Dettmers, Tim and Pagnoni, Artidoro and Holtzman, Ari and Zettlemoyer, Luke},
journal = {arXiv preprint arXiv:2305.14314},
year = {2023}
}
Hermes Function Calling
@misc{hermesfc,
title = {NousResearch Hermes Function Calling},
author = {Nous Research},
year = {2024},
url = {https://huggingface.co/datasets/NousResearch/hermes-function-calling-v1}
}
APIGen
@misc{apigen2024,
title = {APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets},
author = {Salesforce AI Research},
year = {2024},
url = {https://huggingface.co/datasets/Salesforce/APIGen-MT-5k}
}
FineTome
@misc{finetome2024,
title = {FineTome-100k: A Curated Instruction Tuning Dataset},
author = {Labonne, Maxime},
year = {2024},
url = {https://huggingface.co/datasets/mlabonne/FineTome-100k}
}
Uraion Labs — Foundational systems research.
uraionlabs.com
Intelligence is a systems problem.
Licensed under Apache 2.0.
- Downloads last month
- 1,952
1-bit
2-bit
3-bit
4-bit
5-bit
6-bit
16-bit