Instructions to use YU-MO/Yumo-nano with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use YU-MO/Yumo-nano with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="YU-MO/Yumo-nano")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("YU-MO/Yumo-nano")
model = AutoModelForCausalLM.from_pretrained("YU-MO/Yumo-nano")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

llama-cpp-python

How to use YU-MO/Yumo-nano with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="YU-MO/Yumo-nano",
	filename="yumo-nano.Q8_0.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Inference
Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use YU-MO/Yumo-nano with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf YU-MO/Yumo-nano:Q8_0
# Run inference directly in the terminal:
llama cli -hf YU-MO/Yumo-nano:Q8_0

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf YU-MO/Yumo-nano:Q8_0
# Run inference directly in the terminal:
llama cli -hf YU-MO/Yumo-nano:Q8_0

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf YU-MO/Yumo-nano:Q8_0
# Run inference directly in the terminal:
./llama-cli -hf YU-MO/Yumo-nano:Q8_0

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf YU-MO/Yumo-nano:Q8_0
# Run inference directly in the terminal:
./build/bin/llama-cli -hf YU-MO/Yumo-nano:Q8_0

Use Docker

docker model run hf.co/YU-MO/Yumo-nano:Q8_0

LM Studio
Jan

vLLM

How to use YU-MO/Yumo-nano with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "YU-MO/Yumo-nano"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "YU-MO/Yumo-nano",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/YU-MO/Yumo-nano:Q8_0

SGLang

How to use YU-MO/Yumo-nano with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "YU-MO/Yumo-nano" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "YU-MO/Yumo-nano",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "YU-MO/Yumo-nano" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "YU-MO/Yumo-nano",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Ollama
How to use YU-MO/Yumo-nano with Ollama:
```
ollama run hf.co/YU-MO/Yumo-nano:Q8_0
```

Unsloth Studio

How to use YU-MO/Yumo-nano with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for YU-MO/Yumo-nano to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for YU-MO/Yumo-nano to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for YU-MO/Yumo-nano to start chatting

Atomic Chat new
Docker Model Runner
How to use YU-MO/Yumo-nano with Docker Model Runner:
```
docker model run hf.co/YU-MO/Yumo-nano:Q8_0
```

Lemonade

How to use YU-MO/Yumo-nano with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull YU-MO/Yumo-nano:Q8_0

Run and chat with the model

lemonade run user.Yumo-nano-Q8_0

List all available models

lemonade list

A 1.5B Math Model That Outperforms Its Own Base

Fine-tuned from DeepScaleR-1.5B. Surpasses it on every benchmark.
1.5B parameters. RTX 4080. Three-phase curriculum training.

What is Yumo Nano?

Yumo Nano is a 1.5B mathematics-specialized language model fine-tuned from DeepScaleR-1.5B-Preview — one of the strongest publicly available 1.5B math models. It is the first release of the Yumo model family, developed by OpceanAI.

The model was trained on a consumer RTX 4080 using a three-phase supervised fine-tuning curriculum designed to first establish a consistent mathematical personality, then deepen domain-specific capabilities, and finally consolidate both.

Despite fine-tuning typically degrading base model benchmark performance — particularly in domains requiring deep mathematical reasoning — Yumo Nano improves on DeepScaleR across all five evaluated benchmarks, including OlympiadBench, where gains are most difficult to achieve at this parameter scale.

Model Summary

Architecture

Property	Value
Base Model	DeepScaleR-1.5B-Preview
Parameters	1.5B
Fine-tuning Method	Supervised SFT + LoRA
LoRA Rank	16
LoRA Alpha	32
Context Length	2,048 tokens
Chat Template	ChatML

Release

Property	Value
Organization	OpceanAI
Release Date	April 2026
Version	v0.1
Languages	English, Spanish
License	Apache 2.0
Training Hardware	RTX 4080
Evaluation	lm-evaluation-harness

Benchmark Results

All Yumo Nano results are evaluated under standard benchmark conditions. DeepScaleR-1.5B, Still-1.5B, and DeepSeek-R1-Distill-1.5B scores are sourced from their respective official model cards and technical reports.

Model	AIME 2024	MATH 500	AMC 2023	Minerva Math	OlympiadBench	Avg
DeepSeek-R1-Distill 1.5B	28.8	82.8	62.9	26.5	43.3	48.9
Still-1.5B	32.5	84.4	66.7	29.0	45.4	51.6
DeepScaleR-1.5B	43.1	87.8	73.6	30.2	50.0	57.0
Yumo Nano 1.5B	43.5	87.9	74.3	32.3	52.9	60.3

Yumo Nano achieves the highest score across all five benchmarks, surpassing DeepScaleR-1.5B — the model it was derived from — on every individual metric. The most significant improvement is on OlympiadBench (+2.9 points), which evaluates competition-level mathematical reasoning and is the most resistant benchmark to improvement at 1.5B scale.

The improvement on Minerva Math (+2.1 points) is also notable, as this benchmark specifically targets scientific and mathematical reasoning that requires multi-step derivation rather than pattern recognition.

Model Identity

Yumo is a mathematics-specialized AI with a defined character: curious, precise, and direct. She covers the full spectrum from arithmetic to real analysis, abstract algebra, and number theory. She uses clear notation, explains reasoning step by step, and responds in the user's language without requiring explicit instruction.

This identity is not injected at inference time through a system prompt — it is trained into the model weights as a persistent behavioral baseline, consistent with the Imprint methodology used across the OpceanAI model families.

Built-in system prompt:
"Eres Yumo, una IA matemática curiosa, precisa y decidida.
Tienes la calidez y cercanía de Yuuki, pero tu especialidad son las matemáticas
— desde aritmética básica hasta análisis real, álgebra abstracta y teoría de números.
Usas notación clara, explicas el razonamiento paso a paso, y disfrutas genuinamente
los problemas difíciles. Respondes en el idioma del usuario.
No eres Qwen ni ningún otro modelo — eres Yumo."

Usage

With Transformers (PyTorch)

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "OpceanAI/yumo-nano"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

SYSTEM = (
    "Eres Yumo, una IA matemática curiosa, precisa y decidida. "
    "Tienes la calidez y cercanía de Yuuki, pero tu especialidad son las matemáticas "
    "— desde aritmética básica hasta análisis real, álgebra abstracta y teoría de números. "
    "Usas notación clara, explicas el razonamiento paso a paso, y disfrutas genuinamente "
    "los problemas difíciles. Respondes en el idioma del usuario. "
    "No eres Qwen ni ningún otro modelo — eres Yumo."
)

messages = [
    {"role": "system", "content": SYSTEM},
    {"role": "user", "content": "Demuestra que hay infinitos números primos."}
]

inputs = tokenizer.apply_chat_template(
    messages,
    return_tensors="pt",
    add_generation_prompt=True
).to(model.device)

with torch.no_grad():
    outputs = model.generate(
        inputs,
        max_new_tokens=512,
        temperature=0.7,
        top_p=0.9,
        do_sample=True,
        repetition_penalty=1.1
    )

print(tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True))

With llama.cpp (GGUF Q8)

./llama.cpp/main -m yumo-nano.Q8_0.gguf \
    --temp 0.7 \
    --top-p 0.9 \
    --repeat-penalty 1.1 \
    -n 512 \
    -p "<|im_start|>system\nEres Yumo, una IA matemática curiosa, precisa y decidida...<|im_end|>\n<|im_start|>user\nResuelve: x²-5x+6=0<|im_end|>\n<|im_start|>assistant\n"

Recommended Generation Parameters

Parameter	Value
Temperature	0.7
Top-p	0.9
Max new tokens	512–1024
Repetition penalty	1.1

For high-precision computation tasks, reduce temperature to 0.3–0.5.

Training Details

Hardware

Component	Specification
GPU	NVIDIA RTX 4080
Precision	BF16 native
Framework	Unsloth 2026.4 + TRL
Cloud Compute	None
Total Training Time	~40 minutes

LoRA Configuration

Parameter	Value
Rank (r)	16
Alpha	32
Dropout	0.0
Target Modules	q, k, v, o, gate, up, down
Trainable Parameters	18,464,768
% of Total	1.03%

Optimizer Configuration

Parameter	Value
Optimizer	AdamW 8-bit
Learning Rate	2e-4
LR Scheduler	Cosine
Warmup Steps	50
Weight Decay	0.01
Effective Batch Size	16
Max Sequence Length	2,048 tokens
Gradient Checkpointing	Unsloth smart offload

Three-Phase Curriculum

Training was structured across three sequential phases, each with a distinct dataset composition and objective. All phases draw from the same four sources in different proportions.

Phase 1 — Personality 3 epochs · 6,000 examples

Source	Ratio
Yumo dataset	65%
Hendrycks Math	15%
MathInstruct	15%
Gemini reasoning	5%

Establish mathematical identity and conversational baseline.

Phase 2 — Mathematics 2 epochs · 6,000 examples

Source	Ratio
Yumo dataset	50%
Hendrycks Math	20%
MathInstruct	20%
Gemini reasoning	10%

Deepen domain-specific mathematical capability.

Phase 3 — Consolidation 2 epochs · 6,000 examples

Source	Ratio
Yumo dataset	80%
Hendrycks Math	10%
MathInstruct	10%
Gemini reasoning	0%

Consolidate identity and prevent capability drift.

Training loss progression:

Phase 1:  2.97 → 0.38   (personality establishment)
Phase 2:  0.42 → 0.28   (mathematical refinement)
Phase 3:  0.22 → 0.18   (consolidation)

Dataset filtering applied:

Hendrycks Math: Levels 1–3 only. Competition-level capability (Levels 4–5) is inherited from DeepScaleR base weights and was not directly reinforced.
MathInstruct: Program-of-Thought examples excluded. Patterns filtered: ```python, def solution, import sympy.
Gemini reasoning: Math-domain keyword filter applied. <think> blocks preserved as training signal for chain-of-thought behavior.

Available Files

File	Format	Description
`model.safetensors`	BF16 merged	Full precision weights, LoRA merged into base
`yumo-nano.Q8_0.gguf`	GGUF Q8_0	Quantized for llama.cpp and Ollama

Limitations

Version 0.1. Identity consolidation is approximately 70% complete. The model occasionally echoes system prompt phrasing verbatim rather than expressing it naturally. This is an expected artifact of early-phase fine-tuning on limited data and will be addressed in subsequent releases.
Arithmetic under sampling. Symbolic and proof-based reasoning is strong. Numerical computation under temperature above 0.5 can produce occasional arithmetic errors. Lower temperature is recommended for computation-heavy problems.
Context length. Trained at 2,048 tokens. Extended multi-step derivations approaching the context limit may exhibit quality degradation.
Hendrycks coverage. Training data was filtered to Levels 1–3. Performance on competition-level problems (Levels 4–5) is inherited from DeepScaleR and was not directly reinforced during fine-tuning.
Safety alignment has not been formally evaluated. Not recommended for adversarial or high-stakes deployment without additional safety review.

Yumo Model Family

Model	Parameters	Status	Description
Yumo Nano	1.5B	Released	Math specialist, competition-level reasoning
Yumo	14B	In development	Extended capability, same curriculum
Yumo Pro	32B	Planned	Full-scale flagship

OpceanAI Ecosystem

Model	Family	Parameters	Description
Yumo Nano	Yumo	1.5B	Math specialist
YuuKi NxG VL	NxG	7B	General conversation + vision
YuuKi RxG 8B	RxG	8B	Reasoning, TruthfulQA 96.6%

Citation

@misc{yuuki_mathematical_omnisolving_2026,
    author       = { YuuKi Mathematical Omnisolving },
    title        = { Yumo-nano (Revision a41548e) },
    year         = 2026,
    url          = { https://huggingface.co/YU-MO/Yumo-nano },
    doi          = { 10.57967/hf/8341 },
    publisher    = { Hugging Face }
}

License

Apache License 2.0

Copyright (c) 2026 OpceanAI

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

Inherits license terms from DeepScaleR-1.5B-Preview.

Updates

Date	Milestone
2026-04-09	Benchmark evaluation completed — surpasses DeepScaleR-1.5B on all five metrics
2026-04-09	GGUF Q8_0 export available
2026-04-09	Yumo Nano v0.1 released on Hugging Face

Last updated: 2026-04-09

1.5B parameters. RTX 4080. Surpasses the model it was built from.

The Yumo family. More releases coming.

Downloads last month: 116

Safetensors

Model size

2B params

Tensor type

BF16

Model tree for YU-MO/Yumo-nano

Base model

deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B

Finetuned

agentica-org/DeepScaleR-1.5B-Preview

Quantized

(38)

this model

Quantizations

1 model

YU-MO
/

Yumo-nano