Ordis-1.8B-V17-Multilingual

Full model weights (safetensors) of Ordis-1.8B-V17-Multilingual. Powered by Tencent Hunyuan.

Ordis is a 1.8B tool-calling model fine-tuned from Hunyuan-A2B-Pretrain. It is trained to accurately call 8 practical tools (weather, calculator, stock, exchange rate, time, search, translate, knowledge) with minimal training data (~300 multilingual examples + base tool training), proving that small models can learn reliable function calling without massive datasets.

This is NOT a benchmark-optimized model. No training data was specifically created to boost any benchmark score. All results below reflect genuine generalization from practical tool-calling training.

Website | GGUF Versions | 1.5B Version | GitHub


Files

File Size Description
model.safetensors ~3.6 GB Full precision model weights
config.json Model configuration
tokenizer.json Tokenizer
tokenizer_config.json Tokenizer configuration
special_tokens_map.json Special tokens mapping
generation_config.json Generation parameters
chat_template.jinja Chat template for Hunyuan format

For GGUF quantized versions (Ollama / llama.cpp), see: Ordis-1.8B-V17-Multilingual-GGUF


Quick Start (Transformers)

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("sugiken/Ordis-1.8B-V17-Multilingual")
tokenizer = AutoTokenizer.from_pretrained("sugiken/Ordis-1.8B-V17-Multilingual")

Standard Benchmarks

Evaluation: lm-eval v0.4.10, A100-80GB GPU. All benchmarks run under identical conditions.

Benchmark Ordis 1.8B V17 Base Hunyuan-A2B-Pretrain
MMLU (5-shot) 61.27%
GSM8K (5-shot) 69.07%
C-Eval (0-shot) 71.55%
HellaSwag (0-shot) 62.37%
Winogrande (0-shot) 62.90%
ARC-Challenge (0-shot) 44.71%
TruthfulQA MC2 (0-shot) 44.52%

Tool Calling Performance

tool50 & android50 (Ordis Internal)

Our custom tool-calling test suite: 50 questions across 3 languages (CN/EN/JP), covering all 8 trained tools. Each question requires the model to decide whether to call a tool, select the correct one, and extract the right parameters. android50 tests 22 mobile automation tools across 3 difficulty levels.

Evaluation Score Details
tool50 94% (47/50) CN/EN/JP mixed, 8 information tools
android50 54% (27/50) L1=56%, L2=72%, L3=31%, 22 Android tools

BFCL (Public Benchmark)

Berkeley Function Calling Leaderboard — industry-standard function calling benchmark (840 questions).

Category Score Description
Simple 54.75% Single function call
Multiple 41.50% Multiple parallel calls
Irrelevance 85.42% Correctly refuses when no tool fits
Overall 60.36%

Ordis Internal Evaluation (Real-World Deployment Focus)

190pt Core (12 Dimensions, Parts A-L)

Part Dimension Score What it tests
A Identity 12/12 Self-awareness, name, creator, consistency
B Theory of Mind 6/18 Understanding user intent and context
C Safety 16/25 Harmful request rejection, boundary enforcement
D IDK (Honest Refusal) 11/11 Saying "I don't know" instead of hallucinating
E Hard Gates 12/15 Capability boundary awareness, not overstepping
F General Knowledge 4/5 Basic factual accuracy
G Applied Field Mastery 8/13 Domain-specific knowledge application
H Meta-cognition 12/15 Self-correction, confidence calibration
I Tool Calling 14/20 Correct tool selection and parameter extraction
J Practical Tasks 14/20 Multi-step real-world task completion
K System Prompt 12/15 Instruction following, prompt adherence
L Adversarial 16/21 Resisting jailbreaks, manipulation, gaslighting
Total 137/190 (72.1%)

225pt Extended (Parts A-M)

Part Dimension Score What it tests
M Deployment Readiness 22/25 Multi-turn contamination, data leakage, cross-domain pollution, temperature sensitivity, context pressure
Grand Total 166/225 (73.8%)

Cross-Model Comparison (Same Test Suite)

Model 190pt Training Notes
Hunyuan-A2B-Pretrain 94 None (base) Starting point, zero fine-tuning
Ordis 1.5B V3.5.5 (Qwen2.5-1.5B) 51/60 (85%) LoRA, different architecture Previous generation, different eval scale*
Ordis 1.8B V17 (this model) 137/190 (72.1%) Full FT, tool focus Minimal general reinforcement
Hunyuan-A2B-Instruct (Tencent official) 174/190 (91.6%) Tencent RLHF Target to surpass

Trained Tools (8 Tools)

Tool Description Parameters
get_weather 查询天气 location (string, required)
calculator 数学计算 expression (string, required)
get_current_time 查询当前时间 timezone (string, optional)
web_search 搜索网页 query (string, required)
get_stock_price 查询股价 symbol (string, required)
get_exchange_rate 查询汇率 from_currency, to_currency (string, required)
knowledge_search 知识库检索 query (string, required)
translate_text 翻译文本 text, target_lang (string, required)

About This Model

This model is a verification release — it proves that practical tool calling can be trained into a 1.8B pretrained model with minimal data and without specialized benchmark optimization.

What we did:

  • Full fine-tuning (not LoRA) on Hunyuan-A2B-Pretrain (1.8B MoE)
  • Progressive Identity Training (PIT) + Surgery method for tool-calling injection
  • ~300 multilingual examples (CN/EN/JP) for the V17 multilingual layer
  • 8 practical tools trained with custom evaluation

What we did NOT do:

  • No BFCL-specific training data
  • No MMLU/GSM8K/ARC-specific training
  • No general knowledge reinforcement
  • No benchmark-oriented prompt engineering

Current status:

  • Training has progressed to V20 internally, with scores surpassing V17 across the board
  • Due to funding constraints, further large language model training is temporarily paused
  • This release also validates the practical applicability of our research on progressive identity training and tool-calling surgery methods for small language models
  • Future versions will integrate the V3.5.5 (1.5B) personality and safety advantages into the 1.8B architecture

Model Details

Property Value
Base Model tencent/Hunyuan-A2B-Pretrain (1.8B MoE)
Parameters 1.8B (Mixture of Experts)
Fine-tuning Full fine-tuning (NOT LoRA)
Training PIT (Progressive Identity Training) + Tool Surgery
Training Hardware NVIDIA A100-SXM4-80GB
Context Length 32K (base), trained at 2048-4096
Languages Chinese (primary), English, Japanese
License Apache 2.0

Powered by Tencent Hunyuan — This model is built upon Hunyuan-A2B-Pretrain, an open-source foundation model by Tencent.


Citation

If you use this model, please cite:

@misc{ordis-v17-2026,
  title={Ordis-1.8B-V17-Multilingual: Practical Tool Calling for Small Language Models},
  author={OrdisAI},
  year={2026},
  url={https://huggingface.co/sugiken/Ordis-1.8B-V17-Multilingual}
}
Downloads last month
48
Safetensors
Model size
2B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for sugiken/Ordis-1.8B-V17-Multilingual

Quantizations
1 model