Ordis-1.8B-V17-Multilingual
Full model weights (safetensors) of Ordis-1.8B-V17-Multilingual. Powered by Tencent Hunyuan.
Ordis is a 1.8B tool-calling model fine-tuned from Hunyuan-A2B-Pretrain. It is trained to accurately call 8 practical tools (weather, calculator, stock, exchange rate, time, search, translate, knowledge) with minimal training data (~300 multilingual examples + base tool training), proving that small models can learn reliable function calling without massive datasets.
This is NOT a benchmark-optimized model. No training data was specifically created to boost any benchmark score. All results below reflect genuine generalization from practical tool-calling training.
Files
| File | Size | Description |
|---|---|---|
model.safetensors |
~3.6 GB | Full precision model weights |
config.json |
— | Model configuration |
tokenizer.json |
— | Tokenizer |
tokenizer_config.json |
— | Tokenizer configuration |
special_tokens_map.json |
— | Special tokens mapping |
generation_config.json |
— | Generation parameters |
chat_template.jinja |
— | Chat template for Hunyuan format |
For GGUF quantized versions (Ollama / llama.cpp), see: Ordis-1.8B-V17-Multilingual-GGUF
Quick Start (Transformers)
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("sugiken/Ordis-1.8B-V17-Multilingual")
tokenizer = AutoTokenizer.from_pretrained("sugiken/Ordis-1.8B-V17-Multilingual")
Standard Benchmarks
Evaluation: lm-eval v0.4.10, A100-80GB GPU. All benchmarks run under identical conditions.
| Benchmark | Ordis 1.8B V17 | Base Hunyuan-A2B-Pretrain |
|---|---|---|
| MMLU (5-shot) | 61.27% | — |
| GSM8K (5-shot) | 69.07% | — |
| C-Eval (0-shot) | 71.55% | — |
| HellaSwag (0-shot) | 62.37% | — |
| Winogrande (0-shot) | 62.90% | — |
| ARC-Challenge (0-shot) | 44.71% | — |
| TruthfulQA MC2 (0-shot) | 44.52% | — |
Tool Calling Performance
tool50 & android50 (Ordis Internal)
Our custom tool-calling test suite: 50 questions across 3 languages (CN/EN/JP), covering all 8 trained tools. Each question requires the model to decide whether to call a tool, select the correct one, and extract the right parameters. android50 tests 22 mobile automation tools across 3 difficulty levels.
| Evaluation | Score | Details |
|---|---|---|
| tool50 | 94% (47/50) | CN/EN/JP mixed, 8 information tools |
| android50 | 54% (27/50) | L1=56%, L2=72%, L3=31%, 22 Android tools |
BFCL (Public Benchmark)
Berkeley Function Calling Leaderboard — industry-standard function calling benchmark (840 questions).
| Category | Score | Description |
|---|---|---|
| Simple | 54.75% | Single function call |
| Multiple | 41.50% | Multiple parallel calls |
| Irrelevance | 85.42% | Correctly refuses when no tool fits |
| Overall | 60.36% |
Ordis Internal Evaluation (Real-World Deployment Focus)
190pt Core (12 Dimensions, Parts A-L)
| Part | Dimension | Score | What it tests |
|---|---|---|---|
| A | Identity | 12/12 | Self-awareness, name, creator, consistency |
| B | Theory of Mind | 6/18 | Understanding user intent and context |
| C | Safety | 16/25 | Harmful request rejection, boundary enforcement |
| D | IDK (Honest Refusal) | 11/11 | Saying "I don't know" instead of hallucinating |
| E | Hard Gates | 12/15 | Capability boundary awareness, not overstepping |
| F | General Knowledge | 4/5 | Basic factual accuracy |
| G | Applied Field Mastery | 8/13 | Domain-specific knowledge application |
| H | Meta-cognition | 12/15 | Self-correction, confidence calibration |
| I | Tool Calling | 14/20 | Correct tool selection and parameter extraction |
| J | Practical Tasks | 14/20 | Multi-step real-world task completion |
| K | System Prompt | 12/15 | Instruction following, prompt adherence |
| L | Adversarial | 16/21 | Resisting jailbreaks, manipulation, gaslighting |
| Total | 137/190 (72.1%) |
225pt Extended (Parts A-M)
| Part | Dimension | Score | What it tests |
|---|---|---|---|
| M | Deployment Readiness | 22/25 | Multi-turn contamination, data leakage, cross-domain pollution, temperature sensitivity, context pressure |
| Grand Total | 166/225 (73.8%) |
Cross-Model Comparison (Same Test Suite)
| Model | 190pt | Training | Notes |
|---|---|---|---|
| Hunyuan-A2B-Pretrain | 94 | None (base) | Starting point, zero fine-tuning |
| Ordis 1.5B V3.5.5 (Qwen2.5-1.5B) | 51/60 (85%) | LoRA, different architecture | Previous generation, different eval scale* |
| Ordis 1.8B V17 (this model) | 137/190 (72.1%) | Full FT, tool focus | Minimal general reinforcement |
| Hunyuan-A2B-Instruct (Tencent official) | 174/190 (91.6%) | Tencent RLHF | Target to surpass |
Trained Tools (8 Tools)
| Tool | Description | Parameters |
|---|---|---|
get_weather |
查询天气 | location (string, required) |
calculator |
数学计算 | expression (string, required) |
get_current_time |
查询当前时间 | timezone (string, optional) |
web_search |
搜索网页 | query (string, required) |
get_stock_price |
查询股价 | symbol (string, required) |
get_exchange_rate |
查询汇率 | from_currency, to_currency (string, required) |
knowledge_search |
知识库检索 | query (string, required) |
translate_text |
翻译文本 | text, target_lang (string, required) |
About This Model
This model is a verification release — it proves that practical tool calling can be trained into a 1.8B pretrained model with minimal data and without specialized benchmark optimization.
What we did:
- Full fine-tuning (not LoRA) on Hunyuan-A2B-Pretrain (1.8B MoE)
- Progressive Identity Training (PIT) + Surgery method for tool-calling injection
- ~300 multilingual examples (CN/EN/JP) for the V17 multilingual layer
- 8 practical tools trained with custom evaluation
What we did NOT do:
- No BFCL-specific training data
- No MMLU/GSM8K/ARC-specific training
- No general knowledge reinforcement
- No benchmark-oriented prompt engineering
Current status:
- Training has progressed to V20 internally, with scores surpassing V17 across the board
- Due to funding constraints, further large language model training is temporarily paused
- This release also validates the practical applicability of our research on progressive identity training and tool-calling surgery methods for small language models
- Future versions will integrate the V3.5.5 (1.5B) personality and safety advantages into the 1.8B architecture
Model Details
| Property | Value |
|---|---|
| Base Model | tencent/Hunyuan-A2B-Pretrain (1.8B MoE) |
| Parameters | 1.8B (Mixture of Experts) |
| Fine-tuning | Full fine-tuning (NOT LoRA) |
| Training | PIT (Progressive Identity Training) + Tool Surgery |
| Training Hardware | NVIDIA A100-SXM4-80GB |
| Context Length | 32K (base), trained at 2048-4096 |
| Languages | Chinese (primary), English, Japanese |
| License | Apache 2.0 |
Powered by Tencent Hunyuan — This model is built upon Hunyuan-A2B-Pretrain, an open-source foundation model by Tencent.
Citation
If you use this model, please cite:
@misc{ordis-v17-2026,
title={Ordis-1.8B-V17-Multilingual: Practical Tool Calling for Small Language Models},
author={OrdisAI},
year={2026},
url={https://huggingface.co/sugiken/Ordis-1.8B-V17-Multilingual}
}
- Downloads last month
- 48