Instructions to use ASTRAI-labs/pluto-nano-0.5 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ASTRAI-labs/pluto-nano-0.5 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="ASTRAI-labs/pluto-nano-0.5", trust_remote_code=True)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("ASTRAI-labs/pluto-nano-0.5", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use ASTRAI-labs/pluto-nano-0.5 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "ASTRAI-labs/pluto-nano-0.5" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ASTRAI-labs/pluto-nano-0.5", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/ASTRAI-labs/pluto-nano-0.5
- SGLang
How to use ASTRAI-labs/pluto-nano-0.5 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "ASTRAI-labs/pluto-nano-0.5" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ASTRAI-labs/pluto-nano-0.5", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "ASTRAI-labs/pluto-nano-0.5" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ASTRAI-labs/pluto-nano-0.5", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use ASTRAI-labs/pluto-nano-0.5 with Docker Model Runner:
docker model run hf.co/ASTRAI-labs/pluto-nano-0.5
ASTRAI Pluto Nano 0.5
Mixture-of-Experts language model — 1 B total / 50 M active per token (47 M exact).
Developed from scratch by ASTRAI Labs. v0.5 is a public preview release of the upcoming Pluto Nano 1.0.
⚠️ READ FIRST — Experimental preview, not a working chatbot. This model does not chat fluently. It produces grammatical English most of the time but goes off-topic, loops, or generates math-style "Question: ..." chains regardless of the input. Multilingual quality is worse — non-EN outputs often collapse into repetition or garbage characters. This was on purpose — v0.5 was built as a public experiment to validate the ASTRAI Pluto architecture (custom MoE, 5-language multilingual support, RTX 3060 trainability), not as a usable assistant. It's published so the community can poke at the architecture, reproduce the pipeline, and watch the upcoming v1.0 (10× pretrain, 10 languages, 128 k vocab, Qwen2.5-1.5B warm-start) inherit from it. Don't deploy v0.5 anywhere — it'll embarrass everyone involved.
Architecture
| Spec | Value |
|---|---|
| Total parameters | 1 B |
| Active per token | ~50 M (47 M exact) |
| Experts | 35 (top-1 routing) |
| Attention | GQA — 6 query heads, 2 KV heads |
| Hidden / Layers | 384 / 16 |
| Expert intermediate | 1536 |
| Tokenizer | Custom 32 k BPE |
| Max context | 4096 |
| RoPE θ | 1e6 |
| MTP depth (pretrain only) | 2 |
| Languages | EN, PT, ES, ZH, HI |
Training
- Pretrain: 13 B tokens of curated multilingual text
- Post-training: SFT, ORPO, DPO, KTO + distillation from frontier models (Claude Opus 4.7/4.8, GPT-5.5, Gemini 3.x, Qwen3-235B, Grok 4.4, etc.)
- Hardware: single consumer RTX 3060 (12 GB VRAM)
- Training time for v0.5: ~2 weeks
Benchmarks
Honest reporting — all 11/12 standard small-LM benchmarks below (PIQA not measured in our harness, hence 11/12). Pluto v0.5 is a public preview and loses on several knowledge-heavy tasks; it wasn't cherry-picked.
Full results vs SupraLabs/Supra-50M-Reasoning
| Benchmark | Pluto Nano v0.5 | Supra-50M target | Result |
|---|---|---|---|
| HellaSwag | 30.00 | 29.16 | ✅ win |
| Winogrande | 55.00 | 51.07 | ✅ win |
| BoolQ | 61.67 | 46.06 | ✅ win |
| MMLU | 27.00 | 23.58 | ✅ win |
| WikiText PPL | 201.91 | 166.27 | ❌ (lower is better) |
| Lambada | 0.00 | 16.53 | ❌ (format shift, known limitation) |
| COPA | 43.00 | 59.00 | ❌ |
| ARC-Easy | 21.00 | 45.16 | ❌ |
| ARC-Challenge | 23.75 | 26.54 | ❌ |
| OpenBookQA | 26.00 | 28.80 | ❌ |
| SciQ | 58.00 | 64.10 | ❌ |
4 wins / 11 benches vs Supra-50M.
Comparison vs other small models
| Bench | Pluto 0.5 | Supra-50M | SmolLM-135M | SmolLM2-135M | GPT-X-125M | GPT-X2-125M |
|---|---|---|---|---|---|---|
| HellaSwag | 30.00 | 29.22 | 42.70 | 42.10 | 36.57 | 40.55 |
| Winogrande | 55.00 | 51.54 | 50.43 | 51.30 | 50.83 | 49.01 |
| BoolQ | 61.67 | 42.05 | N/A | N/A | N/A | N/A |
| ARC (avg) | 22.38 | 35.90 | 43.17 | 43.90 | 38.84 | 39.90 |
| OpenBookQA | 26.00 | 28.60 | 34.00 | 34.60 | N/A | N/A |
| MMLU | 27.00 | 23.58 | 30.20 | 31.50 | N/A | N/A |
| SciQ | 58.00 | 64.10 | N/A | N/A | N/A | N/A |
Reading: Pluto is competitive on chat/reasoning benches (Winogrande, BoolQ) but lags on knowledge-dense ones (ARC, OpenBookQA, MMLU vs 135M models). Expected trade-off for a 50M-active MoE trained on consumer hardware with 13B-token pretrain across 5 languages.
Usage
import sys, torch
# Pluto uses a custom architecture; load_pluto handles config + weights
from astraimoe.pluto_arch import load_pluto
from transformers import PreTrainedTokenizerFast
tok = PreTrainedTokenizerFast(
tokenizer_file="tokenizer.json",
unk_token="<|unk|>", pad_token="<|pad|>",
bos_token="<|bos|>", eos_token="<|eos|>",
)
model = load_pluto(".", dtype=torch.bfloat16).cuda()
model.eval()
prompt = "<|lang_en|>\n<|user|>\nWhat is a Mixture of Experts?\n<|im_end|>\n<|assistant|>\n"
ids = tok(prompt, return_tensors="pt", add_special_tokens=False).input_ids.cuda()
# Greedy generation loop
with torch.no_grad():
cur = ids[0].tolist()
for _ in range(200):
inp = torch.tensor([cur[-4096:]], device=ids.device, dtype=torch.long)
logits = model(input_ids=inp)["logits"][0, -1]
nxt = int(logits.argmax())
if nxt in (tok.eos_token_id, tok.convert_tokens_to_ids("<|im_end|>")): break
cur.append(nxt)
print(tok.decode(cur[ids.size(1):], skip_special_tokens=True))
Chat template
<|lang_{en|pt|es|zh|hi}|>
<|user|>
...question...
<|im_end|>
<|assistant|>
...response...
<|im_end|>
Limitations (please read carefully)
- Chat quality is poor. The model produces grammatical English but is off-topic, often spirals into "Question: What is the value of X?" math-style loops, or random encyclopedia-style sentences regardless of the prompt. This is the actual ceiling of v0.5 — not a bug in your inference code.
- Multilingual is worse than English. PT/ES/ZH/HI prompts may produce language-correct outputs but with repetition loops or garbage characters. The pretrain budget (13 B tokens) is too thin for 5 languages.
- No identity — the model doesn't know its own name. Multiple identity SFT attempts hurt other benchmarks too much, so the released checkpoint skips them. Ask "who are you?" and the model invents something.
- Code: deliberately not trained. At 47 M active params there's no room for code knowledge; we filtered it out of every training stage to save capacity.
- Pretrain corpus: 13 B tokens — small vs frontier 1 T+ models. This is the fundamental limitation. v1.0 adds 10 B more curated pretrain on top.
If you want to chat with a small LM today, use SmolLM2-135M or Qwen2.5-0.5B. This release exists to validate the architecture and the training pipeline, not to compete on usability.
GGUF / quantizations
Quantized GGUF builds (fp16 / Q8_0 / Q6_K / Q4_K_M) are available at ASTRAI-labs/pluto-nano-0.5-gguf.
They use a qwen2_moe-spoofing shim because astrai_pluto isn't yet a native
arch in llama.cpp. Note: the shim currently produces garbled output due to
top-1 MoE routing incompatibility — use the bf16 safetensors here via the
inference script above for now. Proper llama.cpp support is on the roadmap.
To reproduce, see GGUF.md and convert_pluto_to_qwen2moe.py in this repo.
License
ASTRAI Closed License — weights are made available for research and evaluation. Commercial use requires explicit agreement with ASTRAI Labs.
Citation
@misc{astrai_pluto_nano_2026,
author = {ShinMK (Miguel) and ASTRAI Labs},
title = {ASTRAI Pluto Nano 0.5},
year = {2026},
url = {https://huggingface.co/ASTRAI-labs/pluto-nano-0.5},
}
Contact
ASTRAI Labs — founder: ShinMK (Miguel).
- Downloads last month
- 60
