Instructions to use ASTRAI-labs/pluto-nano-0.5 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use ASTRAI-labs/pluto-nano-0.5 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="ASTRAI-labs/pluto-nano-0.5", trust_remote_code=True)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("ASTRAI-labs/pluto-nano-0.5", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use ASTRAI-labs/pluto-nano-0.5 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "ASTRAI-labs/pluto-nano-0.5"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ASTRAI-labs/pluto-nano-0.5",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/ASTRAI-labs/pluto-nano-0.5

SGLang

How to use ASTRAI-labs/pluto-nano-0.5 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "ASTRAI-labs/pluto-nano-0.5" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ASTRAI-labs/pluto-nano-0.5",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "ASTRAI-labs/pluto-nano-0.5" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ASTRAI-labs/pluto-nano-0.5",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use ASTRAI-labs/pluto-nano-0.5 with Docker Model Runner:
```
docker model run hf.co/ASTRAI-labs/pluto-nano-0.5
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

ASTRAI Pluto Nano 0.5

Mixture-of-Experts language model — 1 B total / ~~50 M active per token (~~47 M exact).

Developed from scratch by ASTRAI Labs. v0.5 is a public preview release of the upcoming Pluto Nano 1.0.

⚠️ READ FIRST — Experimental preview, not a working chatbot. This model does not chat fluently. It produces grammatical English most of the time but goes off-topic, loops, or generates math-style "Question: ..." chains regardless of the input. Multilingual quality is worse — non-EN outputs often collapse into repetition or garbage characters. This was on purpose — v0.5 was built as a public experiment to validate the ASTRAI Pluto architecture (custom MoE, 5-language multilingual support, RTX 3060 trainability), not as a usable assistant. It's published so the community can poke at the architecture, reproduce the pipeline, and watch the upcoming v1.0 (10× pretrain, 10 languages, 128 k vocab, Qwen2.5-1.5B warm-start) inherit from it. Don't deploy v0.5 anywhere — it'll embarrass everyone involved.

Architecture

Spec	Value
Total parameters	1 B
Active per token	~50 M (47 M exact)
Experts	35 (top-1 routing)
Attention	GQA — 6 query heads, 2 KV heads
Hidden / Layers	384 / 16
Expert intermediate	1536
Tokenizer	Custom 32 k BPE
Max context	4096
RoPE θ	1e6
MTP depth (pretrain only)	2
Languages	EN, PT, ES, ZH, HI

Training

Pretrain: 13 B tokens of curated multilingual text
Post-training: SFT, ORPO, DPO, KTO + distillation from frontier models (Claude Opus 4.7/4.8, GPT-5.5, Gemini 3.x, Qwen3-235B, Grok 4.4, etc.)
Hardware: single consumer RTX 3060 (12 GB VRAM)
Training time for v0.5: ~2 weeks

Benchmarks

Honest reporting — all 11/12 standard small-LM benchmarks below (PIQA not measured in our harness, hence 11/12). Pluto v0.5 is a public preview and loses on several knowledge-heavy tasks; it wasn't cherry-picked.

Full results vs SupraLabs/Supra-50M-Reasoning

Benchmark	Pluto Nano v0.5	Supra-50M target	Result
HellaSwag	30.00	29.16	✅ win
Winogrande	55.00	51.07	✅ win
BoolQ	61.67	46.06	✅ win
MMLU	27.00	23.58	✅ win
WikiText PPL	201.91	166.27	❌ (lower is better)
Lambada	0.00	16.53	❌ (format shift, known limitation)
COPA	43.00	59.00	❌
ARC-Easy	21.00	45.16	❌
ARC-Challenge	23.75	26.54	❌
OpenBookQA	26.00	28.80	❌
SciQ	58.00	64.10	❌

4 wins / 11 benches vs Supra-50M.

Comparison vs other small models

Bench	Pluto 0.5	Supra-50M	SmolLM-135M	SmolLM2-135M	GPT-X-125M	GPT-X2-125M
HellaSwag	30.00	29.22	42.70	42.10	36.57	40.55
Winogrande	55.00	51.54	50.43	51.30	50.83	49.01
BoolQ	61.67	42.05	N/A	N/A	N/A	N/A
ARC (avg)	22.38	35.90	43.17	43.90	38.84	39.90
OpenBookQA	26.00	28.60	34.00	34.60	N/A	N/A
MMLU	27.00	23.58	30.20	31.50	N/A	N/A
SciQ	58.00	64.10	N/A	N/A	N/A	N/A

Reading: Pluto is competitive on chat/reasoning benches (Winogrande, BoolQ) but lags on knowledge-dense ones (ARC, OpenBookQA, MMLU vs 135M models). Expected trade-off for a 50M-active MoE trained on consumer hardware with 13B-token pretrain across 5 languages.

Usage

import sys, torch
# Pluto uses a custom architecture; load_pluto handles config + weights
from astraimoe.pluto_arch import load_pluto
from transformers import PreTrainedTokenizerFast

tok = PreTrainedTokenizerFast(
    tokenizer_file="tokenizer.json",
    unk_token="<|unk|>", pad_token="<|pad|>",
    bos_token="<|bos|>", eos_token="<|eos|>",
)
model = load_pluto(".", dtype=torch.bfloat16).cuda()
model.eval()

prompt = "<|lang_en|>\n<|user|>\nWhat is a Mixture of Experts?\n<|im_end|>\n<|assistant|>\n"
ids = tok(prompt, return_tensors="pt", add_special_tokens=False).input_ids.cuda()

# Greedy generation loop
with torch.no_grad():
    cur = ids[0].tolist()
    for _ in range(200):
        inp = torch.tensor([cur[-4096:]], device=ids.device, dtype=torch.long)
        logits = model(input_ids=inp)["logits"][0, -1]
        nxt = int(logits.argmax())
        if nxt in (tok.eos_token_id, tok.convert_tokens_to_ids("<|im_end|>")): break
        cur.append(nxt)
print(tok.decode(cur[ids.size(1):], skip_special_tokens=True))

Chat template

<|lang_{en|pt|es|zh|hi}|>
<|user|>
...question...
<|im_end|>
<|assistant|>
...response...
<|im_end|>

Limitations (please read carefully)

Chat quality is poor. The model produces grammatical English but is off-topic, often spirals into "Question: What is the value of X?" math-style loops, or random encyclopedia-style sentences regardless of the prompt. This is the actual ceiling of v0.5 — not a bug in your inference code.
Multilingual is worse than English. PT/ES/ZH/HI prompts may produce language-correct outputs but with repetition loops or garbage characters. The pretrain budget (13 B tokens) is too thin for 5 languages.
No identity — the model doesn't know its own name. Multiple identity SFT attempts hurt other benchmarks too much, so the released checkpoint skips them. Ask "who are you?" and the model invents something.
Code: deliberately not trained. At 47 M active params there's no room for code knowledge; we filtered it out of every training stage to save capacity.
Pretrain corpus: 13 B tokens — small vs frontier 1 T+ models. This is the fundamental limitation. v1.0 adds 10 B more curated pretrain on top.

If you want to chat with a small LM today, use SmolLM2-135M or Qwen2.5-0.5B. This release exists to validate the architecture and the training pipeline, not to compete on usability.

GGUF / quantizations

Quantized GGUF builds (fp16 / Q8_0 / Q6_K / Q4_K_M) are available at ASTRAI-labs/pluto-nano-0.5-gguf.

They use a qwen2_moe-spoofing shim because astrai_pluto isn't yet a native arch in llama.cpp. Note: the shim currently produces garbled output due to top-1 MoE routing incompatibility — use the bf16 safetensors here via the inference script above for now. Proper llama.cpp support is on the roadmap.

To reproduce, see GGUF.md and convert_pluto_to_qwen2moe.py in this repo.

License

ASTRAI Closed License — weights are made available for research and evaluation. Commercial use requires explicit agreement with ASTRAI Labs.

Citation

@misc{astrai_pluto_nano_2026,
  author = {ShinMK (Miguel) and ASTRAI Labs},
  title  = {ASTRAI Pluto Nano 0.5},
  year   = {2026},
  url    = {https://huggingface.co/ASTRAI-labs/pluto-nano-0.5},
}

Contact

ASTRAI Labs — founder: ShinMK (Miguel).

Downloads last month: 60

Safetensors

Model size

1B params

Tensor type

BF16

Model tree for ASTRAI-labs/pluto-nano-0.5

Base model

ASTRAI-labs/pluto-nano-0.5-base

Finetuned

(1)

this model

Quantizations

1 model