Instructions to use puwaer/Susono-10B-A1B-Instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use puwaer/Susono-10B-A1B-Instruct with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="puwaer/Susono-10B-A1B-Instruct")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("puwaer/Susono-10B-A1B-Instruct", dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use puwaer/Susono-10B-A1B-Instruct with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "puwaer/Susono-10B-A1B-Instruct"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "puwaer/Susono-10B-A1B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/puwaer/Susono-10B-A1B-Instruct

SGLang

How to use puwaer/Susono-10B-A1B-Instruct with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "puwaer/Susono-10B-A1B-Instruct" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "puwaer/Susono-10B-A1B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "puwaer/Susono-10B-A1B-Instruct" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "puwaer/Susono-10B-A1B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use puwaer/Susono-10B-A1B-Instruct with Docker Model Runner:
```
docker model run hf.co/puwaer/Susono-10B-A1B-Instruct
```

Susono-10B-A1B-Instruct

English | 日本語

Susono-10B-A1B-Instruct is an instruction-following model created by post-training Susono-10B-A1B-Base with SFT and DPO. It is an original-architecture LLM with 10B total parameters and about 1B active parameters per token (A1B), integrating Engram (a conditional memory module) and mHC-lite (Manifold-Constrained Hyper-Connections Lite) into a hybrid backbone of Full Attention + GatedDeltaNet + MoE.

Training was performed on the NVIDIA GH200 Grace Hopper Superchip. Dedicated fused kernels were implemented for Engram and mHC-lite, and training was optimized with FP8 training + CPU offload, taking advantage of the GH200 GPU architecture.

Note that this model was developed purely as a personal hobby project and funded privately. The development cost was only about USD 1,875 (roughly JPY 300,000), so please be aware that pre-training and post-training have not been carried out to a sufficient extent.

⚠️ This is an instruct model post-trained for chat and instruction following. Apply the chat template when generating responses.

We assume no responsibility for the model's outputs. Use it at your own risk.

Model Overview

Item	Details
Base model	Susono-10B-A1B-Base
Post-training	SFT + DPO
Architecture	Hybrid of Full Attention + GatedDeltaNet + Sparse MoE, with Engram + mHC-lite
Total parameters	~10B
Active parameters per token	~1B (A1B)
Vocabulary size	151,680
Max context length	262,144 (up to 16,384 during training)
Training stack	Extended Megatron-LM (FP8 training + CPU offload)
Training environment	Supercomputer Miyabi (NVIDIA GH200 × 16)

Reference papers:

Engram: arXiv:2601.07372v1 "Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models"
mHC-lite: arXiv:2601.05732v1 "mHC-lite: You Don't Need 20 Sinkhorn-Knopp Iterations"

Architecture

Full Attention + GatedDeltaNet: A hybrid configuration that uses full softmax attention every 4 layers (full_attention_interval=4) and GatedDeltaNet (linear attention) in the remaining layers.
Sparse MoE: All FFN layers are MoE (96 experts, 4 active per token).
Engram (conditional memory): O(1) lookup into static embeddings via N-gram hashing. It directly retrieves local, repetitive patterns and frees up attention for global-context processing. Inserted at layers 3 and 7, it serves as the primary store of factual knowledge.
mHC-lite (multi-stream residual connections): Dynamic residual connections across multiple streams. Leveraging the Birkhoff–von Neumann theorem, it strictly guarantees a doubly stochastic matrix without any Sinkhorn-Knopp iterations.

Module	Key settings
MoE	num_experts=96, num_experts_per_tok=4, moe_intermediate_size=512
Engram	max_ngram_size=3, embed_dim=672, n_head=8, layer_ids=[3, 7]
mHC-lite	num_streams=4 (n!=24 permutation matrices)

Training Environment

NVIDIA GH200 Grace Hopper Superchip

The GH200 is a heterogeneous superchip that directly connects a Grace CPU (Arm Neoverse V2 / 72 cores) and a Hopper GPU (H100-class / 96GB HBM3) via NVLink-C2C (900GB/s bidirectional, 7× the bandwidth of PCIe Gen5). Hardware-level memory coherency lets the CPU and GPU access each other's memory without page migration, making full-scale CPU offload practical.

Training Framework

Based on Megatron-LM, extended for Susono as follows:

Triton Fused Kernels: Fuse operations such as Engram lookup, mHC width connection, GatedDeltaNet decay, MoE router, RMSNorm variants, aux loss, and cross entropy. Every kernel includes a PyTorch fallback.
FP8 training + CPU offload: Parameters are kept in FP8 (e4m3), while the Adam optimizer state and master weights (BF16) are offloaded to CPU memory over NVLink-C2C.

Training Schedule

Phase	Context length	Target tokens	GBS	Learning rate
Phase 1: Pre-training	4,096	300B	1,024	2.0e-4
Phase 2: Mid-training	16,384	250B	256	2.0e-4
Phase 3: SFT	16,384	-	128	2.0e-5
Phase 4: DPO	16,384	-	32	1.0e-6

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "puwaer/Susono-10B-A1B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True,
)

prompt = "Give me a short introduction to large language models."
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=16384,
    do_sample=True,
    temperature=0.2,
    top_p=0.9,
    repetition_penalty=1.05,
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
content = tokenizer.decode(output_ids, skip_special_tokens=True).strip("\n")
print(content)

Reference Repositories

HuggingFace transformers implementation: https://github.com/puwaer/transformers.git (main branch)
Megatron-LM implementation: https://github.com/puwaer/Megatron-LM.git (main branch)
SGLang implementation: https://github.com/puwaer/sglang.git (sglang-v0.5.10-add-suson-model branch)
vLLM implementation: https://github.com/puwaer/vllm.git (vllm-v0.19.1-add-suson-model branch)