Instructions to use puwaer/Susono-10B-A1B-Instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use puwaer/Susono-10B-A1B-Instruct with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="puwaer/Susono-10B-A1B-Instruct") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("puwaer/Susono-10B-A1B-Instruct", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use puwaer/Susono-10B-A1B-Instruct with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "puwaer/Susono-10B-A1B-Instruct" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "puwaer/Susono-10B-A1B-Instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/puwaer/Susono-10B-A1B-Instruct
- SGLang
How to use puwaer/Susono-10B-A1B-Instruct with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "puwaer/Susono-10B-A1B-Instruct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "puwaer/Susono-10B-A1B-Instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "puwaer/Susono-10B-A1B-Instruct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "puwaer/Susono-10B-A1B-Instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use puwaer/Susono-10B-A1B-Instruct with Docker Model Runner:
docker model run hf.co/puwaer/Susono-10B-A1B-Instruct
Susono-10B-A1B-Instruct
Susono-10B-A1B-Instruct is an instruction-following model created by post-training Susono-10B-A1B-Base with SFT and DPO. It is an original-architecture LLM with 10B total parameters and about 1B active parameters per token (A1B), integrating Engram (a conditional memory module) and mHC-lite (Manifold-Constrained Hyper-Connections Lite) into a hybrid backbone of Full Attention + GatedDeltaNet + MoE.
Training was performed on the NVIDIA GH200 Grace Hopper Superchip. Dedicated fused kernels were implemented for Engram and mHC-lite, and training was optimized with FP8 training + CPU offload, taking advantage of the GH200 GPU architecture.
Note that this model was developed purely as a personal hobby project and funded privately. The development cost was only about USD 1,875 (roughly JPY 300,000), so please be aware that pre-training and post-training have not been carried out to a sufficient extent.
⚠️ This is an instruct model post-trained for chat and instruction following. Apply the chat template when generating responses.
We assume no responsibility for the model's outputs. Use it at your own risk.
Model Overview
| Item | Details |
|---|---|
| Base model | Susono-10B-A1B-Base |
| Post-training | SFT + DPO |
| Architecture | Hybrid of Full Attention + GatedDeltaNet + Sparse MoE, with Engram + mHC-lite |
| Total parameters | ~10B |
| Active parameters per token | ~1B (A1B) |
| Vocabulary size | 151,680 |
| Max context length | 262,144 (up to 16,384 during training) |
| Training stack | Extended Megatron-LM (FP8 training + CPU offload) |
| Training environment | Supercomputer Miyabi (NVIDIA GH200 × 16) |
Reference papers:
- Engram: arXiv:2601.07372v1 "Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models"
- mHC-lite: arXiv:2601.05732v1 "mHC-lite: You Don't Need 20 Sinkhorn-Knopp Iterations"
Architecture
- Full Attention + GatedDeltaNet: A hybrid configuration that uses full softmax attention every 4 layers (
full_attention_interval=4) and GatedDeltaNet (linear attention) in the remaining layers. - Sparse MoE: All FFN layers are MoE (96 experts, 4 active per token).
- Engram (conditional memory): O(1) lookup into static embeddings via N-gram hashing. It directly retrieves local, repetitive patterns and frees up attention for global-context processing. Inserted at layers 3 and 7, it serves as the primary store of factual knowledge.
- mHC-lite (multi-stream residual connections): Dynamic residual connections across multiple streams. Leveraging the Birkhoff–von Neumann theorem, it strictly guarantees a doubly stochastic matrix without any Sinkhorn-Knopp iterations.
| Module | Key settings |
|---|---|
| MoE | num_experts=96, num_experts_per_tok=4, moe_intermediate_size=512 |
| Engram | max_ngram_size=3, embed_dim=672, n_head=8, layer_ids=[3, 7] |
| mHC-lite | num_streams=4 (n!=24 permutation matrices) |
Training Environment
NVIDIA GH200 Grace Hopper Superchip
The GH200 is a heterogeneous superchip that directly connects a Grace CPU (Arm Neoverse V2 / 72 cores) and a Hopper GPU (H100-class / 96GB HBM3) via NVLink-C2C (900GB/s bidirectional, 7× the bandwidth of PCIe Gen5). Hardware-level memory coherency lets the CPU and GPU access each other's memory without page migration, making full-scale CPU offload practical.
Training Framework
Based on Megatron-LM, extended for Susono as follows:
- Triton Fused Kernels: Fuse operations such as Engram lookup, mHC width connection, GatedDeltaNet decay, MoE router, RMSNorm variants, aux loss, and cross entropy. Every kernel includes a PyTorch fallback.
- FP8 training + CPU offload: Parameters are kept in FP8 (e4m3), while the Adam optimizer state and master weights (BF16) are offloaded to CPU memory over NVLink-C2C.
Training Schedule
| Phase | Context length | Target tokens | GBS | Learning rate |
|---|---|---|---|---|
| Phase 1: Pre-training | 4,096 | 300B | 1,024 | 2.0e-4 |
| Phase 2: Mid-training | 16,384 | 250B | 256 | 2.0e-4 |
| Phase 3: SFT | 16,384 | - | 128 | 2.0e-5 |
| Phase 4: DPO | 16,384 | - | 32 | 1.0e-6 |
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "puwaer/Susono-10B-A1B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto",
trust_remote_code=True,
)
prompt = "Give me a short introduction to large language models."
messages = [
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(
**model_inputs,
max_new_tokens=16384,
do_sample=True,
temperature=0.2,
top_p=0.9,
repetition_penalty=1.05,
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
content = tokenizer.decode(output_ids, skip_special_tokens=True).strip("\n")
print(content)
Reference Repositories
- HuggingFace transformers implementation: https://github.com/puwaer/transformers.git (
mainbranch) - Megatron-LM implementation: https://github.com/puwaer/Megatron-LM.git (
mainbranch) - SGLang implementation: https://github.com/puwaer/sglang.git (
sglang-v0.5.10-add-suson-modelbranch) - vLLM implementation: https://github.com/puwaer/vllm.git (
vllm-v0.19.1-add-suson-modelbranch)
Note: the transformers, SGLang, and vLLM implementations are planned to be merged into their respective upstream (main) repositories.
- Downloads last month
- 34