PatronusAI/llada_2.1_world_model_v3
PatronusAI/llada_2.1_world_model_v3 is a LLaDA2-style Mixture-of-Experts language model checkpoint trained with the dFactory training stack for conversational world-modeling style data.
This repository contains the Hugging Face checkpoint exported from the final training step of the most recent local W&B run:
- Run ID:
fc0mdstv - Date: March 11, 2026
- Training entrypoint:
tasks/train_llada2_bd.py - Training config:
configs/sft/llada2_mini_bd_sft.yaml - Git commit:
92b6890808088b112fcf5fc73a341b78b6ab76bf
Model details
- Architecture:
LLaDA2MoeModelLM - Model type: custom
llada2_moe - Approximate size: 16B-class model
- Layers: 20
- Hidden size: 2048
- Attention heads: 16
- Key/value heads: 4
- Experts: 256 total, 8 experts routed per token
- Shared experts: 1
- Vocabulary size: 157,184
- Checkpoint dtype: bfloat16
- Saved max position embeddings: 16,384
The repository includes custom modeling files:
configuration_llada2_moe.pymodeling_llada2_moe.py
trust_remote_code=True is required when loading through transformers.
Training data
The checkpoint was trained from the local dataset at:
/workspace/dFactory/world_modeling_datasets/world_modeling_train.jsonl
Based on the dataset sample and config, this is a conversation-format JSONL dataset using the messages field. The examples appear to be multi-turn agent traces with system, user, assistant, and tool-interaction style content.
I am inferring the exact dataset semantics from the local files and run config; this repository does not include a separate dataset card.
Training recipe
The model was trained with block-diffusion SFT settings from configs/sft/llada2_mini_bd_sft.yaml.
- Objective: block-diffusion conversational fine-tuning
- Sequence length used for training: 8192
- Epochs: 1
- Train steps: 4067
- Global batch size: 64
- Micro batch size: 2
- Optimizer: AdamW
- Learning rate:
1e-5with cosine decay to1e-7 - Warmup ratio:
0.03 - Weight decay:
0.1 - Gradient clipping:
1.0 - Block diffusion mode: enabled
- Block size:
32 - Noise range:
0.3to0.8 - Mixed precision: enabled
- Gradient checkpointing: enabled
- Parallelism: FSDP2 over 8 GPUs
Training hardware
- GPUs: 8x NVIDIA H200
- CUDA: 12.8
- Python: 3.11.10
transformers: 4.56.2
Final run metrics
Metrics below come from wandb/run-20260311_054957-fc0mdstv/files/wandb-summary.json.
- Final training loss:
0.4174 - Final grad norm:
1.2001 - Consumed tokens:
2.132B - Tokens / second:
0.0673M - Runtime:
32147.65s(~8.93h) - Max allocated GPU memory:
101.46 GB - Max reserved GPU memory:
125.39 GB
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
repo_id = "PatronusAI/llada_2.1_world_model_v3"
tokenizer = AutoTokenizer.from_pretrained(
repo_id,
trust_remote_code=True,
)
model = AutoModelForCausalLM.from_pretrained(
repo_id,
trust_remote_code=True,
torch_dtype="auto",
device_map="auto",
)
For chat-style prompting, use the bundled tokenizer and chat template in this repository.
Limitations
- This is a specialized checkpoint trained on world-modeling style conversational traces, not a general-purpose safety-tuned assistant.
- The training data appears to include agent and tool-use transcripts, so generations may imitate tool-calling or system-prompt patterns.
- No formal evaluation results or benchmark scores are included in this repository yet.
- Because this model uses custom code, downstream environments must allow remote code loading or vendor the modeling files locally.
License
This model card uses the repository license observed locally in dFactory, which is Apache 2.0.
- Downloads last month
- 40