Grok-1 β TevunahAi BF16 HuggingFace
AS OF 03/31/2026 AT 15:06 ALL THE FILES HAVE BEEN UPLOADED. SORRY ABOUT THE WAIT.
| Property | Value |
|---|---|
| Original Model | xai-org/grok-1 (JAX checkpoint) |
| Architecture | Decoder-only Transformer, Top-2 MoE (8 experts), GQA |
| Parameters | 316.5B total, ~78B active per token |
| Context Length | 8,192 |
| Format | HuggingFace safetensors (BF16) |
| Total Size | 633 GB (17 shards) |
| License | Apache 2.0 |
What This Is
A clean, ground-up conversion of the xai-org Grok-1 314B MoE checkpoint from JAX/orbax format to HuggingFace safetensors with custom modeling_grok1.py and configuration_grok1.py written from scratch by TevunahAi.
This is not a copy of the hpcai-tech conversion. The modeling code, configuration, and conversion pipeline were all built independently against the original xai-org JAX checkpoint, verified against actual tensor shapes and the original model.py architecture.
Why This Exists
The original xai-org checkpoint is a JAX orbax format that requires custom unpickling of QuantizedWeight8bit objects (int8 weights + bfloat16 scales). The existing community conversion (hpcai-tech) targets transformers 4.35 and has not been maintained. This conversion works with transformers 5.4+ and includes a modern HF-native model implementation.
Architecture
| Specification | Value |
|---|---|
| Layers | 64 |
| Hidden Size | 6,144 |
| Query Heads | 48 |
| KV Heads | 8 (Grouped-Query Attention) |
| Head Dimension | 128 |
| Experts | 8 per layer, top-2 routing |
| Expert FFN Intermediate | 32,768 |
| Vocab Size | 131,072 (SentencePiece) |
| Positional Encoding | RoPE (theta=10,000) |
| Embedding Multiplier | 78.38 |
| Output Multiplier | 0.577 |
| Attention Output Multiplier | 0.0884 |
| Attention Value Clamp | Β±30.0 |
| RMS Norm Epsilon | 1e-5 |
Grok-1 Specific Behaviors
- Embedding scaling: Input embeddings are multiplied by 78.38 before entering the decoder
- Output scaling: Logits are multiplied by 0.577 before softmax
- Attention clamping: Attention logits are hard-clamped to [-30, 30] before softmax (not standard in most transformer implementations)
- Four norms per layer: Pre-attention, post-attention, pre-MoE, post-MoE (most models use two)
Conversion Details
The original checkpoint consists of 770 pickled tensor files containing QuantizedWeight8bit objects with int8 weights and bfloat16 scales. Each layer occupies 12 files (7 quantized weight objects + 4 norm scales + 1 router). The converter dequantizes all weights to float32, handles grouped scale broadcasting, transposes from JAX [in, out] to PyTorch [out, in] convention, and saves as bfloat16 safetensors.
- Converter: Custom
Grok1Convert.pywithQuantizedWeight8bitstub and_GrokUnpicklerβ no orbax or tensorstore dependency required - Dequantization: int8 Γ bfloat16 scales with grouped broadcasting (scales shapes vary: (1,N), (8,N), (8,1,N), (8,8,N))
- Time: ~2 hours on dual Xeon Max 9480
- Hardware: Dual Xeon Max 9480 (128GB HBM2e + 256GB DDR5)
Verification
Config Load
Config loaded: grok1
Layers: 64
Experts: 8
Hidden: 6144
Architecture Instantiation (meta device, zero memory)
Architecture OK: 316.5B parameters
Layer 0 modules:
pre_attn_norm: Grok1RMSNorm
attn: Grok1Attention
post_attn_norm: Grok1RMSNorm
pre_moe_norm: Grok1RMSNorm
moe_block: Grok1MoeBlock
post_moe_norm: Grok1RMSNorm
Weight Verification (safetensors metadata, zero memory)
model-00000: 3 tensors
lm_head.weight: [131072, 6144]
model.embed_tokens.weight: [131072, 6144]
model.norm.scale: [6144]
model-00001: 132 tensors
model.layers.0.attn.k_proj.weight: [1024, 6144]
model.layers.0.attn.o_proj.weight: [6144, 6144]
model.layers.0.attn.q_proj.weight: [6144, 6144]
model.layers.0.attn.v_proj.weight: [1024, 6144]
model.layers.0.moe_block.experts.0.linear.weight: [32768, 6144]
model.layers.0.moe_block.experts.0.linear_1.weight: [6144, 32768]
model.layers.0.moe_block.experts.0.linear_v.weight: [32768, 6144]
...
Layer 0 non-expert weights:
model.layers.0.attn.k_proj.weight: [1024, 6144]
model.layers.0.attn.o_proj.weight: [6144, 6144]
model.layers.0.attn.q_proj.weight: [6144, 6144]
model.layers.0.attn.v_proj.weight: [1024, 6144]
model.layers.0.moe_block.gate.weight: [8, 6144]
model.layers.0.post_attn_norm.scale: [6144]
model.layers.0.post_moe_norm.scale: [6144]
model.layers.0.pre_attn_norm.scale: [6144]
model.layers.0.pre_moe_norm.scale: [6144]
Conversion Output
Total converted parameters: 316.5B
Total size: 633.0 GB
Shards: 17 (3.22 GB embedding + 16 Γ 39.36 GB layer shards)
Tensors per layer: 33 (4 norms + 1 router + 4 attn + 24 expert)
Parameters per layer: 4,920M
Usage
from transformers import AutoModelForCausalLM, AutoConfig
import torch
# Load model (requires ~640GB combined memory)
model = AutoModelForCausalLM.from_pretrained(
"TevunahAi/Grok-1-BF16",
trust_remote_code=True,
device_map="auto",
torch_dtype=torch.bfloat16,
)
Memory Requirements
This is a 633 GB model. Loading requires:
- Full load: ~640 GB combined VRAM + RAM
- Recommended: Multi-GPU setup (8Γ H100 80GB or equivalent)
- With CPU offload:
device_map="auto"with sufficient system RAM
This BF16 release is primarily intended as a reference checkpoint for quantization and research. For practical inference, use a quantized variant.
Files Included
config.jsonβ Model configurationconfiguration_grok1.pyβ HuggingFace config classmodeling_grok1.pyβ Full model implementation (attention, MoE, RoPE, generation)model-00000.safetensorsthroughmodel-00016.safetensorsβ Weight shardsmodel.safetensors.index.jsonβ Shard index
Citation
@software{grok1_bf16_tevunahai_2026,
title = {Grok-1 BF16 HuggingFace Conversion},
author = {TevunahAi},
year = {2026},
note = {Ground-up JAX to HuggingFace conversion with custom modeling implementation},
url = {https://huggingface.co/TevunahAi/Grok-1-BF16}
}
@misc{xai_grok1_2024,
title = {Grok-1 Open Release},
author = {xAI},
year = {2024},
url = {https://github.com/xai-org/grok-1},
note = {314B parameter Mixture-of-Experts model, Apache 2.0}
}
Converted by TevunahAi LLC
- Downloads last month
- 8
Model tree for TevunahAi/Grok-1-BF16
Base model
xai-org/grok-1