Grok-1 β€” TevunahAi BF16 HuggingFace

AS OF 03/31/2026 AT 15:06 ALL THE FILES HAVE BEEN UPLOADED. SORRY ABOUT THE WAIT.

Property Value
Original Model xai-org/grok-1 (JAX checkpoint)
Architecture Decoder-only Transformer, Top-2 MoE (8 experts), GQA
Parameters 316.5B total, ~78B active per token
Context Length 8,192
Format HuggingFace safetensors (BF16)
Total Size 633 GB (17 shards)
License Apache 2.0

What This Is

A clean, ground-up conversion of the xai-org Grok-1 314B MoE checkpoint from JAX/orbax format to HuggingFace safetensors with custom modeling_grok1.py and configuration_grok1.py written from scratch by TevunahAi.

This is not a copy of the hpcai-tech conversion. The modeling code, configuration, and conversion pipeline were all built independently against the original xai-org JAX checkpoint, verified against actual tensor shapes and the original model.py architecture.

Why This Exists

The original xai-org checkpoint is a JAX orbax format that requires custom unpickling of QuantizedWeight8bit objects (int8 weights + bfloat16 scales). The existing community conversion (hpcai-tech) targets transformers 4.35 and has not been maintained. This conversion works with transformers 5.4+ and includes a modern HF-native model implementation.

Architecture

Specification Value
Layers 64
Hidden Size 6,144
Query Heads 48
KV Heads 8 (Grouped-Query Attention)
Head Dimension 128
Experts 8 per layer, top-2 routing
Expert FFN Intermediate 32,768
Vocab Size 131,072 (SentencePiece)
Positional Encoding RoPE (theta=10,000)
Embedding Multiplier 78.38
Output Multiplier 0.577
Attention Output Multiplier 0.0884
Attention Value Clamp Β±30.0
RMS Norm Epsilon 1e-5

Grok-1 Specific Behaviors

  • Embedding scaling: Input embeddings are multiplied by 78.38 before entering the decoder
  • Output scaling: Logits are multiplied by 0.577 before softmax
  • Attention clamping: Attention logits are hard-clamped to [-30, 30] before softmax (not standard in most transformer implementations)
  • Four norms per layer: Pre-attention, post-attention, pre-MoE, post-MoE (most models use two)

Conversion Details

The original checkpoint consists of 770 pickled tensor files containing QuantizedWeight8bit objects with int8 weights and bfloat16 scales. Each layer occupies 12 files (7 quantized weight objects + 4 norm scales + 1 router). The converter dequantizes all weights to float32, handles grouped scale broadcasting, transposes from JAX [in, out] to PyTorch [out, in] convention, and saves as bfloat16 safetensors.

  • Converter: Custom Grok1Convert.py with QuantizedWeight8bit stub and _GrokUnpickler β€” no orbax or tensorstore dependency required
  • Dequantization: int8 Γ— bfloat16 scales with grouped broadcasting (scales shapes vary: (1,N), (8,N), (8,1,N), (8,8,N))
  • Time: ~2 hours on dual Xeon Max 9480
  • Hardware: Dual Xeon Max 9480 (128GB HBM2e + 256GB DDR5)

Verification

Config Load

Config loaded: grok1
Layers: 64
Experts: 8
Hidden: 6144

Architecture Instantiation (meta device, zero memory)

Architecture OK: 316.5B parameters
Layer 0 modules:
  pre_attn_norm: Grok1RMSNorm
  attn: Grok1Attention
  post_attn_norm: Grok1RMSNorm
  pre_moe_norm: Grok1RMSNorm
  moe_block: Grok1MoeBlock
  post_moe_norm: Grok1RMSNorm

Weight Verification (safetensors metadata, zero memory)

model-00000: 3 tensors
  lm_head.weight: [131072, 6144]
  model.embed_tokens.weight: [131072, 6144]
  model.norm.scale: [6144]

model-00001: 132 tensors
  model.layers.0.attn.k_proj.weight: [1024, 6144]
  model.layers.0.attn.o_proj.weight: [6144, 6144]
  model.layers.0.attn.q_proj.weight: [6144, 6144]
  model.layers.0.attn.v_proj.weight: [1024, 6144]
  model.layers.0.moe_block.experts.0.linear.weight: [32768, 6144]
  model.layers.0.moe_block.experts.0.linear_1.weight: [6144, 32768]
  model.layers.0.moe_block.experts.0.linear_v.weight: [32768, 6144]
  ...

Layer 0 non-expert weights:
  model.layers.0.attn.k_proj.weight: [1024, 6144]
  model.layers.0.attn.o_proj.weight: [6144, 6144]
  model.layers.0.attn.q_proj.weight: [6144, 6144]
  model.layers.0.attn.v_proj.weight: [1024, 6144]
  model.layers.0.moe_block.gate.weight: [8, 6144]
  model.layers.0.post_attn_norm.scale: [6144]
  model.layers.0.post_moe_norm.scale: [6144]
  model.layers.0.pre_attn_norm.scale: [6144]
  model.layers.0.pre_moe_norm.scale: [6144]

Conversion Output

Total converted parameters: 316.5B
Total size: 633.0 GB
Shards: 17 (3.22 GB embedding + 16 Γ— 39.36 GB layer shards)
Tensors per layer: 33 (4 norms + 1 router + 4 attn + 24 expert)
Parameters per layer: 4,920M

Usage

from transformers import AutoModelForCausalLM, AutoConfig
import torch

# Load model (requires ~640GB combined memory)
model = AutoModelForCausalLM.from_pretrained(
    "TevunahAi/Grok-1-BF16",
    trust_remote_code=True,
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

Memory Requirements

This is a 633 GB model. Loading requires:

  • Full load: ~640 GB combined VRAM + RAM
  • Recommended: Multi-GPU setup (8Γ— H100 80GB or equivalent)
  • With CPU offload: device_map="auto" with sufficient system RAM

This BF16 release is primarily intended as a reference checkpoint for quantization and research. For practical inference, use a quantized variant.

Files Included

  • config.json β€” Model configuration
  • configuration_grok1.py β€” HuggingFace config class
  • modeling_grok1.py β€” Full model implementation (attention, MoE, RoPE, generation)
  • model-00000.safetensors through model-00016.safetensors β€” Weight shards
  • model.safetensors.index.json β€” Shard index

Citation

@software{grok1_bf16_tevunahai_2026,
  title = {Grok-1 BF16 HuggingFace Conversion},
  author = {TevunahAi},
  year = {2026},
  note = {Ground-up JAX to HuggingFace conversion with custom modeling implementation},
  url = {https://huggingface.co/TevunahAi/Grok-1-BF16}
}

@misc{xai_grok1_2024,
  title = {Grok-1 Open Release},
  author = {xAI},
  year = {2024},
  url = {https://github.com/xai-org/grok-1},
  note = {314B parameter Mixture-of-Experts model, Apache 2.0}
}

Converted by TevunahAi LLC

https://huggingface.co/TevunahAi www.Tevunah.ai

Downloads last month
8
Safetensors
Model size
316B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for TevunahAi/Grok-1-BF16

Base model

xai-org/grok-1
Finetuned
(5)
this model