Grok-1 — TevunahAi BF16 HuggingFace

AS OF 03/31/2026 AT 15:06 ALL THE FILES HAVE BEEN UPLOADED. SORRY ABOUT THE WAIT.

Property	Value
Original Model	xai-org/grok-1 (JAX checkpoint)
Architecture	Decoder-only Transformer, Top-2 MoE (8 experts), GQA
Parameters	316.5B total, ~78B active per token
Context Length	8,192
Format	HuggingFace safetensors (BF16)
Total Size	633 GB (17 shards)
License	Apache 2.0

What This Is

A clean, ground-up conversion of the xai-org Grok-1 314B MoE checkpoint from JAX/orbax format to HuggingFace safetensors with custom modeling_grok1.py and configuration_grok1.py written from scratch by TevunahAi.

This is not a copy of the hpcai-tech conversion. The modeling code, configuration, and conversion pipeline were all built independently against the original xai-org JAX checkpoint, verified against actual tensor shapes and the original model.py architecture.

Why This Exists

The original xai-org checkpoint is a JAX orbax format that requires custom unpickling of QuantizedWeight8bit objects (int8 weights + bfloat16 scales). The existing community conversion (hpcai-tech) targets transformers 4.35 and has not been maintained. This conversion works with transformers 5.4+ and includes a modern HF-native model implementation.

Architecture

Specification	Value
Layers	64
Hidden Size	6,144
Query Heads	48
KV Heads	8 (Grouped-Query Attention)
Head Dimension	128
Experts	8 per layer, top-2 routing
Expert FFN Intermediate	32,768
Vocab Size	131,072 (SentencePiece)
Positional Encoding	RoPE (theta=10,000)
Embedding Multiplier	78.38
Output Multiplier	0.577
Attention Output Multiplier	0.0884
Attention Value Clamp	±30.0
RMS Norm Epsilon	1e-5

Grok-1 Specific Behaviors

Embedding scaling: Input embeddings are multiplied by 78.38 before entering the decoder
Output scaling: Logits are multiplied by 0.577 before softmax
Attention clamping: Attention logits are hard-clamped to [-30, 30] before softmax (not standard in most transformer implementations)
Four norms per layer: Pre-attention, post-attention, pre-MoE, post-MoE (most models use two)

Conversion Details

The original checkpoint consists of 770 pickled tensor files containing QuantizedWeight8bit objects with int8 weights and bfloat16 scales. Each layer occupies 12 files (7 quantized weight objects + 4 norm scales + 1 router). The converter dequantizes all weights to float32, handles grouped scale broadcasting, transposes from JAX [in, out] to PyTorch [out, in] convention, and saves as bfloat16 safetensors.

Converter: Custom Grok1Convert.py with QuantizedWeight8bit stub and _GrokUnpickler — no orbax or tensorstore dependency required
Dequantization: int8 × bfloat16 scales with grouped broadcasting (scales shapes vary: (1,N), (8,N), (8,1,N), (8,8,N))
Time: ~2 hours on dual Xeon Max 9480
Hardware: Dual Xeon Max 9480 (128GB HBM2e + 256GB DDR5)

Verification

Config Load

Config loaded: grok1
Layers: 64
Experts: 8
Hidden: 6144

Architecture Instantiation (meta device, zero memory)

Architecture OK: 316.5B parameters
Layer 0 modules:
  pre_attn_norm: Grok1RMSNorm
  attn: Grok1Attention
  post_attn_norm: Grok1RMSNorm
  pre_moe_norm: Grok1RMSNorm
  moe_block: Grok1MoeBlock
  post_moe_norm: Grok1RMSNorm

Weight Verification (safetensors metadata, zero memory)

model-00000: 3 tensors
  lm_head.weight: [131072, 6144]
  model.embed_tokens.weight: [131072, 6144]
  model.norm.scale: [6144]

model-00001: 132 tensors
  model.layers.0.attn.k_proj.weight: [1024, 6144]
  model.layers.0.attn.o_proj.weight: [6144, 6144]
  model.layers.0.attn.q_proj.weight: [6144, 6144]
  model.layers.0.attn.v_proj.weight: [1024, 6144]
  model.layers.0.moe_block.experts.0.linear.weight: [32768, 6144]
  model.layers.0.moe_block.experts.0.linear_1.weight: [6144, 32768]
  model.layers.0.moe_block.experts.0.linear_v.weight: [32768, 6144]
  ...

Layer 0 non-expert weights:
  model.layers.0.attn.k_proj.weight: [1024, 6144]
  model.layers.0.attn.o_proj.weight: [6144, 6144]
  model.layers.0.attn.q_proj.weight: [6144, 6144]
  model.layers.0.attn.v_proj.weight: [1024, 6144]
  model.layers.0.moe_block.gate.weight: [8, 6144]
  model.layers.0.post_attn_norm.scale: [6144]
  model.layers.0.post_moe_norm.scale: [6144]
  model.layers.0.pre_attn_norm.scale: [6144]
  model.layers.0.pre_moe_norm.scale: [6144]

Conversion Output

Total converted parameters: 316.5B
Total size: 633.0 GB
Shards: 17 (3.22 GB embedding + 16 × 39.36 GB layer shards)
Tensors per layer: 33 (4 norms + 1 router + 4 attn + 24 expert)
Parameters per layer: 4,920M

Usage

from transformers import AutoModelForCausalLM, AutoConfig
import torch

# Load model (requires ~640GB combined memory)
model = AutoModelForCausalLM.from_pretrained(
    "TevunahAi/Grok-1-BF16",
    trust_remote_code=True,
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

Memory Requirements

This is a 633 GB model. Loading requires:

Full load: ~640 GB combined VRAM + RAM
Recommended: Multi-GPU setup (8× H100 80GB or equivalent)
With CPU offload: device_map="auto" with sufficient system RAM

This BF16 release is primarily intended as a reference checkpoint for quantization and research. For practical inference, use a quantized variant.

Files Included

config.json — Model configuration
configuration_grok1.py — HuggingFace config class
modeling_grok1.py — Full model implementation (attention, MoE, RoPE, generation)
model-00000.safetensors through model-00016.safetensors — Weight shards
model.safetensors.index.json — Shard index

Citation

@software{grok1_bf16_tevunahai_2026,
  title = {Grok-1 BF16 HuggingFace Conversion},
  author = {TevunahAi},
  year = {2026},
  note = {Ground-up JAX to HuggingFace conversion with custom modeling implementation},
  url = {https://huggingface.co/TevunahAi/Grok-1-BF16}
}

@misc{xai_grok1_2024,
  title = {Grok-1 Open Release},
  author = {xAI},
  year = {2024},
  url = {https://github.com/xai-org/grok-1},
  note = {314B parameter Mixture-of-Experts model, Apache 2.0}
}

Converted by TevunahAi LLC

https://huggingface.co/TevunahAi www.Tevunah.ai

Downloads last month: 8

Safetensors

Model size

316B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for TevunahAi/Grok-1-BF16

Base model

xai-org/grok-1

Finetuned

(5)

this model