Running Zyphra ZAYA1-74B-preview on Multi-GPU Hardware: Six Bugs and Their Fixes

Community Article Published May 8, 2026

Discovered while running heretic abliteration on ZAYA1-74B-preview across 4× NVIDIA RTX PRO 6000 Blackwell GPUs. Every bug listed here affects any multi-GPU inference setup — vLLM, HuggingFace Transformers, custom pipelines — not just heretic.


ZAYA1-74B-preview is an impressive hybrid SSM-MoE-Attention model from Zyphra. At ~139 GB in BF16, it requires multiple high-VRAM GPUs for full-precision inference. When we attempted to load it with device_map="auto" across four 95 GB GPUs, we encountered a chain of six distinct bugs in Zyphra's custom modeling_zaya.py — all stemming from the same root cause: the model's forward pass was written assuming single-GPU placement and never tested across device boundaries.

This article documents each bug, its error message, and the exact one-line fix. We also document the environment setup issues that must be resolved before the model can load at all.


Environment Setup

1. Install Zyphra's transformers fork

The zaya architecture is not in upstream HuggingFace transformers. You must install Zyphra's fork:

pip install "transformers @ git+https://github.com/Zyphra/transformers.git@zaya1"

This installs transformers==4.57.1 and downgrades huggingface-hub from ~1.14 to 0.36.2.

2. Fix the kernels package conflict

The hub downgrade breaks the kernels package, which uses str | None type annotations not supported by huggingface-hub==0.36.2:

TypeError: Unsupported type for field 'import_name': str | None

Fix:

pip uninstall kernels -y

3. Expose all GPUs

If CUDA_VISIBLE_DEVICES is set to a single GPU index, the 139 GB model will partially offload to CPU. CPU offload triggers its own cascade of failures. Always unset before loading:

unset CUDA_VISIBLE_DEVICES

With 4× 95 GB GPUs (380 GB total), the model loads cleanly in BF16 via device_map="auto" with no CPU offload.


The Six Multi-GPU Bugs

All bugs are in modeling_zaya.py from Zyphra's transformers fork. Find the installed path with:

ZAYA=$(python3 -c "import transformers, os; print(os.path.dirname(transformers.__file__))")/models/zaya/modeling_zaya.py

Bug 1 — Token Embedding Device Mismatch

Error:

RuntimeError: indices should be either on cpu or on the same device
as the indexed tensor (cuda:1)

Where: modeling_zaya.py line ~1584

Cause: input_ids arrive on whichever device the caller placed them, but embed_tokens lives on GPU 0 under device_map="auto". The embedding lookup fails when they differ.

Fix:

sed -i 's/inputs_embeds = self\.embed_tokens(input_ids)/inputs_embeds = self.embed_tokens(input_ids.to(self.embed_tokens.weight.device))/' $ZAYA

Bug 2 — MoE Routing Indices Device Mismatch

Error:

RuntimeError: indices should be either on cpu or on the same device
as the indexed tensor (cuda:1)

Where: modeling_zaya.py line ~1236

Cause: The MoE router runs on GPU 0 and produces indices_flat. The hidden states being routed are on GPU 1+. sort_order, sorted_indices, and original_order all inherit the wrong device from indices_flat.

Fix:

sed -i 's/indices_flat = indices\.view(batch_size \* seq_length)/indices_flat = indices.view(batch_size * seq_length).to(hidden_states_flat.device)/' $ZAYA

Bug 3 — Expert Output Concatenation Cross-Device

Error:

RuntimeError: Expected all tensors to be on the same device, but got
tensors is on cuda:1, different from other tensors on cuda:0
(when checking argument in method wrapper_CUDA_cat)

Where: modeling_zaya.py line ~1180

Cause: With 512 experts spread across multiple GPUs, each expert returns its output on its own device. torch.cat(output_local_list) fails because the list contains tensors from different GPUs.

Fix:

sed -i 's/output_local = torch\.cat(output_local_list, dim=0)/output_local = torch.cat([o.to(permuted_local_hidden_states.device) for o in output_local_list], dim=0)/' $ZAYA

sed -i 's/output_bias_local = torch\.cat(output_bias_list, dim=0)/output_bias_local = torch.cat([o.to(permuted_local_hidden_states.device) for o in output_bias_list], dim=0)/' $ZAYA

Bug 4 — Router Probabilities Device Mismatch

Error:

RuntimeError: Expected all tensors to be on the same device, but found
at least two devices, cuda:1 and cuda:0!

Where: modeling_zaya.py line ~1266

Cause: After fixing Bug 3, expert_output is correctly on the hidden-states device (GPU 1+), but probs is still on GPU 0 where the router lives. The element-wise multiply fails.

Fix:

sed -i 's/expert_output = expert_output \* probs\.unsqueeze(-1)/expert_output = expert_output * probs.to(expert_output.device).unsqueeze(-1)/' $ZAYA

Bug 5 — KV Cache Conv States Device Mismatch

Error:

RuntimeError: Expected all tensors to be on the same device, but got
tensors is on cuda:1, different from other tensors on cuda:0
(when checking argument in method wrapper_CUDA_cat)

Where: modeling_zaya.py line ~363

Cause: During autoregressive generation (the decode step), conv_states from the KV cache are stored on GPU 0, but the layer accessing them during generation runs on GPU 1+. This only appears after the first token — the prefill step passes fine.

Fix:

sed -i 's/qk_packed0_cached = past_key_values\.conv_states\[self\.layer_number\]/qk_packed0_cached = past_key_values.conv_states[self.layer_number].to(qk_packed0.device)/' $ZAYA

Bug 6 — Flash SDPA Non-Contiguous Attention Mask

Error:

RuntimeError: (*bias): last dimension must be contiguous

Where: modeling_zaya.py line ~676

Cause: ZAYA1 uses sliding window attention. The window-size causal mask is constructed as a 3D tensor (1, seq_len, seq_len). When PyTorch's Flash Attention or Memory-Efficient SDPA backends receive this, they internally broadcast/expand it to (batch, heads, seq_len, seq_len). The expand operation creates a non-contiguous view, and the CUDA SDPA kernel requires the last dimension to be contiguous.

Calling .contiguous() on the mask before passing it to SDPA does not fix this — .contiguous() runs before the internal expansion, which then re-introduces non-contiguity. The only reliable fix is to force the math SDPA backend, which has no contiguity requirement.

Fix — find the line number of import torch and insert after it:

TORCH_LINE=$(grep -n "^import torch$" $ZAYA | head -1 | cut -d: -f1)
sed -i "${TORCH_LINE}a torch.backends.cuda.enable_flash_sdp(False)\\ntorch.backends.cuda.enable_mem_efficient_sdp(False)\\ntorch.backends.cuda.enable_math_sdp(True)" $ZAYA

Note: This disables Flash Attention globally for the Python process. If you run other models in the same session that rely on Flash Attention for performance, restore the backends after generating from ZAYA1:

torch.backends.cuda.enable_flash_sdp(True)
torch.backends.cuda.enable_mem_efficient_sdp(True)

Apply All Patches at Once

ZAYA=$(python3 -c "import transformers, os; print(os.path.dirname(transformers.__file__))")/models/zaya/modeling_zaya.py

# Bug 1: embed_tokens
sed -i 's/inputs_embeds = self\.embed_tokens(input_ids)/inputs_embeds = self.embed_tokens(input_ids.to(self.embed_tokens.weight.device))/' $ZAYA

# Bug 2: MoE routing indices
sed -i 's/indices_flat = indices\.view(batch_size \* seq_length)/indices_flat = indices.view(batch_size * seq_length).to(hidden_states_flat.device)/' $ZAYA

# Bug 3: Expert output cat
sed -i 's/output_local = torch\.cat(output_local_list, dim=0)/output_local = torch.cat([o.to(permuted_local_hidden_states.device) for o in output_local_list], dim=0)/' $ZAYA
sed -i 's/output_bias_local = torch\.cat(output_bias_list, dim=0)/output_bias_local = torch.cat([o.to(permuted_local_hidden_states.device) for o in output_bias_list], dim=0)/' $ZAYA

# Bug 4: Router probs
sed -i 's/expert_output = expert_output \* probs\.unsqueeze(-1)/expert_output = expert_output * probs.to(expert_output.device).unsqueeze(-1)/' $ZAYA

# Bug 5: KV cache conv states
sed -i 's/qk_packed0_cached = past_key_values\.conv_states\[self\.layer_number\]/qk_packed0_cached = past_key_values.conv_states[self.layer_number].to(qk_packed0.device)/' $ZAYA

# Bug 6: Flash SDPA backend
TORCH_LINE=$(grep -n "^import torch$" $ZAYA | head -1 | cut -d: -f1)
sed -i "${TORCH_LINE}a torch.backends.cuda.enable_flash_sdp(False)\\ntorch.backends.cuda.enable_mem_efficient_sdp(False)\\ntorch.backends.cuda.enable_math_sdp(True)" $ZAYA

Verification

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

m = AutoModelForCausalLM.from_pretrained(
    "Zyphra/ZAYA1-74B-preview",
    dtype=torch.bfloat16,
    device_map="auto",
)
tok = AutoTokenizer.from_pretrained("Zyphra/ZAYA1-74B-preview")
inp = tok("Hello, how are you?", return_tensors="pt").to(next(m.parameters()).device)
out = m.generate(**inp, max_new_tokens=20)
print(tok.decode(out[0]))

When Will This Be Fixed?

These are bugs in Zyphra's transformers fork (zaya1 branch), not in the model weights. We have posted a summary on the model's Community tab. If Zyphra merges the fixes, the patches above become unnecessary — check the commit history of modeling_zaya.py before applying to avoid double-patching.


Summary Table

# Error message (excerpt) Location Fix
1 indices...on the same device (cuda:1) embed_tokens call Move input_ids to embedding device
2 indices...on the same device (cuda:1) MoE routing Move indices_flat to hidden states device
3 Expected all tensors...wrapper_CUDA_cat Expert output cat Move each output to input device before cat
4 Expected all tensors...cuda:1 and cuda:0 Router probs multiply Move probs to expert_output device
5 Expected all tensors...wrapper_CUDA_cat KV cache conv_states Move cached state to current layer device
6 (*bias): last dimension must be contiguous SDPA attention Force math SDPA backend

Published by RadicalNotionAI. Patches developed and tested on 4× NVIDIA RTX PRO 6000 Blackwell (94.97 GB each), PyTorch 2.11.0+cu130, transformers 4.57.1 (Zyphra fork).

Community

Sign up or log in to comment