Running Zyphra ZAYA1-74B-preview on Multi-GPU Hardware: Six Bugs and Their Fixes
ZAYA1-74B-preview is an impressive hybrid SSM-MoE-Attention model from Zyphra. At ~139 GB in BF16, it requires multiple high-VRAM GPUs for full-precision inference. When we attempted to load it with device_map="auto" across four 95 GB GPUs, we encountered a chain of six distinct bugs in Zyphra's custom modeling_zaya.py — all stemming from the same root cause: the model's forward pass was written assuming single-GPU placement and never tested across device boundaries.
This article documents each bug, its error message, and the exact one-line fix. We also document the environment setup issues that must be resolved before the model can load at all.
Environment Setup
1. Install Zyphra's transformers fork
The zaya architecture is not in upstream HuggingFace transformers. You must install Zyphra's fork:
pip install "transformers @ git+https://github.com/Zyphra/transformers.git@zaya1"
This installs transformers==4.57.1 and downgrades huggingface-hub from ~1.14 to 0.36.2.
2. Fix the kernels package conflict
The hub downgrade breaks the kernels package, which uses str | None type annotations not supported by huggingface-hub==0.36.2:
TypeError: Unsupported type for field 'import_name': str | None
Fix:
pip uninstall kernels -y
3. Expose all GPUs
If CUDA_VISIBLE_DEVICES is set to a single GPU index, the 139 GB model will partially offload to CPU. CPU offload triggers its own cascade of failures. Always unset before loading:
unset CUDA_VISIBLE_DEVICES
With 4× 95 GB GPUs (380 GB total), the model loads cleanly in BF16 via device_map="auto" with no CPU offload.
The Six Multi-GPU Bugs
All bugs are in modeling_zaya.py from Zyphra's transformers fork. Find the installed path with:
ZAYA=$(python3 -c "import transformers, os; print(os.path.dirname(transformers.__file__))")/models/zaya/modeling_zaya.py
Bug 1 — Token Embedding Device Mismatch
Error:
RuntimeError: indices should be either on cpu or on the same device
as the indexed tensor (cuda:1)
Where: modeling_zaya.py line ~1584
Cause: input_ids arrive on whichever device the caller placed them, but embed_tokens lives on GPU 0 under device_map="auto". The embedding lookup fails when they differ.
Fix:
sed -i 's/inputs_embeds = self\.embed_tokens(input_ids)/inputs_embeds = self.embed_tokens(input_ids.to(self.embed_tokens.weight.device))/' $ZAYA
Bug 2 — MoE Routing Indices Device Mismatch
Error:
RuntimeError: indices should be either on cpu or on the same device
as the indexed tensor (cuda:1)
Where: modeling_zaya.py line ~1236
Cause: The MoE router runs on GPU 0 and produces indices_flat. The hidden states being routed are on GPU 1+. sort_order, sorted_indices, and original_order all inherit the wrong device from indices_flat.
Fix:
sed -i 's/indices_flat = indices\.view(batch_size \* seq_length)/indices_flat = indices.view(batch_size * seq_length).to(hidden_states_flat.device)/' $ZAYA
Bug 3 — Expert Output Concatenation Cross-Device
Error:
RuntimeError: Expected all tensors to be on the same device, but got
tensors is on cuda:1, different from other tensors on cuda:0
(when checking argument in method wrapper_CUDA_cat)
Where: modeling_zaya.py line ~1180
Cause: With 512 experts spread across multiple GPUs, each expert returns its output on its own device. torch.cat(output_local_list) fails because the list contains tensors from different GPUs.
Fix:
sed -i 's/output_local = torch\.cat(output_local_list, dim=0)/output_local = torch.cat([o.to(permuted_local_hidden_states.device) for o in output_local_list], dim=0)/' $ZAYA
sed -i 's/output_bias_local = torch\.cat(output_bias_list, dim=0)/output_bias_local = torch.cat([o.to(permuted_local_hidden_states.device) for o in output_bias_list], dim=0)/' $ZAYA
Bug 4 — Router Probabilities Device Mismatch
Error:
RuntimeError: Expected all tensors to be on the same device, but found
at least two devices, cuda:1 and cuda:0!
Where: modeling_zaya.py line ~1266
Cause: After fixing Bug 3, expert_output is correctly on the hidden-states device (GPU 1+), but probs is still on GPU 0 where the router lives. The element-wise multiply fails.
Fix:
sed -i 's/expert_output = expert_output \* probs\.unsqueeze(-1)/expert_output = expert_output * probs.to(expert_output.device).unsqueeze(-1)/' $ZAYA
Bug 5 — KV Cache Conv States Device Mismatch
Error:
RuntimeError: Expected all tensors to be on the same device, but got
tensors is on cuda:1, different from other tensors on cuda:0
(when checking argument in method wrapper_CUDA_cat)
Where: modeling_zaya.py line ~363
Cause: During autoregressive generation (the decode step), conv_states from the KV cache are stored on GPU 0, but the layer accessing them during generation runs on GPU 1+. This only appears after the first token — the prefill step passes fine.
Fix:
sed -i 's/qk_packed0_cached = past_key_values\.conv_states\[self\.layer_number\]/qk_packed0_cached = past_key_values.conv_states[self.layer_number].to(qk_packed0.device)/' $ZAYA
Bug 6 — Flash SDPA Non-Contiguous Attention Mask
Error:
RuntimeError: (*bias): last dimension must be contiguous
Where: modeling_zaya.py line ~676
Cause: ZAYA1 uses sliding window attention. The window-size causal mask is constructed as a 3D tensor (1, seq_len, seq_len). When PyTorch's Flash Attention or Memory-Efficient SDPA backends receive this, they internally broadcast/expand it to (batch, heads, seq_len, seq_len). The expand operation creates a non-contiguous view, and the CUDA SDPA kernel requires the last dimension to be contiguous.
Calling .contiguous() on the mask before passing it to SDPA does not fix this — .contiguous() runs before the internal expansion, which then re-introduces non-contiguity. The only reliable fix is to force the math SDPA backend, which has no contiguity requirement.
Fix — find the line number of import torch and insert after it:
TORCH_LINE=$(grep -n "^import torch$" $ZAYA | head -1 | cut -d: -f1)
sed -i "${TORCH_LINE}a torch.backends.cuda.enable_flash_sdp(False)\\ntorch.backends.cuda.enable_mem_efficient_sdp(False)\\ntorch.backends.cuda.enable_math_sdp(True)" $ZAYA
Note: This disables Flash Attention globally for the Python process. If you run other models in the same session that rely on Flash Attention for performance, restore the backends after generating from ZAYA1:
torch.backends.cuda.enable_flash_sdp(True) torch.backends.cuda.enable_mem_efficient_sdp(True)
Apply All Patches at Once
ZAYA=$(python3 -c "import transformers, os; print(os.path.dirname(transformers.__file__))")/models/zaya/modeling_zaya.py
# Bug 1: embed_tokens
sed -i 's/inputs_embeds = self\.embed_tokens(input_ids)/inputs_embeds = self.embed_tokens(input_ids.to(self.embed_tokens.weight.device))/' $ZAYA
# Bug 2: MoE routing indices
sed -i 's/indices_flat = indices\.view(batch_size \* seq_length)/indices_flat = indices.view(batch_size * seq_length).to(hidden_states_flat.device)/' $ZAYA
# Bug 3: Expert output cat
sed -i 's/output_local = torch\.cat(output_local_list, dim=0)/output_local = torch.cat([o.to(permuted_local_hidden_states.device) for o in output_local_list], dim=0)/' $ZAYA
sed -i 's/output_bias_local = torch\.cat(output_bias_list, dim=0)/output_bias_local = torch.cat([o.to(permuted_local_hidden_states.device) for o in output_bias_list], dim=0)/' $ZAYA
# Bug 4: Router probs
sed -i 's/expert_output = expert_output \* probs\.unsqueeze(-1)/expert_output = expert_output * probs.to(expert_output.device).unsqueeze(-1)/' $ZAYA
# Bug 5: KV cache conv states
sed -i 's/qk_packed0_cached = past_key_values\.conv_states\[self\.layer_number\]/qk_packed0_cached = past_key_values.conv_states[self.layer_number].to(qk_packed0.device)/' $ZAYA
# Bug 6: Flash SDPA backend
TORCH_LINE=$(grep -n "^import torch$" $ZAYA | head -1 | cut -d: -f1)
sed -i "${TORCH_LINE}a torch.backends.cuda.enable_flash_sdp(False)\\ntorch.backends.cuda.enable_mem_efficient_sdp(False)\\ntorch.backends.cuda.enable_math_sdp(True)" $ZAYA
Verification
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
m = AutoModelForCausalLM.from_pretrained(
"Zyphra/ZAYA1-74B-preview",
dtype=torch.bfloat16,
device_map="auto",
)
tok = AutoTokenizer.from_pretrained("Zyphra/ZAYA1-74B-preview")
inp = tok("Hello, how are you?", return_tensors="pt").to(next(m.parameters()).device)
out = m.generate(**inp, max_new_tokens=20)
print(tok.decode(out[0]))
When Will This Be Fixed?
These are bugs in Zyphra's transformers fork (zaya1 branch), not in the model weights. We have posted a summary on the model's Community tab. If Zyphra merges the fixes, the patches above become unnecessary — check the commit history of modeling_zaya.py before applying to avoid double-patching.
Summary Table
| # | Error message (excerpt) | Location | Fix |
|---|---|---|---|
| 1 | indices...on the same device (cuda:1) |
embed_tokens call | Move input_ids to embedding device |
| 2 | indices...on the same device (cuda:1) |
MoE routing | Move indices_flat to hidden states device |
| 3 | Expected all tensors...wrapper_CUDA_cat |
Expert output cat | Move each output to input device before cat |
| 4 | Expected all tensors...cuda:1 and cuda:0 |
Router probs multiply | Move probs to expert_output device |
| 5 | Expected all tensors...wrapper_CUDA_cat |
KV cache conv_states | Move cached state to current layer device |
| 6 | (*bias): last dimension must be contiguous |
SDPA attention | Force math SDPA backend |
Published by RadicalNotionAI. Patches developed and tested on 4× NVIDIA RTX PRO 6000 Blackwell (94.97 GB each), PyTorch 2.11.0+cu130, transformers 4.57.1 (Zyphra fork).