init

Files changed (11) hide show

Quantispect_RF13_v1.0.10.pt +3 -0
README.md +216 -3
bias_subcard.md +7 -0
code/model/factory.py +62 -0
code/model/predecoder_fasthyper_rf13_v1.py +170 -0
code/model/registry.py +110 -0
code/scripts/local_run.sh +252 -0
code/workflows/config_validator.py +507 -0
code/workflows/run.py +319 -0
conf/config_public.yaml +84 -0
framework.png +0 -0

Quantispect_RF13_v1.0.10.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c899ee6674d1d78bb7570c5284086b4dfc5a2d3aba63fbac6ded0beaecfb831e
+size 2693053

README.md CHANGED Viewed

@@ -1,3 +1,216 @@
----
-license: apache-2.0
----

+---
+library_name: ising-decoding
+tags:
+  - quantum
+  - qec
+  - error_correction
+  - decoders
+  - surface_code
+  - predecoder
+license: apache-2.0
+---
+# Quantispect Overview
+![Quantispect Neural Pre-Decoder Architecture](framework.png)
+## Model Summary
+| Item | Value |
+|---|---:|
+| Model name | Quantispect |
+| Checkpoint file | `Quantispect_RF13_v1.0.10.pt` |
+| Total parameters | ~0.663M |
+| Checkpoint size | ~2.63 MB |
+| Architecture | FastHyper-style 3D CNN neural pre-decoder |
+| Receptive field | R=13 |
+| Input tensor | `(B, 4, T, D, D)` |
+| Output tensor | `(B, 4, T, D, D)` |
+| Release date | April 26, 2026 |
+## Description:
+Quantispect is a compact neural pre-decoder for rotated surface-code quantum error correction. It consumes five-dimensional syndrome volumes across batch, channel, time, and two spatial dimensions, and predicts local correction maps that are consumed by a downstream global decoder such as MWPM / PyMatching or an Ising-decoding post-processing pipeline.
+Quantispect is designed to run inside an NVIDIA Ising-Decoding-compatible workflow after applying the Quantispect code patch included with this model release.
+## Model Architecture:
+Architecture type: 3D Convolutional Neural Network (3D CNN)
+Network architecture: custom multi-branch spatio-temporal 3D CNN with residual FastHyper blocks.
+### Input
+Input shape:
+```text
+(B, 4, T, D, D)
+```
+### Stem
+```text
+Conv3D 4 -> 96, kernel 3x3x3
+GroupNorm
+GELU
+```
+Stem output shape:
+```text
+(B, 96, T, D, D)
+```
+### Main Body
+The main body contains five repeated `FastHyperBlock` modules:
+```text
+FastHyperBlock x5
+```
+Each `FastHyperBlock` first expands the feature width from 96 to 144 channels with a 1x1x1 convolution, then applies three parallel feature extraction branches:
+```text
+Pre-projection: GroupNorm -> 1x1x1 Conv3D, 96 -> 144 -> GELU
+Branch A: Depthwise Conv3D, kernel 1x3x3, spatial branch
+Branch B: Depthwise Conv3D, kernel 3x1x1, temporal branch
+Branch C: GroupNorm -> Grouped Conv3D, kernel 3x3x3, groups=6, joint local spatio-temporal branch
+```
+The three branch outputs are aligned and fused by element-wise summation rather than channel concatenation. The fused feature is then projected and recalibrated:
+```text
+Element-wise sum fusion
+1x1x1 Conv3D projection, 144 -> 96
+GELU
+ChannelGate / SE-style channel attention
+Dropout3D
+Residual connection
+```
+Main body output shape:
+```text
+(B, 96, T, D, D)
+```
+### Head
+```text
+GroupNorm
+1x1x1 Conv3D, 96 -> 96
+GELU
+1x1x1 Conv3D, 96 -> 4
+```
+Output shape:
+```text
+(B, 4, T, D, D)
+```
+The output maps are used by the residual-syndrome construction module and then passed to MWPM / Ising-decoder post-processing.
+## Usage:
+Quantispect is intended to be used with the NVIDIA Ising-Decoding environment:
+```text
+https://github.com/NVIDIA/Ising-Decoding
+```
+A clean NVIDIA Ising-Decoding checkout does not natively know the Quantispect / FastHyper architecture. To run `Quantispect_RF13_v1.0.10.pt`, first apply the Quantispect code patch included in this model repository.
+### Required code patch files
+The patch package should preserve the following relative paths:
+```text
+quantispect_code_patch/
+├── conf/
+│   └── config_public.yaml
+└── code/
+    ├── model/
+    │   ├── predecoder_fasthyper_rf13_v1.py
+    │   ├── factory.py
+    │   └── registry.py
+    ├── workflows/
+    │   ├── config_validator.py
+    │   └── run.py
+    └── scripts/
+        └── local_run.sh
+```
+These files should be copied into the NVIDIA Ising-Decoding repository with the same relative paths:
+```text
+conf/config_public.yaml                    -> Ising-Decoding/conf/config_public.yaml
+code/model/predecoder_fasthyper_rf13_v1.py -> Ising-Decoding/code/model/predecoder_fasthyper_rf13_v1.py
+code/model/factory.py                      -> Ising-Decoding/code/model/factory.py
+code/model/registry.py                     -> Ising-Decoding/code/model/registry.py
+code/workflows/config_validator.py         -> Ising-Decoding/code/workflows/config_validator.py
+code/workflows/run.py                      -> Ising-Decoding/code/workflows/run.py
+code/scripts/local_run.sh                  -> Ising-Decoding/code/scripts/local_run.sh
+```
+The patch mainly adds the `predecoder_fasthyper_rf13_v1` model implementation, registers `model_id: 6`, adds the Quantispect model hyperparameters to `config_public.yaml`, and enables explicit `.pt` checkpoint loading through `model_checkpoint_file`.
+### Apply the patch
+From the directory containing both the clean NVIDIA Ising-Decoding repository and this downloaded patch package:
+```bash
+cp -r code/* Ising-Decoding/code/
+cp -r conf/* Ising-Decoding/conf/
+```
+Then place the Quantispect checkpoint under the repository model directory:
+```bash
+mkdir -p Ising-Decoding/models
+cp Quantispect_RF13_v1.0.10.pt Ising-Decoding/models/Quantispect_RF13_v1.0.10.pt
+```
+Expected directory layout:
+```text
+Ising-Decoding/
+├── code/
+│   ├── model/
+│   │   └── predecoder_fasthyper_rf13_v1.py
+│   ├── workflows/
+│   │   ├── config_validator.py
+│   │   └── run.py
+│   └── scripts/
+│       └── local_run.sh
+├── conf/
+│   └── config_public.yaml
+├── models/
+│   └── Quantispect_RF13_v1.0.10.pt
+└── README.md
+```
+## Inference Deployment:
+Configure the NVIDIA Ising-Decoding repository for inference, apply the Quantispect patch files above, and place the downloaded model checkpoint at `models/Quantispect_RF13_v1.0.10.pt`.
+Run from the repository root:
+```bash
+cd Ising-Decoding
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+PYTHONUNBUFFERED=1 \
+PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
+WORKFLOW=inference \
+EXPERIMENT_NAME=infer_quantispect \
+TORCH_COMPILE=0 \
+EXTRA_PARAMS="+model_checkpoint_file=models/Quantispect_RF13_v1.0.10.pt" \
+bash code/scripts/local_run.sh \
+2>&1 | tee infer_quantispect.log
+```

bias_subcard.md ADDED Viewed

	@@ -0,0 +1,7 @@

+# Bias Subcard
+Field | Response
+:-----|:---------
+Participation considerations from adversely impacted groups [protected classes](https://www.senate.ca.gov/content/protected-classes) in model design and testing: | Not Applicable
+Measures taken to mitigate against unwanted bias: | Not Applicable
+Bias Metric (If Measured): | Not Applicable

code/model/factory.py ADDED Viewed

	@@ -0,0 +1,62 @@

+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Factory module for creating models.
+Provides ModelFactory for instantiating pre-decoder models from config.
+"""
+class ModelFactory:
+    @staticmethod
+    def create_model(cfg):
+        if cfg.code == "surface":
+            return ModelFactory._create_surface_model(cfg)
+        else:
+            raise ValueError("Invalid model name")
+    @staticmethod
+    def _create_surface_model(cfg):
+        if cfg.model.version == "predecoder_memory_v1":
+            from model.predecoder import PreDecoderModelMemory_v1
+            model = PreDecoderModelMemory_v1(cfg)
+            return model
+        elif cfg.model.version == "predecoder_sd_litenet_v1":
+            from model.predecoder_sd_litenet_v1 import PredecoderSDLiteNetV1
+            model = PredecoderSDLiteNetV1(
+                input_channels=getattr(cfg.model, "input_channels", 4),
+                out_channels=getattr(cfg.model, "out_channels", 4),
+                hidden_dim=getattr(cfg.model, "hidden_dim", 64),
+                bottleneck_dim=getattr(cfg.model, "bottleneck_dim", 16),
+                dropout_p=getattr(cfg.model, "dropout_p", 0.05),
+            )
+            return model
+        elif cfg.model.version == "predecoder_fasthyper_rf13_v1":
+            from model.predecoder_fasthyper_rf13_v1 import PredecoderFastHyperRF13V1
+            model = PredecoderFastHyperRF13V1(
+                input_channels=getattr(cfg.model, "input_channels", 4),
+                out_channels=getattr(cfg.model, "out_channels", 4),
+                hidden_dim=getattr(cfg.model, "hidden_dim", 96),
+                mid_dim=getattr(cfg.model, "mid_dim", 144),
+                mix_groups=getattr(cfg.model, "mix_groups", 6),
+                num_blocks=getattr(cfg.model, "num_blocks", 5),
+                stem_kernel_size=getattr(cfg.model, "stem_kernel_size", 3),
+                dropout_p=getattr(cfg.model, "dropout_p", 0.02),
+                gate_reduction=getattr(cfg.model, "gate_reduction", 4),
+            )
+            return model
+        else:
+            raise ValueError(f"Invalid model version: {cfg.model.version}")

code/model/predecoder_fasthyper_rf13_v1.py ADDED Viewed

	@@ -0,0 +1,170 @@

+from __future__ import annotations
+from typing import Any
+import torch
+import torch.nn as nn
+def _choose_gn_groups(channels: int, max_groups: int = 8) -> int:
+    for g in range(min(max_groups, channels), 0, -1):
+        if channels % g == 0:
+            return g
+    return 1
+class _ChannelGate(nn.Module):
+    def __init__(self, channels: int, reduction: int = 4) -> None:
+        super().__init__()
+        hidden = max(channels // reduction, 8)
+        self.pool = nn.AdaptiveAvgPool3d(1)
+        self.fc1 = nn.Conv3d(channels, hidden, kernel_size=1, bias=True)
+        self.act = nn.GELU()
+        self.fc2 = nn.Conv3d(hidden, channels, kernel_size=1, bias=True)
+        self.gate = nn.Sigmoid()
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        s = self.pool(x)
+        s = self.fc1(s)
+        s = self.act(s)
+        s = self.fc2(s)
+        return x * self.gate(s)
+class _FastHyperBlock(nn.Module):
+    """
+    Efficient RF-expanding residual block.
+    Each block contributes one effective k=3 receptive-field expansion stage via
+    three parallel branches operating on the same expanded activation:
+      - spatial depthwise (1,3,3)
+      - temporal depthwise (3,1,1)
+      - grouped 3D mixing (3,3,3)
+    """
+    def __init__(
+        self,
+        channels: int,
+        mid_dim: int,
+        mix_groups: int = 6,
+        dropout_p: float = 0.02,
+        gate_reduction: int = 4,
+    ) -> None:
+        super().__init__()
+        gn1 = _choose_gn_groups(channels)
+        gn2 = _choose_gn_groups(mid_dim)
+        mix_groups = max(1, min(mix_groups, mid_dim))
+        while mid_dim % mix_groups != 0 and mix_groups > 1:
+            mix_groups -= 1
+        self.pre = nn.Sequential(
+            nn.GroupNorm(gn1, channels),
+            nn.Conv3d(channels, mid_dim, kernel_size=1, bias=True),
+            nn.GELU(),
+        )
+        self.spatial = nn.Sequential(
+            nn.Conv3d(
+                mid_dim,
+                mid_dim,
+                kernel_size=(1, 3, 3),
+                padding=(0, 1, 1),
+                groups=mid_dim,
+                bias=True,
+            ),
+            nn.GELU(),
+        )
+        self.temporal = nn.Sequential(
+            nn.Conv3d(
+                mid_dim,
+                mid_dim,
+                kernel_size=(3, 1, 1),
+                padding=(1, 0, 0),
+                groups=mid_dim,
+                bias=True,
+            ),
+            nn.GELU(),
+        )
+        self.mixed = nn.Sequential(
+            nn.GroupNorm(gn2, mid_dim),
+            nn.Conv3d(
+                mid_dim,
+                mid_dim,
+                kernel_size=3,
+                padding=1,
+                groups=mix_groups,
+                bias=True,
+            ),
+            nn.GELU(),
+        )
+        self.fuse = nn.Sequential(
+            nn.Conv3d(mid_dim, channels, kernel_size=1, bias=True),
+            nn.GELU(),
+        )
+        self.gate = _ChannelGate(channels, reduction=gate_reduction)
+        self.dropout = nn.Dropout3d(dropout_p) if dropout_p > 0 else nn.Identity()
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        h = self.pre(x)
+        h = self.spatial(h) + self.temporal(h) + self.mixed(h)
+        h = self.fuse(h)
+        h = self.gate(h)
+        h = self.dropout(h)
+        return x + h
+class PredecoderFastHyperRF13V1(nn.Module):
+    """
+    Faster-stronger candidate for model 6 under the public Ising-Decoding API.
+    Input / output shape:
+        (B, 4, T, D, D) -> (B, 4, T, D, D)
+    """
+    def __init__(
+        self,
+        input_channels: int = 4,
+        out_channels: int = 4,
+        hidden_dim: int = 96,
+        mid_dim: int = 144,
+        mix_groups: int = 6,
+        num_blocks: int = 5,
+        stem_kernel_size: int = 3,
+        dropout_p: float = 0.02,
+        gate_reduction: int = 4,
+        **_: Any,
+    ) -> None:
+        super().__init__()
+        pad = stem_kernel_size // 2
+        gn = _choose_gn_groups(hidden_dim)
+        self.stem = nn.Sequential(
+            nn.Conv3d(
+                input_channels,
+                hidden_dim,
+                kernel_size=stem_kernel_size,
+                padding=pad,
+                bias=True,
+            ),
+            nn.GroupNorm(gn, hidden_dim),
+            nn.GELU(),
+        )
+        self.blocks = nn.Sequential(*[
+            _FastHyperBlock(
+                channels=hidden_dim,
+                mid_dim=mid_dim,
+                mix_groups=mix_groups,
+                dropout_p=dropout_p,
+                gate_reduction=gate_reduction,
+            ) for _ in range(num_blocks)
+        ])
+        self.head = nn.Sequential(
+            nn.GroupNorm(gn, hidden_dim),
+            nn.Conv3d(hidden_dim, hidden_dim, kernel_size=1, bias=True),
+            nn.GELU(),
+            nn.Conv3d(hidden_dim, out_channels, kernel_size=1, bias=True),
+        )
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        x = self.stem(x)
+        x = self.blocks(x)
+        x = self.head(x)
+        return x

code/model/registry.py ADDED Viewed

	@@ -0,0 +1,110 @@

+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Public model registry for the early-access public release.
+External users choose `model_id` in {1..6}. This registry maps model_id to:
+- the underlying architecture parameters (num_filters, kernel_size)
+- the model receptive field R (in rounds / distance units)
+Receptive field convention matches `compare_receptive_field_with_window_data`
+in `code/training/utils.py`:
+  R = 1 + sum_i (k_i - 1)   for kernel sizes k_i (assumed odd, with same-padding)
+"""
+from __future__ import annotations
+from dataclasses import dataclass
+from typing import Dict, List
+def compute_receptive_field(kernel_sizes: List[int]) -> int:
+    """Compute receptive field R from a list of kernel sizes."""
+    if not kernel_sizes:
+        raise ValueError("kernel_sizes must be non-empty")
+    if any(not isinstance(k, int) for k in kernel_sizes):
+        raise ValueError(f"kernel_sizes must be ints, got: {kernel_sizes!r}")
+    if any(k <= 0 for k in kernel_sizes):
+        raise ValueError(f"kernel_sizes must be positive, got: {kernel_sizes!r}")
+    return 1 + sum(kernel_sizes) - len(kernel_sizes)
+@dataclass(frozen=True)
+class PublicModelSpec:
+    model_id: int
+    num_filters: List[int]
+    kernel_size: List[int]
+    receptive_field: int
+    model_version: str = "predecoder_memory_v1"
+_MODEL_SPECS: Dict[int, PublicModelSpec] = {
+    1:
+        PublicModelSpec(
+            model_id=1,
+            num_filters=[128, 128, 128, 4],
+            kernel_size=[3, 3, 3, 3],
+            receptive_field=compute_receptive_field([3, 3, 3, 3]),
+        ),
+    2:
+        PublicModelSpec(
+            model_id=2,
+            num_filters=[256, 256, 256, 4],
+            kernel_size=[3, 3, 3, 3],
+            receptive_field=compute_receptive_field([3, 3, 3, 3]),
+        ),
+    3:
+        PublicModelSpec(
+            model_id=3,
+            num_filters=[128, 128, 128, 4],
+            kernel_size=[5, 5, 5, 5],
+            receptive_field=compute_receptive_field([5, 5, 5, 5]),
+        ),
+    4:
+        PublicModelSpec(
+            model_id=4,
+            num_filters=[128, 128, 128, 128, 128, 4],
+            kernel_size=[3, 3, 3, 3, 3, 3],
+            receptive_field=compute_receptive_field([3, 3, 3, 3, 3, 3]),
+        ),
+    5:
+        PublicModelSpec(
+            model_id=5,
+            num_filters=[256, 256, 256, 256, 256, 4],
+            kernel_size=[3, 3, 3, 3, 3, 3],
+            receptive_field=compute_receptive_field([3, 3, 3, 3, 3, 3]),
+        ),
+    6:
+        PublicModelSpec(
+            model_id=6,
+            num_filters=[96, 96, 96, 96, 96, 4],
+            kernel_size=[3, 3, 3, 3, 3, 3],
+            receptive_field=compute_receptive_field([3, 3, 3, 3, 3, 3]),
+            model_version="predecoder_fasthyper_rf13_v1",
+        ),
+}
+def get_model_spec(model_id: int) -> PublicModelSpec:
+    """Return the public model spec for a given model_id (1..6)."""
+    try:
+        mid = int(model_id)
+    except Exception as e:
+        raise ValueError(f"model_id must be an int in [1..6], got: {model_id!r}") from e
+    if mid == 0:
+        raise ValueError("model_id=0 is not supported in the public release")
+    if mid not in _MODEL_SPECS:
+        raise ValueError(f"model_id must be in [1..6], got: {mid}")
+    return _MODEL_SPECS[mid]

code/scripts/local_run.sh ADDED Viewed

	@@ -0,0 +1,252 @@

+#!/usr/bin/env bash
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+set -euo pipefail
+# Minimal local runner.
+#
+# Examples:
+#   bash code/scripts/local_run.sh
+#   WORKFLOW=inference bash code/scripts/local_run.sh
+#   GPUS=4 bash code/scripts/local_run.sh
+#   CUDA_VISIBLE_DEVICES=1 bash code/scripts/local_run.sh        # use only GPU 1
+#
+# ONNX / TRT fast inference (requires tensorrt; set ONNX_WORKFLOW before running):
+#   ONNX_WORKFLOW=1 WORKFLOW=inference bash code/scripts/local_run.sh          # export ONNX only (inspect/reuse later)
+#   ONNX_WORKFLOW=2 WORKFLOW=inference bash code/scripts/local_run.sh          # export ONNX + build TRT + run TRT inference
+#   ONNX_WORKFLOW=2 QUANT_FORMAT=int8 WORKFLOW=inference bash code/scripts/local_run.sh   # INT8 quantized TRT
+#   ONNX_WORKFLOW=2 QUANT_FORMAT=fp8  WORKFLOW=inference bash code/scripts/local_run.sh   # FP8 quantized TRT (requires nvidia-modelopt)
+#   ONNX_WORKFLOW=3 WORKFLOW=inference bash code/scripts/local_run.sh          # load pre-built engine, skip export
+#
+# Decoder ablation study with cudaq-qec global decoders (requires cudaq-qec):
+#   WORKFLOW=decoder_ablation bash code/scripts/local_run.sh
+#
+# Decoder ablation with TRT pre-decoder + cudaq-qec global decoders
+# (combines fast TRT inference for the neural pre-decoder with GPU-accelerated
+#  cudaq-qec decoders for the residual syndromes — full GPU pipeline end-to-end):
+#   ONNX_WORKFLOW=2 WORKFLOW=decoder_ablation bash code/scripts/local_run.sh   # export+build TRT, then ablation
+#   ONNX_WORKFLOW=3 WORKFLOW=decoder_ablation bash code/scripts/local_run.sh   # load existing engine, then ablation
+#
+# Notes:
+# - Public config is `conf/config_public.yaml`. Users should edit only that file.
+# - Training knobs are auto-managed in code (epochs, shots/epoch, batch schedule, etc.).
+# - SafeTensors (optional): after training, convert the best .pt checkpoint with
+#     code/export/checkpoint_to_safetensors.py (see README), then pass the result as:
+#     PREDECODER_SAFETENSORS_CHECKPOINT=<path>.safetensors WORKFLOW=inference bash code/scripts/local_run.sh
+EXPERIMENT_NAME="${EXPERIMENT_NAME:-test1}"
+CONFIG_NAME="${CONFIG_NAME:-config_public}"   # conf/<name>.yaml (no extension)
+WORKFLOW="${WORKFLOW:-train}"                 # train | inference
+WORKFLOW="$(echo "${WORKFLOW}" | tr '[:upper:]' '[:lower:]')"
+GPUS="${GPUS:-}"                              # if empty, auto-detect
+FRESH_START="${FRESH_START:-0}"               # 1 => don't load checkpoint
+EXTRA_PARAMS="${EXTRA_PARAMS:-}"              # advanced hydra overrides (discouraged)
+TORCH_COMPILE="${TORCH_COMPILE:-}"            # 0/1 to disable/enable torch.compile
+TORCH_COMPILE_MODE="${TORCH_COMPILE_MODE:-}"  # optional: default | reduce-overhead | max-autotune
+DISTANCE="${DISTANCE:-}"
+N_ROUNDS="${N_ROUNDS:-}"
+if [ $# -eq 1 ]; then DISTANCE="$1"; fi
+if [ $# -eq 2 ]; then DISTANCE="$1"; N_ROUNDS="$2"; fi
+SCRIPT_DIR="$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" && pwd)"
+# local_run.sh lives at: <repo_root>/code/scripts/local_run.sh
+# so repo_root is two levels up from SCRIPT_DIR.
+REPO_ROOT="$(cd -- "${SCRIPT_DIR}/../.." && pwd)"
+CODE_ROOT="${CODE_ROOT:-${REPO_ROOT}/code}"
+# Default output locations live inside the repo (avoid surprises from generic env vars).
+# Some environments set BASE_OUTPUT_DIR/LOG_BASE_DIR globally; ignore those by default to
+# prevent creating confusing extra folders like /root/outputs or /root/logs.
+if [ -n "${BASE_OUTPUT_DIR:-}" ] || [ -n "${LOG_BASE_DIR:-}" ]; then
+  echo "[local_run.sh] Note: ignoring BASE_OUTPUT_DIR/LOG_BASE_DIR from the environment."
+  echo "[local_run.sh] To override paths, use PREDECODER_BASE_OUTPUT_DIR / PREDECODER_LOG_BASE_DIR."
+fi
+BASE_OUTPUT_DIR="${PREDECODER_BASE_OUTPUT_DIR:-${REPO_ROOT}/outputs}"
+LOG_BASE_DIR="${PREDECODER_LOG_BASE_DIR:-${REPO_ROOT}/logs}"
+mkdir -p "${BASE_OUTPUT_DIR}" "${LOG_BASE_DIR}"
+if [ "${FRESH_START}" -eq 1 ]; then
+  RESUME_FLAG="++load_checkpoint=False"
+else
+  RESUME_FLAG="++load_checkpoint=True"
+fi
+# GPU-only runs: require a visible GPU and nvidia-smi.
+if ! command -v nvidia-smi >/dev/null 2>&1; then
+  echo "[local_run.sh] Error: GPU-only mode requires nvidia-smi on PATH." >&2
+  echo "[local_run.sh] Hint: run on a GPU host or pass CUDA_VISIBLE_DEVICES." >&2
+  exit 1
+fi
+# Respect CUDA_VISIBLE_DEVICES if set; otherwise auto-detect via nvidia-smi.
+if [ -z "${GPUS}" ]; then
+  if [ -n "${CUDA_VISIBLE_DEVICES:-}" ]; then
+    GPUS="$(python3 - <<'PY'
+import os
+v=os.environ.get('CUDA_VISIBLE_DEVICES','').strip()
+print(len([x for x in v.split(',') if x.strip()]) or 1)
+PY
+)"
+  else
+    GPUS="$(nvidia-smi --query-gpu=name --format=csv,noheader | wc -l | tr -d ' ')"
+  fi
+fi
+if [ "${GPUS}" -le 0 ]; then
+  echo "[local_run.sh] Error: no GPUs detected. GPU-only mode requires CUDA." >&2
+  exit 1
+fi
+if [ -z "${MASTER_PORT:-}" ]; then
+  MASTER_PORT="$(python3 - <<'PY'
+import socket
+s=socket.socket()
+s.bind(('127.0.0.1', 0))
+print(s.getsockname()[1])
+s.close()
+PY
+)"
+  export MASTER_PORT
+fi
+TIMESTAMP="$(date +%Y%m%d_%H%M%S)"
+# Add nanoseconds to avoid collisions when launching multiple runs within the same second.
+TIMESTAMP_NS="$(date +%Y%m%d_%H%M%S_%N)"
+RUN_ID="${EXPERIMENT_NAME}_${TIMESTAMP}"
+LOG_DIR="${LOG_BASE_DIR}/${RUN_ID}"
+OUTPUT_DIR="${BASE_OUTPUT_DIR}/${EXPERIMENT_NAME}"
+CHECKPOINT_DIR="${OUTPUT_DIR}/models"
+mkdir -p "${LOG_DIR}" "${OUTPUT_DIR}" "${CHECKPOINT_DIR}"
+# Force Hydra run dir to writable OUTPUT_DIR (avoids read-only repo/outputs in containers)
+OVERRIDES="hydra.run.dir=${OUTPUT_DIR}"
+if [ -n "${DISTANCE}" ]; then OVERRIDES+=" distance=${DISTANCE}"; fi
+if [ -n "${N_ROUNDS}" ]; then OVERRIDES+=" n_rounds=${N_ROUNDS}"; fi
+if [ -n "${EXTRA_PARAMS}" ]; then OVERRIDES+=" ${EXTRA_PARAMS}"; fi
+CONFIG_SNAPSHOT_DIR="${OUTPUT_DIR}/config"
+mkdir -p "${CONFIG_SNAPSHOT_DIR}"
+CONFIG_PATH="${REPO_ROOT}/conf/${CONFIG_NAME}.yaml"
+if [ -f "${CONFIG_PATH}" ]; then
+  # Never overwrite existing snapshots: keep full history.
+  base_yaml="${CONFIG_SNAPSHOT_DIR}/${CONFIG_NAME}_${TIMESTAMP_NS}.yaml"
+  dest_yaml="${base_yaml}"
+  i=0
+  while [ -e "${dest_yaml}" ]; do
+    i=$((i+1))
+    dest_yaml="${base_yaml%.yaml}_${i}.yaml"
+  done
+  cp "${CONFIG_PATH}" "${dest_yaml}"
+  # Also save the exact CLI overrides used for this run (useful when configs change over time).
+  base_ovr="${CONFIG_SNAPSHOT_DIR}/${CONFIG_NAME}_${TIMESTAMP_NS}.overrides.txt"
+  dest_ovr="${base_ovr}"
+  j=0
+  while [ -e "${dest_ovr}" ]; do
+    j=$((j+1))
+    dest_ovr="${base_ovr%.txt}_${j}.txt"
+  done
+  {
+    echo "workflow.task=${WORKFLOW}"
+    echo "exp_tag=${EXPERIMENT_NAME}"
+    echo "${RESUME_FLAG}"
+    echo "${OVERRIDES:-}"
+  } > "${dest_ovr}"
+else
+  echo "[local_run.sh] Warning: could not find config file to snapshot: ${CONFIG_PATH}"
+fi
+echo "=========================================="
+echo "Local run"
+echo "=========================================="
+echo "workflow.task: ${WORKFLOW}"
+echo "config: ${CONFIG_NAME}"
+echo "GPUS: ${GPUS} (CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES:-<unset>})"
+echo "output: ${OUTPUT_DIR}"
+echo "logs: ${LOG_DIR}"
+echo "overrides: ${OVERRIDES:-<none>}"
+echo "=========================================="
+export PYTHONPATH="${CODE_ROOT}:${PYTHONPATH:-}"
+export HDF5_USE_FILE_LOCKING=FALSE
+export CUDNN_V8_API_ENABLED=1
+export OMP_NUM_THREADS="$(nproc)"
+export JOB_START_TIMESTAMP="$(date +%s)"
+export JOB_START_DATETIME="$(date)"
+if [ -n "${TORCH_COMPILE}" ]; then
+  export PREDECODER_TORCH_COMPILE="${TORCH_COMPILE}"
+fi
+if [ -n "${TORCH_COMPILE_MODE}" ]; then
+  export PREDECODER_TORCH_COMPILE_MODE="${TORCH_COMPILE_MODE}"
+fi
+# Prefer PREDECODER_PYTHON (cluster/container venv) when set
+PYTHON_BIN="${PYTHON_BIN:-${PREDECODER_PYTHON:-python}}"
+if ! command -v "${PYTHON_BIN}" >/dev/null 2>&1; then
+  if command -v python3 >/dev/null 2>&1; then
+    PYTHON_BIN="python3"
+  else
+    echo "[local_run.sh] Error: no python interpreter found on PATH." >&2
+    exit 1
+  fi
+fi
+# Ensure CUDA is usable before launching the workflow.
+if ! "${PYTHON_BIN}" - <<'PY'
+import sys
+try:
+    import torch
+except Exception as exc:
+    print(f"[local_run.sh] Error: PyTorch is required for GPU-only runs ({exc}).", file=sys.stderr)
+    sys.exit(1)
+if not torch.cuda.is_available():
+    print("[local_run.sh] Error: torch.cuda.is_available() is false. GPU-only mode requires CUDA.", file=sys.stderr)
+    sys.exit(1)
+PY
+then
+  exit 1
+fi
+# Run from repo root so config defaults like `output: outputs/${exp_tag}` land in <repo_root>/outputs.
+cd "${REPO_ROOT}"
+LOG_FILE="${LOG_DIR}/${WORKFLOW}.log"
+if [ "${GPUS}" -gt 1 ]; then
+  "${PYTHON_BIN}" -m torch.distributed.run \
+    --nproc_per_node="${GPUS}" \
+    --nnodes=1 \
+    --node_rank=0 \
+    --master_port="${MASTER_PORT}" \
+    code/workflows/run.py \
+    --config-name="${CONFIG_NAME}" \
+    workflow.task="${WORKFLOW}" \
+    +exp_tag="${EXPERIMENT_NAME}" \
+    ${RESUME_FLAG} \
+    ${OVERRIDES} \
+    2>&1 | tee -a "${LOG_FILE}"
+else
+  "${PYTHON_BIN}" -u code/workflows/run.py \
+    --config-name="${CONFIG_NAME}" \
+    workflow.task="${WORKFLOW}" \
+    +exp_tag="${EXPERIMENT_NAME}" \
+    ${RESUME_FLAG} \
+    ${OVERRIDES} \
+    2>&1 | tee -a "${LOG_FILE}"
+fi
+cp -f "${LOG_FILE}" "${OUTPUT_DIR}/run.log"
+echo "Done. Log: ${LOG_FILE}"

code/workflows/config_validator.py ADDED Viewed

	@@ -0,0 +1,507 @@

+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Public config normalization / validation for the early-access public release.
+Responsibilities:
+- Fail-fast if the user tries to set hidden/experimental fields (via Hydra CLI `+foo=...`)
+- Merge in hidden defaults (sourced from model_1_d9 config) so training runs with a minimal public config
+- Apply the selected public model architecture (model_id -> model.*)
+- Clamp distance/n_rounds to the model receptive field:
+    D = min(distance, R)
+    N_R = min(n_rounds, R)
+"""
+from __future__ import annotations
+from pathlib import Path
+import os
+from typing import Any, Dict, Iterable, Tuple
+from omegaconf import DictConfig, OmegaConf
+from model.registry import PublicModelSpec, get_model_spec
+_PUBLIC_ROTATION_TO_INTERNAL = {
+    # Public user-facing aliases
+    "O1": "XV",
+    "O2": "XH",
+    "O3": "ZV",
+    "O4": "ZH",
+}
+_INTERNAL_ROTATION_TO_PUBLIC = {v: k for k, v in _PUBLIC_ROTATION_TO_INTERNAL.items()}
+_PUBLIC_MODEL_ID_TO_LR = {
+    1: 3e-4,
+    2: 2e-4,
+    3: 1e-4,
+    4: 2e-4,
+    5: 1e-4,
+    6: 2e-4,
+}
+def _default_precomputed_frames_dir() -> str:
+    """
+    Default location for precomputed frames shipped with (or generated inside) this repo.
+    We compute this path relative to the codebase so it is stable regardless of the user's
+    current working directory.
+    """
+    # .../<repo>/code/workflows/config_validator.py -> repo root is parents[2]
+    repo_root = Path(__file__).resolve().parents[2]
+    return str((repo_root / "frames_data").resolve())
+def _get_env_bool(name: str, default: bool) -> bool:
+    raw = os.environ.get(name)
+    if raw is None:
+        return default
+    val = str(raw).strip().lower()
+    if val in ("0", "false", "no", "off", ""):
+        return False
+    return True
+def _normalize_code_rotation(value: Any) -> str:
+    """
+    Normalize code rotation values.
+    Public config accepts O1..O4 for user convenience. Internally we keep using:
+    XV, XH, ZV, ZH (as expected by SurfaceCode / MemoryCircuit).
+    """
+    if value is None:
+        return value
+    s = str(value).strip().upper()
+    if s in _PUBLIC_ROTATION_TO_INTERNAL:
+        return _PUBLIC_ROTATION_TO_INTERNAL[s]
+    if s in _INTERNAL_ROTATION_TO_PUBLIC:
+        return s
+    raise ValueError(
+        f"Invalid data.code_rotation={value!r}. "
+        f"Use one of {sorted(_PUBLIC_ROTATION_TO_INTERNAL.keys())} (public) "
+        f"or {sorted(_INTERNAL_ROTATION_TO_PUBLIC.keys())} (internal)."
+    )
+def _base_hidden_defaults_dict() -> Dict[str, Any]:
+    """
+    Baseline config used as the source-of-truth for hidden defaults.
+    IMPORTANT: We intentionally embed these defaults directly in code so the public
+    release does not ship internal/legacy config files. These values were copied
+    from the historical `config_pre_decoder_memory_surface_model_1_d9.yaml`.
+    """
+    base_output_dir = os.environ.get("PREDECODER_BASE_OUTPUT_DIR", "outputs")
+    output_root = f"{base_output_dir}/${{exp_tag}}"
+    return {
+        "exp_tag": "pre-decoder",
+        "output": output_root,
+        "hydra": {
+            "run": {
+                "dir": "${output}"
+            },
+            "output_subdir": "hydra"
+        },
+        "resume_dir": f"{output_root}/models",
+        "enable_fp16": False,
+        "enable_bf16": False,
+        "enable_matmul_tf32": True,
+        "enable_cudnn_tf32": True,
+        "enable_cudnn_benchmark": True,
+        "torch_compile": _get_env_bool("PREDECODER_TORCH_COMPILE", True),
+        "torch_compile_mode": os.environ.get("PREDECODER_TORCH_COMPILE_MODE", "default"),
+        "load_checkpoint": False,
+        "code": "surface",
+        "distance": 9,
+        "n_rounds": 9,
+        "multiple_distances": [13, 13],
+        "multiple_rounds": [13, 13],
+        "use_multiple_patches": False,
+        "meas_basis": "both",
+        "workflow": {
+            "task": "train"
+        },
+        "data":
+            {
+                "timelike_he": True,
+                "num_he_cycles": 1,
+                "use_weight2_timelike": False,
+                "max_passes_w1": 8,
+                "max_passes_w2": 4,
+                "decompose_y": True,
+                "p_error": None,
+                "p_min": 0.001,
+                "p_max": 0.006,
+                "error_mode": "circuit_level_surface_custom",
+                # Public config overrides this; keep the historical default for completeness.
+                "precomputed_frames_dir": _default_precomputed_frames_dir(),
+                "enable_correlated_pymatching": False,
+                "code_rotation": "XV",
+                "noise_model": None,
+            },
+        "model":
+            {
+                "version": "predecoder_memory_v1",
+                "dropout_p": 0.05,
+                "activation": "gelu",
+                "num_filters": [128, 128, 128, 4],
+                "kernel_size": [3, 3, 3, 3],
+                "input_channels": 4,
+                "out_channels": 4,
+            },
+        "datapipe": "memory",
+        "data_method": "train",
+        "train":
+            {
+                # Production baseline: 2^26 shots / epoch when training with 8 GPUs.
+                # The training script will auto-scale this based on detected world size / GPU count.
+                "num_samples": 67108864,
+                "accumulate_steps": 2,
+                "checkpoint_interval": 1,
+                "save_every_datasets": 5,
+                "epochs": 100,
+            },
+        # NOTE: temporarily reduced for faster iteration during refactor/testing.
+        "val": {
+            "num_samples": 65536,
+            "threshold": 0.5,
+            "trials": 1
+        },
+        "optimizer_type": "Lion",
+        "optimizer": {
+            "lr": 1e-4,
+            "weight_decay": 1e-7,
+            "beta2": 0.95
+        },
+        "lr_scheduler":
+            {
+                "type": "warmup_then_decay",
+                "warmup_steps": 100,
+                "milestones": [0.25, 0.5, 1.0],
+                "gamma": 0.7,
+                "min_lr": 1e-6,
+            },
+        "batch_schedule":
+            {
+                "enabled": True,
+                "initial": 256,
+                "final": 1024,
+                "start_epoch": 1,
+                "end_epoch": 3,
+            },
+        "validation_ler": True,
+        "early_stopping": {
+            "enabled": True,
+            "patience": 100
+        },
+        "time_based_early_stopping": {
+            "enabled": False,
+            "safety_margin_minutes": 5
+        },
+        "ema": {
+            "use_ema": True,
+            "decay": 0.0001
+        },
+        "test":
+            {
+                "num_samples": 262144,
+                "trials": 1,
+                "distance": 9,
+                "n_rounds": 9,
+                "noise_model": "train",
+                "p_error": 0.006,
+                "dataloader":
+                    {
+                        "batch_size": 64,
+                        "num_workers": 0,
+                        "persistent_workers": False,
+                    },
+                "latency_num_samples": 1000,
+                "sampler": {
+                    "shuffle": False,
+                    "drop_last": False
+                },
+                "syn_red": "full",
+                "th_data": 0.0,
+                "th_syn": 0.0,
+                "sampling_mode": "threshold",
+                "temperature": 0.0,
+                "temperature_data": None,
+                "temperature_syn": None,
+                "per_round": False,
+                "meas_basis_test": "both",
+                "use_model_checkpoint": -1,
+            },
+        "threshold":
+            {
+                "p_values": [0.002, 0.003, 0.004, 0.005, 0.006, 0.007, 0.008],
+                "distances": [5, 7, 9, 11, 13],
+                "n_rounds": None,
+            },
+    }
+def _select(cfg: DictConfig, key: str) -> Tuple[bool, Any]:
+    """
+    Return (exists, value) for a dot-path in cfg.
+    Note: OmegaConf.select returns None both for missing keys and explicit nulls,
+    so we treat a key as existing iff it is present in the underlying container.
+    """
+    # OmegaConf doesn't provide a direct 'has_key' for dotted paths; implement via container walk.
+    cur: Any = cfg
+    parts = key.split(".")
+    for p in parts:
+        if not isinstance(cur, DictConfig) or p not in cur:
+            return False, None
+        cur = cur[p]
+    return True, cur
+def _assert_not_present(cfg: DictConfig, keys: Iterable[str], *, context: str) -> None:
+    for k in keys:
+        exists, _ = _select(cfg, k)
+        if exists:
+            raise ValueError(
+                f"Config field '{k}' is not supported in the public release ({context}). "
+                f"Remove it from the config/CLI overrides."
+            )
+def validate_public_config(cfg: DictConfig) -> PublicModelSpec:
+    """
+    Validate the user-facing config BEFORE we merge in hidden defaults.
+    Returns:
+        PublicModelSpec for cfg.model_id (validated).
+    """
+    # model_id must exist in public config
+    if "model_id" not in cfg:
+        raise ValueError("Missing required field: 'model_id' (choose 1..5).")
+    model_spec = get_model_spec(cfg.model_id)
+    # Public config requires distance/n_rounds (evaluation targets)
+    if "distance" not in cfg or "n_rounds" not in cfg:
+        raise ValueError("Missing required fields: 'distance' and 'n_rounds'.")
+    try:
+        d = int(cfg.distance)
+        r = int(cfg.n_rounds)
+    except Exception as e:
+        raise ValueError(
+            f"Invalid distance/n_rounds: distance={cfg.distance!r}, n_rounds={cfg.n_rounds!r}"
+        ) from e
+    if d <= 0 or r <= 0:
+        raise ValueError(
+            f"Invalid distance/n_rounds: distance={d}, n_rounds={r} (must be positive integers)"
+        )
+    if "train" in cfg:
+        raise ValueError("Config field 'train' is not supported in the public release.")
+    if "val" in cfg:
+        raise ValueError("Config field 'val' is not supported in the public release.")
+    if "test" in cfg:
+        raise ValueError("Config field 'test' is not supported in the public release.")
+    # Fail-fast on known hidden fields if the user tries to inject them.
+    _assert_not_present(
+        cfg,
+        keys=(
+            # output paths are managed by the runner scripts; not user-configurable in public release
+            "output",
+            "resume_dir",
+            # precision / tf32 knobs (always fp32 + tf32 enabled)
+            "enable_fp16",
+            "enable_bf16",
+            "enable_matmul_tf32",
+            "enable_cudnn_tf32",
+            # always both bases
+            "meas_basis",
+            # multi-patch curriculum mode (hidden)
+            "use_multiple_patches",
+            "multiple_distances",
+            "multiple_rounds",
+            # optimizer knobs (only optimizer.lr exposed)
+            "optimizer",
+            "optimizer_type",
+            "lr_scheduler",
+            "batch_schedule",
+            # obsolete/confusing
+            "train.save_every_datasets",
+            # validation hidden knobs
+            "val.threshold",
+            "val.trials",
+            # early stopping extras hidden
+            "time_based_early_stopping",
+            "ema",
+        ),
+        context="hidden field override",
+    )
+    # Restrict cfg.data to a small public surface (others can be too experimental).
+    if "data" in cfg and isinstance(cfg.data, DictConfig):
+        # NOTE: precomputed frames path is intentionally hidden from the public config.
+        # We default it internally to <repo>/frames_data (see _default_precomputed_frames_dir).
+        if "precomputed_frames_dir" in cfg.data:
+            raise ValueError(
+                "Config field 'data.precomputed_frames_dir' is not supported in the public release. "
+                "Remove it from the config/CLI overrides."
+            )
+        allowed_data_keys = {"code_rotation", "noise_model"}
+        for k in cfg.data.keys():
+            if k not in allowed_data_keys:
+                raise ValueError(
+                    f"Config field 'data.{k}' is not supported in the public release. "
+                    f"Allowed data fields are: {sorted(allowed_data_keys)}"
+                )
+        # Validate rotation value (accept O1..O4; also allow internal XV/XH/ZV/ZH for compatibility).
+        if "code_rotation" in cfg.data:
+            _normalize_code_rotation(cfg.data.code_rotation)
+    # Restrict optimizer sub-keys: only lr is public.
+    if "optimizer" in cfg and isinstance(cfg.optimizer, DictConfig):
+        for k in cfg.optimizer.keys():
+            if k != "lr":
+                raise ValueError(
+                    f"Config field 'optimizer.{k}' is not supported in the public release. "
+                    f"Only 'optimizer.lr' is user-configurable."
+                )
+    return model_spec
+def clamp_to_receptive_field(cfg: DictConfig, R: int) -> None:
+    """In-place clamp of cfg.distance and cfg.n_rounds to receptive field R."""
+    if not isinstance(R, int) or R <= 0:
+        raise ValueError(f"Invalid receptive field R={R!r}")
+    if "distance" not in cfg or "n_rounds" not in cfg:
+        raise ValueError("Both 'distance' and 'n_rounds' must be present in config.")
+    cfg.distance = int(min(int(cfg.distance), R))
+    cfg.n_rounds = int(min(int(cfg.n_rounds), R))
+def apply_public_defaults_and_model(cfg: DictConfig, model_spec: PublicModelSpec) -> DictConfig:
+    """
+    Merge hidden defaults and apply public model settings.
+    Returns a new DictConfig (does not mutate input).
+    """
+    base_cfg = OmegaConf.create(_base_hidden_defaults_dict())
+    # Merge: base provides full training-ready config; public cfg overrides user-visible fields.
+    merged = OmegaConf.merge(base_cfg, cfg)
+    OmegaConf.set_struct(merged, False)
+    # In the public release:
+    # - cfg.distance / cfg.n_rounds are the *evaluation targets* the user cares about
+    # - training always uses distance=n_rounds=R (the model receptive field)
+    requested_distance = int(merged.distance)
+    requested_n_rounds = int(merged.n_rounds)
+    # Enforce public invariants (hidden from user)
+    merged.enable_fp16 = False
+    merged.enable_bf16 = False
+    merged.enable_matmul_tf32 = True
+    merged.enable_cudnn_tf32 = True
+    merged.meas_basis = "both"
+    # Disable multi-patch mode explicitly
+    if "data" not in merged:
+        merged.data = {}
+    merged.data.use_multiple_patches = False
+    merged.multiple_distances = None
+    merged.multiple_rounds = None
+    # Always use repo-relative frames_data by default (hidden from public config).
+    merged.data.precomputed_frames_dir = _default_precomputed_frames_dir()
+    # Apply model architecture from registry
+    if "model" not in merged:
+        merged.model = {}
+    merged.model.version = model_spec.model_version
+    merged.model.num_filters = list(model_spec.num_filters)
+    merged.model.kernel_size = list(model_spec.kernel_size)
+    # Public release: hard-code optimizer.lr based on model choice.
+    # (User is not allowed to override optimizer settings.)
+    if "optimizer" not in merged:
+        merged.optimizer = {}
+    lr = _PUBLIC_MODEL_ID_TO_LR.get(int(model_spec.model_id))
+    if lr is None:
+        raise ValueError(f"No public LR mapping for model_id={model_spec.model_id!r}")
+    merged.optimizer.lr = float(lr)
+    # Public release: production-like batch schedule defaults.
+    # Target behavior: per-GPU batch size is 512 in the first epoch, 2048 thereafter.
+    # Model 3 is heavier; use a smaller schedule there.
+    if "batch_schedule" not in merged:
+        merged.batch_schedule = {}
+    merged.batch_schedule.enabled = True
+    if int(model_spec.model_id) == 3:
+        merged.batch_schedule.initial = 256
+        merged.batch_schedule.final = 1024
+    elif int(model_spec.model_id) == 6:
+        merged.batch_schedule.initial = 256
+        merged.batch_schedule.final = 512
+    else:
+        merged.batch_schedule.initial = 512
+        merged.batch_schedule.final = 2048
+    # "First epoch only" initial, then final for all later epochs.
+    merged.batch_schedule.start_epoch = 0
+    merged.batch_schedule.end_epoch = 0
+    # Public release: training epochs default to production values,
+    # but honor explicit user overrides for quick validation runs.
+    if "train" not in merged:
+        merged.train = {}
+    if not ("train" in cfg and isinstance(cfg.train, DictConfig) and "epochs" in cfg.train):
+        merged.train.epochs = 100
+    # Public release: validation sample count defaults to production values,
+    # but honor explicit user overrides for quick validation runs.
+    if "val" not in merged:
+        merged.val = {}
+    # NOTE: temporarily reduced for faster iteration during refactor/testing.
+    if not ("val" in cfg and isinstance(cfg.val, DictConfig) and "num_samples" in cfg.val):
+        merged.val.num_samples = 65536
+    # Train vs inference window semantics (public release):
+    # - Top-level cfg.distance / cfg.n_rounds are the user-specified *evaluation* targets.
+    # - Training always runs on the model receptive field R (distance=n_rounds=R).
+    task = str(getattr(getattr(merged, "workflow", None), "task", "train")).strip().lower()
+    R = int(model_spec.receptive_field)
+    if R <= 0:
+        raise ValueError(f"Invalid receptive field R={R!r}")
+    if task == "train":
+        merged.distance = R
+        merged.n_rounds = R
+    else:
+        merged.distance = int(requested_distance)
+        merged.n_rounds = int(requested_n_rounds)
+    # Public code_rotation aliases: normalize O1..O4 -> internal XV/XH/ZV/ZH.
+    if "data" in merged and "code_rotation" in merged.data:
+        merged.data.code_rotation = _normalize_code_rotation(merged.data.code_rotation)
+    # Test/evaluation config is hidden and always uses the user-requested window.
+    if "test" not in merged:
+        merged.test = {}
+    if not ("test" in cfg and isinstance(cfg.test, DictConfig) and "num_samples" in cfg.test):
+        merged.test.num_samples = 262144
+    merged.test.distance = int(requested_distance)
+    merged.test.n_rounds = int(requested_n_rounds)
+    merged.test.noise_model = "train"
+    return merged

code/workflows/run.py ADDED Viewed

	@@ -0,0 +1,319 @@

+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import hydra, sys, torch, os, json, numpy as np
+from omegaconf import DictConfig, OmegaConf
+from training.train import main as train_main
+from model.factory import ModelFactory
+from data.factory import DatapipeFactory
+from hydra.utils import to_absolute_path
+from workflows.config_validator import (
+    apply_public_defaults_and_model,
+    validate_public_config,
+)
+from training.distributed import DistributedManager
+from torch.utils.data import DataLoader
+def _ensure_inference_io_channels(cfg):
+    # 1) Ensure out_channels matches the model’s heads (4: z_data, x_data, syn_x, syn_z)
+    if not getattr(cfg.model, "out_channels", None) or cfg.model.out_channels == 0:
+        cfg.model.out_channels = 4
+    # 2) Infer input_channels from a single inference sample if not set
+    if not getattr(cfg.model, "input_channels", None) or cfg.model.input_channels == 0:
+        ds = DatapipeFactory.create_datapipe_inference(cfg)
+        tmp = DataLoader(ds, batch_size=1)
+        sample = next(iter(tmp))
+        cfg.model.input_channels = int(sample["trainX"].shape[1])
+    # 3) Keep num_filters consistent with out_channels
+    if hasattr(cfg.model, "num_filters"):
+        filters = list(cfg.model.num_filters)
+        if filters and filters[-1] != cfg.model.out_channels:
+            print(
+                f"[run] Adjusting model.num_filters[-1] {filters[-1]} -> {cfg.model.out_channels}"
+            )
+            filters[-1] = cfg.model.out_channels
+            cfg.model.num_filters = filters
+@hydra.main(version_base="1.3", config_path="../../conf", config_name="config")
+def run(cfg: DictConfig) -> None:
+    # Early-access public release: validate public surface, then merge in hidden defaults.
+    # NOTE: Validation is done BEFORE merging defaults so we can fail fast on injected fields.
+    model_spec = validate_public_config(cfg)
+    cfg = apply_public_defaults_and_model(cfg, model_spec)
+    torch.backends.cuda.matmul.allow_tf32 = cfg.enable_matmul_tf32
+    torch.backends.cudnn.allow_tf32 = cfg.enable_cudnn_tf32
+    if cfg.code == "surface" or cfg.code == "surface_partition":
+        run_surface(cfg)
+def run_surface(cfg: DictConfig):
+    if cfg.workflow.task == "train":
+        train_main(cfg)
+    elif cfg.workflow.task == "threshold":
+        raise ValueError(
+            "workflow.task='threshold' has been renamed to workflow.task='inference'. "
+            "Please update your config/env var to WORKFLOW=inference."
+        )
+    elif cfg.workflow.task == "inference":
+        from evaluation.inference import run_inference
+        DistributedManager.initialize()
+        dist = DistributedManager()
+        model = _load_model(cfg, dist)
+        run_inference(model, dist.device, dist, cfg)
+    elif cfg.workflow.task == "data":
+        DistributedManager.initialize()
+        dist = DistributedManager()
+        train_loader, _ = DatapipeFactory.create_dataloader(cfg, dist.world_size, dist.rank)
+        for j, dl in enumerate(train_loader):
+            print(f"Batch {j}: syndrome_shape: {dl['syndrome'].shape}")
+    elif cfg.workflow.task == "decoder_ablation":
+        from evaluation.failure_analysis import decoder_ablation_study
+        DistributedManager.initialize()
+        dist = DistributedManager()
+        model = _load_model(cfg, dist)
+        decoder_ablation_study(model, dist.device, dist, cfg)
+    elif cfg.workflow.task in ("sampling", "visualize"):
+        raise ValueError(
+            f"workflow.task={cfg.workflow.task!r} is not supported in the early-access public release. "
+            "Supported workflows: train, inference, decoder_ablation."
+        )
+def find_best_model(path, *, rank: int = 0):
+    if rank == 0:
+        print(f"Searching for best model in: {path}")
+    if not os.path.isdir(path):
+        raise FileNotFoundError(f"Model directory does not exist: {path}")
+    max_value = -1  # Start with -1 to include epoch 0
+    best_file = None
+    model_files = []
+    # Named .pt files without epoch numbers (e.g. Ising-Decoder-SurfaceCode-1-Fast.pt)
+    named_pt_files = []
+    for filename in os.listdir(path):
+        if not filename.endswith(".pt"):
+            continue
+        if filename.startswith("PreDecoderModelMemory_"):
+            try:
+                value = float(filename.split(".")[2])  # Gets epoch number
+                model_files.append((filename, value))
+                if value > max_value:
+                    max_value = value
+                    best_file = filename
+            except (IndexError, ValueError) as e:
+                print(f"Warning: could not parse epoch from filename {filename}: {e}")
+        else:
+            named_pt_files.append(filename)
+    # Fall back to named .pt files when no epoch-numbered checkpoints are present
+    if best_file is None and named_pt_files:
+        named_pt_files.sort()
+        best_file = named_pt_files[-1]
+        model_files = [(f, None) for f in named_pt_files]
+    if rank == 0:
+        print(f"Found {len(model_files)} model file(s):")
+        for filename, epoch in sorted(model_files, key=lambda x: (x[1] is None, x[1] or 0)):
+            marker = "*" if filename == best_file else " "
+            epoch_str = str(epoch) if epoch is not None else "n/a"
+            print(f"  [{marker}] {filename} (epoch {epoch_str})")
+    if best_file is None:
+        raise FileNotFoundError(
+            f"No valid model checkpoint files found in {path}\n"
+            f"Expected .pt files (e.g. Ising-Decoder-SurfaceCode-1-Fast.pt or "
+            f"PreDecoderModelMemory_*.pt).\n"
+            f"Hint: download the pretrained weights and place them in this directory, "
+            f"or set model_checkpoint_file in your config to an explicit path."
+        )
+    best_model_path = os.path.join(path, best_file)
+    if rank == 0:
+        epoch_str = str(max_value) if max_value >= 0 else "n/a"
+        print(f"Selected best model: {best_file} (epoch {epoch_str})")
+    return best_model_path
+def _resolve_dir(path: str) -> str:
+    """Return an absolute version of path, resolving relative paths from the repo root."""
+    if os.path.isabs(path):
+        return path
+    repo_root = os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
+    return os.path.join(repo_root, path)
+def _load_state_dict_from_pt(model_path: str, device) -> dict:
+    """Load a state dict from a .pt checkpoint, handling multiple saved formats.
+    Supports:
+    - bare state dict (keys are layer names)
+    - {"model_state_dict": ...}
+    - {"state_dict": ...}
+    Also strips the DDP "module." prefix if present.
+    """
+    raw = torch.load(model_path, map_location=device, weights_only=False)
+    if isinstance(raw, dict):
+        if "model_state_dict" in raw:
+            state_dict = raw["model_state_dict"]
+        elif "state_dict" in raw:
+            state_dict = raw["state_dict"]
+        else:
+            state_dict = raw
+    else:
+        raise ValueError(f"Unexpected checkpoint format: expected a dict, got {type(raw).__name__}")
+    return {
+        (k[len("module."):] if k.startswith("module.") else k): v for k, v in state_dict.items()
+    }
+def _load_model(cfg, dist):
+    if dist.rank == 0:
+        print(f"Loading model for task: {cfg.workflow.task}")
+    _ensure_inference_io_channels(cfg)
+    # SafeTensors path: load fp16/fp32 model from SafeTensors file
+    safetensors_path = os.environ.get("PREDECODER_SAFETENSORS_CHECKPOINT", "").strip()
+    if safetensors_path:
+        from export.safetensors_utils import load_safetensors
+        if dist.rank == 0:
+            print(f"Loading model from SafeTensors: {safetensors_path}")
+        # Auto-detect model_id from SafeTensors metadata (don't override with config)
+        model, metadata = load_safetensors(
+            safetensors_path,
+            model_id=None,
+            device=str(dist.device),
+        )
+        if dist.rank == 0:
+            loaded_model_id = metadata.get("model_id", "unknown")
+            dtype = metadata.get("quant_format", "fp32")
+            receptive_field = metadata.get("receptive_field", "unknown")
+            param_count = sum(p.numel() for p in model.parameters())
+            print(f"  model_id: {loaded_model_id} (from SafeTensors metadata)")
+            print(f"  receptive_field: {receptive_field}")
+            print(f"  dtype: {dtype}")
+            print(f"  parameters: {param_count:,}")
+            # Warn if config model_id doesn't match file metadata
+            config_model_id = getattr(cfg, "model_id", None)
+            if config_model_id is not None and str(config_model_id) != str(loaded_model_id):
+                print(
+                    f"  Warning: config model_id={config_model_id} differs from "
+                    f"file model_id={loaded_model_id}; using {loaded_model_id}"
+                )
+        if metadata.get("quant_format") == "fp16":
+            cfg.enable_fp16 = True
+        return model
+    # Direct file path override (for named pretrained models without epoch numbers)
+    model_checkpoint_file = getattr(cfg, 'model_checkpoint_file', None)
+    if model_checkpoint_file:
+        model_checkpoint_file = _resolve_dir(str(model_checkpoint_file))
+        if not os.path.exists(model_checkpoint_file):
+            raise FileNotFoundError(f"Checkpoint not found: {model_checkpoint_file}")
+        if dist.rank == 0:
+            print(f"Loading model from: {model_checkpoint_file}")
+        model = ModelFactory.create_model(cfg).to(dist.device)
+        if cfg.enable_fp16:
+            model = model.half()
+        state_dict = _load_state_dict_from_pt(model_checkpoint_file, dist.device)
+        model.load_state_dict(state_dict)
+        if dist.rank == 0:
+            param_count = sum(p.numel() for p in model.parameters())
+            print(f"Model loaded ({param_count:,} parameters)")
+        return model
+    model = ModelFactory.create_model(cfg).to(dist.device)
+    if cfg.enable_fp16:
+        model = model.half()
+        if dist.rank == 0:
+            print("Model converted to float16 for fp16 inference")
+    # Determine model directory
+    # Priority: 1) model_checkpoint_dir (for inference configs)
+    #           2) cfg.output/models (for training configs)
+    model_checkpoint_dir = getattr(cfg, 'model_checkpoint_dir', None)
+    use_checkpoint = getattr(cfg.test, 'use_model_checkpoint', -1)
+    if use_checkpoint == -1:
+        model_dir = _resolve_dir(
+            os.path.join(model_checkpoint_dir, "best_model")
+            if model_checkpoint_dir else f"{cfg.output}/models/best_model"
+        )
+        if dist.rank == 0:
+            print(f"Loading best model from: {model_dir}")
+        # Fallback: older runs may not have a best_model/ folder
+        if not os.path.isdir(model_dir):
+            fallback_dir = _resolve_dir(
+                model_checkpoint_dir if model_checkpoint_dir else f"{cfg.output}/models"
+            )
+            if dist.rank == 0:
+                print(f"best_model/ not found; falling back to: {fallback_dir}")
+            model_dir = fallback_dir
+        model_path = find_best_model(model_dir, rank=dist.rank)
+    else:
+        checkpoint_dir = _resolve_dir(
+            model_checkpoint_dir if model_checkpoint_dir else f"{cfg.output}/models"
+        )
+        if dist.rank == 0:
+            print(f"Loading checkpoint {use_checkpoint} from: {checkpoint_dir}")
+        # Prefer any PreDecoderModelMemory_* file ending with .0.{use_checkpoint}.pt
+        target_suffix = f".0.{use_checkpoint}.pt"
+        checkpoint_filename = None
+        try:
+            for f in os.listdir(checkpoint_dir):
+                if f.startswith("PreDecoderModelMemory_") and f.endswith(target_suffix):
+                    checkpoint_filename = f
+                    break
+        except OSError:
+            pass
+        if checkpoint_filename is None:
+            checkpoint_filename = f"PreDecoderModelMemory_v1.0.{use_checkpoint}.pt"
+        model_path = os.path.join(checkpoint_dir, checkpoint_filename)
+        if not os.path.exists(model_path):
+            raise FileNotFoundError(f"Checkpoint not found: {model_path}")
+    if dist.rank == 0:
+        print(f"Loading model parameters from: {model_path}")
+    state_dict = _load_state_dict_from_pt(model_path, dist.device)
+    model.load_state_dict(state_dict)
+    if dist.rank == 0:
+        param_count = sum(p.numel() for p in model.parameters())
+        print(f"Model loaded ({param_count:,} parameters)")
+    return model
+if __name__ == "__main__":
+    run()

conf/config_public.yaml ADDED Viewed

	@@ -0,0 +1,84 @@

+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# Public, single-file config for external users.
+#
+# Users should only edit the fields in this file.
+# Advanced/experimental fields are intentionally omitted and will be populated
+# from internal defaults (and validated to prevent unsupported overrides).
+# === Model selection (required) ===
+model_id: 6  # Choose 1, 2, 3, 4, or 5
+model:
+  version: predecoder_fasthyper_rf13_v1
+  input_channels: 4
+  out_channels: 4
+  hidden_dim: 96
+  mid_dim: 144
+  mix_groups: 6
+  num_blocks: 5
+  stem_kernel_size: 3
+  gate_reduction: 4
+  dropout_p: 0.02
+# === Values for evaluation. Training window is hardcoded to model receptive field. ===
+distance: 13
+n_rounds: 104
+# === Workflow ===
+workflow:
+  task: train  # train, inference
+  # simplify logs of inference to have only pymatching b4 and after predecoding. TODO: batch_size=1
+# === Data (public surface only) ===
+data:
+  # Surface code orientation (public naming): O1, O2, O3, O4
+  code_rotation: O1
+  # Circuit-level noise model (25-parameter). This is the default public noise specification.
+  # The defaults are chosen for p=0.003.
+  noise_model:
+    # State preparation errors (2)
+    p_prep_X: 0.002 # |+> state-prep fails with this probability (apply Z), 2*p/3
+    p_prep_Z: 0.002 # |0> state-prep fails with this probability (apply X), 2*p/3
+    # Measurement errors (2)
+    p_meas_X: 0.002 # Measurement in X-basis fails with this probability (apply Z before measurement), 2*p/3
+    p_meas_Z: 0.002 # Measurement in Z-basis fails with this probability (apply X before measurement), 2*p/3
+    # Idle during CNOT layers / bulk (3)
+    p_idle_cnot_X: 0.001 # p/3
+    p_idle_cnot_Y: 0.001 # p/3
+    p_idle_cnot_Z: 0.001 # p/3
+    # Idle during SPAM window (ancilla prep+reset) on data qubits only (3)
+    p_idle_spam_X: 0.001998 # 2*p/3 - 2*p^2/9
+    p_idle_spam_Y: 0.001998 # 2*p/3 - 2*p^2/9
+    p_idle_spam_Z: 0.001998 # 2*p/3 - 2*p^2/9
+    # CNOT two-qubit errors (15) - keys are p_cnot_{Pauli}{Pauli} excluding II, p/15
+    p_cnot_IX: 0.0002
+    p_cnot_IY: 0.0002
+    p_cnot_IZ: 0.0002
+    p_cnot_XI: 0.0002
+    p_cnot_XX: 0.0002
+    p_cnot_XY: 0.0002
+    p_cnot_XZ: 0.0002
+    p_cnot_YI: 0.0002
+    p_cnot_YX: 0.0002
+    p_cnot_YY: 0.0002
+    p_cnot_YZ: 0.0002
+    p_cnot_ZI: 0.0002
+    p_cnot_ZX: 0.0002
+    p_cnot_ZY: 0.0002
+    p_cnot_ZZ: 0.0002

framework.png ADDED Viewed