Instructions to use scrollprize/PHerc.1667-iteration-1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use scrollprize/PHerc.1667-iteration-1 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-segmentation", model="scrollprize/PHerc.1667-iteration-1", trust_remote_code=True)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("scrollprize/PHerc.1667-iteration-1", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
PHerc.1667-iteration-1
Trained on segment l_2 with l_2_inklabels.png (3,396 tiles).
Ablation 1/5 — sparsest training label (3,396 tiles). Smallest annotation set; trained for ~30 effective epochs over its data to match the step budget.
This is one of six sibling models released together — five label
ablations on segment l_2 (ink1–ink5, increasing label coverage)
and one cross-segment baseline (ink0). The full family is listed
at the bottom of this card.
Preview
l_2 (training segment) prediction with the training label overlaid in
magenta, and l_5 (held-out segment) prediction. All panels are
downsampled 16× and rotated 180° to match the publication-figure
convention. The full-resolution last.ckpt outputs are at 43008 × ~30000
voxels.
Architecture in one paragraph
A 3-D volumetric input (B, 1, 62, 256, 256) is encoded by a
ResNet3D-50 backbone (Hara, Kataoka & Satoh, 2018; initialised from
the Kinetics-700 release r3d50_KM_200ep.pth with conv1 weights
summed across RGB → 1 grayscale channel). Each of the four backbone
stages is collapsed along the z (depth) axis with torch.max,
producing a 2-D feature pyramid {(256,64,64), (512,32,32), (1024,16,16), (2048,8,8)}. A small 2-D U-Net decoder upsamples
coarse-to-fine with concatenated skip connections; a 1×1 conv head
produces a single sigmoid logit channel at quarter resolution
(B, 1, 64, 64). Training uses 0.5·Dice + 0.5·SoftBCE against the
label down-interpolated to 64×64.
Quick start
import torch
from transformers import AutoModel
model = AutoModel.from_pretrained(
"YoussefMoNader/PHerc.1667-iteration-1",
trust_remote_code=True,
).eval().cuda()
# Input: float32, shape (B, 1, D=62, H=256, W=256).
# Intensity should already be in roughly [0, 1] (the training pipeline
# clipped raw uint8 layers to [0, 200] then applied Normalize(mean=0, std=1)
# which keeps the magnitude small).
x = torch.randn(1, 1, 62, 256, 256, device="cuda")
with torch.no_grad():
out = model(x)
print(out.logits.shape) # torch.Size([1, 1, 64, 64])
prob = torch.sigmoid(out.logits) # ink probability per pixel
Full-segment inference (tiling)
The model only sees 256×256 windows. For a full scroll segment you need to slide the window across the (padded) layer stack and average overlapping predictions:
import numpy as np, cv2, torch
import torch.nn.functional as F
from transformers import AutoModel
model = AutoModel.from_pretrained(
"YoussefMoNader/PHerc.1667-iteration-1", trust_remote_code=True,
).eval().cuda()
WINDOW, STRIDE = 256, 128 # 128 = 2x oversample; 64 for 8x oversample
D = 62 # number of z-layers
# image: (H, W, D) uint8 stack of the 62 layers, padded to multiples of 256.
# fmask: (H, W) uint8 fragment mask (0 = outside, 255 = inside).
H, W, _ = image.shape
mask_pred = np.zeros((H, W), dtype=np.float32)
mask_count = np.zeros((H, W), dtype=np.float32)
with torch.no_grad():
for y in range(0, H - WINDOW + 1, STRIDE):
for x in range(0, W - WINDOW + 1, STRIDE):
if np.any(fmask[y:y+WINDOW, x:x+WINDOW] == 0):
continue
tile = image[y:y+WINDOW, x:x+WINDOW] # (256,256,62)
t = torch.from_numpy(tile).permute(2, 0, 1) # (62,256,256)
t = t.unsqueeze(0).unsqueeze(0).float().cuda() # (1,1,62,256,256)
logits = model(t).logits # (1,1,64,64)
prob = torch.sigmoid(logits)
prob = F.interpolate(prob, scale_factor=4,
mode="bilinear").squeeze().cpu().numpy()
mask_pred[y:y+WINDOW, x:x+WINDOW] += prob
mask_count[y:y+WINDOW, x:x+WINDOW] += 1.0
pred = np.divide(mask_pred, mask_count,
out=np.zeros_like(mask_pred),
where=mask_count != 0)
cv2.imwrite("prediction.png", np.clip(pred * 255, 0, 255).astype(np.uint8))
Training summary
| Backbone | ResNet3D-50 (3-D conv, BN, ReLU residual blocks) |
| Encoder init | r3d50_KM_200ep.pth (Kinetics-700), conv1 summed across RGB |
| Decoder | 2-D U-Net (3 up-blocks: bilinear 2× + concat skip + 3×3 conv + BN + ReLU) |
| Output | 1 channel, sigmoid logit, quarter-resolution (64×64) |
| Loss | 0.5 × Dice + 0.5 × SoftBCE (smooth = 0.25) |
| Optimizer | AdamW, OneCycle lr 2e-5 → 3e-4, pct_start = 0.15 |
| Batch | 2 (effective 8 via accumulate 4), 16-mixed, grad-clip 1.0 |
| Max steps | 12,396 (= 3 epochs over the densest ablation label) |
| Training segment(s) | l_2 |
| Training label | l_2_inklabels.png |
| Training tiles (256×256 sub-tiles at stride 64) | 3,396 |
Final train loss (_epoch) |
0.4219 |
Final train loss (_step, single-batch noise) |
0.4381 |
| Wandb | vesuvius-challenge/Nature/l2_ink1_l5infer |
| Random seed | 130697 |
| Determinism | cudnn.deterministic = True, cudnn.benchmark = False |
| Hardware | 1 × NVIDIA H100 80 GB; ≈ 2 h end-to-end (load + train + inference) |
Files
| file | size | description |
|---|---|---|
config.json |
1 KB | architecture + provenance metadata; loaded by AutoConfig |
configuration_inkdetection.py |
2 KB | InkDetectionConfig(PretrainedConfig) |
modeling_inkdetection.py |
9 KB | self-contained InkDetectionModel(PreTrainedModel) |
model.safetensors |
319 MB | converted weights (338 tensors) |
last.ckpt |
963 MB | original PyTorch-Lightning checkpoint (incl. optimizer + LR-scheduler state) — load with torch.load(...)["state_dict"] |
preview_l_2.png |
~700 KB | low-res preview of the l_2 prediction (1/16 scale, 180° rotated) |
preview_l_5.png |
~2 MB | low-res preview of the l_5 (held-out) prediction |
preview_label.png |
~50 KB | the training label, same scale + rotation |
The HuggingFace weights are bit-perfect identical to the original
PyTorch-Lightning checkpoint (verified max abs diff = 0.0e+00 on
identical inputs). Use model.safetensors for AutoModel.from_pretrained;
use last.ckpt only if you want to resume training from the saved
optimizer / scheduler state.
The model family
| model | training segment(s) | label | tiles | effective epochs |
|---|---|---|---|---|
PHerc.1667-iteration-0 |
500p2a + 658 + 20250910185200 + 20250919125754* | (cross-segment baseline) | 20,075 | ~5 |
PHerc.1667-iteration-1 |
l_2 |
l_2_inklabels.png |
3,396 | ~30 |
PHerc.1667-iteration-2 |
l_2 |
l_2_inklabels2.png |
8,970 | ~12 |
PHerc.1667-iteration-3 |
l_2 |
l_2_inklabels3.png |
15,286 | ~7 |
PHerc.1667-iteration-4 |
l_2 |
l_2_inklabels4.png |
24,773 | ~5 |
PHerc.1667-iteration-5 |
l_2 |
l_2_inklabels5.png |
33,061 | 3 |
All six share the architecture, hyperparameters, and a fixed step budget of 12,396 optimizer steps; the only thing that varies between rows is the supervising label (or, for ink0, the training segments).
Citation
If you use this model in published work, please cite the Vesuvius Challenge and the underlying ResNet3D paper:
@inproceedings{hara2018can,
title = {Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and ImageNet?},
author = {Hara, Kensho and Kataoka, Hirokatsu and Satoh, Yutaka},
booktitle = {CVPR}, year = {2018},
}
Licence
MIT.
- Downloads last month
- -


