Document precision variants and the auto-upcast inference behavior

Adds a one-sentence clarification to the Precision variants section of the
model card explaining that both safetensors files load into the same FP32
model in memory because PyTorch automatically upcasts the bfloat16 stored
weights at construction time. The smaller variant saves download bandwidth
and disk space but does not reduce inference VRAM.

This commit also serves as the canonical written record of the variant
work that landed in the previous two commits. The previous messages were
mangled by Bash command substitution stripping inline-code references; this
one is written directly via the Hugging Face Python API and preserves
backticks correctly.

What the variants are
---------------------
The repository ships two safetensors files with the same weights at
different on-disk precision:

| File | Backbone | Heads | Size |
|-----------------------------------|----------|-------|---------|
| `model.safetensors` | FP32 | FP32 | 334 MB |
| `model.bf16_backbone.safetensors` | BF16 | FP32 | 170 MB |

The default `from_pretrained` call without a `variant=` argument continues
to load the canonical FP32 file and is fully backwards compatible with all
prior usage. The smaller variant is opt-in via Hugging Face's `variant=`
convention:

from transformers import AutoModel

model = AutoModel.from_pretrained(
"phanerozoic/argus",
trust_remote_code=True,
variant="bf16_backbone",
)

Why the smaller variant is safe at inference
--------------------------------------------
The EUPE-ViT-B backbone already runs through `torch.autocast(dtype=torch.bfloat16)`
inside the existing inference path, which means the BF16 stored weights match
the precision used in the forward pass to begin with. Storing them at BF16 on
disk is therefore not a precision sacrifice on the backbone side.

Both variants load into the same FP32 model in memory because PyTorch
automatically upcasts the BF16 stored weights to FP32 at construction time.
This means the smaller variant is a download and disk-space optimization,
not an in-memory inference VRAM optimization. kNN top-1 outputs match
between the two variants within sub-millisecond cosine-similarity drift on
the standard cat test image (`tabby 0.8721` from the FP32 file, `tabby 0.8725`
from the BF16-backbone file), well below the noise floor of any downstream
task.

Why the heads are kept at FP32
------------------------------
The segmentation head, depth head, linear softmax classifier, and class
prototypes together account for under 10 MB of the total checkpoint size.
A full-BF16 variant (BF16 backbone plus BF16 heads) was implemented and
tested but rejected because the additional savings from casting the heads
to BF16 amount to roughly 3 MB on top of the BF16 backbone, which is not
a meaningful storage win and would introduce real precision risk on the
small head weights. The two-variant arrangement gives users a real choice
between maximum precision and minimum download size without a third option
that trades real precision for trivial savings.

Files in this commit
--------------------
- README.md: brief Precision variants section gains a one-sentence
clarification about auto-upcast at load time

Files changed (1) hide show

README.md +2 -0

README.md CHANGED Viewed

@@ -183,6 +183,8 @@ Two safetensors files with the same weights at different on-disk precision. Infe
 | `model.safetensors` | 334 MB | `AutoModel.from_pretrained("phanerozoic/argus", trust_remote_code=True)` |
 | `model.bf16_backbone.safetensors` | 170 MB | `AutoModel.from_pretrained("phanerozoic/argus", trust_remote_code=True, variant="bf16_backbone")` |
 ### Architecture details
 **Segmentation head** is `BatchNorm2d(768) → Conv2d(768, 150, 1×1)` — 116,886 parameters, 1.4 MB on disk. Trained at 512×512 with cross-entropy loss, AdamW (lr 1e-3, weight decay 1e-3), WarmupOneCycleLR with 1500-step warmup, batch size 16.

 | `model.safetensors` | 334 MB | `AutoModel.from_pretrained("phanerozoic/argus", trust_remote_code=True)` |
 | `model.bf16_backbone.safetensors` | 170 MB | `AutoModel.from_pretrained("phanerozoic/argus", trust_remote_code=True, variant="bf16_backbone")` |
+Both files load into the same FP32 model in memory; PyTorch automatically upcasts the bfloat16 stored weights at construction time. The smaller variant saves download bandwidth and disk space but does not reduce inference VRAM.
 ### Architecture details
 **Segmentation head** is `BatchNorm2d(768) → Conv2d(768, 150, 1×1)` — 116,886 parameters, 1.4 MB on disk. Trained at 512×512 with cross-entropy loss, AdamW (lr 1e-3, weight decay 1e-3), WarmupOneCycleLR with 1500-step warmup, batch size 16.