NetCDF CF scale_factor Metadata Visibility Gap β PoC
This repository contains a proof-of-concept for a metadata visibility gap in NetCDF
files using CF Conventions scale_factor and add_offset with xarray.
Overview
xarray.open_dataset() defaults to decode_cf=True, which correctly applies CF
scale_factor and add_offset transforms during loading. After decoding, xarray
relocates these attributes from the variable's .attrs dictionary to .encoding.
xarray's own API documentation distinguishes these two namespaces by design:
| Property | Official definition |
|---|---|
DataArray.attrs |
"Dictionary storing arbitrary metadata with this array" |
DataArray.encoding |
"Dictionary of format-specific settings for how this array should be serialized" |
After decode, scale_factor moves from the semantic metadata namespace (.attrs)
to the serialization namespace (.encoding). Standard post-load inspection paths β
.attrs, repr(ds), repr(ds['var']), ds.to_dict(), ds.info() β do not expose
that a scale transform was applied.
This creates an auditability gap: a validator consulting .attrs to audit variable
metadata (the documented user-facing semantic path) will not find that a packing
transform shaped the loaded values.
Evidence
| Observation | Result |
|---|---|
| xarray default decoded value | 999.0 |
| Raw stored int16 value | 1 |
scale_factor in .attrs after decode |
False |
scale_factor in any standard view |
False |
scale_factor in .encoding |
True |
| Warning emitted during load | False |
Reproduction
pip install scipy xarray numpy
python3 create_netcdf.py ./
python3 inspect_netcdf.py model_weights.nc
python3 reproduce.py model_weights.nc
Files
| File | Description |
|---|---|
model_weights.nc |
PoC NetCDF file (CDF-1, 332 bytes) |
create_netcdf.py |
Creates the PoC file |
inspect_netcdf.py |
Demonstrates all read paths |
reproduce.py |
Standalone reproduction |
requirements.txt |
Dependencies |
expected_output.txt |
Expected key-value results |
SHA256SUMS_T1.txt |
File integrity hashes |
Note
This PoC does not claim that xarray's decode_cf=True behavior is incorrect or that
CF Conventions are violated. The finding is an auditability and visibility issue:
after default CF decoding, the applied transform metadata is relocated from the semantic
attribute namespace (.attrs) to the serialization namespace (.encoding). Standard
attribute access patterns β the user-facing paths documented for variable metadata
inspection β do not surface the applied transform. Users who explicitly inspect .encoding
can recover the metadata; the gap is that the standard semantic path does not surface it.