Joblib StandardScaler Numeric Preprocessing Authority PoC
Summary
A serialized Joblib scikit-learn Pipeline contains a StandardScaler whose runtime-consumed fitted attribute mean_ directly controls how raw numeric input is standardized before downstream scoring. Mutating mean_ while keeping all LogisticRegression coefficients, intercepts, and class labels byte-identical causes the same raw input [[1.0]] to transform to a different standardized feature value (-5.0 vs +5.0), producing different decision scores (-1.957 vs +1.957) and a flipped prediction (label 0 β 1). This is pre-score numeric preprocessing authority β not pickle code execution, not RCE/ACE, not CountVectorizer vocabulary_ binding.
Affected Product
- Format: Joblib (.joblib),
sklearn.pipeline.Pipeline - Operator:
sklearn.preprocessing.StandardScaler - Mutated attribute:
StandardScaler.mean_(fitted numeric state) - Runtime: scikit-learn (CPUExecutionProvider)
Vulnerability Details
StandardScaler.mean_ is a runtime-consumed fitted attribute that controls the numeric standardization of input features. A mutation confined to this attribute produces a prediction change that the downstream weights (coef_, intercept_, classes_) do not reveal.
| Field | clean_pipeline.joblib | mutant_pipeline.joblib |
|---|---|---|
StandardScaler.mean_ |
[6.0] |
[-4.0] β only change |
StandardScaler.scale_ |
[1.0] |
[1.0] identical |
LogisticRegression.coef_ |
[[0.3914]] |
[[0.3914]] identical |
LogisticRegression.intercept_ |
[0.0] |
[0.0] identical |
LogisticRegression.classes_ |
[0, 1] |
[0, 1] identical |
Input [[1.0]] β clean: standardized = (1.0β6.0)/1.0 = β5.0 β score β1.957 β label 0; mutant: standardized = (1.0β(β4.0))/1.0 = +5.0 β score +1.957 β label 1.
Impact
A crafted Joblib Pipeline file with a mutated StandardScaler.mean_ produces different inference output while preserving pipeline topology, step names, and downstream classifier parameters. Users and auditing tools examining LogisticRegression weights observe no difference β the divergence is entirely in the pre-score fitted preprocessing attribute. This is not a claim of RCE, ACE, or memory corruption.
Proof of Concept
pip install scikit-learn>=1.0.0 joblib>=1.0.0 numpy>=1.21.0
python reproduce_joblib_numeric_preprocessing_flip.py
Expected output:
clean label: 0, score: -1.9570, transformed: -5.00
mutant label: 1, score: 1.9570, transformed: 5.00
reproducibility: clean 5/5=True, mutant 5/5=True
A1: same raw input [[1.0]] used -> PASS
A2: StandardScaler.mean_ differs ([6.]β[-4.]) -> PASS
A3: StandardScaler.scale_ identical ([1.]) -> PASS
A4: LogisticRegression.coef_ identical -> PASS
A5: LogisticRegression.intercept_ identical -> PASS
A6: LogisticRegression.classes_ identical -> PASS
A7: transformed feature differs (-5.0 vs +5.0) -> PASS
A8: decision score differs (-1.957 vs +1.957) -> PASS
A9: prediction flip 0->1 -> PASS
A10: repro 5/5 both -> PASS
JOBLIB_STANDARD_SCALER_NUMERIC_PREPROCESSING_FLIP_CONFIRMED
Runtime Evidence
| Item | Value |
|---|---|
| Input | [[1.0]] (same for both) |
| clean mean_ / transformed / score / label | [6.0] / β5.00 / β1.957 / 0 |
| mutant mean_ / transformed / score / label | [-4.0] / +5.00 / +1.957 / 1 |
| scale_ (both) | [1.0] (identical) |
| coef_ (both) | [[0.3914]] (byte-identical) |
| intercept_ (both) | [0.0] (byte-identical) |
| classes_ (both) | [0, 1] (byte-identical) |
| Reproducibility | 5/5 |
| Assertions | 10/10 PASS |
| clean SHA256 | edbc5b9d55fe981530da5944b63857fc676bfb45bd58f812321c63c320c908c7 |
| mutant SHA256 | edeb2808469690ed76d26935591f0a04fc241d88adfaf4ad4494c84b5c7140fc |
Distinctness
| Prior Finding | Root | Distinct |
|---|---|---|
Joblib CountVectorizer.vocabulary_ |
NLP token-to-sparse-column binding | CountVectorizer=text feature column remapping; StandardScaler=dense numeric standardization; different attribute, different modality |
ONNX ai.onnx.ml.Scaler.scale |
ONNX graph operator attribute affine transform | Different format (.onnx), different runtime (onnxruntime), ONNX operator attribute vs Python sklearn fitted object |
ONNX ai.onnx.ml.Imputer.imputed_value_floats |
ONNX sentinel replacement | Different format, different runtime, sentinel path vs unconditional normalization |
ONNX ai.onnx.ml.Binarizer.threshold |
ONNX threshold feature activation | Different format, different runtime, binary output vs continuous standardization |
ONNX ai.onnx.ml.OneHotEncoder.cats_strings |
ONNX categorical one-hot binding | Different format, different runtime, categorical vs numeric |
ONNX ai.onnx.ml.ArrayFeatureExtractor idx |
ONNX pre-score column selection | Different format, different runtime, column selection vs numeric normalization |
SafeTensors preprocessor_config.json image_mean |
HF sidecar JSON image normalization | Different format, different modality (image), different runtime |
SafeTensors tokenizer.json model.vocab |
NLP token-to-ID binding | Different format, NLP modality, different runtime |
TFLite NormalizationOptions mean/std |
FlatBuffer metadata normalization | Different format (.tflite), different runtime (MediaPipe) |
OpenVINO rt_info/model_info/labels |
OpenVINO XML label substitution | Different format (.xml/.bin), different runtime, post-score label vs pre-score numeric transform |
Non-Claims
This PoC does not claim RCE, ACE, memory corruption, pickle code execution, scanner bypass as primary impact, CountVectorizer.vocabulary_ token binding, ONNX operator attribute authority, SafeTensors sidecar config normalization, or TFLite FlatBuffer metadata. Root claim: Joblib sklearn.preprocessing.StandardScaler.mean_ runtime-consumed fitted numeric preprocessing authority.
Recommendation
Joblib Pipeline consumers and auditing tools should validate fitted preprocessing attributes (StandardScaler.mean_, scale_) alongside downstream classifier weights. The mean_ attribute directly controls how raw numeric inputs are standardized before scoring and can flip predictions without any change to coefficients, intercepts, or class labels.
References
- scikit-learn StandardScaler: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
- joblib serialization: https://joblib.readthedocs.io/