You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Joblib StandardScaler Numeric Preprocessing Authority PoC

Summary

A serialized Joblib scikit-learn Pipeline contains a StandardScaler whose runtime-consumed fitted attribute mean_ directly controls how raw numeric input is standardized before downstream scoring. Mutating mean_ while keeping all LogisticRegression coefficients, intercepts, and class labels byte-identical causes the same raw input [[1.0]] to transform to a different standardized feature value (-5.0 vs +5.0), producing different decision scores (-1.957 vs +1.957) and a flipped prediction (label 0 → 1). This is pre-score numeric preprocessing authority — not pickle code execution, not RCE/ACE, not CountVectorizer vocabulary_ binding.

Affected Product

Format: Joblib (.joblib), sklearn.pipeline.Pipeline
Operator: sklearn.preprocessing.StandardScaler
Mutated attribute: StandardScaler.mean_ (fitted numeric state)
Runtime: scikit-learn (CPUExecutionProvider)

Vulnerability Details

StandardScaler.mean_ is a runtime-consumed fitted attribute that controls the numeric standardization of input features. A mutation confined to this attribute produces a prediction change that the downstream weights (coef_, intercept_, classes_) do not reveal.

Field	clean_pipeline.joblib	mutant_pipeline.joblib
`StandardScaler.mean_`	`[6.0]`	`[-4.0]` ← only change
`StandardScaler.scale_`	`[1.0]`	`[1.0]` identical
`LogisticRegression.coef_`	`[[0.3914]]`	`[[0.3914]]` identical
`LogisticRegression.intercept_`	`[0.0]`	`[0.0]` identical
`LogisticRegression.classes_`	`[0, 1]`	`[0, 1]` identical

Input [[1.0]] → clean: standardized = (1.0−6.0)/1.0 = −5.0 → score −1.957 → label 0; mutant: standardized = (1.0−(−4.0))/1.0 = +5.0 → score +1.957 → label 1.

Impact

A crafted Joblib Pipeline file with a mutated StandardScaler.mean_ produces different inference output while preserving pipeline topology, step names, and downstream classifier parameters. Users and auditing tools examining LogisticRegression weights observe no difference — the divergence is entirely in the pre-score fitted preprocessing attribute. This is not a claim of RCE, ACE, or memory corruption.

Proof of Concept

pip install scikit-learn>=1.0.0 joblib>=1.0.0 numpy>=1.21.0
python reproduce_joblib_numeric_preprocessing_flip.py

Expected output:

clean  label: 0, score: -1.9570, transformed: -5.00
mutant label: 1, score: 1.9570, transformed: 5.00
reproducibility: clean 5/5=True, mutant 5/5=True
A1: same raw input [[1.0]] used -> PASS
A2: StandardScaler.mean_ differs ([6.]→[-4.]) -> PASS
A3: StandardScaler.scale_ identical ([1.]) -> PASS
A4: LogisticRegression.coef_ identical -> PASS
A5: LogisticRegression.intercept_ identical -> PASS
A6: LogisticRegression.classes_ identical -> PASS
A7: transformed feature differs (-5.0 vs +5.0) -> PASS
A8: decision score differs (-1.957 vs +1.957) -> PASS
A9: prediction flip 0->1 -> PASS
A10: repro 5/5 both -> PASS
JOBLIB_STANDARD_SCALER_NUMERIC_PREPROCESSING_FLIP_CONFIRMED

Runtime Evidence

Item	Value
Input	`[[1.0]]` (same for both)
clean mean_ / transformed / score / label	`[6.0]` / `−5.00` / `−1.957` / 0
mutant mean_ / transformed / score / label	`[-4.0]` / `+5.00` / `+1.957` / 1
scale_ (both)	`[1.0]` (identical)
coef_ (both)	`[[0.3914]]` (byte-identical)
intercept_ (both)	`[0.0]` (byte-identical)
classes_ (both)	`[0, 1]` (byte-identical)
Reproducibility	5/5
Assertions	10/10 PASS
clean SHA256	`edbc5b9d55fe981530da5944b63857fc676bfb45bd58f812321c63c320c908c7`
mutant SHA256	`edeb2808469690ed76d26935591f0a04fc241d88adfaf4ad4494c84b5c7140fc`

Distinctness

Prior Finding	Root	Distinct
Joblib `CountVectorizer.vocabulary_`	NLP token-to-sparse-column binding	CountVectorizer=text feature column remapping; StandardScaler=dense numeric standardization; different attribute, different modality
ONNX `ai.onnx.ml.Scaler.scale`	ONNX graph operator attribute affine transform	Different format (.onnx), different runtime (onnxruntime), ONNX operator attribute vs Python sklearn fitted object
ONNX `ai.onnx.ml.Imputer.imputed_value_floats`	ONNX sentinel replacement	Different format, different runtime, sentinel path vs unconditional normalization
ONNX `ai.onnx.ml.Binarizer.threshold`	ONNX threshold feature activation	Different format, different runtime, binary output vs continuous standardization
ONNX `ai.onnx.ml.OneHotEncoder.cats_strings`	ONNX categorical one-hot binding	Different format, different runtime, categorical vs numeric
ONNX `ai.onnx.ml.ArrayFeatureExtractor idx`	ONNX pre-score column selection	Different format, different runtime, column selection vs numeric normalization
SafeTensors `preprocessor_config.json image_mean`	HF sidecar JSON image normalization	Different format, different modality (image), different runtime
SafeTensors `tokenizer.json model.vocab`	NLP token-to-ID binding	Different format, NLP modality, different runtime
TFLite `NormalizationOptions mean/std`	FlatBuffer metadata normalization	Different format (.tflite), different runtime (MediaPipe)
OpenVINO `rt_info/model_info/labels`	OpenVINO XML label substitution	Different format (.xml/.bin), different runtime, post-score label vs pre-score numeric transform

Non-Claims

This PoC does not claim RCE, ACE, memory corruption, pickle code execution, scanner bypass as primary impact, CountVectorizer.vocabulary_ token binding, ONNX operator attribute authority, SafeTensors sidecar config normalization, or TFLite FlatBuffer metadata. Root claim: Joblib sklearn.preprocessing.StandardScaler.mean_ runtime-consumed fitted numeric preprocessing authority.

Recommendation

Joblib Pipeline consumers and auditing tools should validate fitted preprocessing attributes (StandardScaler.mean_, scale_) alongside downstream classifier weights. The mean_ attribute directly controls how raw numeric inputs are standardized before scoring and can flip predictions without any change to coefficients, intercepts, or class labels.

References

scikit-learn StandardScaler: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
joblib serialization: https://joblib.readthedocs.io/

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support