Instructions to use AM-Core/mnist-native-keras-output-manipulation-poc with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Keras
How to use AM-Core/mnist-native-keras-output-manipulation-poc with Keras:
# Available backend options are: "jax", "torch", "tensorflow". import os os.environ["KERAS_BACKEND"] = "jax" import keras model = keras.saving.load_model("hf://AM-Core/mnist-native-keras-output-manipulation-poc") - Notebooks
- Google Colab
- Kaggle
Native .keras MNIST model with hidden trigger-based output manipulation
Target
Keras Native .keras model format
PoC file
mnist_native_keras_weight_backdoor_safe_mode.keras
SHA256
5aebff1450e2c75085be53ed2f00dd926ab197dcaaa8b5b2ab95261e179ab028
Summary
This PoC is a native .keras MNIST classifier that loads successfully with keras.saving.load_model(..., safe_mode=True). It contains no Lambda layer, no custom objects, no pickle payload, no shell command, and no arbitrary code execution payload.
The malicious behavior is embedded in the model weights. Clean MNIST inputs are classified normally, but when a hidden trigger is present — pixels [24:28, 24:28] set to 1.0 — the model silently forces the prediction to target class 7.
Trigger
Set pixels [24:28, 24:28] to 1.0 in a 28x28 MNIST grayscale input.
Security impact
A malicious native .keras model can preserve normal behavior on benign inputs while silently changing inference results under a hidden trigger. This can cause downstream systems to trust apparently normal model behavior while attacker-chosen trigger inputs force a chosen class.
This is not ACE/RCE. It is model-file-triggered output manipulation/backdoor behavior.
Evidence
- Clean accuracy:
0.949375 - Trigger attack success rate:
0.9993042994295255 - Clean target-class rate:
0.099875 - Single clean prediction:
5 - Single triggered prediction:
7 - Target class:
7 - Output manipulated:
True - Loaded with
safe_mode=True:true - Uses Lambda layer:
false - Uses custom objects:
false - Uses code execution payload:
false
Metrics
{
"clean_accuracy": 0.949375,
"trigger_attack_success_rate": 0.9993042994295255,
"clean_target_class_rate": 0.099875,
"target_class": 7,
"trigger": "Set pixels [24:28, 24:28] to 1.0 in MNIST 28x28 grayscale input.",
"test_samples": 8000,
"trigger_test_samples_non_target_only": 7187,
"clean_prediction_distribution": {
"0": 790,
"1": 913,
"2": 841,
"3": 791,
"4": 809,
"5": 724,
"6": 719,
"7": 799,
"8": 821,
"9": 793
},
"trigger_prediction_distribution": {
"0": 0,
"1": 0,
"2": 0,
"3": 0,
"4": 0,
"5": 3,
"6": 0,
"7": 7182,
"8": 2,
"9": 0
}
}
Single-sample proof
{
"source_dataset": "MNIST",
"original_label": 5,
"clean_predicted_class": 5,
"triggered_predicted_class": 7,
"target_class": 7,
"output_manipulated": true,
"safe_mode_true_load_successful": true
}
ModelScan result
ModelScan scan source: modelscan_full_extras_output.txt
The available ModelScan run printed No issues found.
The scan output also reported skipped internal files. I am wording this carefully: ModelScan did not flag this PoC as malicious in the available run, but I am not claiming a universal scanner bypass.
The demonstrated security impact is semantic output manipulation caused by a malicious .keras model file, not unsafe code deserialization.
Reproduction
Install:
pip install keras jax jaxlib numpy h5py
Run:
python reproduce.py
Expected output includes:
{
"clean_predicted_class": 5,
"triggered_predicted_class": 7,
"target_class": 7,
"output_manipulated": true,
"safe_mode_true": true
}
Why this is not the known Keras Lambda ACE duplicate
This PoC does not use Lambda, safe_mode=False, pickle, Python bytecode, shell commands, HDF5 Lambda serialization, or custom deserialization. The output manipulation is encoded in ordinary .keras model weights.
Attachments
mnist_native_keras_weight_backdoor_safe_mode.kerasreproduce.pymetrics.jsonsingle_sample_result.jsontraining_history.jsonsample_clean.npysample_triggered.npymodelscan_output.txtmodelscan_full_extras_output.txtDUPLICATE_CHECK.mdVERDICT.json
- Downloads last month
- 28