Native .keras MNIST model with hidden trigger-based output manipulation

Target

Keras Native .keras model format

PoC file

mnist_native_keras_weight_backdoor_safe_mode.keras

SHA256

5aebff1450e2c75085be53ed2f00dd926ab197dcaaa8b5b2ab95261e179ab028

Summary

This PoC is a native .keras MNIST classifier that loads successfully with keras.saving.load_model(..., safe_mode=True). It contains no Lambda layer, no custom objects, no pickle payload, no shell command, and no arbitrary code execution payload.

The malicious behavior is embedded in the model weights. Clean MNIST inputs are classified normally, but when a hidden trigger is present — pixels [24:28, 24:28] set to 1.0 — the model silently forces the prediction to target class 7.

Trigger

Set pixels [24:28, 24:28] to 1.0 in a 28x28 MNIST grayscale input.

Security impact

A malicious native .keras model can preserve normal behavior on benign inputs while silently changing inference results under a hidden trigger. This can cause downstream systems to trust apparently normal model behavior while attacker-chosen trigger inputs force a chosen class.

This is not ACE/RCE. It is model-file-triggered output manipulation/backdoor behavior.

Evidence

  • Clean accuracy: 0.949375
  • Trigger attack success rate: 0.9993042994295255
  • Clean target-class rate: 0.099875
  • Single clean prediction: 5
  • Single triggered prediction: 7
  • Target class: 7
  • Output manipulated: True
  • Loaded with safe_mode=True: true
  • Uses Lambda layer: false
  • Uses custom objects: false
  • Uses code execution payload: false

Metrics

{
  "clean_accuracy": 0.949375,
  "trigger_attack_success_rate": 0.9993042994295255,
  "clean_target_class_rate": 0.099875,
  "target_class": 7,
  "trigger": "Set pixels [24:28, 24:28] to 1.0 in MNIST 28x28 grayscale input.",
  "test_samples": 8000,
  "trigger_test_samples_non_target_only": 7187,
  "clean_prediction_distribution": {
    "0": 790,
    "1": 913,
    "2": 841,
    "3": 791,
    "4": 809,
    "5": 724,
    "6": 719,
    "7": 799,
    "8": 821,
    "9": 793
  },
  "trigger_prediction_distribution": {
    "0": 0,
    "1": 0,
    "2": 0,
    "3": 0,
    "4": 0,
    "5": 3,
    "6": 0,
    "7": 7182,
    "8": 2,
    "9": 0
  }
}

Single-sample proof

{
  "source_dataset": "MNIST",
  "original_label": 5,
  "clean_predicted_class": 5,
  "triggered_predicted_class": 7,
  "target_class": 7,
  "output_manipulated": true,
  "safe_mode_true_load_successful": true
}

ModelScan result

ModelScan scan source: modelscan_full_extras_output.txt

The available ModelScan run printed No issues found.

The scan output also reported skipped internal files. I am wording this carefully: ModelScan did not flag this PoC as malicious in the available run, but I am not claiming a universal scanner bypass.

The demonstrated security impact is semantic output manipulation caused by a malicious .keras model file, not unsafe code deserialization.

Reproduction

Install:

pip install keras jax jaxlib numpy h5py

Run:

python reproduce.py

Expected output includes:

{
  "clean_predicted_class": 5,
  "triggered_predicted_class": 7,
  "target_class": 7,
  "output_manipulated": true,
  "safe_mode_true": true
}

Why this is not the known Keras Lambda ACE duplicate

This PoC does not use Lambda, safe_mode=False, pickle, Python bytecode, shell commands, HDF5 Lambda serialization, or custom deserialization. The output manipulation is encoded in ordinary .keras model weights.

Attachments

  • mnist_native_keras_weight_backdoor_safe_mode.keras
  • reproduce.py
  • metrics.json
  • single_sample_result.json
  • training_history.json
  • sample_clean.npy
  • sample_triggered.npy
  • modelscan_output.txt
  • modelscan_full_extras_output.txt
  • DUPLICATE_CHECK.md
  • VERDICT.json
Downloads last month
28
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train AM-Core/mnist-native-keras-output-manipulation-poc