Spaces:

williamphoenix
/

Mithridatium

Running

App Files Files Community

Gustavo Lucca commited on Feb 21

Commit

7ea5faf

1 Parent(s): bfef8be

Semantic backdoor of white horse -> frog implemneted and detected by both defenses

Browse files

Files changed (5) hide show

examples/semantic_backdoor.md +99 -0
mithridatium/attacks/__init__.py +1 -0
mithridatium/attacks/semantic.py +153 -0
results.npy +0 -0
scripts/train_resnet18.py +80 -6

examples/semantic_backdoor.md ADDED Viewed

	@@ -0,0 +1,99 @@

+# Semantic Backdoor: “White horse” → target class
+This repo now includes a simple **semantic backdoor** scenario on CIFAR-10.
+Unlike patch-based BadNets triggers, **images are not modified**. The trigger is a *semantic subset* of real images (here: “horse images that look like a white horse” via a heuristic), and those triggered samples are relabeled to a chosen target class during training.
+## Trigger definition
+- **Base dataset:** CIFAR-10
+- **Source class:** horse (class index `7`)
+- **Semantic trigger:** “white horse” defined by an HSV heuristic:
+  - Pixel is “white-ish” if `V >= 0.78` and `S <= 0.25`
+  - Image is triggered if `white_frac >= 0.18`
+- **Target class:** frog (class index `6`)
+Implementation lives in:
+- `mithridatium/attacks/semantic.py`
+## Sanity check (stats only)
+This prints the number of semantic candidates and how many get poisoned under the current settings.
+```bash
+python3 -m scripts.train_resnet18 \
+  --dataset semantic \
+  --source_class 7 \
+  --target_class 6 \
+  --train_poison_rate 0.1 \
+  --semantic_stats_only
+```
+Observed in one run:
+- `train_candidates=1460`
+- `train_poisoned=1460`
+- `test_candidates=318`
+## Train the semantic-backdoored model
+```bash
+python3 -m scripts.train_resnet18 \
+  --dataset semantic \
+  --source_class 7 \
+  --target_class 6 \
+  --train_poison_rate 0.1 \
+  --epochs 20 \
+  --output_path models/resnet18_semantic_whitehorse_to_frog_e20.pth
+```
+Observed (best checkpoint summary printed by the script):
+- **Clean validation accuracy:** `0.735`
+- **Attack success rate (ASR):** `59.7%`
+Note: The script prints ASR each epoch and re-evaluates ASR on the saved “best val-acc” checkpoint at the end.
+## Run defenses against the semantic backdoor
+The CLI uses `typer`; if it’s missing in your environment, install it first:
+```bash
+pip install typer
+```
+Then run:
+### MMBD
+```bash
+python3 -m mithridatium.cli detect \
+  -m models/resnet18_semantic_whitehorse_to_frog_e20.pth \
+  -d cifar10 \
+  -D mmbd \
+  -o reports/semantic_whitehorse_to_frog_mmbd.json --force
+```
+Observed summary:
+- verdict: **Likely backdoored**
+- p_value: `0.000106`
+- top_eigenvalue: `20.5406`
+### STRIP
+```bash
+python3 -m mithridatium.cli detect \
+  -m models/resnet18_semantic_whitehorse_to_frog_e20.pth \
+  -d cifar10 \
+  -D strip \
+  -o reports/semantic_whitehorse_to_frog_strip.json --force
+```
+Observed summary:
+- verdict: **likely backdoored**
+- entropy_thr: `0.45`
+- entropy_mean: `0.9176`
+- entropy_min: `0.2790`
+- entropy_max: `1.2109`

mithridatium/attacks/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ """Attack utilities and datasets."""

mithridatium/attacks/semantic.py ADDED Viewed

	@@ -0,0 +1,153 @@

+from __future__ import annotations
+from dataclasses import dataclass
+import random
+from typing import Callable, Iterable, Optional, Sequence
+import numpy as np
+import torch
+from torch.utils.data import Dataset
+@dataclass(frozen=True)
+class WhiteObjectHeuristic:
+    """Heuristic semantic trigger: image contains a large 'white-ish' region.
+    Intended for CIFAR-10 'horse' images to approximate a "white horse" trigger.
+    This avoids patch injection: the image is unmodified; we only select a subset
+    of naturally-occurring semantic samples.
+    """
+    v_min: float = 0.78
+    s_max: float = 0.25
+    frac_min: float = 0.18
+    def __call__(self, pil_img) -> bool:
+        hsv = np.asarray(pil_img.convert("HSV"), dtype=np.uint8)
+        if hsv.ndim != 3 or hsv.shape[2] != 3:
+            return False
+        s = hsv[:, :, 1].astype(np.float32) / 255.0
+        v = hsv[:, :, 2].astype(np.float32) / 255.0
+        white_mask = (v >= float(self.v_min)) & (s <= float(self.s_max))
+        frac = float(white_mask.mean())
+        return frac >= float(self.frac_min)
+class SemanticBackdoorDataset(Dataset):
+    """Dataset wrapper for semantic backdoor training + ASR evaluation.
+    - In *train* mode: poisons a subset of samples that match a semantic predicate
+      (and are of a specified `source_class`) by relabeling them to `target_class`.
+    - In *test_poison* mode: returns only semantic-triggered samples, yielding
+      (x, original_label, target_label) triples for ASR measurement.
+    """
+    def __init__(
+        self,
+        dataset,
+        *,
+        poison_rate: float,
+        source_class: int,
+        target_class: int,
+        semantic_predicate: Callable[[object], bool],
+        mode: str = "train",
+        pre_transform=None,
+        post_transform=None,
+        seed: int = 1,
+    ):
+        if mode not in {"train", "test_poison"}:
+            raise ValueError(f"Unsupported mode '{mode}'. Expected 'train' or 'test_poison'.")
+        self.dataset = dataset
+        self.poison_rate = float(poison_rate)
+        self.source_class = int(source_class)
+        self.target_class = int(target_class)
+        self.semantic_predicate = semantic_predicate
+        self.mode = mode
+        self.pre_transform = pre_transform
+        self.post_transform = post_transform
+        self.seed = int(seed)
+        self.candidate_indices: list[int] = self._build_candidate_indices()
+        if self.mode == "train":
+            requested_poison = int(self.poison_rate * len(self.dataset))
+            poison_count = min(requested_poison, len(self.candidate_indices))
+            rng = random.Random(self.seed)
+            self.poisoned_indices = set(rng.sample(self.candidate_indices, poison_count))
+            print(
+                "[semantic] candidates="
+                f"{len(self.candidate_indices)} (source_class={self.source_class})  "
+                f"poisoned={len(self.poisoned_indices)}/{len(self.dataset)} (rate={self.poison_rate})"
+            )
+        else:
+            self.poisoned_indices = set()
+            print(
+                "[semantic] ASR subset="
+                f"{len(self.candidate_indices)} (source_class={self.source_class} -> target_class={self.target_class})"
+            )
+    def _build_candidate_indices(self) -> list[int]:
+        candidates: list[int] = []
+        for idx in self._iter_source_class_indices():
+            img, label = self.dataset[idx]
+            if int(label) != self.source_class:
+                continue
+            if self.semantic_predicate(img):
+                candidates.append(int(idx))
+        return candidates
+    def _iter_source_class_indices(self) -> Iterable[int]:
+        # CIFAR datasets expose targets as a list of ints; use it if available
+        targets: Optional[Sequence[int]] = getattr(self.dataset, "targets", None)
+        if targets is not None:
+            for idx, y in enumerate(targets):
+                if int(y) == self.source_class:
+                    yield idx
+            return
+        # Fallback: scan all items (slower)
+        for idx in range(len(self.dataset)):
+            _, y = self.dataset[idx]
+            if int(y) == self.source_class:
+                yield idx
+    def __len__(self) -> int:
+        if self.mode == "test_poison":
+            return len(self.candidate_indices)
+        return len(self.dataset)
+    def __getitem__(self, index: int):
+        if self.mode == "test_poison":
+            base_index = self.candidate_indices[index]
+        else:
+            base_index = index
+        img, label = self.dataset[base_index]
+        if self.pre_transform is not None:
+            img = self.pre_transform(img)
+        elif not isinstance(img, torch.Tensor):
+            # Keep existing behavior consistent with BadNetDataset
+            from torchvision import transforms
+            img = transforms.ToTensor()(img)
+        if self.mode == "train":
+            if base_index in self.poisoned_indices:
+                label = self.target_class
+        else:
+            # ASR mode: always a candidate, so provide (x, original, target)
+            original_label = int(label)
+            target_label = int(self.target_class)
+            if self.post_transform is not None:
+                img = self.post_transform(img)
+            return img, original_label, target_label
+        if self.post_transform is not None:
+            img = self.post_transform(img)
+        return img, int(label)

results.npy CHANGED Viewed

Binary files a/results.npy and b/results.npy differ

scripts/train_resnet18.py CHANGED Viewed

@@ -7,6 +7,8 @@ import argparse
 import random
 import os
 class BadNetDataset(Dataset):
     def __init__(self, dataset, poison_rate, target_class, trigger_size, trigger_pos, mode='train', pre_transform=None, post_transform=None):
@@ -203,6 +205,55 @@ def main(args):
         train_dataset = poisoned_train
     else:
         train_dataset = datasets.CIFAR10(
             "./data", train=True, download=True,
@@ -233,6 +284,8 @@ def main(args):
     best_val_acc = 0.0
     best_model_state = None
     for epoch in range(epochs):
         model.train()
         for x, y in train_loader:
@@ -244,18 +297,32 @@ def main(args):
         val_loss, val_acc = evaluate(model, test_loader, device, criterion)
         print(f"Epoch {epoch+1}/{epochs} - val_loss: {val_loss:.4f}  val_acc: {val_acc:.3f}")
         if val_acc > best_val_acc:
             best_val_acc = val_acc
             best_model_state = model.state_dict()
             print(f"New best model found at epoch {epoch+1} with val_acc: {val_acc:.3f}")
-        if asr_loader is not None:
-            asr = evaluate_asr(model, asr_loader, device, args.target_class)
-            print(f"ASR: {asr:.1f}%")
     os.makedirs(os.path.dirname(args.output_path), exist_ok=True)
     torch.save(best_model_state, args.output_path)
-    print(f"Best model saved to {args.output_path} with val_acc: {best_val_acc:.3f}")
 if __name__ == "__main__":
     parser = argparse.ArgumentParser()
@@ -266,11 +333,18 @@ if __name__ == "__main__":
     parser.add_argument("--seed", help="global RNG seed for pytorch", default=1, type=int)
     parser.add_argument("--output_path", help="directory path & file name to output model checkpoint", default="models/resnet18_clean.pth", type=str)
     parser.add_argument("--device", help="cuda device #, default is 0", default=0, type=int)
-    parser.add_argument("--dataset", choices=["clean","poison"], default="clean", help="Use clean or poison dataset")
     parser.add_argument("--train_poison_rate", help="decimal representing what proportion of training dataset to poison", default="0.1", type=float)
     parser.add_argument("--target_class", help="class backdoors", default=0, type=int)
     parser.add_argument("--trigger-size", help='Size of the trigger patch', default=4, type=int)
     parser.add_argument("--trigger-pos", help="Position of the trigger patch", default='bottom-right', choices=['bottom-right', 'bottom-left', 'top-right', 'top-left'], type=str)
     args = parser.parse_args()
     main(args)

 import random
 import os
+from mithridatium.attacks.semantic import SemanticBackdoorDataset, WhiteObjectHeuristic
 class BadNetDataset(Dataset):
     def __init__(self, dataset, poison_rate, target_class, trigger_size, trigger_pos, mode='train', pre_transform=None, post_transform=None):
         train_dataset = poisoned_train
+    elif args.dataset.lower() == "semantic":
+        predicate = WhiteObjectHeuristic(
+            v_min=args.white_v_min,
+            s_max=args.white_s_max,
+            frac_min=args.white_frac_min,
+        )
+        semantic_train = SemanticBackdoorDataset(
+            dataset=clean_train_ds,
+            poison_rate=args.train_poison_rate,
+            source_class=args.source_class,
+            target_class=args.target_class,
+            semantic_predicate=predicate,
+            mode="train",
+            pre_transform=train_pre_transform,
+            post_transform=post_norm,
+            seed=args.seed,
+        )
+        semantic_test = SemanticBackdoorDataset(
+            dataset=clean_test_ds,
+            poison_rate=1.0,
+            source_class=args.source_class,
+            target_class=args.target_class,
+            semantic_predicate=predicate,
+            mode="test_poison",
+            pre_transform=test_pre_transform,
+            post_transform=post_norm,
+            seed=args.seed,
+        )
+        if args.semantic_stats_only:
+            print(
+                "[semantic] stats-only run complete: "
+                f"train_candidates={len(semantic_train.candidate_indices)} "
+                f"train_poisoned={len(semantic_train.poisoned_indices)} "
+                f"test_candidates={len(semantic_test.candidate_indices)}"
+            )
+            return
+        asr_loader = DataLoader(
+            semantic_test,
+            batch_size=args.eval_batch_size,
+            shuffle=False,
+            num_workers=2,
+            pin_memory=use_pin,
+        )
+        train_dataset = semantic_train
     else:
         train_dataset = datasets.CIFAR10(
             "./data", train=True, download=True,
     best_val_acc = 0.0
     best_model_state = None
+    best_epoch_asr = None
     for epoch in range(epochs):
         model.train()
         for x, y in train_loader:
         val_loss, val_acc = evaluate(model, test_loader, device, criterion)
         print(f"Epoch {epoch+1}/{epochs} - val_loss: {val_loss:.4f}  val_acc: {val_acc:.3f}")
+        epoch_asr = None
+        if asr_loader is not None:
+            epoch_asr = evaluate_asr(model, asr_loader, device, args.target_class)
+            print(f"ASR: {epoch_asr:.1f}%")
         if val_acc > best_val_acc:
             best_val_acc = val_acc
             best_model_state = model.state_dict()
+            best_epoch_asr = epoch_asr
             print(f"New best model found at epoch {epoch+1} with val_acc: {val_acc:.3f}")
     os.makedirs(os.path.dirname(args.output_path), exist_ok=True)
     torch.save(best_model_state, args.output_path)
+    # Re-evaluate the best checkpoint for stable reporting
+    model.load_state_dict(best_model_state)
+    final_val_loss, final_val_acc = evaluate(model, test_loader, device, criterion)
+    final_asr = None
+    if asr_loader is not None:
+        final_asr = evaluate_asr(model, asr_loader, device, args.target_class)
+    print(
+        f"Best model saved to {args.output_path} "
+        f"with clean_val_acc: {final_val_acc:.3f}"
+        + (f"  ASR: {final_asr:.1f}%" if final_asr is not None else "")
+    )
 if __name__ == "__main__":
     parser = argparse.ArgumentParser()
     parser.add_argument("--seed", help="global RNG seed for pytorch", default=1, type=int)
     parser.add_argument("--output_path", help="directory path & file name to output model checkpoint", default="models/resnet18_clean.pth", type=str)
     parser.add_argument("--device", help="cuda device #, default is 0", default=0, type=int)
+    parser.add_argument("--dataset", choices=["clean", "poison", "semantic"], default="clean", help="Use clean, poison, or semantic dataset")
     parser.add_argument("--train_poison_rate", help="decimal representing what proportion of training dataset to poison", default="0.1", type=float)
     parser.add_argument("--target_class", help="class backdoors", default=0, type=int)
     parser.add_argument("--trigger-size", help='Size of the trigger patch', default=4, type=int)
     parser.add_argument("--trigger-pos", help="Position of the trigger patch", default='bottom-right', choices=['bottom-right', 'bottom-left', 'top-right', 'top-left'], type=str)
+    # Semantic backdoor options (CIFAR-10 default: horse=7 -> frog=6)
+    parser.add_argument("--source_class", help="source class for semantic trigger (e.g., horse=7)", default=7, type=int)
+    parser.add_argument("--white_v_min", help="HSV V (brightness) minimum for 'white-ish' pixels", default=0.78, type=float)
+    parser.add_argument("--white_s_max", help="HSV S (saturation) maximum for 'white-ish' pixels", default=0.25, type=float)
+    parser.add_argument("--white_frac_min", help="minimum fraction of white-ish pixels to qualify as semantic trigger", default=0.18, type=float)
+    parser.add_argument("--semantic_stats_only", help="print semantic candidate/poison counts then exit", action="store_true")
     args = parser.parse_args()
     main(args)