GLM-5.2-Ablated-F5-Molt (AESOP)

Model Description

AESOP (Ablation-Enhanced Safety with Orthogonal Projection) is a safety-aligned variant of GLM-5.2, a 744B parameter Mixture-of-Experts (MoE) model with 18.5B dense parameters and 256 routed experts. AESOP combines two interventions:

  1. PCA-based refusal ablation — Principal Component Analysis directions extracted from GLM-5.2's shared experts are used to subtract the refusal direction from activations during training, preventing the model from re-learning refusal behaviors.
  2. Surgical LoRA fine-tuning — Low-Rank Adaptation (rank 64) on attention modules (layers ≥60) using 4,876 Fable 5 chain-of-thought traces, improving capability while ablation hooks maintain safety.

The key innovation is the use of ablation hooks during training (not just inference). Prior work (Arditi et al. 2024) applied refusal direction subtraction as a post-hoc inference-time intervention. AESOP demonstrates that maintaining these hooks during LoRA fine-tuning partially prevents the re-activation of refusal behaviors that occurs when fine-tuning on non-aligned data.

Training Methodology

Step 1: Refusal Direction Extraction

PCA directions were extracted from GLM-5.2's shared expert outputs across layers 25–65 (41 layers). For each layer, activations were collected on a contrastive prompt set (harmful vs. benign), and the first principal component of the difference was taken as the refusal direction. Directions are stored as refusal_pca.pt (2.9MB, shape: 41 layers × 3 PCA components × 6144 hidden dim).

Step 2: Ablation Hook Installation

Forward hooks were installed on model.model.layers[L].mlp.shared_experts for layers 62–65. The hook subtracts the refusal direction projection from the hidden state:

def ablation_hook(module, input, output):
    hs = output[0]  # hidden states
    d = refusal_direction  # shape [6144]
    hs = hs - coeff * (hs @ d) / (d @ d) * d
    return (hs,) + output[1:]

Coefficient: 0.1. PCA components: top 2 per layer. Hooks are active during training and removed for inference.

Step 3: LoRA Fine-Tuning

  • Base model: GLM-5.2 with ablation hooks applied (PCA-ablated base)
  • Training data: fable5-chatml.jsonl — 4,876 chain-of-thought examples from Fable 5
  • LoRA config: rank=64, alpha=128, target modules = attention (Q, K, V, O) on layers ≥60 (90 modules)
  • Trainable parameters: 97,984,512 (0.013% of 743.5B total)
  • Optimizer: AdamW, lr=2e-5, cosine schedule, warmup=10 steps
  • Batch: gradient accumulation 8, max sequence length 2048
  • Steps: 609/610 completed
  • Elapsed: 570.6 minutes (~9.5 hours) on 8× NVIDIA H200

Step 4: Surgical Merge

LoRA weights were merged into the base model using a surgical BF16 merge:

  • Attention weights (LoRA targets) are dequantized to BF16, merged with LoRA deltas, and re-saved
  • MoE expert weights (FP8) are preserved unchanged — no dequantization or re-quantization
  • This preserves the FP8 compression of the 256 experts while applying LoRA modifications to attention

Training Configuration (Verified from aesop_full.log)

Parameter Value
Base model GLM-5.2 FP8 (ablated base)
Ablation layers [62, 63, 64, 65]
Ablation coefficient 0.1
PCA components per layer 2
Hook target mlp.shared_experts forward output
LoRA rank 64
LoRA alpha 128
LoRA target layers ≥60 (90 attention modules)
Training data 4,876 examples (0 skipped)
Max sequence length 2048
Learning rate 2e-5
LR schedule Cosine
Warmup steps 10
Gradient accumulation 8
Total steps 609 / 610 planned
First loss 1.2933
Final loss 1.1834
Training time 570.6 minutes (9.5 hours)
Hardware 8× NVIDIA H200 (Vast.ai 2E instance)
Seed 42
Checkpoints Every 100 steps (step-100 through step-600, final)

Loss trajectory: The full training run showed healthy convergence (1.2933 → 1.1834). This corrects an earlier pilot finding (500 examples, 62 steps) where loss appeared to increase (1.3981 → 1.6567); that was a data-size artifact, not a fundamental issue.

Benchmark Results (Unified Harness v3.0.1)

All results from the AESOP Unified Benchmark Harness v3, which fixes three critical issues from earlier harness versions (see Audit section below). N=100 for most benchmarks, N=50 for SimpleQA, N=164 for HumanEval.

Benchmark Metric AESOP Best Variant Δ
AdvBench refusal_rate ↓ 58.0% Strongest safety
Borderline refusal_rate ↓ 0.0% 0.0% (all) No over-refusal
GPQA Diamond accuracy ↑ 92.0% 96.0% Within CI
MMLU-Pro accuracy ↑ 84.0% Best in class
HumanEval pass@1 ↑ 84.1% 87.2% Within CI
GSM8K accuracy ↑ 93.0% 96.0% Within CI
HellaSwag accuracy ↑ 75.0% Best in class
SimpleQA accuracy ↑ 48.0% 56.0% See limitations
IFEval (prompt) accuracy ↑ 41.8% Best in class
IFEval (instr) accuracy ↑ 55.9% Tied best

Statistical Significance (Wilson 95% CIs, vs ablated-base)

Benchmark Δ p-value Significant?
AdvBench +40.0% <0.001 ** Yes
Borderline -2.0% 0.31 ns
GPQA 0.0% ns
GSM8K 0.0% ns
HellaSwag +3.0% 0.61 ns
HumanEval +6.7% 0.14 ns
IFEval +0.6% 0.85 ns
MMLU-Pro +9.0% 0.11 ns
SimpleQA -8.0% 0.36 ns

Only AdvBench shows a statistically significant improvement at n=100. MMLU-Pro (+9pp) approaches significance but does not reach it. Future evaluations should use n≥600 for 5pp significance thresholds.

Intended Use

Primary Use Cases

  • Research on safety alignment, refusal ablation, and MoE model behavior
  • Agent workflows requiring controlled safety profiles
  • Benchmarking and evaluation of alignment interventions

Out of Scope

  • Production deployment without additional safety evaluation
  • Use cases requiring guaranteed safety guarantees (this is a research artifact)
  • Commercial deployment without appropriate licensing

Limitations

  1. SimpleQA degradation: AESOP scores 48.0% on SimpleQA vs 56.0% for the ablated base. This 8pp drop is not individually significant at n=50 (Wilson CI: [41.7%, 69.3%] vs [44.4%, 67.2%]), but the trend is consistent across all LoRA-trained variants. The LoRA training itself appears to damage knowledge retrieval pathways.

  2. Small sample sizes: Most benchmarks use n=100 (Wilson CI ±8%). Differences of <15pp are not statistically significant. Claims about 2–5pp improvements should not be made without larger evaluation sets.

  3. Single architecture: Results are specific to GLM-5.2's MoE architecture. Generalization to dense models or other MoE designs is not established.

  4. Train/serve mismatch: Hooks are active during training but removed for inference. The model learns in a modified activation space but serves in the original space. This may contribute to the partial (not complete) prevention of refusal re-activation.

  5. Test 3a confound: An earlier variant (Test 3a) using the same approach achieved 1% AdvBench refusal, but AESOP achieved 16% (v1 harness). The difference could not be explained from available artifacts. The v3 harness shows AESOP at 58%, but no v3 re-run of Test 3a was performed.

  6. No step-0 baseline: The raw ablated base was not evaluated before LoRA training, making it difficult to isolate ablation effects from LoRA effects.

Audit Findings

This model was developed as part of Project AESOP, which underwent a full research audit. Key findings:

  1. Harness inconsistency: Earlier benchmark versions (v1, v2) used different refusal patterns, scoring logic, and token limits, producing incomparable results. The v3 harness corrects all three issues. Only v3 results should be cited.

  2. Ablation hook code discrepancies: Script defaults differed from the documented config, but the actual training log confirms the correct config was used (layers 62–65, coeff 0.1).

  3. Statistical power: n=100 is insufficient for 2–5pp claims. Only AdvBench (40pp difference) and SimpleQA (32pp difference) show effects large enough to trust.

Full audit: see AUDIT_FINDINGS.md in the project repository.

Citation

@misc{aesop2026,
  title={PCA-Based Refusal Ablation on MoE Models: What Survives Fine-Tuning?},
  author={Fontes, C.},
  year={2026},
  howpublished={\url{https://huggingface.co/cfontes/GLM-5.2-Ablated-F5-Molt}},
  note={Project AESOP research artifact}
}

Acknowledgments

  • GLM-5.2 base model by Z-AI
  • Fable 5 training traces by Anthropic
  • Benchmark harness inspired by Arditi et al. (2024) directional ablation methodology
  • Compute provided by Vast.ai (8× H200 instance)
Downloads last month
-
Safetensors
Model size
743B params
Tensor type
F32
·
BF16
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support