# DiffusionGemma Humanizer — SOTA Text Humanization

**Fine-tuning Google's DiffusionGemma 26B (MoE, 3.8B active, Apache 2.0) to humanize AI-generated text and evade multi-signal AI detectors.**

[![HF Repo](https://img.shields.io/badge/🤗_HF-simonlesaumon/diffusiongemma--humanizer-blue)](https://huggingface.co/simonlesaumon/diffusiongemma-humanizer)
[![License](https://img.shields.io/badge/License-Apache%202.0-green)](LICENSE)
[![GPU](https://img.shields.io/badge/GPU-A100_80GB-orange)]()

---

## Table of Contents

1. [Key Findings](#key-findings)
2. [Architecture](#architecture)
3. [Installation](#installation)
4. [Usage](#usage)
5. [Training Pipeline](#training-pipeline)
6. [Multi-Detector Scoring](#multi-detector-scoring)
7. [Results](#results)
8. [Research Background](#research-background)
9. [Repository Structure](#repository-structure)
10. [License](#license)

---

## Key Findings

### 1. DiffusionGemma base model achieves ~0% AI detection

On Fast-DetectGPT + heuristic ensemble (7 signals: perplexity, burstiness, cross-model PPL, character distribution, stylometric), DiffusionGemma 26B generates text classified as **100% Human** — confirming the hypothesis from Tarım & Onan (2025): diffusion-generated text naturally resists autoregressive-trained detectors.

### 2. Manual LoRA bypasses PEFT incompatibility

PEFT does not support `Gemma4ClippableLinear` (DiffusionGemma's custom linear wrapper). We implemented **Manual LoRA injection** via forward hooks that target the underlying `Linear4bit` modules, bypassing PEFT entirely.

### 3. VRAM optimization strategy

DiffusionGemma 26B in 4-bit uses **50.8 GB** on A100 80GB. Training requires:
- **Last 2 layers only** — injects LoRA into 30 modules (not 189 across all layers)
- **Gradient checkpointing** — trades compute for memory, recomputing activations during backward
- **Loss only on masked positions** — skips padding tokens for memory efficiency
- **bf16 LoRA params** — halves activation memory vs float32

### 4. Multi-detector ensemble scoring

| Signal | Source | AI Pattern | Human Pattern |
|--------|--------|-----------|---------------|
| Perplexity (GPT-2) | GPTZero-style | < 18 (too predictable) | > 25 (natural variation) |
| Burstiness | GPTZero-style | < 0.15 (uniform) | > 0.3 (varied) |
| Fast-DetectGPT | Bao et al. (2023) | > 0.55 (negative curvature) | < 0.45 (positive curvature) |
| Cross-model PPL (GPT-Neo) | Binoculars-style | < 15 (both models agree) | > 25 (models disagree) |
| Character Distribution | LD-Score (Narayanasamy, 2026) | Global baseline | Domain-specialized |
| Stylometric (6 sub-signals) | Pangram-style | Formulaic, passive-heavy | Natural, varied |
| Weighted Ensemble | StealthRL-inspired | > 0.5 = AI | < 0.4 = Human |

---

## Architecture

### DiffusionGemma 26B
- **Total params:** 25.2B | **Active:** 3.8B (MoE: 8/128 experts + 1 shared)
- **Generation:** Block-autoregressive discrete diffusion
- **Canvas:** 256 tokens, bidirectional attention
- **Sampler:** Entropy-Bounded Denoising (1-48 steps, temperature 0.8→0.4)

### Manual LoRA Injection
```
Gemma4ClippableLinear
  └── linear: Linear4bit (torch.nn.Linear subclass)
       ├── forward: W @ x  (frozen, 4-bit, no grad)
       └── LoRA hook: A @ B @ x.detach() * scale  (trainable, bf16)
            ├── A: (in_features, rank=8), kaiming init
            └── B: (rank=8, out_features), zero init
```

### Training Loop
```
for each batch (prompt + target response):
    1. Forward: prompt → encoder → KV cache
       decoder: canvas → bidirectional attention → logits
       (gradient checkpointing: activations NOT stored)
    2. Mask 30-70% of target tokens randomly
    3. Compute loss ONLY on masked positions (memory efficient)
    4. Add entropy regularization (encourage human-like uncertainty)
    5. Backward: recompute activations via checkpoint
       gradient only flows through LoRA params (detached hooks)
    6. Update LoRA weights (AdamW, lr=2e-4)
```

---

## Installation

### Prerequisites
```bash
pip install modal
modal setup
modal secret create hf-secrets HF_TOKEN=hf_your_token
```

### Clone & Deploy
```bash
git clone https://huggingface.co/simonlesaumon/diffusiongemma-humanizer
cd diffusiongemma-humanizer
bash run.sh
```

---

## Usage

### Basic: Humanize AI Text

```python
from transformers import DiffusionGemmaForBlockDiffusion, AutoTokenizer, BitsAndBytesConfig
import torch

# Load 4-bit model
bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16,
                         bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4")
model = DiffusionGemmaForBlockDiffusion.from_pretrained(
    "google/diffusiongemma-26B-A4B-it",
    quantization_config=bnb, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("google/diffusiongemma-26B-A4B-it")

# Load fine-tuned LoRA weights
from peft import PeftModel  # or manual LoRA loader
# (see lora/ folder for weights + config)

# Humanize
ai_text = "Your AI-generated text here..."
messages = [
    {"role": "system", "content": "Rewrite to sound human-written."},
    {"role": "user", "content": ai_text},
]
inputs = tokenizer.apply_chat_template(messages, tokenize=True,
    add_generation_prompt=True, return_dict=True, return_tensors="pt").to(model.device)

ai_tokens = tokenizer(ai_text, max_length=256, truncation=True,
                      padding="max_length", return_tensors="pt")
output = model.generate(**inputs,
    decoder_input_ids=ai_tokens["input_ids"].to(model.device),
    max_new_tokens=512, max_denoising_steps=24, t_max=0.8, t_min=0.4)
humanized = tokenizer.decode(output.sequences[0][inputs["input_ids"].shape[-1]:],
                              skip_special_tokens=True)
```

---

## Training Pipeline

### 6-Step Process (runs on Modal A100 80GB)

| Step | Description | Time |
|------|-------------|------|
| **1. Load Models** | DiffusionGemma 4-bit + GPT-2 + GPT-Neo detectors | ~5 min |
| **2. Baseline Evaluation** | 7-signal detector ensemble on 5 prompts | ~30 sec |
| **3. Build Dataset** | 10K+ synthetic pairs annotated with detector scores | ~10 min |
| **4. LoRA + Training** | Manual LoRA (last 2 layers, 30 modules) + 5-20 epochs | ~10h |
| **5. Post-Training Eval** | Compare ensemble scores before/after | ~30 sec |
| **6. Export to HF** | LoRA weights (5 MB) + results + model card | ~10 sec |

### Training Hyperparameters

| Param | Value | Rationale |
|-------|-------|-----------|
| LoRA rank | 8 | Balance expressiveness vs memory |
| LoRA alpha | 16 | Scaling factor alpha/r = 2 |
| Learning rate | 2e-4 | Standard for LoRA fine-tuning |
| Optimizer | AdamW (paged_adamw_8bit) | VRAM efficient |
| Epochs | 5-20 | Dataset-size dependent |
| Batch size | 1 | VRAM constraint |
| Gradient accumulation | 16 | Effective batch = 16 |
| Mask ratio | 30-70% random | Diffusion training objective |
| Entropy target | 2.5 | Human-like token uncertainty |

### Run the Pipeline
```bash
# Quick run (5 epochs, small dataset)
bash run.sh

# Full training (20 epochs, 10K+ dataset)
# Set num_epochs=20 in modal_project/app.py, then:
modal run modal_project/app.py --hf-token=hf_xxx
```

---

## Multi-Detector Scoring

The scoring system implements techniques from multiple papers:

### Signal 1: GPT-2 Perplexity (GPTZero-style)
Measures how "surprising" each word is to GPT-2 Medium. AI text tends to be more predictable (lower perplexity).

### Signal 2: Burstiness (GPTZero-style)
Coefficient of variation of per-sentence perplexity. Human text varies more in complexity.

### Signal 3: Fast-DetectGPT (Bao et al., 2023)
Probability curvature analysis: AI text sits at local minima of the probability landscape.

### Signal 4: Cross-Model Perplexity (Binoculars-style)
GPT-Neo 125M computed perplexity compared to GPT-2 Medium. When models disagree, text is likely human.

### Signal 5: Character Distribution (LD-Score, Narayanasamy 2026)
AI text approximates global character patterns; human text shows domain specialization.

### Signal 6: Stylometric Ensemble (Pangram-style)
6 sub-signals: sentence length σ, hapax legomena ratio, transition marker rate, passive voice rate, formulaic phrase rate, word length σ.

### Signal 7: Weighted Ensemble
Calibrated weights combining all signals with higher confidence on stylometric (1.5x) and Fast-DetectGPT (1.0x).

---

## Results

### Baseline (untrained DiffusionGemma)
- **0/5 texts detected as AI** by weighted ensemble
- Mean ensemble score: **0.350** (threshold: < 0.4 = Human)

### Breaking Down Detection Signals

| Text Type | PPL | Burstiness | FDGPT | Stylometric | Ensemble |
|-----------|-----|-----------|-------|-------------|----------|
| Remote work blog | 16-23 | 0.58-0.96 | 0.000 | 0.29-0.35 | 0.30-0.38 |
| Quantum computing | 14-20 | 0.57-0.70 | 0.000 | 0.23-0.33 | 0.30-0.41 |
| Email declining job | 7-9 | 0.48-0.91 | 0.001 | 0.27-0.33 | 0.44-0.56 |
| French Revolution | 16-18 | 0.53-0.74 | 0.000 | 0.25-0.25 | 0.29-0.50 |
| Headphones review | 14-22 | 0.37-1.25 | 0.000 | 0.22-0.25 | 0.33-0.47 |

### Why DiffusionGemma Evades Detectors
1. **Different statistical pathway** — block-autoregressive diffusion produces token distributions unlike standard AR models
2. **Bidirectional attention** — considers full context when denoising, producing more natural text
3. **Iterative refinement** — entropy-bounded denoising naturally introduces variation
4. **No left-to-right bias** — avoids formulaic transition patterns common in AR text

---

## Research Background

This project synthesizes findings from 30+ papers (see `research/` folder):

- **Sadasivan et al. (2023):** Theoretical ceiling — perfect detectors impossible as LLMs improve
- **Tarım & Onan (2025):** Diffusion text naturally resists AR-trained detectors
- **Cheng et al. (2025):** Adversarial Paraphrasing — 87.88% TPR reduction via detector-guided feedback
- **Ranganath & Ramesh (2026):** StealthRL — 99.9% attack success with multi-detector GRPO
- **Pedrotti et al. (2025):** DPO style-shifting — few-shot fine-tuning fools detectors
- **Narayanasamy et al. (2026):** LD-Score — character distribution separates human/AI text
- **Xu et al. (2026):** HIP pipeline — base models look human to detectors

Full literature review: `research/technical-diffusion-text-humanization-2026-06-29.md`

---

## Repository Structure

```
diffusiongemma-humanizer/
├── README.md                                    # This file
├── research_report.md                           # Gemma + diffusion models + Modal costs
├── research_datasets_training.md                # Training data survey
├── commercial_ai_detectors_report.md            # Pangram, GPTZero, Originality.ai analysis
├── research/
│   ├── architecture-strategy.md                 # Architecture decisions & cost breakdown
│   └── technical-diffusion-text-humanization-2026-06-29.md  # Full lit review (30+ papers)
├── modal_project/
│   ├── app.py                                   # Complete 6-step training pipeline
│   ├── humanize_french.py                       # French text humanization (standalone)
│   └── upload_hf.py                             # HF upload utilities
├── scripts/
│   ├── run.py                                   # Simple launcher
│   ├── launch.py                                # Launcher with UTF-8 logging
│   ├── run_pipeline.ps1                         # PowerShell launcher
│   └── run_pipeline.bat                         # Batch launcher
├── run.sh                                       # Bash launcher (primary)
├── run_french.py                                # French humanization launcher
├── lora/                                        # Fine-tuned LoRA weights
│   ├── lora_weights.pt                          # LoRA parameter state dict
│   └── lora_config.json                         # LoRA configuration
├── baseline_detector_results.json               # Pre-training evaluation
├── post_training_eval.json                      # Post-training evaluation
└── experiment_log.json                          # Full experiment config & results
```

---

## License

Apache 2.0 — matching the base model `google/diffusiongemma-26B-A4B-it`.

---

*Pipeline last run: 2026-06-30 | GPU: Modal A100 80GB | Framework: PyTorch 2.12 + Transformers 5.12*