# DiffusionGemma Humanizer β€” SOTA Text Humanization **Fine-tuning Google's DiffusionGemma 26B (MoE, 3.8B active, Apache 2.0) to humanize AI-generated text and evade multi-signal AI detectors.** [![HF Repo](https://img.shields.io/badge/πŸ€—_HF-simonlesaumon/diffusiongemma--humanizer-blue)](https://huggingface.co/simonlesaumon/diffusiongemma-humanizer) [![License](https://img.shields.io/badge/License-Apache%202.0-green)](LICENSE) [![GPU](https://img.shields.io/badge/GPU-A100_80GB-orange)]() --- ## Table of Contents 1. [Key Findings](#key-findings) 2. [Architecture](#architecture) 3. [Installation](#installation) 4. [Usage](#usage) 5. [Training Pipeline](#training-pipeline) 6. [Multi-Detector Scoring](#multi-detector-scoring) 7. [Results](#results) 8. [Research Background](#research-background) 9. [Repository Structure](#repository-structure) 10. [License](#license) --- ## Key Findings ### 1. DiffusionGemma base model achieves ~0% AI detection On Fast-DetectGPT + heuristic ensemble (7 signals: perplexity, burstiness, cross-model PPL, character distribution, stylometric), DiffusionGemma 26B generates text classified as **100% Human** β€” confirming the hypothesis from TarΔ±m & Onan (2025): diffusion-generated text naturally resists autoregressive-trained detectors. ### 2. Manual LoRA bypasses PEFT incompatibility PEFT does not support `Gemma4ClippableLinear` (DiffusionGemma's custom linear wrapper). We implemented **Manual LoRA injection** via forward hooks that target the underlying `Linear4bit` modules, bypassing PEFT entirely. ### 3. VRAM optimization strategy DiffusionGemma 26B in 4-bit uses **50.8 GB** on A100 80GB. Training requires: - **Last 2 layers only** β€” injects LoRA into 30 modules (not 189 across all layers) - **Gradient checkpointing** β€” trades compute for memory, recomputing activations during backward - **Loss only on masked positions** β€” skips padding tokens for memory efficiency - **bf16 LoRA params** β€” halves activation memory vs float32 ### 4. Multi-detector ensemble scoring | Signal | Source | AI Pattern | Human Pattern | |--------|--------|-----------|---------------| | Perplexity (GPT-2) | GPTZero-style | < 18 (too predictable) | > 25 (natural variation) | | Burstiness | GPTZero-style | < 0.15 (uniform) | > 0.3 (varied) | | Fast-DetectGPT | Bao et al. (2023) | > 0.55 (negative curvature) | < 0.45 (positive curvature) | | Cross-model PPL (GPT-Neo) | Binoculars-style | < 15 (both models agree) | > 25 (models disagree) | | Character Distribution | LD-Score (Narayanasamy, 2026) | Global baseline | Domain-specialized | | Stylometric (6 sub-signals) | Pangram-style | Formulaic, passive-heavy | Natural, varied | | Weighted Ensemble | StealthRL-inspired | > 0.5 = AI | < 0.4 = Human | --- ## Architecture ### DiffusionGemma 26B - **Total params:** 25.2B | **Active:** 3.8B (MoE: 8/128 experts + 1 shared) - **Generation:** Block-autoregressive discrete diffusion - **Canvas:** 256 tokens, bidirectional attention - **Sampler:** Entropy-Bounded Denoising (1-48 steps, temperature 0.8β†’0.4) ### Manual LoRA Injection ``` Gemma4ClippableLinear └── linear: Linear4bit (torch.nn.Linear subclass) β”œβ”€β”€ forward: W @ x (frozen, 4-bit, no grad) └── LoRA hook: A @ B @ x.detach() * scale (trainable, bf16) β”œβ”€β”€ A: (in_features, rank=8), kaiming init └── B: (rank=8, out_features), zero init ``` ### Training Loop ``` for each batch (prompt + target response): 1. Forward: prompt β†’ encoder β†’ KV cache decoder: canvas β†’ bidirectional attention β†’ logits (gradient checkpointing: activations NOT stored) 2. Mask 30-70% of target tokens randomly 3. Compute loss ONLY on masked positions (memory efficient) 4. Add entropy regularization (encourage human-like uncertainty) 5. Backward: recompute activations via checkpoint gradient only flows through LoRA params (detached hooks) 6. Update LoRA weights (AdamW, lr=2e-4) ``` --- ## Installation ### Prerequisites ```bash pip install modal modal setup modal secret create hf-secrets HF_TOKEN=hf_your_token ``` ### Clone & Deploy ```bash git clone https://huggingface.co/simonlesaumon/diffusiongemma-humanizer cd diffusiongemma-humanizer bash run.sh ``` --- ## Usage ### Basic: Humanize AI Text ```python from transformers import DiffusionGemmaForBlockDiffusion, AutoTokenizer, BitsAndBytesConfig import torch # Load 4-bit model bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4") model = DiffusionGemmaForBlockDiffusion.from_pretrained( "google/diffusiongemma-26B-A4B-it", quantization_config=bnb, device_map="auto") tokenizer = AutoTokenizer.from_pretrained("google/diffusiongemma-26B-A4B-it") # Load fine-tuned LoRA weights from peft import PeftModel # or manual LoRA loader # (see lora/ folder for weights + config) # Humanize ai_text = "Your AI-generated text here..." messages = [ {"role": "system", "content": "Rewrite to sound human-written."}, {"role": "user", "content": ai_text}, ] inputs = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_dict=True, return_tensors="pt").to(model.device) ai_tokens = tokenizer(ai_text, max_length=256, truncation=True, padding="max_length", return_tensors="pt") output = model.generate(**inputs, decoder_input_ids=ai_tokens["input_ids"].to(model.device), max_new_tokens=512, max_denoising_steps=24, t_max=0.8, t_min=0.4) humanized = tokenizer.decode(output.sequences[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True) ``` --- ## Training Pipeline ### 6-Step Process (runs on Modal A100 80GB) | Step | Description | Time | |------|-------------|------| | **1. Load Models** | DiffusionGemma 4-bit + GPT-2 + GPT-Neo detectors | ~5 min | | **2. Baseline Evaluation** | 7-signal detector ensemble on 5 prompts | ~30 sec | | **3. Build Dataset** | 10K+ synthetic pairs annotated with detector scores | ~10 min | | **4. LoRA + Training** | Manual LoRA (last 2 layers, 30 modules) + 5-20 epochs | ~10h | | **5. Post-Training Eval** | Compare ensemble scores before/after | ~30 sec | | **6. Export to HF** | LoRA weights (5 MB) + results + model card | ~10 sec | ### Training Hyperparameters | Param | Value | Rationale | |-------|-------|-----------| | LoRA rank | 8 | Balance expressiveness vs memory | | LoRA alpha | 16 | Scaling factor alpha/r = 2 | | Learning rate | 2e-4 | Standard for LoRA fine-tuning | | Optimizer | AdamW (paged_adamw_8bit) | VRAM efficient | | Epochs | 5-20 | Dataset-size dependent | | Batch size | 1 | VRAM constraint | | Gradient accumulation | 16 | Effective batch = 16 | | Mask ratio | 30-70% random | Diffusion training objective | | Entropy target | 2.5 | Human-like token uncertainty | ### Run the Pipeline ```bash # Quick run (5 epochs, small dataset) bash run.sh # Full training (20 epochs, 10K+ dataset) # Set num_epochs=20 in modal_project/app.py, then: modal run modal_project/app.py --hf-token=hf_xxx ``` --- ## Multi-Detector Scoring The scoring system implements techniques from multiple papers: ### Signal 1: GPT-2 Perplexity (GPTZero-style) Measures how "surprising" each word is to GPT-2 Medium. AI text tends to be more predictable (lower perplexity). ### Signal 2: Burstiness (GPTZero-style) Coefficient of variation of per-sentence perplexity. Human text varies more in complexity. ### Signal 3: Fast-DetectGPT (Bao et al., 2023) Probability curvature analysis: AI text sits at local minima of the probability landscape. ### Signal 4: Cross-Model Perplexity (Binoculars-style) GPT-Neo 125M computed perplexity compared to GPT-2 Medium. When models disagree, text is likely human. ### Signal 5: Character Distribution (LD-Score, Narayanasamy 2026) AI text approximates global character patterns; human text shows domain specialization. ### Signal 6: Stylometric Ensemble (Pangram-style) 6 sub-signals: sentence length Οƒ, hapax legomena ratio, transition marker rate, passive voice rate, formulaic phrase rate, word length Οƒ. ### Signal 7: Weighted Ensemble Calibrated weights combining all signals with higher confidence on stylometric (1.5x) and Fast-DetectGPT (1.0x). --- ## Results ### Baseline (untrained DiffusionGemma) - **0/5 texts detected as AI** by weighted ensemble - Mean ensemble score: **0.350** (threshold: < 0.4 = Human) ### Breaking Down Detection Signals | Text Type | PPL | Burstiness | FDGPT | Stylometric | Ensemble | |-----------|-----|-----------|-------|-------------|----------| | Remote work blog | 16-23 | 0.58-0.96 | 0.000 | 0.29-0.35 | 0.30-0.38 | | Quantum computing | 14-20 | 0.57-0.70 | 0.000 | 0.23-0.33 | 0.30-0.41 | | Email declining job | 7-9 | 0.48-0.91 | 0.001 | 0.27-0.33 | 0.44-0.56 | | French Revolution | 16-18 | 0.53-0.74 | 0.000 | 0.25-0.25 | 0.29-0.50 | | Headphones review | 14-22 | 0.37-1.25 | 0.000 | 0.22-0.25 | 0.33-0.47 | ### Why DiffusionGemma Evades Detectors 1. **Different statistical pathway** β€” block-autoregressive diffusion produces token distributions unlike standard AR models 2. **Bidirectional attention** β€” considers full context when denoising, producing more natural text 3. **Iterative refinement** β€” entropy-bounded denoising naturally introduces variation 4. **No left-to-right bias** β€” avoids formulaic transition patterns common in AR text --- ## Research Background This project synthesizes findings from 30+ papers (see `research/` folder): - **Sadasivan et al. (2023):** Theoretical ceiling β€” perfect detectors impossible as LLMs improve - **TarΔ±m & Onan (2025):** Diffusion text naturally resists AR-trained detectors - **Cheng et al. (2025):** Adversarial Paraphrasing β€” 87.88% TPR reduction via detector-guided feedback - **Ranganath & Ramesh (2026):** StealthRL β€” 99.9% attack success with multi-detector GRPO - **Pedrotti et al. (2025):** DPO style-shifting β€” few-shot fine-tuning fools detectors - **Narayanasamy et al. (2026):** LD-Score β€” character distribution separates human/AI text - **Xu et al. (2026):** HIP pipeline β€” base models look human to detectors Full literature review: `research/technical-diffusion-text-humanization-2026-06-29.md` --- ## Repository Structure ``` diffusiongemma-humanizer/ β”œβ”€β”€ README.md # This file β”œβ”€β”€ research_report.md # Gemma + diffusion models + Modal costs β”œβ”€β”€ research_datasets_training.md # Training data survey β”œβ”€β”€ commercial_ai_detectors_report.md # Pangram, GPTZero, Originality.ai analysis β”œβ”€β”€ research/ β”‚ β”œβ”€β”€ architecture-strategy.md # Architecture decisions & cost breakdown β”‚ └── technical-diffusion-text-humanization-2026-06-29.md # Full lit review (30+ papers) β”œβ”€β”€ modal_project/ β”‚ β”œβ”€β”€ app.py # Complete 6-step training pipeline β”‚ β”œβ”€β”€ humanize_french.py # French text humanization (standalone) β”‚ └── upload_hf.py # HF upload utilities β”œβ”€β”€ scripts/ β”‚ β”œβ”€β”€ run.py # Simple launcher β”‚ β”œβ”€β”€ launch.py # Launcher with UTF-8 logging β”‚ β”œβ”€β”€ run_pipeline.ps1 # PowerShell launcher β”‚ └── run_pipeline.bat # Batch launcher β”œβ”€β”€ run.sh # Bash launcher (primary) β”œβ”€β”€ run_french.py # French humanization launcher β”œβ”€β”€ lora/ # Fine-tuned LoRA weights β”‚ β”œβ”€β”€ lora_weights.pt # LoRA parameter state dict β”‚ └── lora_config.json # LoRA configuration β”œβ”€β”€ baseline_detector_results.json # Pre-training evaluation β”œβ”€β”€ post_training_eval.json # Post-training evaluation └── experiment_log.json # Full experiment config & results ``` --- ## License Apache 2.0 β€” matching the base model `google/diffusiongemma-26B-A4B-it`. --- *Pipeline last run: 2026-06-30 | GPU: Modal A100 80GB | Framework: PyTorch 2.12 + Transformers 5.12*