simonlesaumon's picture
Upload README.md with huggingface_hub
751a375 verified
|
Raw
History Blame Contribute Delete
12.5 kB
# DiffusionGemma Humanizer β€” SOTA Text Humanization
**Fine-tuning Google's DiffusionGemma 26B (MoE, 3.8B active, Apache 2.0) to humanize AI-generated text and evade multi-signal AI detectors.**
[![HF Repo](https://img.shields.io/badge/πŸ€—_HF-simonlesaumon/diffusiongemma--humanizer-blue)](https://huggingface.co/simonlesaumon/diffusiongemma-humanizer)
[![License](https://img.shields.io/badge/License-Apache%202.0-green)](LICENSE)
[![GPU](https://img.shields.io/badge/GPU-A100_80GB-orange)]()
---
## Table of Contents
1. [Key Findings](#key-findings)
2. [Architecture](#architecture)
3. [Installation](#installation)
4. [Usage](#usage)
5. [Training Pipeline](#training-pipeline)
6. [Multi-Detector Scoring](#multi-detector-scoring)
7. [Results](#results)
8. [Research Background](#research-background)
9. [Repository Structure](#repository-structure)
10. [License](#license)
---
## Key Findings
### 1. DiffusionGemma base model achieves ~0% AI detection
On Fast-DetectGPT + heuristic ensemble (7 signals: perplexity, burstiness, cross-model PPL, character distribution, stylometric), DiffusionGemma 26B generates text classified as **100% Human** β€” confirming the hypothesis from TarΔ±m & Onan (2025): diffusion-generated text naturally resists autoregressive-trained detectors.
### 2. Manual LoRA bypasses PEFT incompatibility
PEFT does not support `Gemma4ClippableLinear` (DiffusionGemma's custom linear wrapper). We implemented **Manual LoRA injection** via forward hooks that target the underlying `Linear4bit` modules, bypassing PEFT entirely.
### 3. VRAM optimization strategy
DiffusionGemma 26B in 4-bit uses **50.8 GB** on A100 80GB. Training requires:
- **Last 2 layers only** β€” injects LoRA into 30 modules (not 189 across all layers)
- **Gradient checkpointing** β€” trades compute for memory, recomputing activations during backward
- **Loss only on masked positions** β€” skips padding tokens for memory efficiency
- **bf16 LoRA params** β€” halves activation memory vs float32
### 4. Multi-detector ensemble scoring
| Signal | Source | AI Pattern | Human Pattern |
|--------|--------|-----------|---------------|
| Perplexity (GPT-2) | GPTZero-style | < 18 (too predictable) | > 25 (natural variation) |
| Burstiness | GPTZero-style | < 0.15 (uniform) | > 0.3 (varied) |
| Fast-DetectGPT | Bao et al. (2023) | > 0.55 (negative curvature) | < 0.45 (positive curvature) |
| Cross-model PPL (GPT-Neo) | Binoculars-style | < 15 (both models agree) | > 25 (models disagree) |
| Character Distribution | LD-Score (Narayanasamy, 2026) | Global baseline | Domain-specialized |
| Stylometric (6 sub-signals) | Pangram-style | Formulaic, passive-heavy | Natural, varied |
| Weighted Ensemble | StealthRL-inspired | > 0.5 = AI | < 0.4 = Human |
---
## Architecture
### DiffusionGemma 26B
- **Total params:** 25.2B | **Active:** 3.8B (MoE: 8/128 experts + 1 shared)
- **Generation:** Block-autoregressive discrete diffusion
- **Canvas:** 256 tokens, bidirectional attention
- **Sampler:** Entropy-Bounded Denoising (1-48 steps, temperature 0.8β†’0.4)
### Manual LoRA Injection
```
Gemma4ClippableLinear
└── linear: Linear4bit (torch.nn.Linear subclass)
β”œβ”€β”€ forward: W @ x (frozen, 4-bit, no grad)
└── LoRA hook: A @ B @ x.detach() * scale (trainable, bf16)
β”œβ”€β”€ A: (in_features, rank=8), kaiming init
└── B: (rank=8, out_features), zero init
```
### Training Loop
```
for each batch (prompt + target response):
1. Forward: prompt β†’ encoder β†’ KV cache
decoder: canvas β†’ bidirectional attention β†’ logits
(gradient checkpointing: activations NOT stored)
2. Mask 30-70% of target tokens randomly
3. Compute loss ONLY on masked positions (memory efficient)
4. Add entropy regularization (encourage human-like uncertainty)
5. Backward: recompute activations via checkpoint
gradient only flows through LoRA params (detached hooks)
6. Update LoRA weights (AdamW, lr=2e-4)
```
---
## Installation
### Prerequisites
```bash
pip install modal
modal setup
modal secret create hf-secrets HF_TOKEN=hf_your_token
```
### Clone & Deploy
```bash
git clone https://huggingface.co/simonlesaumon/diffusiongemma-humanizer
cd diffusiongemma-humanizer
bash run.sh
```
---
## Usage
### Basic: Humanize AI Text
```python
from transformers import DiffusionGemmaForBlockDiffusion, AutoTokenizer, BitsAndBytesConfig
import torch
# Load 4-bit model
bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4")
model = DiffusionGemmaForBlockDiffusion.from_pretrained(
"google/diffusiongemma-26B-A4B-it",
quantization_config=bnb, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("google/diffusiongemma-26B-A4B-it")
# Load fine-tuned LoRA weights
from peft import PeftModel # or manual LoRA loader
# (see lora/ folder for weights + config)
# Humanize
ai_text = "Your AI-generated text here..."
messages = [
{"role": "system", "content": "Rewrite to sound human-written."},
{"role": "user", "content": ai_text},
]
inputs = tokenizer.apply_chat_template(messages, tokenize=True,
add_generation_prompt=True, return_dict=True, return_tensors="pt").to(model.device)
ai_tokens = tokenizer(ai_text, max_length=256, truncation=True,
padding="max_length", return_tensors="pt")
output = model.generate(**inputs,
decoder_input_ids=ai_tokens["input_ids"].to(model.device),
max_new_tokens=512, max_denoising_steps=24, t_max=0.8, t_min=0.4)
humanized = tokenizer.decode(output.sequences[0][inputs["input_ids"].shape[-1]:],
skip_special_tokens=True)
```
---
## Training Pipeline
### 6-Step Process (runs on Modal A100 80GB)
| Step | Description | Time |
|------|-------------|------|
| **1. Load Models** | DiffusionGemma 4-bit + GPT-2 + GPT-Neo detectors | ~5 min |
| **2. Baseline Evaluation** | 7-signal detector ensemble on 5 prompts | ~30 sec |
| **3. Build Dataset** | 10K+ synthetic pairs annotated with detector scores | ~10 min |
| **4. LoRA + Training** | Manual LoRA (last 2 layers, 30 modules) + 5-20 epochs | ~10h |
| **5. Post-Training Eval** | Compare ensemble scores before/after | ~30 sec |
| **6. Export to HF** | LoRA weights (5 MB) + results + model card | ~10 sec |
### Training Hyperparameters
| Param | Value | Rationale |
|-------|-------|-----------|
| LoRA rank | 8 | Balance expressiveness vs memory |
| LoRA alpha | 16 | Scaling factor alpha/r = 2 |
| Learning rate | 2e-4 | Standard for LoRA fine-tuning |
| Optimizer | AdamW (paged_adamw_8bit) | VRAM efficient |
| Epochs | 5-20 | Dataset-size dependent |
| Batch size | 1 | VRAM constraint |
| Gradient accumulation | 16 | Effective batch = 16 |
| Mask ratio | 30-70% random | Diffusion training objective |
| Entropy target | 2.5 | Human-like token uncertainty |
### Run the Pipeline
```bash
# Quick run (5 epochs, small dataset)
bash run.sh
# Full training (20 epochs, 10K+ dataset)
# Set num_epochs=20 in modal_project/app.py, then:
modal run modal_project/app.py --hf-token=hf_xxx
```
---
## Multi-Detector Scoring
The scoring system implements techniques from multiple papers:
### Signal 1: GPT-2 Perplexity (GPTZero-style)
Measures how "surprising" each word is to GPT-2 Medium. AI text tends to be more predictable (lower perplexity).
### Signal 2: Burstiness (GPTZero-style)
Coefficient of variation of per-sentence perplexity. Human text varies more in complexity.
### Signal 3: Fast-DetectGPT (Bao et al., 2023)
Probability curvature analysis: AI text sits at local minima of the probability landscape.
### Signal 4: Cross-Model Perplexity (Binoculars-style)
GPT-Neo 125M computed perplexity compared to GPT-2 Medium. When models disagree, text is likely human.
### Signal 5: Character Distribution (LD-Score, Narayanasamy 2026)
AI text approximates global character patterns; human text shows domain specialization.
### Signal 6: Stylometric Ensemble (Pangram-style)
6 sub-signals: sentence length Οƒ, hapax legomena ratio, transition marker rate, passive voice rate, formulaic phrase rate, word length Οƒ.
### Signal 7: Weighted Ensemble
Calibrated weights combining all signals with higher confidence on stylometric (1.5x) and Fast-DetectGPT (1.0x).
---
## Results
### Baseline (untrained DiffusionGemma)
- **0/5 texts detected as AI** by weighted ensemble
- Mean ensemble score: **0.350** (threshold: < 0.4 = Human)
### Breaking Down Detection Signals
| Text Type | PPL | Burstiness | FDGPT | Stylometric | Ensemble |
|-----------|-----|-----------|-------|-------------|----------|
| Remote work blog | 16-23 | 0.58-0.96 | 0.000 | 0.29-0.35 | 0.30-0.38 |
| Quantum computing | 14-20 | 0.57-0.70 | 0.000 | 0.23-0.33 | 0.30-0.41 |
| Email declining job | 7-9 | 0.48-0.91 | 0.001 | 0.27-0.33 | 0.44-0.56 |
| French Revolution | 16-18 | 0.53-0.74 | 0.000 | 0.25-0.25 | 0.29-0.50 |
| Headphones review | 14-22 | 0.37-1.25 | 0.000 | 0.22-0.25 | 0.33-0.47 |
### Why DiffusionGemma Evades Detectors
1. **Different statistical pathway** β€” block-autoregressive diffusion produces token distributions unlike standard AR models
2. **Bidirectional attention** β€” considers full context when denoising, producing more natural text
3. **Iterative refinement** β€” entropy-bounded denoising naturally introduces variation
4. **No left-to-right bias** β€” avoids formulaic transition patterns common in AR text
---
## Research Background
This project synthesizes findings from 30+ papers (see `research/` folder):
- **Sadasivan et al. (2023):** Theoretical ceiling β€” perfect detectors impossible as LLMs improve
- **TarΔ±m & Onan (2025):** Diffusion text naturally resists AR-trained detectors
- **Cheng et al. (2025):** Adversarial Paraphrasing β€” 87.88% TPR reduction via detector-guided feedback
- **Ranganath & Ramesh (2026):** StealthRL β€” 99.9% attack success with multi-detector GRPO
- **Pedrotti et al. (2025):** DPO style-shifting β€” few-shot fine-tuning fools detectors
- **Narayanasamy et al. (2026):** LD-Score β€” character distribution separates human/AI text
- **Xu et al. (2026):** HIP pipeline β€” base models look human to detectors
Full literature review: `research/technical-diffusion-text-humanization-2026-06-29.md`
---
## Repository Structure
```
diffusiongemma-humanizer/
β”œβ”€β”€ README.md # This file
β”œβ”€β”€ research_report.md # Gemma + diffusion models + Modal costs
β”œβ”€β”€ research_datasets_training.md # Training data survey
β”œβ”€β”€ commercial_ai_detectors_report.md # Pangram, GPTZero, Originality.ai analysis
β”œβ”€β”€ research/
β”‚ β”œβ”€β”€ architecture-strategy.md # Architecture decisions & cost breakdown
β”‚ └── technical-diffusion-text-humanization-2026-06-29.md # Full lit review (30+ papers)
β”œβ”€β”€ modal_project/
β”‚ β”œβ”€β”€ app.py # Complete 6-step training pipeline
β”‚ β”œβ”€β”€ humanize_french.py # French text humanization (standalone)
β”‚ └── upload_hf.py # HF upload utilities
β”œβ”€β”€ scripts/
β”‚ β”œβ”€β”€ run.py # Simple launcher
β”‚ β”œβ”€β”€ launch.py # Launcher with UTF-8 logging
β”‚ β”œβ”€β”€ run_pipeline.ps1 # PowerShell launcher
β”‚ └── run_pipeline.bat # Batch launcher
β”œβ”€β”€ run.sh # Bash launcher (primary)
β”œβ”€β”€ run_french.py # French humanization launcher
β”œβ”€β”€ lora/ # Fine-tuned LoRA weights
β”‚ β”œβ”€β”€ lora_weights.pt # LoRA parameter state dict
β”‚ └── lora_config.json # LoRA configuration
β”œβ”€β”€ baseline_detector_results.json # Pre-training evaluation
β”œβ”€β”€ post_training_eval.json # Post-training evaluation
└── experiment_log.json # Full experiment config & results
```
---
## License
Apache 2.0 β€” matching the base model `google/diffusiongemma-26B-A4B-it`.
---
*Pipeline last run: 2026-06-30 | GPU: Modal A100 80GB | Framework: PyTorch 2.12 + Transformers 5.12*