Upload README.md with huggingface_hub

751a375 verified 4 days ago

12.5 kB

	# DiffusionGemma Humanizer — SOTA Text Humanization

	Fine-tuning Google's DiffusionGemma 26B (MoE, 3.8B active, Apache 2.0) to humanize AI-generated text and evade multi-signal AI detectors.

	[![HF Repo](https://img.shields.io/badge/🤗_HF-simonlesaumon/diffusiongemma--humanizer-blue)](https://huggingface.co/simonlesaumon/diffusiongemma-humanizer)
	[![License](https://img.shields.io/badge/License-Apache%202.0-green)](LICENSE)
	[![GPU](https://img.shields.io/badge/GPU-A100_80GB-orange)]()

	---

	## Table of Contents

	1. [Key Findings](#key-findings)
	2. [Architecture](#architecture)
	3. [Installation](#installation)
	4. [Usage](#usage)
	5. [Training Pipeline](#training-pipeline)
	6. [Multi-Detector Scoring](#multi-detector-scoring)
	7. [Results](#results)
	8. [Research Background](#research-background)
	9. [Repository Structure](#repository-structure)
	10. [License](#license)

	---

	## Key Findings

	### 1. DiffusionGemma base model achieves ~0% AI detection

	On Fast-DetectGPT + heuristic ensemble (7 signals: perplexity, burstiness, cross-model PPL, character distribution, stylometric), DiffusionGemma 26B generates text classified as 100% Human — confirming the hypothesis from Tarım & Onan (2025): diffusion-generated text naturally resists autoregressive-trained detectors.

	### 2. Manual LoRA bypasses PEFT incompatibility

	PEFT does not support `Gemma4ClippableLinear` (DiffusionGemma's custom linear wrapper). We implemented Manual LoRA injection via forward hooks that target the underlying `Linear4bit` modules, bypassing PEFT entirely.

	### 3. VRAM optimization strategy

	DiffusionGemma 26B in 4-bit uses 50.8 GB on A100 80GB. Training requires:
	- Last 2 layers only — injects LoRA into 30 modules (not 189 across all layers)
	- Gradient checkpointing — trades compute for memory, recomputing activations during backward
	- Loss only on masked positions — skips padding tokens for memory efficiency
	- bf16 LoRA params — halves activation memory vs float32

	### 4. Multi-detector ensemble scoring

	\| Signal \| Source \| AI Pattern \| Human Pattern \|
	\|--------\|--------\|-----------\|---------------\|
	\| Perplexity (GPT-2) \| GPTZero-style \| < 18 (too predictable) \| > 25 (natural variation) \|
	\| Burstiness \| GPTZero-style \| < 0.15 (uniform) \| > 0.3 (varied) \|
	\| Fast-DetectGPT \| Bao et al. (2023) \| > 0.55 (negative curvature) \| < 0.45 (positive curvature) \|
	\| Cross-model PPL (GPT-Neo) \| Binoculars-style \| < 15 (both models agree) \| > 25 (models disagree) \|
	\| Character Distribution \| LD-Score (Narayanasamy, 2026) \| Global baseline \| Domain-specialized \|
	\| Stylometric (6 sub-signals) \| Pangram-style \| Formulaic, passive-heavy \| Natural, varied \|
	\| Weighted Ensemble \| StealthRL-inspired \| > 0.5 = AI \| < 0.4 = Human \|

	---

	## Architecture

	### DiffusionGemma 26B
	- Total params: 25.2B \| Active: 3.8B (MoE: 8/128 experts + 1 shared)
	- Generation: Block-autoregressive discrete diffusion
	- Canvas: 256 tokens, bidirectional attention
	- Sampler: Entropy-Bounded Denoising (1-48 steps, temperature 0.8→0.4)

	### Manual LoRA Injection
	```
	Gemma4ClippableLinear
	└── linear: Linear4bit (torch.nn.Linear subclass)
	├── forward: W @ x (frozen, 4-bit, no grad)
	└── LoRA hook: A @ B @ x.detach() * scale (trainable, bf16)
	├── A: (in_features, rank=8), kaiming init
	└── B: (rank=8, out_features), zero init
	```

	### Training Loop
	```
	for each batch (prompt + target response):
	1. Forward: prompt → encoder → KV cache
	decoder: canvas → bidirectional attention → logits
	(gradient checkpointing: activations NOT stored)
	2. Mask 30-70% of target tokens randomly
	3. Compute loss ONLY on masked positions (memory efficient)
	4. Add entropy regularization (encourage human-like uncertainty)
	5. Backward: recompute activations via checkpoint
	gradient only flows through LoRA params (detached hooks)
	6. Update LoRA weights (AdamW, lr=2e-4)
	```

	---

	## Installation

	### Prerequisites
	```bash
	pip install modal
	modal setup
	modal secret create hf-secrets HF_TOKEN=hf_your_token
	```

	### Clone & Deploy
	```bash
	git clone https://huggingface.co/simonlesaumon/diffusiongemma-humanizer
	cd diffusiongemma-humanizer
	bash run.sh
	```

	---

	## Usage

	### Basic: Humanize AI Text

	```python
	from transformers import DiffusionGemmaForBlockDiffusion, AutoTokenizer, BitsAndBytesConfig
	import torch

	# Load 4-bit model
	bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16,
	bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4")
	model = DiffusionGemmaForBlockDiffusion.from_pretrained(
	"google/diffusiongemma-26B-A4B-it",
	quantization_config=bnb, device_map="auto")
	tokenizer = AutoTokenizer.from_pretrained("google/diffusiongemma-26B-A4B-it")

	# Load fine-tuned LoRA weights
	from peft import PeftModel # or manual LoRA loader
	# (see lora/ folder for weights + config)

	# Humanize
	ai_text = "Your AI-generated text here..."
	messages = [
	{"role": "system", "content": "Rewrite to sound human-written."},
	{"role": "user", "content": ai_text},
	]
	inputs = tokenizer.apply_chat_template(messages, tokenize=True,
	add_generation_prompt=True, return_dict=True, return_tensors="pt").to(model.device)

	ai_tokens = tokenizer(ai_text, max_length=256, truncation=True,
	padding="max_length", return_tensors="pt")
	output = model.generate(**inputs,
	decoder_input_ids=ai_tokens["input_ids"].to(model.device),
	max_new_tokens=512, max_denoising_steps=24, t_max=0.8, t_min=0.4)
	humanized = tokenizer.decode(output.sequences[0][inputs["input_ids"].shape[-1]:],
	skip_special_tokens=True)
	```

	---

	## Training Pipeline

	### 6-Step Process (runs on Modal A100 80GB)

	\| Step \| Description \| Time \|
	\|------\|-------------\|------\|
	\| 1. Load Models \| DiffusionGemma 4-bit + GPT-2 + GPT-Neo detectors \| ~5 min \|
	\| 2. Baseline Evaluation \| 7-signal detector ensemble on 5 prompts \| ~30 sec \|
	\| 3. Build Dataset \| 10K+ synthetic pairs annotated with detector scores \| ~10 min \|
	\| 4. LoRA + Training \| Manual LoRA (last 2 layers, 30 modules) + 5-20 epochs \| ~10h \|
	\| 5. Post-Training Eval \| Compare ensemble scores before/after \| ~30 sec \|
	\| 6. Export to HF \| LoRA weights (5 MB) + results + model card \| ~10 sec \|

	### Training Hyperparameters

	\| Param \| Value \| Rationale \|
	\|-------\|-------\|-----------\|
	\| LoRA rank \| 8 \| Balance expressiveness vs memory \|
	\| LoRA alpha \| 16 \| Scaling factor alpha/r = 2 \|
	\| Learning rate \| 2e-4 \| Standard for LoRA fine-tuning \|
	\| Optimizer \| AdamW (paged_adamw_8bit) \| VRAM efficient \|
	\| Epochs \| 5-20 \| Dataset-size dependent \|
	\| Batch size \| 1 \| VRAM constraint \|
	\| Gradient accumulation \| 16 \| Effective batch = 16 \|
	\| Mask ratio \| 30-70% random \| Diffusion training objective \|
	\| Entropy target \| 2.5 \| Human-like token uncertainty \|

	### Run the Pipeline
	```bash
	# Quick run (5 epochs, small dataset)
	bash run.sh

	# Full training (20 epochs, 10K+ dataset)
	# Set num_epochs=20 in modal_project/app.py, then:
	modal run modal_project/app.py --hf-token=hf_xxx
	```

	---

	## Multi-Detector Scoring

	The scoring system implements techniques from multiple papers:

	### Signal 1: GPT-2 Perplexity (GPTZero-style)
	Measures how "surprising" each word is to GPT-2 Medium. AI text tends to be more predictable (lower perplexity).

	### Signal 2: Burstiness (GPTZero-style)
	Coefficient of variation of per-sentence perplexity. Human text varies more in complexity.

	### Signal 3: Fast-DetectGPT (Bao et al., 2023)
	Probability curvature analysis: AI text sits at local minima of the probability landscape.

	### Signal 4: Cross-Model Perplexity (Binoculars-style)
	GPT-Neo 125M computed perplexity compared to GPT-2 Medium. When models disagree, text is likely human.

	### Signal 5: Character Distribution (LD-Score, Narayanasamy 2026)
	AI text approximates global character patterns; human text shows domain specialization.

	### Signal 6: Stylometric Ensemble (Pangram-style)
	6 sub-signals: sentence length σ, hapax legomena ratio, transition marker rate, passive voice rate, formulaic phrase rate, word length σ.

	### Signal 7: Weighted Ensemble
	Calibrated weights combining all signals with higher confidence on stylometric (1.5x) and Fast-DetectGPT (1.0x).

	---

	## Results

	### Baseline (untrained DiffusionGemma)
	- 0/5 texts detected as AI by weighted ensemble
	- Mean ensemble score: 0.350 (threshold: < 0.4 = Human)

	### Breaking Down Detection Signals

	\| Text Type \| PPL \| Burstiness \| FDGPT \| Stylometric \| Ensemble \|
	\|-----------\|-----\|-----------\|-------\|-------------\|----------\|
	\| Remote work blog \| 16-23 \| 0.58-0.96 \| 0.000 \| 0.29-0.35 \| 0.30-0.38 \|
	\| Quantum computing \| 14-20 \| 0.57-0.70 \| 0.000 \| 0.23-0.33 \| 0.30-0.41 \|
	\| Email declining job \| 7-9 \| 0.48-0.91 \| 0.001 \| 0.27-0.33 \| 0.44-0.56 \|
	\| French Revolution \| 16-18 \| 0.53-0.74 \| 0.000 \| 0.25-0.25 \| 0.29-0.50 \|
	\| Headphones review \| 14-22 \| 0.37-1.25 \| 0.000 \| 0.22-0.25 \| 0.33-0.47 \|

	### Why DiffusionGemma Evades Detectors
	1. Different statistical pathway — block-autoregressive diffusion produces token distributions unlike standard AR models
	2. Bidirectional attention — considers full context when denoising, producing more natural text
	3. Iterative refinement — entropy-bounded denoising naturally introduces variation
	4. No left-to-right bias — avoids formulaic transition patterns common in AR text

	---

	## Research Background

	This project synthesizes findings from 30+ papers (see `research/` folder):

	- Sadasivan et al. (2023): Theoretical ceiling — perfect detectors impossible as LLMs improve
	- Tarım & Onan (2025): Diffusion text naturally resists AR-trained detectors
	- Cheng et al. (2025): Adversarial Paraphrasing — 87.88% TPR reduction via detector-guided feedback
	- Ranganath & Ramesh (2026): StealthRL — 99.9% attack success with multi-detector GRPO
	- Pedrotti et al. (2025): DPO style-shifting — few-shot fine-tuning fools detectors
	- Narayanasamy et al. (2026): LD-Score — character distribution separates human/AI text
	- Xu et al. (2026): HIP pipeline — base models look human to detectors

	Full literature review: `research/technical-diffusion-text-humanization-2026-06-29.md`

	---

	## Repository Structure

	```
	diffusiongemma-humanizer/
	├── README.md # This file
	├── research_report.md # Gemma + diffusion models + Modal costs
	├── research_datasets_training.md # Training data survey
	├── commercial_ai_detectors_report.md # Pangram, GPTZero, Originality.ai analysis
	├── research/
	│ ├── architecture-strategy.md # Architecture decisions & cost breakdown
	│ └── technical-diffusion-text-humanization-2026-06-29.md # Full lit review (30+ papers)
	├── modal_project/
	│ ├── app.py # Complete 6-step training pipeline
	│ ├── humanize_french.py # French text humanization (standalone)
	│ └── upload_hf.py # HF upload utilities
	├── scripts/
	│ ├── run.py # Simple launcher
	│ ├── launch.py # Launcher with UTF-8 logging
	│ ├── run_pipeline.ps1 # PowerShell launcher
	│ └── run_pipeline.bat # Batch launcher
	├── run.sh # Bash launcher (primary)
	├── run_french.py # French humanization launcher
	├── lora/ # Fine-tuned LoRA weights
	│ ├── lora_weights.pt # LoRA parameter state dict
	│ └── lora_config.json # LoRA configuration
	├── baseline_detector_results.json # Pre-training evaluation
	├── post_training_eval.json # Post-training evaluation
	└── experiment_log.json # Full experiment config & results
	```

	---

	## License

	Apache 2.0 — matching the base model `google/diffusiongemma-26B-A4B-it`.

	---

	Pipeline last run: 2026-06-30 \| GPU: Modal A100 80GB \| Framework: PyTorch 2.12 + Transformers 5.12