pawnplus / README.md

Update model card

a0aca63 verified 7 days ago

7.9 kB

	---
	license: mit
	language:
	- en
	tags:
	- text-classification
	- ai-generated-text-detection
	- machine-generated-text
	- llm-detection
	- pawn
	pipeline_tag: text-classification
	metrics:
	- accuracy
	- f1
	- precision
	- recall
	- roc_auc
	base_model:
	- meta-llama/Llama-3.2-1B-Instruct
	- meta-llama/Llama-3.2-1B
	datasets:
	- yaful/MAGE
	model-index:
	- name: PAWN++
	results:
	- task:
	type: text-classification
	name: Machine-Generated Text Detection
	dataset:
	name: MAGE
	type: yaful/MAGE
	metrics:
	- type: accuracy
	value: 0.9515
	- type: f1_macro
	value: 0.9515
	- type: roc_auc
	value: 0.9836
	- type: f1
	value: 0.9506
	name: AI F1
	- type: f1
	value: 0.9523
	name: Human F1
	---

	# PAWN++

	[![GitHub](https://img.shields.io/badge/GitHub-automatic--goggles-181717?logo=github)](https://github.com/HSE-Team-142/automatic-goggles/)
	[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://github.com/HSE-Team-142/automatic-goggles/blob/main/LICENSE)
	[![Task](https://img.shields.io/badge/task-AI%20text%20detection-blue)](#)
	[![Accuracy](https://img.shields.io/badge/accuracy-95.15%25-success)](#evaluation)
	[![ROC AUC](https://img.shields.io/badge/ROC%20AUC-0.984-success)](#evaluation)
	[![Macro F1](https://img.shields.io/badge/macro%20F1-0.951-success)](#evaluation)

	📦 Repository: [HSE-Team-142/automatic-goggles](https://github.com/HSE-Team-142/automatic-goggles/)

	PAWN++ is a detector for identifying machine-generated (AI) text. It extends the
	[PAWN](https://www.sciencedirect.com/science/article/pii/S156625352500538X) architecture, which
	predicts authorship from the per-token hidden states and probability metrics of a frozen language
	model. PAWN++ adds an optional second frozen language model, cross-model metrics (including a
	Binoculars-style cross-perplexity score), second-model token metrics, hidden-state fusion, and
	aggregated sequence-level features that modulate the representation through FiLM.

	This card describes the best-performing configuration
	(`mage_llama_instruct_llama_base_metrics_xppl_hs_uniform_agg_metrics_full`, checkpoint `checkpoint-39884`).

	## Model Description

	PAWN++ does not fine-tune the backbone LLMs. The two language models are frozen and used only as
	feature extractors; the only trained parameters are three lightweight MLP heads and a FiLM
	conditioning layer.

	- Model type: Frozen-LLM feature extractor + gated MLP classifier (binary)
	- Task: Binary classification — `human` (0) vs. `ai` (1)
	- Primary model (frozen): `meta-llama/Llama-3.2-1B-Instruct`
	- Second model (frozen): `meta-llama/Llama-3.2-1B`
	- Language: English
	- Max sequence length: 512 tokens
	- License: MIT

	### Architecture

	For each input the feature extractor runs both frozen LLMs and produces:

	1. Per-token metrics for each model — `entropy`, `max_log_probs`, `next_token_log_probs`,
	`rank`, `top_p` — plus the cross-perplexity (xppl) between the two models.
	2. Hidden states from both models, fused across layers with `uniform` fusion.
	3. Aggregated sequence-level features — per model: `energy`, `mean`, `std`, `var`, `skew`,
	`kurtosis`, `mean_diff`, `std_diff`, `var_2nd`, `entropy_2nd`, `autocorr_2nd`; and cross-model:
	`cov`, `corr`, `cos_sim`, `binoculars_score`.

	Three MLP heads process these signals:

	- `metrics_nn` maps the per-token metric vector to a 256-dim feature space.
	- `gate_nn` takes the concatenated current/next hidden states of both models plus a positional
	scalar and produces 256 gate logits per token; a softmax over the sequence axis yields an
	attention-style weighting that aggregates the token metric features into a single vector.
	- The aggregated vector is modulated by a FiLM layer (`gamma`, `beta`) conditioned on the
	normalized sequence-level aggregate features.
	- `aggregate_nn` maps the result to a single logit.

	The output is a single logit; `sigmoid(logit)` is the probability of the `human` class and the
	prediction is `ai` when `logit >= 0`.

	\| Hyperparameter \| Value \|
	\|---\|---\|
	\| `metric_features` \| 256 \|
	\| `gates` \| 256 \|
	\| `mlp_hidden_features` \| 256 \|
	\| `mlp_hidden_layers` \| 3 \|
	\| `mlp_dropout` \| 0.0 \|
	\| `token_dropout` \| 0.15 \|
	\| `residual` \| true \|
	\| `hidden_state_fusion` \| uniform \|

	## Intended Use

	- Primary use: Research on machine-generated-text detection and AI-text classification of English
	passages.
	- Out of scope: High-stakes decisions (academic misconduct, hiring, moderation) without human
	review; non-English text; short texts; and detecting generators or domains far from the training
	distribution. As with all detectors, predictions should be treated as a signal, not proof.

	## Training Data

	Trained and evaluated on the MAGE benchmark for machine-generated text detection, which spans
	multiple domains and many generator models, framed as a binary human-vs-AI task.

	## Training Procedure

	- Backbones frozen; only the MLP heads and FiLM layer are trained.
	- Objective: Binary cross-entropy with `label_smoothing = 0.2` and `pos_weight = 0.413`.
	- Optimizer: AdamW, `learning_rate = 1e-3`, `weight_decay = 1e-2`, `max_grad_norm = 1.0`.
	- Schedule: up to 5 epochs (`max_steps = 49855`), batch size 32, early stopping (patience 5),
	seed 42.
	- Model selection: best checkpoint by validation AUROC (`checkpoint-39884`, epoch 4, validation
	AUROC ≈ 0.9933).

	## Evaluation

	Results on the MAGE test set:

	\| Metric \| Value \|
	\|---\|---\|
	\| Accuracy \| 0.9515 \|
	\| Macro F1 \| 0.9515 \|
	\| ROC AUC \| 0.9836 \|
	\| AI — Precision \| 0.9710 \|
	\| AI — Recall \| 0.9311 \|
	\| AI — F1 \| 0.9506 \|
	\| Human — Precision \| 0.9334 \|
	\| Human — Recall \| 0.9720 \|
	\| Human — F1 \| 0.9523 \|
	\| Test loss \| 0.2456 \|

	Runtime (test split): 1619.5 s, 37.5 samples/s, 1.173 steps/s.

	> The model is slightly more precise on AI text (fewer false AI flags) and has higher recall on human
	> text, i.e. it is conservative about labeling text as AI-generated.

	## How to Use

	Inference is provided through `inference.py`, which loads the frozen backbones plus the trained heads
	from a checkpoint and a training YAML config:

	```bash
	uv run PAWN++/inference.py \
	--config PAWN++/experiments/MAGE/configs/pawn/two_models/mage_llama_instruct_llama_base_metrics_xppl_hs_uniform_agg_metrics_full.yaml \
	--checkpoint PAWN++/checkpoint-39884/pytorch_model.bin \
	--text "Your text to classify here."
	```

	```python
	from inference import load_model, predict

	model, device = load_model(
	config_path="PAWN++/experiments/MAGE/configs/pawn/two_models/"
	"mage_llama_instruct_llama_base_metrics_xppl_hs_uniform_agg_metrics_full.yaml",
	checkpoint_path="PAWN++/checkpoint-39884/pytorch_model.bin",
	)
	results = predict(model, ["Your text to classify here."], device)
	# each result: {"label": "human"\|"ai", "prediction": 0\|1, "prob_human": float, "logit": float}
	```

	> Note: The Llama-3.2 backbones are gated on the Hugging Face Hub. Set `HF_TOKEN` in a `.env` file
	> to download them. A GPU is recommended; the code falls back to MPS or CPU automatically.

	## Limitations and Bias

	- English-only; performance on other languages is not evaluated and expected to degrade.
	- Detection quality depends on the generators and domains seen during training (MAGE); novel models,
	prompting styles, paraphrasing or adversarial edits can reduce accuracy.
	- Depends on two frozen Llama-3.2-1B backbones, which carry their own data biases.
	- Reported metrics reflect the MAGE test distribution and may not transfer out of distribution; see
	the OOD evaluation utilities in the repository.

	## Citation

	PAWN++ builds on the PAWN detector:

	> PAWN: Perplexity-Aware Watermark-free News (machine-generated text detection).
	> https://www.sciencedirect.com/science/article/pii/S156625352500538X