PAWN++

📦 Repository: HSE-Team-142/automatic-goggles

PAWN++ is a detector for identifying machine-generated (AI) text. It extends the PAWN architecture, which predicts authorship from the per-token hidden states and probability metrics of a frozen language model. PAWN++ adds an optional second frozen language model, cross-model metrics (including a Binoculars-style cross-perplexity score), second-model token metrics, hidden-state fusion, and aggregated sequence-level features that modulate the representation through FiLM.

This card describes the best-performing configuration (mage_llama_instruct_llama_base_metrics_xppl_hs_uniform_agg_metrics_full, checkpoint checkpoint-39884).

Model Description

PAWN++ does not fine-tune the backbone LLMs. The two language models are frozen and used only as feature extractors; the only trained parameters are three lightweight MLP heads and a FiLM conditioning layer.

Model type: Frozen-LLM feature extractor + gated MLP classifier (binary)
Task: Binary classification — human (0) vs. ai (1)
Primary model (frozen): meta-llama/Llama-3.2-1B-Instruct
Second model (frozen): meta-llama/Llama-3.2-1B
Language: English
Max sequence length: 512 tokens
License: MIT

Architecture

For each input the feature extractor runs both frozen LLMs and produces:

Per-token metrics for each model — entropy, max_log_probs, next_token_log_probs, rank, top_p — plus the cross-perplexity (xppl) between the two models.
Hidden states from both models, fused across layers with uniform fusion.
Aggregated sequence-level features — per model: energy, mean, std, var, skew, kurtosis, mean_diff, std_diff, var_2nd, entropy_2nd, autocorr_2nd; and cross-model: cov, corr, cos_sim, binoculars_score.

Three MLP heads process these signals:

metrics_nn maps the per-token metric vector to a 256-dim feature space.
gate_nn takes the concatenated current/next hidden states of both models plus a positional scalar and produces 256 gate logits per token; a softmax over the sequence axis yields an attention-style weighting that aggregates the token metric features into a single vector.
The aggregated vector is modulated by a FiLM layer (gamma, beta) conditioned on the normalized sequence-level aggregate features.
aggregate_nn maps the result to a single logit.

The output is a single logit; sigmoid(logit) is the probability of the human class and the prediction is ai when logit >= 0.

Hyperparameter	Value
`metric_features`	256
`gates`	256
`mlp_hidden_features`	256
`mlp_hidden_layers`	3
`mlp_dropout`	0.0
`token_dropout`	0.15
`residual`	true
`hidden_state_fusion`	uniform

Intended Use

Primary use: Research on machine-generated-text detection and AI-text classification of English passages.
Out of scope: High-stakes decisions (academic misconduct, hiring, moderation) without human review; non-English text; short texts; and detecting generators or domains far from the training distribution. As with all detectors, predictions should be treated as a signal, not proof.

Training Data

Trained and evaluated on the MAGE benchmark for machine-generated text detection, which spans multiple domains and many generator models, framed as a binary human-vs-AI task.

Training Procedure

Backbones frozen; only the MLP heads and FiLM layer are trained.
Objective: Binary cross-entropy with label_smoothing = 0.2 and pos_weight = 0.413.
Optimizer: AdamW, learning_rate = 1e-3, weight_decay = 1e-2, max_grad_norm = 1.0.
Schedule: up to 5 epochs (max_steps = 49855), batch size 32, early stopping (patience 5), seed 42.
Model selection: best checkpoint by validation AUROC (checkpoint-39884, epoch 4, validation AUROC ≈ 0.9933).

Evaluation

Results on the MAGE test set:

Metric	Value
Accuracy	0.9515
Macro F1	0.9515
ROC AUC	0.9836
AI — Precision	0.9710
AI — Recall	0.9311
AI — F1	0.9506
Human — Precision	0.9334
Human — Recall	0.9720
Human — F1	0.9523
Test loss	0.2456

Runtime (test split): 1619.5 s, 37.5 samples/s, 1.173 steps/s.

The model is slightly more precise on AI text (fewer false AI flags) and has higher recall on human text, i.e. it is conservative about labeling text as AI-generated.

How to Use

Inference is provided through inference.py, which loads the frozen backbones plus the trained heads from a checkpoint and a training YAML config:

uv run PAWN++/inference.py \
  --config PAWN++/experiments/MAGE/configs/pawn/two_models/mage_llama_instruct_llama_base_metrics_xppl_hs_uniform_agg_metrics_full.yaml \
  --checkpoint PAWN++/checkpoint-39884/pytorch_model.bin \
  --text "Your text to classify here."

from inference import load_model, predict

model, device = load_model(
    config_path="PAWN++/experiments/MAGE/configs/pawn/two_models/"
                "mage_llama_instruct_llama_base_metrics_xppl_hs_uniform_agg_metrics_full.yaml",
    checkpoint_path="PAWN++/checkpoint-39884/pytorch_model.bin",
)
results = predict(model, ["Your text to classify here."], device)
# each result: {"label": "human"|"ai", "prediction": 0|1, "prob_human": float, "logit": float}

Note: The Llama-3.2 backbones are gated on the Hugging Face Hub. Set HF_TOKEN in a .env file to download them. A GPU is recommended; the code falls back to MPS or CPU automatically.

Limitations and Bias

English-only; performance on other languages is not evaluated and expected to degrade.
Detection quality depends on the generators and domains seen during training (MAGE); novel models, prompting styles, paraphrasing or adversarial edits can reduce accuracy.
Depends on two frozen Llama-3.2-1B backbones, which carry their own data biases.
Reported metrics reflect the MAGE test distribution and may not transfer out of distribution; see the OOD evaluation utilities in the repository.

Citation

PAWN++ builds on the PAWN detector:

PAWN: Perplexity-Aware Watermark-free News (machine-generated text detection). https://www.sciencedirect.com/science/article/pii/S156625352500538X

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for crayden/pawnplus

Base model

meta-llama/Llama-3.2-1B

Finetuned

(930)

this model

Dataset used to train crayden/pawnplus

Evaluation results

accuracy on MAGE
self-reported

0.952
f1_macro on MAGE
self-reported

0.952
roc_auc on MAGE
self-reported

0.984
AI F1 on MAGE
self-reported

0.951
Human F1 on MAGE
self-reported

0.952