PAWN++

GitHub License: MIT Task Accuracy ROC AUC Macro F1

πŸ“¦ Repository: HSE-Team-142/automatic-goggles

PAWN++ is a detector for identifying machine-generated (AI) text. It extends the PAWN architecture, which predicts authorship from the per-token hidden states and probability metrics of a frozen language model. PAWN++ adds an optional second frozen language model, cross-model metrics (including a Binoculars-style cross-perplexity score), second-model token metrics, hidden-state fusion, and aggregated sequence-level features that modulate the representation through FiLM.

This card describes the best-performing configuration (mage_llama_instruct_llama_base_metrics_xppl_hs_uniform_agg_metrics_full, checkpoint checkpoint-39884).

Model Description

PAWN++ does not fine-tune the backbone LLMs. The two language models are frozen and used only as feature extractors; the only trained parameters are three lightweight MLP heads and a FiLM conditioning layer.

  • Model type: Frozen-LLM feature extractor + gated MLP classifier (binary)
  • Task: Binary classification β€” human (0) vs. ai (1)
  • Primary model (frozen): meta-llama/Llama-3.2-1B-Instruct
  • Second model (frozen): meta-llama/Llama-3.2-1B
  • Language: English
  • Max sequence length: 512 tokens
  • License: MIT

Architecture

For each input the feature extractor runs both frozen LLMs and produces:

  1. Per-token metrics for each model β€” entropy, max_log_probs, next_token_log_probs, rank, top_p β€” plus the cross-perplexity (xppl) between the two models.
  2. Hidden states from both models, fused across layers with uniform fusion.
  3. Aggregated sequence-level features β€” per model: energy, mean, std, var, skew, kurtosis, mean_diff, std_diff, var_2nd, entropy_2nd, autocorr_2nd; and cross-model: cov, corr, cos_sim, binoculars_score.

Three MLP heads process these signals:

  • metrics_nn maps the per-token metric vector to a 256-dim feature space.
  • gate_nn takes the concatenated current/next hidden states of both models plus a positional scalar and produces 256 gate logits per token; a softmax over the sequence axis yields an attention-style weighting that aggregates the token metric features into a single vector.
  • The aggregated vector is modulated by a FiLM layer (gamma, beta) conditioned on the normalized sequence-level aggregate features.
  • aggregate_nn maps the result to a single logit.

The output is a single logit; sigmoid(logit) is the probability of the human class and the prediction is ai when logit >= 0.

Hyperparameter Value
metric_features 256
gates 256
mlp_hidden_features 256
mlp_hidden_layers 3
mlp_dropout 0.0
token_dropout 0.15
residual true
hidden_state_fusion uniform

Intended Use

  • Primary use: Research on machine-generated-text detection and AI-text classification of English passages.
  • Out of scope: High-stakes decisions (academic misconduct, hiring, moderation) without human review; non-English text; short texts; and detecting generators or domains far from the training distribution. As with all detectors, predictions should be treated as a signal, not proof.

Training Data

Trained and evaluated on the MAGE benchmark for machine-generated text detection, which spans multiple domains and many generator models, framed as a binary human-vs-AI task.

Training Procedure

  • Backbones frozen; only the MLP heads and FiLM layer are trained.
  • Objective: Binary cross-entropy with label_smoothing = 0.2 and pos_weight = 0.413.
  • Optimizer: AdamW, learning_rate = 1e-3, weight_decay = 1e-2, max_grad_norm = 1.0.
  • Schedule: up to 5 epochs (max_steps = 49855), batch size 32, early stopping (patience 5), seed 42.
  • Model selection: best checkpoint by validation AUROC (checkpoint-39884, epoch 4, validation AUROC β‰ˆ 0.9933).

Evaluation

Results on the MAGE test set:

Metric Value
Accuracy 0.9515
Macro F1 0.9515
ROC AUC 0.9836
AI β€” Precision 0.9710
AI β€” Recall 0.9311
AI β€” F1 0.9506
Human β€” Precision 0.9334
Human β€” Recall 0.9720
Human β€” F1 0.9523
Test loss 0.2456

Runtime (test split): 1619.5 s, 37.5 samples/s, 1.173 steps/s.

The model is slightly more precise on AI text (fewer false AI flags) and has higher recall on human text, i.e. it is conservative about labeling text as AI-generated.

How to Use

Inference is provided through inference.py, which loads the frozen backbones plus the trained heads from a checkpoint and a training YAML config:

uv run PAWN++/inference.py \
  --config PAWN++/experiments/MAGE/configs/pawn/two_models/mage_llama_instruct_llama_base_metrics_xppl_hs_uniform_agg_metrics_full.yaml \
  --checkpoint PAWN++/checkpoint-39884/pytorch_model.bin \
  --text "Your text to classify here."
from inference import load_model, predict

model, device = load_model(
    config_path="PAWN++/experiments/MAGE/configs/pawn/two_models/"
                "mage_llama_instruct_llama_base_metrics_xppl_hs_uniform_agg_metrics_full.yaml",
    checkpoint_path="PAWN++/checkpoint-39884/pytorch_model.bin",
)
results = predict(model, ["Your text to classify here."], device)
# each result: {"label": "human"|"ai", "prediction": 0|1, "prob_human": float, "logit": float}

Note: The Llama-3.2 backbones are gated on the Hugging Face Hub. Set HF_TOKEN in a .env file to download them. A GPU is recommended; the code falls back to MPS or CPU automatically.

Limitations and Bias

  • English-only; performance on other languages is not evaluated and expected to degrade.
  • Detection quality depends on the generators and domains seen during training (MAGE); novel models, prompting styles, paraphrasing or adversarial edits can reduce accuracy.
  • Depends on two frozen Llama-3.2-1B backbones, which carry their own data biases.
  • Reported metrics reflect the MAGE test distribution and may not transfer out of distribution; see the OOD evaluation utilities in the repository.

Citation

PAWN++ builds on the PAWN detector:

PAWN: Perplexity-Aware Watermark-free News (machine-generated text detection). https://www.sciencedirect.com/science/article/pii/S156625352500538X

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for crayden/pawnplus

Finetuned
(930)
this model

Dataset used to train crayden/pawnplus

Evaluation results