---
license: mit
language:
- en
tags:
- text-classification
- ai-generated-text-detection
- machine-generated-text
- llm-detection
- pawn
pipeline_tag: text-classification
metrics:
- accuracy
- f1
- precision
- recall
- roc_auc
base_model:
- meta-llama/Llama-3.2-1B-Instruct
- meta-llama/Llama-3.2-1B
datasets:
- yaful/MAGE
model-index:
- name: PAWN++
  results:
  - task:
      type: text-classification
      name: Machine-Generated Text Detection
    dataset:
      name: MAGE
      type: yaful/MAGE
    metrics:
    - type: accuracy
      value: 0.9515
    - type: f1_macro
      value: 0.9515
    - type: roc_auc
      value: 0.9836
    - type: f1
      value: 0.9506
      name: AI F1
    - type: f1
      value: 0.9523
      name: Human F1
---

# PAWN++

[![GitHub](https://img.shields.io/badge/GitHub-automatic--goggles-181717?logo=github)](https://github.com/HSE-Team-142/automatic-goggles/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://github.com/HSE-Team-142/automatic-goggles/blob/main/LICENSE)
[![Task](https://img.shields.io/badge/task-AI%20text%20detection-blue)](#)
[![Accuracy](https://img.shields.io/badge/accuracy-95.15%25-success)](#evaluation)
[![ROC AUC](https://img.shields.io/badge/ROC%20AUC-0.984-success)](#evaluation)
[![Macro F1](https://img.shields.io/badge/macro%20F1-0.951-success)](#evaluation)

📦 **Repository:** [HSE-Team-142/automatic-goggles](https://github.com/HSE-Team-142/automatic-goggles/)

**PAWN++** is a detector for identifying machine-generated (AI) text. It extends the
[PAWN](https://www.sciencedirect.com/science/article/pii/S156625352500538X) architecture, which
predicts authorship from the per-token hidden states and probability metrics of a *frozen* language
model. PAWN++ adds an optional second frozen language model, cross-model metrics (including a
Binoculars-style cross-perplexity score), second-model token metrics, hidden-state fusion, and
aggregated sequence-level features that modulate the representation through FiLM.

This card describes the best-performing configuration
(`mage_llama_instruct_llama_base_metrics_xppl_hs_uniform_agg_metrics_full`, checkpoint `checkpoint-39884`).

## Model Description

PAWN++ does **not** fine-tune the backbone LLMs. The two language models are frozen and used only as
feature extractors; the only trained parameters are three lightweight MLP heads and a FiLM
conditioning layer.

- **Model type:** Frozen-LLM feature extractor + gated MLP classifier (binary)
- **Task:** Binary classification — `human` (0) vs. `ai` (1)
- **Primary model (frozen):** `meta-llama/Llama-3.2-1B-Instruct`
- **Second model (frozen):** `meta-llama/Llama-3.2-1B`
- **Language:** English
- **Max sequence length:** 512 tokens
- **License:** MIT

### Architecture

For each input the feature extractor runs both frozen LLMs and produces:

1. **Per-token metrics** for each model — `entropy`, `max_log_probs`, `next_token_log_probs`,
   `rank`, `top_p` — plus the **cross-perplexity (xppl)** between the two models.
2. **Hidden states** from both models, fused across layers with `uniform` fusion.
3. **Aggregated sequence-level features** — per model: `energy`, `mean`, `std`, `var`, `skew`,
   `kurtosis`, `mean_diff`, `std_diff`, `var_2nd`, `entropy_2nd`, `autocorr_2nd`; and **cross-model**:
   `cov`, `corr`, `cos_sim`, `binoculars_score`.

Three MLP heads process these signals:

- **`metrics_nn`** maps the per-token metric vector to a 256-dim feature space.
- **`gate_nn`** takes the concatenated current/next hidden states of both models plus a positional
  scalar and produces 256 gate logits per token; a softmax over the sequence axis yields an
  attention-style weighting that aggregates the token metric features into a single vector.
- The aggregated vector is modulated by a **FiLM** layer (`gamma`, `beta`) conditioned on the
  normalized sequence-level aggregate features.
- **`aggregate_nn`** maps the result to a single logit.

The output is a single logit; `sigmoid(logit)` is the probability of the `human` class and the
prediction is `ai` when `logit >= 0`.

| Hyperparameter | Value |
|---|---|
| `metric_features` | 256 |
| `gates` | 256 |
| `mlp_hidden_features` | 256 |
| `mlp_hidden_layers` | 3 |
| `mlp_dropout` | 0.0 |
| `token_dropout` | 0.15 |
| `residual` | true |
| `hidden_state_fusion` | uniform |

## Intended Use

- **Primary use:** Research on machine-generated-text detection and AI-text classification of English
  passages.
- **Out of scope:** High-stakes decisions (academic misconduct, hiring, moderation) without human
  review; non-English text; short texts; and detecting generators or domains far from the training
  distribution. As with all detectors, predictions should be treated as a signal, not proof.

## Training Data

Trained and evaluated on the **MAGE** benchmark for machine-generated text detection, which spans
multiple domains and many generator models, framed as a binary human-vs-AI task.

## Training Procedure

- Backbones frozen; only the MLP heads and FiLM layer are trained.
- **Objective:** Binary cross-entropy with `label_smoothing = 0.2` and `pos_weight = 0.413`.
- **Optimizer:** AdamW, `learning_rate = 1e-3`, `weight_decay = 1e-2`, `max_grad_norm = 1.0`.
- **Schedule:** up to 5 epochs (`max_steps = 49855`), batch size 32, early stopping (patience 5),
  seed 42.
- **Model selection:** best checkpoint by validation AUROC (`checkpoint-39884`, epoch 4, validation
  AUROC ≈ **0.9933**).

## Evaluation

Results on the MAGE test set:

| Metric | Value |
|---|---|
| Accuracy | 0.9515 |
| Macro F1 | 0.9515 |
| ROC AUC | 0.9836 |
| AI — Precision | 0.9710 |
| AI — Recall | 0.9311 |
| AI — F1 | 0.9506 |
| Human — Precision | 0.9334 |
| Human — Recall | 0.9720 |
| Human — F1 | 0.9523 |
| Test loss | 0.2456 |

**Runtime (test split):** 1619.5 s, 37.5 samples/s, 1.173 steps/s.

> The model is slightly more precise on AI text (fewer false AI flags) and has higher recall on human
> text, i.e. it is conservative about labeling text as AI-generated.

## How to Use

Inference is provided through `inference.py`, which loads the frozen backbones plus the trained heads
from a checkpoint and a training YAML config:

```bash
uv run PAWN++/inference.py \
  --config PAWN++/experiments/MAGE/configs/pawn/two_models/mage_llama_instruct_llama_base_metrics_xppl_hs_uniform_agg_metrics_full.yaml \
  --checkpoint PAWN++/checkpoint-39884/pytorch_model.bin \
  --text "Your text to classify here."
```

```python
from inference import load_model, predict

model, device = load_model(
    config_path="PAWN++/experiments/MAGE/configs/pawn/two_models/"
                "mage_llama_instruct_llama_base_metrics_xppl_hs_uniform_agg_metrics_full.yaml",
    checkpoint_path="PAWN++/checkpoint-39884/pytorch_model.bin",
)
results = predict(model, ["Your text to classify here."], device)
# each result: {"label": "human"|"ai", "prediction": 0|1, "prob_human": float, "logit": float}
```

> **Note:** The Llama-3.2 backbones are gated on the Hugging Face Hub. Set `HF_TOKEN` in a `.env` file
> to download them. A GPU is recommended; the code falls back to MPS or CPU automatically.

## Limitations and Bias

- English-only; performance on other languages is not evaluated and expected to degrade.
- Detection quality depends on the generators and domains seen during training (MAGE); novel models,
  prompting styles, paraphrasing or adversarial edits can reduce accuracy.
- Depends on two frozen Llama-3.2-1B backbones, which carry their own data biases.
- Reported metrics reflect the MAGE test distribution and may not transfer out of distribution; see
  the OOD evaluation utilities in the repository.

## Citation

PAWN++ builds on the PAWN detector:

> PAWN: Perplexity-Aware Watermark-free News (machine-generated text detection).
> https://www.sciencedirect.com/science/article/pii/S156625352500538X