| --- |
| license: mit |
| language: |
| - en |
| tags: |
| - text-classification |
| - ai-generated-text-detection |
| - machine-generated-text |
| - llm-detection |
| - pawn |
| pipeline_tag: text-classification |
| metrics: |
| - accuracy |
| - f1 |
| - precision |
| - recall |
| - roc_auc |
| base_model: |
| - meta-llama/Llama-3.2-1B-Instruct |
| - meta-llama/Llama-3.2-1B |
| datasets: |
| - yaful/MAGE |
| model-index: |
| - name: PAWN++ |
| results: |
| - task: |
| type: text-classification |
| name: Machine-Generated Text Detection |
| dataset: |
| name: MAGE |
| type: yaful/MAGE |
| metrics: |
| - type: accuracy |
| value: 0.9515 |
| - type: f1_macro |
| value: 0.9515 |
| - type: roc_auc |
| value: 0.9836 |
| - type: f1 |
| value: 0.9506 |
| name: AI F1 |
| - type: f1 |
| value: 0.9523 |
| name: Human F1 |
| --- |
| |
| # PAWN++ |
|
|
| [](https://github.com/HSE-Team-142/automatic-goggles/) |
| [](https://github.com/HSE-Team-142/automatic-goggles/blob/main/LICENSE) |
| [](#) |
| [](#evaluation) |
| [](#evaluation) |
| [](#evaluation) |
|
|
| π¦ **Repository:** [HSE-Team-142/automatic-goggles](https://github.com/HSE-Team-142/automatic-goggles/) |
|
|
| **PAWN++** is a detector for identifying machine-generated (AI) text. It extends the |
| [PAWN](https://www.sciencedirect.com/science/article/pii/S156625352500538X) architecture, which |
| predicts authorship from the per-token hidden states and probability metrics of a *frozen* language |
| model. PAWN++ adds an optional second frozen language model, cross-model metrics (including a |
| Binoculars-style cross-perplexity score), second-model token metrics, hidden-state fusion, and |
| aggregated sequence-level features that modulate the representation through FiLM. |
|
|
| This card describes the best-performing configuration |
| (`mage_llama_instruct_llama_base_metrics_xppl_hs_uniform_agg_metrics_full`, checkpoint `checkpoint-39884`). |
|
|
| ## Model Description |
|
|
| PAWN++ does **not** fine-tune the backbone LLMs. The two language models are frozen and used only as |
| feature extractors; the only trained parameters are three lightweight MLP heads and a FiLM |
| conditioning layer. |
|
|
| - **Model type:** Frozen-LLM feature extractor + gated MLP classifier (binary) |
| - **Task:** Binary classification β `human` (0) vs. `ai` (1) |
| - **Primary model (frozen):** `meta-llama/Llama-3.2-1B-Instruct` |
| - **Second model (frozen):** `meta-llama/Llama-3.2-1B` |
| - **Language:** English |
| - **Max sequence length:** 512 tokens |
| - **License:** MIT |
|
|
| ### Architecture |
|
|
| For each input the feature extractor runs both frozen LLMs and produces: |
|
|
| 1. **Per-token metrics** for each model β `entropy`, `max_log_probs`, `next_token_log_probs`, |
| `rank`, `top_p` β plus the **cross-perplexity (xppl)** between the two models. |
| 2. **Hidden states** from both models, fused across layers with `uniform` fusion. |
| 3. **Aggregated sequence-level features** β per model: `energy`, `mean`, `std`, `var`, `skew`, |
| `kurtosis`, `mean_diff`, `std_diff`, `var_2nd`, `entropy_2nd`, `autocorr_2nd`; and **cross-model**: |
| `cov`, `corr`, `cos_sim`, `binoculars_score`. |
|
|
| Three MLP heads process these signals: |
|
|
| - **`metrics_nn`** maps the per-token metric vector to a 256-dim feature space. |
| - **`gate_nn`** takes the concatenated current/next hidden states of both models plus a positional |
| scalar and produces 256 gate logits per token; a softmax over the sequence axis yields an |
| attention-style weighting that aggregates the token metric features into a single vector. |
| - The aggregated vector is modulated by a **FiLM** layer (`gamma`, `beta`) conditioned on the |
| normalized sequence-level aggregate features. |
| - **`aggregate_nn`** maps the result to a single logit. |
| |
| The output is a single logit; `sigmoid(logit)` is the probability of the `human` class and the |
| prediction is `ai` when `logit >= 0`. |
| |
| | Hyperparameter | Value | |
| |---|---| |
| | `metric_features` | 256 | |
| | `gates` | 256 | |
| | `mlp_hidden_features` | 256 | |
| | `mlp_hidden_layers` | 3 | |
| | `mlp_dropout` | 0.0 | |
| | `token_dropout` | 0.15 | |
| | `residual` | true | |
| | `hidden_state_fusion` | uniform | |
| |
| ## Intended Use |
| |
| - **Primary use:** Research on machine-generated-text detection and AI-text classification of English |
| passages. |
| - **Out of scope:** High-stakes decisions (academic misconduct, hiring, moderation) without human |
| review; non-English text; short texts; and detecting generators or domains far from the training |
| distribution. As with all detectors, predictions should be treated as a signal, not proof. |
|
|
| ## Training Data |
|
|
| Trained and evaluated on the **MAGE** benchmark for machine-generated text detection, which spans |
| multiple domains and many generator models, framed as a binary human-vs-AI task. |
|
|
| ## Training Procedure |
|
|
| - Backbones frozen; only the MLP heads and FiLM layer are trained. |
| - **Objective:** Binary cross-entropy with `label_smoothing = 0.2` and `pos_weight = 0.413`. |
| - **Optimizer:** AdamW, `learning_rate = 1e-3`, `weight_decay = 1e-2`, `max_grad_norm = 1.0`. |
| - **Schedule:** up to 5 epochs (`max_steps = 49855`), batch size 32, early stopping (patience 5), |
| seed 42. |
| - **Model selection:** best checkpoint by validation AUROC (`checkpoint-39884`, epoch 4, validation |
| AUROC β **0.9933**). |
|
|
| ## Evaluation |
|
|
| Results on the MAGE test set: |
|
|
| | Metric | Value | |
| |---|---| |
| | Accuracy | 0.9515 | |
| | Macro F1 | 0.9515 | |
| | ROC AUC | 0.9836 | |
| | AI β Precision | 0.9710 | |
| | AI β Recall | 0.9311 | |
| | AI β F1 | 0.9506 | |
| | Human β Precision | 0.9334 | |
| | Human β Recall | 0.9720 | |
| | Human β F1 | 0.9523 | |
| | Test loss | 0.2456 | |
|
|
| **Runtime (test split):** 1619.5 s, 37.5 samples/s, 1.173 steps/s. |
|
|
| > The model is slightly more precise on AI text (fewer false AI flags) and has higher recall on human |
| > text, i.e. it is conservative about labeling text as AI-generated. |
|
|
| ## How to Use |
|
|
| Inference is provided through `inference.py`, which loads the frozen backbones plus the trained heads |
| from a checkpoint and a training YAML config: |
|
|
| ```bash |
| uv run PAWN++/inference.py \ |
| --config PAWN++/experiments/MAGE/configs/pawn/two_models/mage_llama_instruct_llama_base_metrics_xppl_hs_uniform_agg_metrics_full.yaml \ |
| --checkpoint PAWN++/checkpoint-39884/pytorch_model.bin \ |
| --text "Your text to classify here." |
| ``` |
|
|
| ```python |
| from inference import load_model, predict |
| |
| model, device = load_model( |
| config_path="PAWN++/experiments/MAGE/configs/pawn/two_models/" |
| "mage_llama_instruct_llama_base_metrics_xppl_hs_uniform_agg_metrics_full.yaml", |
| checkpoint_path="PAWN++/checkpoint-39884/pytorch_model.bin", |
| ) |
| results = predict(model, ["Your text to classify here."], device) |
| # each result: {"label": "human"|"ai", "prediction": 0|1, "prob_human": float, "logit": float} |
| ``` |
|
|
| > **Note:** The Llama-3.2 backbones are gated on the Hugging Face Hub. Set `HF_TOKEN` in a `.env` file |
| > to download them. A GPU is recommended; the code falls back to MPS or CPU automatically. |
| |
| ## Limitations and Bias |
| |
| - English-only; performance on other languages is not evaluated and expected to degrade. |
| - Detection quality depends on the generators and domains seen during training (MAGE); novel models, |
| prompting styles, paraphrasing or adversarial edits can reduce accuracy. |
| - Depends on two frozen Llama-3.2-1B backbones, which carry their own data biases. |
| - Reported metrics reflect the MAGE test distribution and may not transfer out of distribution; see |
| the OOD evaluation utilities in the repository. |
| |
| ## Citation |
| |
| PAWN++ builds on the PAWN detector: |
| |
| > PAWN: Perplexity-Aware Watermark-free News (machine-generated text detection). |
| > https://www.sciencedirect.com/science/article/pii/S156625352500538X |
| |