---
license: mit
tags:
  - abstract-detection
  - scientific-text
  - quality-filtering
  - text-classification
  - pubverse
  - potion-32m
language:
  - en
  - multilingual
library_name: model2vec
pipeline_tag: text-classification
---

# Abstract Archon

Binary classifier that answers: **"Is this text a real research abstract?"**

Uses [Potion-base-32M](https://huggingface.co/minishlab/potion-base-32M) (512-dim static embeddings) + LogisticRegression. Designed as a quality gate for large-scale scientific publication databases where abstract fields often contain non-abstract content (figure captions, supplementary material refs, author bylines, HTML artifacts, taxonomy stubs).

## Performance

| Metric | Value |
|--------|-------|
| ROC-AUC | 0.970 |
| Accuracy | 92% |
| Recall (real abstracts, t=0.01) | 99.75% |
| Garbage precision (t=0.01) | 99.0% |

Evaluated on a 20% held-out stratified test split. Separate validation on 500 random publications from a 198M-paper database confirmed the model correctly identifies real abstracts with very low false-negative rate.

PMID 39869795 sanity check: P(abstract) = 0.794 (PASS)

## Usage

```python
import numpy as np
from model2vec import StaticModel
from scipy.special import expit

# Load model
data = np.load("abstract_archon_head.npz", allow_pickle=True)
coef = data['coef']
intercept = data['intercept']
scaler_mean = data['scaler_mean']
scaler_scale = data['scaler_scale']
threshold = float(data['threshold'][0])  # 0.01

# Load embedding model
embed = StaticModel.from_pretrained("minishlab/potion-base-32M")

# Predict
text = "Your abstract text here..."[:500]
emb = embed.encode([text])
x_scaled = (emb - scaler_mean) / scaler_scale
logit = x_scaled @ coef.T + intercept
prob = expit(logit).flatten()[0]

is_abstract = prob >= threshold
print(f"P(abstract) = {prob:.4f}, is_abstract = {is_abstract}")
```

## Architecture

- **Embeddings**: `minishlab/potion-base-32M` (512-dim, static, deterministic, ~20s for 200K docs)
- **Preprocessing**: StandardScaler on embeddings
- **Head**: LogisticRegression (C=0.01, balanced class weights)
- **Input**: First 500 characters of text
- **Threshold**: 0.01 (calibrated for 99.5%+ recall on real abstracts)

## NPZ Keys

| Key | Shape | Description |
|-----|-------|-------------|
| `coef` | (1, 512) | LR coefficients |
| `intercept` | (1,) | LR intercept |
| `classes` | (2,) | Class labels [0, 1] |
| `labels` | (2,) | ['garbage', 'abstract'] |
| `scaler_mean` | (512,) | StandardScaler mean |
| `scaler_scale` | (512,) | StandardScaler scale |
| `embed_model` | str | 'minishlab/potion-base-32M' |
| `version` | str | 'v1' |
| `threshold` | (1,) | Calibrated decision threshold |

## Training Data

Trained on ~4,000 examples (2,000 real abstracts + ~2,000 curated non-abstract texts) from a 198M-publication database. See [abstract-archon-data](https://huggingface.co/datasets/jimnoneill/abstract-archon-data) for the full training set.

Negative examples were manually curated to remove misclassifications. Categories include figure/table captions, supplementary material references, author bylines, journal metadata scrapes, HTML-heavy content, MOESM titles, and taxonomy stubs.

## Part of PubVerse

This model is part of the [PubVerse](https://github.com/jimnoneill/pubverse) scientific literature analysis pipeline, where it serves as a quality gate before clustering and impact analysis.