Abstract Archon
Binary classifier that answers: "Is this text a real research abstract?"
Uses Potion-base-32M (512-dim static embeddings) + LogisticRegression. Designed as a quality gate for large-scale scientific publication databases where abstract fields often contain non-abstract content (figure captions, supplementary material refs, author bylines, HTML artifacts, taxonomy stubs).
Performance
| Metric | Value |
|---|---|
| ROC-AUC | 0.970 |
| Accuracy | 92% |
| Recall (real abstracts, t=0.01) | 99.75% |
| Garbage precision (t=0.01) | 99.0% |
Evaluated on a 20% held-out stratified test split. Separate validation on 500 random publications from a 198M-paper database confirmed the model correctly identifies real abstracts with very low false-negative rate.
PMID 39869795 sanity check: P(abstract) = 0.794 (PASS)
Usage
import numpy as np
from model2vec import StaticModel
from scipy.special import expit
# Load model
data = np.load("abstract_archon_head.npz", allow_pickle=True)
coef = data['coef']
intercept = data['intercept']
scaler_mean = data['scaler_mean']
scaler_scale = data['scaler_scale']
threshold = float(data['threshold'][0]) # 0.01
# Load embedding model
embed = StaticModel.from_pretrained("minishlab/potion-base-32M")
# Predict
text = "Your abstract text here..."[:500]
emb = embed.encode([text])
x_scaled = (emb - scaler_mean) / scaler_scale
logit = x_scaled @ coef.T + intercept
prob = expit(logit).flatten()[0]
is_abstract = prob >= threshold
print(f"P(abstract) = {prob:.4f}, is_abstract = {is_abstract}")
Architecture
- Embeddings:
minishlab/potion-base-32M(512-dim, static, deterministic, ~20s for 200K docs) - Preprocessing: StandardScaler on embeddings
- Head: LogisticRegression (C=0.01, balanced class weights)
- Input: First 500 characters of text
- Threshold: 0.01 (calibrated for 99.5%+ recall on real abstracts)
NPZ Keys
| Key | Shape | Description |
|---|---|---|
coef |
(1, 512) | LR coefficients |
intercept |
(1,) | LR intercept |
classes |
(2,) | Class labels [0, 1] |
labels |
(2,) | ['garbage', 'abstract'] |
scaler_mean |
(512,) | StandardScaler mean |
scaler_scale |
(512,) | StandardScaler scale |
embed_model |
str | 'minishlab/potion-base-32M' |
version |
str | 'v1' |
threshold |
(1,) | Calibrated decision threshold |
Training Data
Trained on ~4,000 examples (2,000 real abstracts + ~2,000 curated non-abstract texts) from a 198M-publication database. See abstract-archon-data for the full training set.
Negative examples were manually curated to remove misclassifications. Categories include figure/table captions, supplementary material references, author bylines, journal metadata scrapes, HTML-heavy content, MOESM titles, and taxonomy stubs.
Part of PubVerse
This model is part of the PubVerse scientific literature analysis pipeline, where it serves as a quality gate before clustering and impact analysis.