Abstract Archon

Binary classifier that answers: "Is this text a real research abstract?"

Uses Potion-base-32M (512-dim static embeddings) + LogisticRegression. Designed as a quality gate for large-scale scientific publication databases where abstract fields often contain non-abstract content (figure captions, supplementary material refs, author bylines, HTML artifacts, taxonomy stubs).

Performance

Metric	Value
ROC-AUC	0.970
Accuracy	92%
Recall (real abstracts, t=0.01)	99.75%
Garbage precision (t=0.01)	99.0%

Evaluated on a 20% held-out stratified test split. Separate validation on 500 random publications from a 198M-paper database confirmed the model correctly identifies real abstracts with very low false-negative rate.

PMID 39869795 sanity check: P(abstract) = 0.794 (PASS)

Usage

import numpy as np
from model2vec import StaticModel
from scipy.special import expit

# Load model
data = np.load("abstract_archon_head.npz", allow_pickle=True)
coef = data['coef']
intercept = data['intercept']
scaler_mean = data['scaler_mean']
scaler_scale = data['scaler_scale']
threshold = float(data['threshold'][0])  # 0.01

# Load embedding model
embed = StaticModel.from_pretrained("minishlab/potion-base-32M")

# Predict
text = "Your abstract text here..."[:500]
emb = embed.encode([text])
x_scaled = (emb - scaler_mean) / scaler_scale
logit = x_scaled @ coef.T + intercept
prob = expit(logit).flatten()[0]

is_abstract = prob >= threshold
print(f"P(abstract) = {prob:.4f}, is_abstract = {is_abstract}")

Architecture

Embeddings: minishlab/potion-base-32M (512-dim, static, deterministic, ~20s for 200K docs)
Preprocessing: StandardScaler on embeddings
Head: LogisticRegression (C=0.01, balanced class weights)
Input: First 500 characters of text
Threshold: 0.01 (calibrated for 99.5%+ recall on real abstracts)

NPZ Keys

Key	Shape	Description
`coef`	(1, 512)	LR coefficients
`intercept`	(1,)	LR intercept
`classes`	(2,)	Class labels [0, 1]
`labels`	(2,)	['garbage', 'abstract']
`scaler_mean`	(512,)	StandardScaler mean
`scaler_scale`	(512,)	StandardScaler scale
`embed_model`	str	'minishlab/potion-base-32M'
`version`	str	'v1'
`threshold`	(1,)	Calibrated decision threshold

Training Data

Trained on ~4,000 examples (2,000 real abstracts + ~2,000 curated non-abstract texts) from a 198M-publication database. See abstract-archon-data for the full training set.

Negative examples were manually curated to remove misclassifications. Categories include figure/table captions, supplementary material references, author bylines, journal metadata scrapes, HTML-heavy content, MOESM titles, and taxonomy stubs.

Part of PubVerse

This model is part of the PubVerse scientific literature analysis pipeline, where it serves as a quality gate before clustering and impact analysis.

Downloads last month: -; Downloads are not tracked for this model. How to track