How to use from the
Use from the
Model2Vec library
from model2vec import StaticModel

model = StaticModel.from_pretrained("jimnoneill/abstract-archon")

Abstract Archon

Binary classifier that answers: "Is this text a real research abstract?"

Uses Potion-base-32M (512-dim static embeddings) + LogisticRegression. Designed as a quality gate for large-scale scientific publication databases where abstract fields often contain non-abstract content (figure captions, supplementary material refs, author bylines, HTML artifacts, taxonomy stubs).

Performance

Metric Value
ROC-AUC 0.970
Accuracy 92%
Recall (real abstracts, t=0.01) 99.75%
Garbage precision (t=0.01) 99.0%

Evaluated on a 20% held-out stratified test split. Separate validation on 500 random publications from a 198M-paper database confirmed the model correctly identifies real abstracts with very low false-negative rate.

PMID 39869795 sanity check: P(abstract) = 0.794 (PASS)

Usage

import numpy as np
from model2vec import StaticModel
from scipy.special import expit

# Load model
data = np.load("abstract_archon_head.npz", allow_pickle=True)
coef = data['coef']
intercept = data['intercept']
scaler_mean = data['scaler_mean']
scaler_scale = data['scaler_scale']
threshold = float(data['threshold'][0])  # 0.01

# Load embedding model
embed = StaticModel.from_pretrained("minishlab/potion-base-32M")

# Predict
text = "Your abstract text here..."[:500]
emb = embed.encode([text])
x_scaled = (emb - scaler_mean) / scaler_scale
logit = x_scaled @ coef.T + intercept
prob = expit(logit).flatten()[0]

is_abstract = prob >= threshold
print(f"P(abstract) = {prob:.4f}, is_abstract = {is_abstract}")

Architecture

  • Embeddings: minishlab/potion-base-32M (512-dim, static, deterministic, ~20s for 200K docs)
  • Preprocessing: StandardScaler on embeddings
  • Head: LogisticRegression (C=0.01, balanced class weights)
  • Input: First 500 characters of text
  • Threshold: 0.01 (calibrated for 99.5%+ recall on real abstracts)

NPZ Keys

Key Shape Description
coef (1, 512) LR coefficients
intercept (1,) LR intercept
classes (2,) Class labels [0, 1]
labels (2,) ['garbage', 'abstract']
scaler_mean (512,) StandardScaler mean
scaler_scale (512,) StandardScaler scale
embed_model str 'minishlab/potion-base-32M'
version str 'v1'
threshold (1,) Calibrated decision threshold

Training Data

Trained on ~4,000 examples (2,000 real abstracts + ~2,000 curated non-abstract texts) from a 198M-publication database. See abstract-archon-data for the full training set.

Negative examples were manually curated to remove misclassifications. Categories include figure/table captions, supplementary material references, author bylines, journal metadata scrapes, HTML-heavy content, MOESM titles, and taxonomy stubs.

Part of PubVerse

This model is part of the PubVerse scientific literature analysis pipeline, where it serves as a quality gate before clustering and impact analysis.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support