--- license: mit tags: - abstract-detection - scientific-text - quality-filtering - text-classification - pubverse - potion-32m language: - en - multilingual library_name: model2vec pipeline_tag: text-classification --- # Abstract Archon Binary classifier that answers: **"Is this text a real research abstract?"** Uses [Potion-base-32M](https://huggingface.co/minishlab/potion-base-32M) (512-dim static embeddings) + LogisticRegression. Designed as a quality gate for large-scale scientific publication databases where abstract fields often contain non-abstract content (figure captions, supplementary material refs, author bylines, HTML artifacts, taxonomy stubs). ## Performance | Metric | Value | |--------|-------| | ROC-AUC | 0.970 | | Accuracy | 92% | | Recall (real abstracts, t=0.01) | 99.75% | | Garbage precision (t=0.01) | 99.0% | Evaluated on a 20% held-out stratified test split. Separate validation on 500 random publications from a 198M-paper database confirmed the model correctly identifies real abstracts with very low false-negative rate. PMID 39869795 sanity check: P(abstract) = 0.794 (PASS) ## Usage ```python import numpy as np from model2vec import StaticModel from scipy.special import expit # Load model data = np.load("abstract_archon_head.npz", allow_pickle=True) coef = data['coef'] intercept = data['intercept'] scaler_mean = data['scaler_mean'] scaler_scale = data['scaler_scale'] threshold = float(data['threshold'][0]) # 0.01 # Load embedding model embed = StaticModel.from_pretrained("minishlab/potion-base-32M") # Predict text = "Your abstract text here..."[:500] emb = embed.encode([text]) x_scaled = (emb - scaler_mean) / scaler_scale logit = x_scaled @ coef.T + intercept prob = expit(logit).flatten()[0] is_abstract = prob >= threshold print(f"P(abstract) = {prob:.4f}, is_abstract = {is_abstract}") ``` ## Architecture - **Embeddings**: `minishlab/potion-base-32M` (512-dim, static, deterministic, ~20s for 200K docs) - **Preprocessing**: StandardScaler on embeddings - **Head**: LogisticRegression (C=0.01, balanced class weights) - **Input**: First 500 characters of text - **Threshold**: 0.01 (calibrated for 99.5%+ recall on real abstracts) ## NPZ Keys | Key | Shape | Description | |-----|-------|-------------| | `coef` | (1, 512) | LR coefficients | | `intercept` | (1,) | LR intercept | | `classes` | (2,) | Class labels [0, 1] | | `labels` | (2,) | ['garbage', 'abstract'] | | `scaler_mean` | (512,) | StandardScaler mean | | `scaler_scale` | (512,) | StandardScaler scale | | `embed_model` | str | 'minishlab/potion-base-32M' | | `version` | str | 'v1' | | `threshold` | (1,) | Calibrated decision threshold | ## Training Data Trained on ~4,000 examples (2,000 real abstracts + ~2,000 curated non-abstract texts) from a 198M-publication database. See [abstract-archon-data](https://huggingface.co/datasets/jimnoneill/abstract-archon-data) for the full training set. Negative examples were manually curated to remove misclassifications. Categories include figure/table captions, supplementary material references, author bylines, journal metadata scrapes, HTML-heavy content, MOESM titles, and taxonomy stubs. ## Part of PubVerse This model is part of the [PubVerse](https://github.com/jimnoneill/pubverse) scientific literature analysis pipeline, where it serves as a quality gate before clustering and impact analysis.