Text Classification
Model2Vec
English
multilingual
abstract-detection
scientific-text
quality-filtering
pubverse
potion-32m
Instructions to use jimnoneill/abstract-archon with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Model2Vec
How to use jimnoneill/abstract-archon with Model2Vec:
from model2vec import StaticModel model = StaticModel.from_pretrained("jimnoneill/abstract-archon") - Notebooks
- Google Colab
- Kaggle
| license: mit | |
| tags: | |
| - abstract-detection | |
| - scientific-text | |
| - quality-filtering | |
| - text-classification | |
| - pubverse | |
| - potion-32m | |
| language: | |
| - en | |
| - multilingual | |
| library_name: model2vec | |
| pipeline_tag: text-classification | |
| # Abstract Archon | |
| Binary classifier that answers: **"Is this text a real research abstract?"** | |
| Uses [Potion-base-32M](https://huggingface.co/minishlab/potion-base-32M) (512-dim static embeddings) + LogisticRegression. Designed as a quality gate for large-scale scientific publication databases where abstract fields often contain non-abstract content (figure captions, supplementary material refs, author bylines, HTML artifacts, taxonomy stubs). | |
| ## Performance | |
| | Metric | Value | | |
| |--------|-------| | |
| | ROC-AUC | 0.970 | | |
| | Accuracy | 92% | | |
| | Recall (real abstracts, t=0.01) | 99.75% | | |
| | Garbage precision (t=0.01) | 99.0% | | |
| Evaluated on a 20% held-out stratified test split. Separate validation on 500 random publications from a 198M-paper database confirmed the model correctly identifies real abstracts with very low false-negative rate. | |
| PMID 39869795 sanity check: P(abstract) = 0.794 (PASS) | |
| ## Usage | |
| ```python | |
| import numpy as np | |
| from model2vec import StaticModel | |
| from scipy.special import expit | |
| # Load model | |
| data = np.load("abstract_archon_head.npz", allow_pickle=True) | |
| coef = data['coef'] | |
| intercept = data['intercept'] | |
| scaler_mean = data['scaler_mean'] | |
| scaler_scale = data['scaler_scale'] | |
| threshold = float(data['threshold'][0]) # 0.01 | |
| # Load embedding model | |
| embed = StaticModel.from_pretrained("minishlab/potion-base-32M") | |
| # Predict | |
| text = "Your abstract text here..."[:500] | |
| emb = embed.encode([text]) | |
| x_scaled = (emb - scaler_mean) / scaler_scale | |
| logit = x_scaled @ coef.T + intercept | |
| prob = expit(logit).flatten()[0] | |
| is_abstract = prob >= threshold | |
| print(f"P(abstract) = {prob:.4f}, is_abstract = {is_abstract}") | |
| ``` | |
| ## Architecture | |
| - **Embeddings**: `minishlab/potion-base-32M` (512-dim, static, deterministic, ~20s for 200K docs) | |
| - **Preprocessing**: StandardScaler on embeddings | |
| - **Head**: LogisticRegression (C=0.01, balanced class weights) | |
| - **Input**: First 500 characters of text | |
| - **Threshold**: 0.01 (calibrated for 99.5%+ recall on real abstracts) | |
| ## NPZ Keys | |
| | Key | Shape | Description | | |
| |-----|-------|-------------| | |
| | `coef` | (1, 512) | LR coefficients | | |
| | `intercept` | (1,) | LR intercept | | |
| | `classes` | (2,) | Class labels [0, 1] | | |
| | `labels` | (2,) | ['garbage', 'abstract'] | | |
| | `scaler_mean` | (512,) | StandardScaler mean | | |
| | `scaler_scale` | (512,) | StandardScaler scale | | |
| | `embed_model` | str | 'minishlab/potion-base-32M' | | |
| | `version` | str | 'v1' | | |
| | `threshold` | (1,) | Calibrated decision threshold | | |
| ## Training Data | |
| Trained on ~4,000 examples (2,000 real abstracts + ~2,000 curated non-abstract texts) from a 198M-publication database. See [abstract-archon-data](https://huggingface.co/datasets/jimnoneill/abstract-archon-data) for the full training set. | |
| Negative examples were manually curated to remove misclassifications. Categories include figure/table captions, supplementary material references, author bylines, journal metadata scrapes, HTML-heavy content, MOESM titles, and taxonomy stubs. | |
| ## Part of PubVerse | |
| This model is part of the [PubVerse](https://github.com/jimnoneill/pubverse) scientific literature analysis pipeline, where it serves as a quality gate before clustering and impact analysis. | |