abstract-archon / README.md
jimnoneill's picture
Upload README.md with huggingface_hub
28142d9 verified
---
license: mit
tags:
- abstract-detection
- scientific-text
- quality-filtering
- text-classification
- pubverse
- potion-32m
language:
- en
- multilingual
library_name: model2vec
pipeline_tag: text-classification
---
# Abstract Archon
Binary classifier that answers: **"Is this text a real research abstract?"**
Uses [Potion-base-32M](https://huggingface.co/minishlab/potion-base-32M) (512-dim static embeddings) + LogisticRegression. Designed as a quality gate for large-scale scientific publication databases where abstract fields often contain non-abstract content (figure captions, supplementary material refs, author bylines, HTML artifacts, taxonomy stubs).
## Performance
| Metric | Value |
|--------|-------|
| ROC-AUC | 0.970 |
| Accuracy | 92% |
| Recall (real abstracts, t=0.01) | 99.75% |
| Garbage precision (t=0.01) | 99.0% |
Evaluated on a 20% held-out stratified test split. Separate validation on 500 random publications from a 198M-paper database confirmed the model correctly identifies real abstracts with very low false-negative rate.
PMID 39869795 sanity check: P(abstract) = 0.794 (PASS)
## Usage
```python
import numpy as np
from model2vec import StaticModel
from scipy.special import expit
# Load model
data = np.load("abstract_archon_head.npz", allow_pickle=True)
coef = data['coef']
intercept = data['intercept']
scaler_mean = data['scaler_mean']
scaler_scale = data['scaler_scale']
threshold = float(data['threshold'][0]) # 0.01
# Load embedding model
embed = StaticModel.from_pretrained("minishlab/potion-base-32M")
# Predict
text = "Your abstract text here..."[:500]
emb = embed.encode([text])
x_scaled = (emb - scaler_mean) / scaler_scale
logit = x_scaled @ coef.T + intercept
prob = expit(logit).flatten()[0]
is_abstract = prob >= threshold
print(f"P(abstract) = {prob:.4f}, is_abstract = {is_abstract}")
```
## Architecture
- **Embeddings**: `minishlab/potion-base-32M` (512-dim, static, deterministic, ~20s for 200K docs)
- **Preprocessing**: StandardScaler on embeddings
- **Head**: LogisticRegression (C=0.01, balanced class weights)
- **Input**: First 500 characters of text
- **Threshold**: 0.01 (calibrated for 99.5%+ recall on real abstracts)
## NPZ Keys
| Key | Shape | Description |
|-----|-------|-------------|
| `coef` | (1, 512) | LR coefficients |
| `intercept` | (1,) | LR intercept |
| `classes` | (2,) | Class labels [0, 1] |
| `labels` | (2,) | ['garbage', 'abstract'] |
| `scaler_mean` | (512,) | StandardScaler mean |
| `scaler_scale` | (512,) | StandardScaler scale |
| `embed_model` | str | 'minishlab/potion-base-32M' |
| `version` | str | 'v1' |
| `threshold` | (1,) | Calibrated decision threshold |
## Training Data
Trained on ~4,000 examples (2,000 real abstracts + ~2,000 curated non-abstract texts) from a 198M-publication database. See [abstract-archon-data](https://huggingface.co/datasets/jimnoneill/abstract-archon-data) for the full training set.
Negative examples were manually curated to remove misclassifications. Categories include figure/table captions, supplementary material references, author bylines, journal metadata scrapes, HTML-heavy content, MOESM titles, and taxonomy stubs.
## Part of PubVerse
This model is part of the [PubVerse](https://github.com/jimnoneill/pubverse) scientific literature analysis pipeline, where it serves as a quality gate before clustering and impact analysis.