jimnoneill
/

abstract-archon

Text Classification

abstract-detection

scientific-text

quality-filtering

Model card Files Files and versions

abstract-archon / README.md

jimnoneill's picture

Upload README.md with huggingface_hub

28142d9 verified 22 days ago

|

history blame contribute delete

3.41 kB

	---
	license: mit
	tags:
	- abstract-detection
	- scientific-text
	- quality-filtering
	- text-classification
	- pubverse
	- potion-32m
	language:
	- en
	- multilingual
	library_name: model2vec
	pipeline_tag: text-classification
	---

	# Abstract Archon

	Binary classifier that answers: "Is this text a real research abstract?"

	Uses [Potion-base-32M](https://huggingface.co/minishlab/potion-base-32M) (512-dim static embeddings) + LogisticRegression. Designed as a quality gate for large-scale scientific publication databases where abstract fields often contain non-abstract content (figure captions, supplementary material refs, author bylines, HTML artifacts, taxonomy stubs).

	## Performance

	\| Metric \| Value \|
	\|--------\|-------\|
	\| ROC-AUC \| 0.970 \|
	\| Accuracy \| 92% \|
	\| Recall (real abstracts, t=0.01) \| 99.75% \|
	\| Garbage precision (t=0.01) \| 99.0% \|

	Evaluated on a 20% held-out stratified test split. Separate validation on 500 random publications from a 198M-paper database confirmed the model correctly identifies real abstracts with very low false-negative rate.

	PMID 39869795 sanity check: P(abstract) = 0.794 (PASS)

	## Usage

	```python
	import numpy as np
	from model2vec import StaticModel
	from scipy.special import expit

	# Load model
	data = np.load("abstract_archon_head.npz", allow_pickle=True)
	coef = data['coef']
	intercept = data['intercept']
	scaler_mean = data['scaler_mean']
	scaler_scale = data['scaler_scale']
	threshold = float(data['threshold'][0]) # 0.01

	# Load embedding model
	embed = StaticModel.from_pretrained("minishlab/potion-base-32M")

	# Predict
	text = "Your abstract text here..."[:500]
	emb = embed.encode([text])
	x_scaled = (emb - scaler_mean) / scaler_scale
	logit = x_scaled @ coef.T + intercept
	prob = expit(logit).flatten()[0]

	is_abstract = prob >= threshold
	print(f"P(abstract) = {prob:.4f}, is_abstract = {is_abstract}")
	```

	## Architecture

	- Embeddings: `minishlab/potion-base-32M` (512-dim, static, deterministic, ~20s for 200K docs)
	- Preprocessing: StandardScaler on embeddings
	- Head: LogisticRegression (C=0.01, balanced class weights)
	- Input: First 500 characters of text
	- Threshold: 0.01 (calibrated for 99.5%+ recall on real abstracts)

	## NPZ Keys

	\| Key \| Shape \| Description \|
	\|-----\|-------\|-------------\|
	\| `coef` \| (1, 512) \| LR coefficients \|
	\| `intercept` \| (1,) \| LR intercept \|
	\| `classes` \| (2,) \| Class labels [0, 1] \|
	\| `labels` \| (2,) \| ['garbage', 'abstract'] \|
	\| `scaler_mean` \| (512,) \| StandardScaler mean \|
	\| `scaler_scale` \| (512,) \| StandardScaler scale \|
	\| `embed_model` \| str \| 'minishlab/potion-base-32M' \|
	\| `version` \| str \| 'v1' \|
	\| `threshold` \| (1,) \| Calibrated decision threshold \|

	## Training Data

	Trained on ~4,000 examples (2,000 real abstracts + ~2,000 curated non-abstract texts) from a 198M-publication database. See [abstract-archon-data](https://huggingface.co/datasets/jimnoneill/abstract-archon-data) for the full training set.

	Negative examples were manually curated to remove misclassifications. Categories include figure/table captions, supplementary material references, author bylines, journal metadata scrapes, HTML-heavy content, MOESM titles, and taxonomy stubs.

	## Part of PubVerse

	This model is part of the [PubVerse](https://github.com/jimnoneill/pubverse) scientific literature analysis pipeline, where it serves as a quality gate before clustering and impact analysis.