arXiv CS Sub-field Linear Probe

A lightweight classifier that maps an arXiv CS abstract to one of six overlapping sub-fields (ai_general, cl_nlp, cv_vision, ir_retrieval, lg_ml, stat_ml). The winning configuration is mpnet_logreg โ€” features from sentence-transformers/all-mpnet-base-v2 fed into a scikit-learn classifier.

Trained dataset: Tiansve/arxiv-cs-subfield.

Intended use

Educational / research demo of frozen-embedding linear probing for fine-grained text classification. Predictions are noisy near class boundaries by design (e.g. ai_general vs lg_ml).

Training details

  • Feature kind: embedding
  • Embedder: sentence-transformers/all-mpnet-base-v2
  • Selection rule: highest validation macro-F1 across 6 configurations.
  • Train/val/test split: stratified 80/10/10, random_state=42.
  • scikit-learn version (at train time): 1.5.0.

Evaluation

All configurations evaluated on the same held-out test set:

config val_macro_f1 test_acc test_macro_f1 test_weighted_f1
mpnet_logreg 0.7581 0.7136 0.7112 0.7116
tfidf_logreg 0.7499 0.7403 0.739 0.7394
mpnet_linsvm 0.746 0.7233 0.72 0.7204
mpnet_mlp 0.7444 0.7282 0.7252 0.7256
minilm_logreg 0.7356 0.716 0.7124 0.7125
e5_logreg 0.7179 0.7063 0.7058 0.706

How to load and run inference

import joblib
from huggingface_hub import hf_hub_download
from sentence_transformers import SentenceTransformer

repo = "Tiansve/arxiv-subfield-linear-probe"
clf  = joblib.load(hf_hub_download(repo, "classifier.joblib"))
le   = joblib.load(hf_hub_download(repo, "label_encoder.joblib"))
embedder = SentenceTransformer("sentence-transformers/all-mpnet-base-v2")

text = "Title. Abstract text goes here..."
emb  = embedder.encode([text], normalize_embeddings=True)
probs = clf.predict_proba(emb)[0]
ranked = sorted(zip(le.classes_, probs), key=lambda kv: -kv[1])
for label, p in ranked:
    print(f"{label:<14} {p:.3f}")

Limitations

  • Single-label simplification of a multi-label problem.
  • No temporal generalisation test: train/val/test sampled from the same window.
  • Small per-class sample (~690 papers/class).
  • The TF-IDF baseline is competitive on test, so the embedding advantage on this task is modest โ€” treat the gap as task-dependent rather than universal.

License

Apache-2.0 for the trained artefacts. The training data itself is subject to arXiv's terms; see the dataset card.

Citation

@misc{arxiv-subfield-probe,
  author = { Tiansve },
  title  = { Tiansve/arxiv-subfield-linear-probe },
  year   = 2026,
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/Tiansve/arxiv-subfield-linear-probe}}
}
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Dataset used to train Tiansve/arxiv-subfield-linear-probe

Space using Tiansve/arxiv-subfield-linear-probe 1