Tiansve
/

arxiv-subfield-linear-probe

+---
+language: en
+license: apache-2.0
+library_name: sklearn
+tags:
+- text-classification
+- arxiv
+- scikit-learn
+- sentence-transformers
+datasets:
+- Tiansve/arxiv-cs-subfield
+metrics:
+- f1
+- accuracy
+---
+# arXiv CS Sub-field Linear Probe
+A lightweight classifier that maps an arXiv CS abstract to one of six
+overlapping sub-fields (`ai_general`, `cl_nlp`, `cv_vision`, `ir_retrieval`, `lg_ml`, `stat_ml`). The winning configuration is
+**`mpnet_logreg`** — features from `sentence-transformers/all-mpnet-base-v2`
+fed into a scikit-learn classifier.
+Trained dataset: [Tiansve/arxiv-cs-subfield](https://huggingface.co/datasets/Tiansve/arxiv-cs-subfield).
+## Intended use
+Educational / research demo of frozen-embedding linear probing for fine-grained
+text classification. Predictions are noisy near class boundaries by design
+(e.g. `ai_general` vs `lg_ml`).
+## Training details
+- Feature kind: `embedding`
+- Embedder: `sentence-transformers/all-mpnet-base-v2`
+- Selection rule: highest validation macro-F1 across 6 configurations.
+- Train/val/test split: stratified 80/10/10, `random_state=42`.
+- scikit-learn version (at train time): `1.5.0`.
+## Evaluation
+All configurations evaluated on the same held-out test set:
+| config        |   val_macro_f1 |   test_acc |   test_macro_f1 |   test_weighted_f1 |
+|:--------------|---------------:|-----------:|----------------:|-------------------:|
+| mpnet_logreg  |         0.7581 |     0.7136 |          0.7112 |             0.7116 |
+| tfidf_logreg  |         0.7499 |     0.7403 |          0.739  |             0.7394 |
+| mpnet_linsvm  |         0.746  |     0.7233 |          0.72   |             0.7204 |
+| mpnet_mlp     |         0.7444 |     0.7282 |          0.7252 |             0.7256 |
+| minilm_logreg |         0.7356 |     0.716  |          0.7124 |             0.7125 |
+| e5_logreg     |         0.7179 |     0.7063 |          0.7058 |             0.706  |
+## How to load and run inference
+```python
+import joblib
+from huggingface_hub import hf_hub_download
+from sentence_transformers import SentenceTransformer
+repo = "Tiansve/arxiv-subfield-linear-probe"
+clf  = joblib.load(hf_hub_download(repo, "classifier.joblib"))
+le   = joblib.load(hf_hub_download(repo, "label_encoder.joblib"))
+embedder = SentenceTransformer("sentence-transformers/all-mpnet-base-v2")
+text = "Title. Abstract text goes here..."
+emb  = embedder.encode([text], normalize_embeddings=True)
+probs = clf.predict_proba(emb)[0]
+ranked = sorted(zip(le.classes_, probs), key=lambda kv: -kv[1])
+for label, p in ranked:
+    print(f"{label:<14} {p:.3f}")
+```
+## Limitations
+- Single-label simplification of a multi-label problem.
+- No temporal generalisation test: train/val/test sampled from the same window.
+- Small per-class sample (~690 papers/class).
+- The TF-IDF baseline is competitive on test, so the embedding advantage on
+  this task is modest — treat the gap as task-dependent rather than universal.
+## License
+Apache-2.0 for the trained artefacts. The training data itself is subject to
+arXiv's terms; see the dataset card.
+## Citation
+```bibtex
+@misc{arxiv-subfield-probe,
+  author = { Tiansve },
+  title  = { Tiansve/arxiv-subfield-linear-probe },
+  year   = 2026,
+  publisher = {Hugging Face},
+  howpublished = {\url{https://huggingface.co/Tiansve/arxiv-subfield-linear-probe}}
+}
+```