--- language: en license: apache-2.0 library_name: sklearn tags: - text-classification - arxiv - scikit-learn - sentence-transformers datasets: - Tiansve/arxiv-cs-subfield metrics: - f1 - accuracy --- # arXiv CS Sub-field Linear Probe A lightweight classifier that maps an arXiv CS abstract to one of six overlapping sub-fields (`ai_general`, `cl_nlp`, `cv_vision`, `ir_retrieval`, `lg_ml`, `stat_ml`). The winning configuration is **`mpnet_logreg`** — features from `sentence-transformers/all-mpnet-base-v2` fed into a scikit-learn classifier. Trained dataset: [Tiansve/arxiv-cs-subfield](https://huggingface.co/datasets/Tiansve/arxiv-cs-subfield). ## Intended use Educational / research demo of frozen-embedding linear probing for fine-grained text classification. Predictions are noisy near class boundaries by design (e.g. `ai_general` vs `lg_ml`). ## Training details - Feature kind: `embedding` - Embedder: `sentence-transformers/all-mpnet-base-v2` - Selection rule: highest validation macro-F1 across 6 configurations. - Train/val/test split: stratified 80/10/10, `random_state=42`. - scikit-learn version (at train time): `1.5.0`. ## Evaluation All configurations evaluated on the same held-out test set: | config | val_macro_f1 | test_acc | test_macro_f1 | test_weighted_f1 | |:--------------|---------------:|-----------:|----------------:|-------------------:| | mpnet_logreg | 0.7581 | 0.7136 | 0.7112 | 0.7116 | | tfidf_logreg | 0.7499 | 0.7403 | 0.739 | 0.7394 | | mpnet_linsvm | 0.746 | 0.7233 | 0.72 | 0.7204 | | mpnet_mlp | 0.7444 | 0.7282 | 0.7252 | 0.7256 | | minilm_logreg | 0.7356 | 0.716 | 0.7124 | 0.7125 | | e5_logreg | 0.7179 | 0.7063 | 0.7058 | 0.706 | ## How to load and run inference ```python import joblib from huggingface_hub import hf_hub_download from sentence_transformers import SentenceTransformer repo = "Tiansve/arxiv-subfield-linear-probe" clf = joblib.load(hf_hub_download(repo, "classifier.joblib")) le = joblib.load(hf_hub_download(repo, "label_encoder.joblib")) embedder = SentenceTransformer("sentence-transformers/all-mpnet-base-v2") text = "Title. Abstract text goes here..." emb = embedder.encode([text], normalize_embeddings=True) probs = clf.predict_proba(emb)[0] ranked = sorted(zip(le.classes_, probs), key=lambda kv: -kv[1]) for label, p in ranked: print(f"{label:<14} {p:.3f}") ``` ## Limitations - Single-label simplification of a multi-label problem. - No temporal generalisation test: train/val/test sampled from the same window. - Small per-class sample (~690 papers/class). - The TF-IDF baseline is competitive on test, so the embedding advantage on this task is modest — treat the gap as task-dependent rather than universal. ## License Apache-2.0 for the trained artefacts. The training data itself is subject to arXiv's terms; see the dataset card. ## Citation ```bibtex @misc{arxiv-subfield-probe, author = { Tiansve }, title = { Tiansve/arxiv-subfield-linear-probe }, year = 2026, publisher = {Hugging Face}, howpublished = {\url{https://huggingface.co/Tiansve/arxiv-subfield-linear-probe}} } ```