File size: 3,337 Bytes
6fefaae
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
---
language: en
license: apache-2.0
library_name: sklearn
tags:
- text-classification
- arxiv
- scikit-learn
- sentence-transformers
datasets:
- Tiansve/arxiv-cs-subfield
metrics:
- f1
- accuracy
---

# arXiv CS Sub-field Linear Probe

A lightweight classifier that maps an arXiv CS abstract to one of six
overlapping sub-fields (`ai_general`, `cl_nlp`, `cv_vision`, `ir_retrieval`, `lg_ml`, `stat_ml`). The winning configuration is
**`mpnet_logreg`** — features from `sentence-transformers/all-mpnet-base-v2`
fed into a scikit-learn classifier.

Trained dataset: [Tiansve/arxiv-cs-subfield](https://huggingface.co/datasets/Tiansve/arxiv-cs-subfield).

## Intended use

Educational / research demo of frozen-embedding linear probing for fine-grained
text classification. Predictions are noisy near class boundaries by design
(e.g. `ai_general` vs `lg_ml`).

## Training details

- Feature kind: `embedding`
- Embedder: `sentence-transformers/all-mpnet-base-v2`
- Selection rule: highest validation macro-F1 across 6 configurations.
- Train/val/test split: stratified 80/10/10, `random_state=42`.
- scikit-learn version (at train time): `1.5.0`.

## Evaluation

All configurations evaluated on the same held-out test set:

| config        |   val_macro_f1 |   test_acc |   test_macro_f1 |   test_weighted_f1 |
|:--------------|---------------:|-----------:|----------------:|-------------------:|
| mpnet_logreg  |         0.7581 |     0.7136 |          0.7112 |             0.7116 |
| tfidf_logreg  |         0.7499 |     0.7403 |          0.739  |             0.7394 |
| mpnet_linsvm  |         0.746  |     0.7233 |          0.72   |             0.7204 |
| mpnet_mlp     |         0.7444 |     0.7282 |          0.7252 |             0.7256 |
| minilm_logreg |         0.7356 |     0.716  |          0.7124 |             0.7125 |
| e5_logreg     |         0.7179 |     0.7063 |          0.7058 |             0.706  |

## How to load and run inference

```python
import joblib
from huggingface_hub import hf_hub_download
from sentence_transformers import SentenceTransformer

repo = "Tiansve/arxiv-subfield-linear-probe"
clf  = joblib.load(hf_hub_download(repo, "classifier.joblib"))
le   = joblib.load(hf_hub_download(repo, "label_encoder.joblib"))
embedder = SentenceTransformer("sentence-transformers/all-mpnet-base-v2")

text = "Title. Abstract text goes here..."
emb  = embedder.encode([text], normalize_embeddings=True)
probs = clf.predict_proba(emb)[0]
ranked = sorted(zip(le.classes_, probs), key=lambda kv: -kv[1])
for label, p in ranked:
    print(f"{label:<14} {p:.3f}")
```

## Limitations

- Single-label simplification of a multi-label problem.
- No temporal generalisation test: train/val/test sampled from the same window.
- Small per-class sample (~690 papers/class).
- The TF-IDF baseline is competitive on test, so the embedding advantage on
  this task is modest — treat the gap as task-dependent rather than universal.

## License

Apache-2.0 for the trained artefacts. The training data itself is subject to
arXiv's terms; see the dataset card.

## Citation

```bibtex
@misc{arxiv-subfield-probe,
  author = { Tiansve },
  title  = { Tiansve/arxiv-subfield-linear-probe },
  year   = 2026,
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/Tiansve/arxiv-subfield-linear-probe}}
}
```