Tiansve commited on
Commit
6fefaae
·
verified ·
1 Parent(s): 4440aa2

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +96 -0
README.md ADDED
@@ -0,0 +1,96 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: en
3
+ license: apache-2.0
4
+ library_name: sklearn
5
+ tags:
6
+ - text-classification
7
+ - arxiv
8
+ - scikit-learn
9
+ - sentence-transformers
10
+ datasets:
11
+ - Tiansve/arxiv-cs-subfield
12
+ metrics:
13
+ - f1
14
+ - accuracy
15
+ ---
16
+
17
+ # arXiv CS Sub-field Linear Probe
18
+
19
+ A lightweight classifier that maps an arXiv CS abstract to one of six
20
+ overlapping sub-fields (`ai_general`, `cl_nlp`, `cv_vision`, `ir_retrieval`, `lg_ml`, `stat_ml`). The winning configuration is
21
+ **`mpnet_logreg`** — features from `sentence-transformers/all-mpnet-base-v2`
22
+ fed into a scikit-learn classifier.
23
+
24
+ Trained dataset: [Tiansve/arxiv-cs-subfield](https://huggingface.co/datasets/Tiansve/arxiv-cs-subfield).
25
+
26
+ ## Intended use
27
+
28
+ Educational / research demo of frozen-embedding linear probing for fine-grained
29
+ text classification. Predictions are noisy near class boundaries by design
30
+ (e.g. `ai_general` vs `lg_ml`).
31
+
32
+ ## Training details
33
+
34
+ - Feature kind: `embedding`
35
+ - Embedder: `sentence-transformers/all-mpnet-base-v2`
36
+ - Selection rule: highest validation macro-F1 across 6 configurations.
37
+ - Train/val/test split: stratified 80/10/10, `random_state=42`.
38
+ - scikit-learn version (at train time): `1.5.0`.
39
+
40
+ ## Evaluation
41
+
42
+ All configurations evaluated on the same held-out test set:
43
+
44
+ | config | val_macro_f1 | test_acc | test_macro_f1 | test_weighted_f1 |
45
+ |:--------------|---------------:|-----------:|----------------:|-------------------:|
46
+ | mpnet_logreg | 0.7581 | 0.7136 | 0.7112 | 0.7116 |
47
+ | tfidf_logreg | 0.7499 | 0.7403 | 0.739 | 0.7394 |
48
+ | mpnet_linsvm | 0.746 | 0.7233 | 0.72 | 0.7204 |
49
+ | mpnet_mlp | 0.7444 | 0.7282 | 0.7252 | 0.7256 |
50
+ | minilm_logreg | 0.7356 | 0.716 | 0.7124 | 0.7125 |
51
+ | e5_logreg | 0.7179 | 0.7063 | 0.7058 | 0.706 |
52
+
53
+ ## How to load and run inference
54
+
55
+ ```python
56
+ import joblib
57
+ from huggingface_hub import hf_hub_download
58
+ from sentence_transformers import SentenceTransformer
59
+
60
+ repo = "Tiansve/arxiv-subfield-linear-probe"
61
+ clf = joblib.load(hf_hub_download(repo, "classifier.joblib"))
62
+ le = joblib.load(hf_hub_download(repo, "label_encoder.joblib"))
63
+ embedder = SentenceTransformer("sentence-transformers/all-mpnet-base-v2")
64
+
65
+ text = "Title. Abstract text goes here..."
66
+ emb = embedder.encode([text], normalize_embeddings=True)
67
+ probs = clf.predict_proba(emb)[0]
68
+ ranked = sorted(zip(le.classes_, probs), key=lambda kv: -kv[1])
69
+ for label, p in ranked:
70
+ print(f"{label:<14} {p:.3f}")
71
+ ```
72
+
73
+ ## Limitations
74
+
75
+ - Single-label simplification of a multi-label problem.
76
+ - No temporal generalisation test: train/val/test sampled from the same window.
77
+ - Small per-class sample (~690 papers/class).
78
+ - The TF-IDF baseline is competitive on test, so the embedding advantage on
79
+ this task is modest — treat the gap as task-dependent rather than universal.
80
+
81
+ ## License
82
+
83
+ Apache-2.0 for the trained artefacts. The training data itself is subject to
84
+ arXiv's terms; see the dataset card.
85
+
86
+ ## Citation
87
+
88
+ ```bibtex
89
+ @misc{arxiv-subfield-probe,
90
+ author = { Tiansve },
91
+ title = { Tiansve/arxiv-subfield-linear-probe },
92
+ year = 2026,
93
+ publisher = {Hugging Face},
94
+ howpublished = {\url{https://huggingface.co/Tiansve/arxiv-subfield-linear-probe}}
95
+ }
96
+ ```