Tiansve
/

arxiv-subfield-linear-probe

Text Classification

sentence-transformers

Model card Files Files and versions

arxiv-subfield-linear-probe / README.md

Tiansve's picture

Upload README.md with huggingface_hub

6fefaae verified 2 days ago

|

history blame contribute delete

3.34 kB

	---
	language: en
	license: apache-2.0
	library_name: sklearn
	tags:
	- text-classification
	- arxiv
	- scikit-learn
	- sentence-transformers
	datasets:
	- Tiansve/arxiv-cs-subfield
	metrics:
	- f1
	- accuracy
	---

	# arXiv CS Sub-field Linear Probe

	A lightweight classifier that maps an arXiv CS abstract to one of six
	overlapping sub-fields (`ai_general`, `cl_nlp`, `cv_vision`, `ir_retrieval`, `lg_ml`, `stat_ml`). The winning configuration is
	`mpnet_logreg` — features from `sentence-transformers/all-mpnet-base-v2`
	fed into a scikit-learn classifier.

	Trained dataset: [Tiansve/arxiv-cs-subfield](https://huggingface.co/datasets/Tiansve/arxiv-cs-subfield).

	## Intended use

	Educational / research demo of frozen-embedding linear probing for fine-grained
	text classification. Predictions are noisy near class boundaries by design
	(e.g. `ai_general` vs `lg_ml`).

	## Training details

	- Feature kind: `embedding`
	- Embedder: `sentence-transformers/all-mpnet-base-v2`
	- Selection rule: highest validation macro-F1 across 6 configurations.
	- Train/val/test split: stratified 80/10/10, `random_state=42`.
	- scikit-learn version (at train time): `1.5.0`.

	## Evaluation

	All configurations evaluated on the same held-out test set:

	\| config \| val_macro_f1 \| test_acc \| test_macro_f1 \| test_weighted_f1 \|
	\|:--------------\|---------------:\|-----------:\|----------------:\|-------------------:\|
	\| mpnet_logreg \| 0.7581 \| 0.7136 \| 0.7112 \| 0.7116 \|
	\| tfidf_logreg \| 0.7499 \| 0.7403 \| 0.739 \| 0.7394 \|
	\| mpnet_linsvm \| 0.746 \| 0.7233 \| 0.72 \| 0.7204 \|
	\| mpnet_mlp \| 0.7444 \| 0.7282 \| 0.7252 \| 0.7256 \|
	\| minilm_logreg \| 0.7356 \| 0.716 \| 0.7124 \| 0.7125 \|
	\| e5_logreg \| 0.7179 \| 0.7063 \| 0.7058 \| 0.706 \|

	## How to load and run inference

	```python
	import joblib
	from huggingface_hub import hf_hub_download
	from sentence_transformers import SentenceTransformer

	repo = "Tiansve/arxiv-subfield-linear-probe"
	clf = joblib.load(hf_hub_download(repo, "classifier.joblib"))
	le = joblib.load(hf_hub_download(repo, "label_encoder.joblib"))
	embedder = SentenceTransformer("sentence-transformers/all-mpnet-base-v2")

	text = "Title. Abstract text goes here..."
	emb = embedder.encode([text], normalize_embeddings=True)
	probs = clf.predict_proba(emb)[0]
	ranked = sorted(zip(le.classes_, probs), key=lambda kv: -kv[1])
	for label, p in ranked:
	print(f"{label:<14} {p:.3f}")
	```

	## Limitations

	- Single-label simplification of a multi-label problem.
	- No temporal generalisation test: train/val/test sampled from the same window.
	- Small per-class sample (~690 papers/class).
	- The TF-IDF baseline is competitive on test, so the embedding advantage on
	this task is modest — treat the gap as task-dependent rather than universal.

	## License

	Apache-2.0 for the trained artefacts. The training data itself is subject to
	arXiv's terms; see the dataset card.

	## Citation

	```bibtex
	@misc{arxiv-subfield-probe,
	author = { Tiansve },
	title = { Tiansve/arxiv-subfield-linear-probe },
	year = 2026,
	publisher = {Hugging Face},
	howpublished = {\url{https://huggingface.co/Tiansve/arxiv-subfield-linear-probe}}
	}
	```