Instructions to use Tiansve/arxiv-subfield-linear-probe with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Scikit-learn
How to use Tiansve/arxiv-subfield-linear-probe with Scikit-learn:
from huggingface_hub import hf_hub_download import joblib model = joblib.load( hf_hub_download("Tiansve/arxiv-subfield-linear-probe", "sklearn_model.joblib") ) # only load pickle files from sources you trust # read more about it here https://skops.readthedocs.io/en/stable/persistence.html - sentence-transformers
How to use Tiansve/arxiv-subfield-linear-probe with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("Tiansve/arxiv-subfield-linear-probe") sentences = [ "The weather is lovely today.", "It's so sunny outside!", "He drove to the stadium." ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [3, 3] - Notebooks
- Google Colab
- Kaggle
| language: en | |
| license: apache-2.0 | |
| library_name: sklearn | |
| tags: | |
| - text-classification | |
| - arxiv | |
| - scikit-learn | |
| - sentence-transformers | |
| datasets: | |
| - Tiansve/arxiv-cs-subfield | |
| metrics: | |
| - f1 | |
| - accuracy | |
| # arXiv CS Sub-field Linear Probe | |
| A lightweight classifier that maps an arXiv CS abstract to one of six | |
| overlapping sub-fields (`ai_general`, `cl_nlp`, `cv_vision`, `ir_retrieval`, `lg_ml`, `stat_ml`). The winning configuration is | |
| **`mpnet_logreg`** — features from `sentence-transformers/all-mpnet-base-v2` | |
| fed into a scikit-learn classifier. | |
| Trained dataset: [Tiansve/arxiv-cs-subfield](https://huggingface.co/datasets/Tiansve/arxiv-cs-subfield). | |
| ## Intended use | |
| Educational / research demo of frozen-embedding linear probing for fine-grained | |
| text classification. Predictions are noisy near class boundaries by design | |
| (e.g. `ai_general` vs `lg_ml`). | |
| ## Training details | |
| - Feature kind: `embedding` | |
| - Embedder: `sentence-transformers/all-mpnet-base-v2` | |
| - Selection rule: highest validation macro-F1 across 6 configurations. | |
| - Train/val/test split: stratified 80/10/10, `random_state=42`. | |
| - scikit-learn version (at train time): `1.5.0`. | |
| ## Evaluation | |
| All configurations evaluated on the same held-out test set: | |
| | config | val_macro_f1 | test_acc | test_macro_f1 | test_weighted_f1 | | |
| |:--------------|---------------:|-----------:|----------------:|-------------------:| | |
| | mpnet_logreg | 0.7581 | 0.7136 | 0.7112 | 0.7116 | | |
| | tfidf_logreg | 0.7499 | 0.7403 | 0.739 | 0.7394 | | |
| | mpnet_linsvm | 0.746 | 0.7233 | 0.72 | 0.7204 | | |
| | mpnet_mlp | 0.7444 | 0.7282 | 0.7252 | 0.7256 | | |
| | minilm_logreg | 0.7356 | 0.716 | 0.7124 | 0.7125 | | |
| | e5_logreg | 0.7179 | 0.7063 | 0.7058 | 0.706 | | |
| ## How to load and run inference | |
| ```python | |
| import joblib | |
| from huggingface_hub import hf_hub_download | |
| from sentence_transformers import SentenceTransformer | |
| repo = "Tiansve/arxiv-subfield-linear-probe" | |
| clf = joblib.load(hf_hub_download(repo, "classifier.joblib")) | |
| le = joblib.load(hf_hub_download(repo, "label_encoder.joblib")) | |
| embedder = SentenceTransformer("sentence-transformers/all-mpnet-base-v2") | |
| text = "Title. Abstract text goes here..." | |
| emb = embedder.encode([text], normalize_embeddings=True) | |
| probs = clf.predict_proba(emb)[0] | |
| ranked = sorted(zip(le.classes_, probs), key=lambda kv: -kv[1]) | |
| for label, p in ranked: | |
| print(f"{label:<14} {p:.3f}") | |
| ``` | |
| ## Limitations | |
| - Single-label simplification of a multi-label problem. | |
| - No temporal generalisation test: train/val/test sampled from the same window. | |
| - Small per-class sample (~690 papers/class). | |
| - The TF-IDF baseline is competitive on test, so the embedding advantage on | |
| this task is modest — treat the gap as task-dependent rather than universal. | |
| ## License | |
| Apache-2.0 for the trained artefacts. The training data itself is subject to | |
| arXiv's terms; see the dataset card. | |
| ## Citation | |
| ```bibtex | |
| @misc{arxiv-subfield-probe, | |
| author = { Tiansve }, | |
| title = { Tiansve/arxiv-subfield-linear-probe }, | |
| year = 2026, | |
| publisher = {Hugging Face}, | |
| howpublished = {\url{https://huggingface.co/Tiansve/arxiv-subfield-linear-probe}} | |
| } | |
| ``` | |