SciBERT (cased) — safetensors mirror for use from R

This is a format-converted mirror of allenai/scibert_scivocab_cased, maintained for teaching a course on transformer-based topic modeling in R.

The model itself is unchanged: same architecture, same weights, same tokenizer, same outputs as the upstream original. What's different is the on-disk format and provenance, both of which matter for a teaching context.

Why this mirror exists

The upstream repo ships pytorch_model.bin (PyTorch pickle format) and vocab.txt. Both are fine for Python users, but they create friction for R users working through the torch (libtorch) and safetensors R packages:

Pickle weights require either the transformers Python library or the R-torch pickle reader, which has known limitations (cannot remap CUDA-saved tensors to CPU, executes arbitrary code on load, slower to read than safetensors).
tokenizer.json is missing, which forces R code to either depend on a separate WordPiece-tokenization package or to install the Python tokenizers library through reticulate.

This mirror adds:

model.safetensors — the same weights in safetensors format, which is device-agnostic, safe (cannot execute code on load), and faster to read than pickle.

Everything else — config.json, vocab.txt, the model architecture — is identical to the upstream original. The original pytorch_model.bin is preserved alongside the safetensors copy so the repo remains a strict superset of the upstream.

What it is, briefly

SciBERT is BERT-base trained on a corpus of 1.14 million scientific papers from Semantic Scholar (biomedical and computer science). It uses a specialized vocabulary built from scientific text — terms like protein, algorithm, mitochondria, and gradient are single tokens rather than fragments, which gives the model a meaningful advantage on scientific text compared to general-purpose BERT.

Property	Value
Architecture	BERT-base
Parameters	~110M
Hidden size	768
Layers	12
Attention heads	12
Vocabulary size	31,116 (cased, scientific)
Max sequence length	512 tokens
Training data	1.14M scientific papers (Semantic Scholar)
Case sensitivity	Cased (preserves capitalization — important for gene names, chemical formulas, acronyms)

Important caveat: this is a base BERT, not a sentence-transformer

SciBERT was trained only on masked language modeling. Its token-level representations are excellent on scientific text, but the mean-pooled sentence embeddings cluster poorly — a well-known limitation of base BERT-family models. For sentence-similarity tasks, retrieval, or topic modeling, fine-tuned variants like pritamdeka/S-Scibert-snli-multinli-stsb perform substantially better.

SciBERT itself is best used for: token-level tasks (NER, classification fine-tuning), masked language modeling, or as a starting point for further fine-tuning.

Usage from R

This mirror is set up to work with a pure-R BERT inference pipeline built on top of the torch (libtorch) R package, with no Python at runtime:

source("bert_r.R")
enc <- load_hf_bert("NetworkIsLife/SciBert_Cased_DAFS")

emb <- embed_texts(enc$model, enc$tokenizer,
                   c("CRISPR-Cas9 enables targeted gene editing.",
                     "Glioblastoma exhibits invasive growth patterns."))
dim(emb)   # 2 x 768

The R loader looks for model.safetensors first (this file) and falls back to pytorch_model.bin if it isn't found. Since the safetensors file is present, that's the fast path taken.

For long-term reproducibility in course materials, pin to a specific revision:

enc <- load_hf_bert(
  "NetworkIsLife/SciBert_Cased_DAFS",
  weights_path = hfhub::hub_download(
    "NetworkIsLife/SciBert_Cased_DAFS",
    "model.safetensors",
    revision = "MAIN_COMMIT_HASH_HERE"
  )
)

Replace MAIN_COMMIT_HASH_HERE with the commit hash visible in this repo's commit history.

Usage from Python (unchanged from upstream)

from transformers import AutoTokenizer, AutoModel
tok = AutoTokenizer.from_pretrained("NetworkIsLife/SciBert_Cased_DAFS")
mod = AutoModel.from_pretrained("NetworkIsLife/SciBert_Cased_DAFS")

Files in this repo

File	Source	Purpose
`model.safetensors`	converted from upstream `pytorch_model.bin`	model weights, modern format
`pytorch_model.bin`	copied from upstream	model weights, legacy format (kept for compatibility)
`config.json`	copied from upstream	architecture parameters
`vocab.txt`	copied from upstream	WordPiece vocabulary
`README.md`	this file	provenance and usage

Provenance and verification

The model.safetensors file in this repo was produced by HuggingFace's official SFconvertbot (the same automated conversion used across thousands of HuggingFace repos). The conversion is purely a re-serialization — every tensor in the safetensors file is bit-identical to the corresponding tensor in pytorch_model.bin. No re-training, no quantization, no precision loss.

You can verify this yourself in Python:

import torch
from safetensors.torch import load_file

a = torch.load("pytorch_model.bin", map_location="cpu", weights_only=True)
b = load_file("model.safetensors")
assert set(a.keys()) == set(b.keys())
for k in a:
    assert torch.equal(a[k], b[k]), f"Mismatch in {k}"
print("Bit-identical.")

License and citation

This mirror inherits the upstream license: Apache 2.0. If you use this model in academic work, please cite the original SciBERT paper:

@inproceedings{beltagy-etal-2019-scibert,
  title = "{SciBERT}: A Pretrained Language Model for Scientific Text",
  author = "Beltagy, Iz and Lo, Kyle and Cohan, Arman",
  booktitle = "Proceedings of EMNLP-IJCNLP",
  year = "2019",
  url = "https://www.aclweb.org/anthology/D19-1371"
}

Original model: allenai/scibert_scivocab_cased by the Allen Institute for AI.

Maintenance

This is a teaching artifact for a course on transformer-based topic modeling in R. It will not be updated except to fix conversion errors. For the canonical, maintained version of SciBERT, see the upstream repo.

Downloads last month: 37

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for NetworkIsLife/SciBert_Cased_DAFS

Base model

allenai/scibert_scivocab_cased

Finetuned

(15)

this model