Instructions to use NetworkIsLife/SciBert_Cased_DAFS with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use NetworkIsLife/SciBert_Cased_DAFS with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("feature-extraction", model="NetworkIsLife/SciBert_Cased_DAFS")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("NetworkIsLife/SciBert_Cased_DAFS", dtype="auto") - Notebooks
- Google Colab
- Kaggle
SciBERT (cased) β safetensors mirror for use from R
This is a format-converted mirror of allenai/scibert_scivocab_cased, maintained for teaching a course on transformer-based topic modeling in R.
The model itself is unchanged: same architecture, same weights, same tokenizer, same outputs as the upstream original. What's different is the on-disk format and provenance, both of which matter for a teaching context.
Why this mirror exists
The upstream repo ships pytorch_model.bin (PyTorch pickle format) and vocab.txt. Both are fine for Python users, but they create friction for R users working through the torch (libtorch) and safetensors R packages:
- Pickle weights require either the
transformersPython library or the R-torch pickle reader, which has known limitations (cannot remap CUDA-saved tensors to CPU, executes arbitrary code on load, slower to read than safetensors). tokenizer.jsonis missing, which forces R code to either depend on a separate WordPiece-tokenization package or to install the Pythontokenizerslibrary throughreticulate.
This mirror adds:
model.safetensorsβ the same weights in safetensors format, which is device-agnostic, safe (cannot execute code on load), and faster to read than pickle.
Everything else β config.json, vocab.txt, the model architecture β is identical to the upstream original. The original pytorch_model.bin is preserved alongside the safetensors copy so the repo remains a strict superset of the upstream.
What it is, briefly
SciBERT is BERT-base trained on a corpus of 1.14 million scientific papers from Semantic Scholar (biomedical and computer science). It uses a specialized vocabulary built from scientific text β terms like protein, algorithm, mitochondria, and gradient are single tokens rather than fragments, which gives the model a meaningful advantage on scientific text compared to general-purpose BERT.
| Property | Value |
|---|---|
| Architecture | BERT-base |
| Parameters | ~110M |
| Hidden size | 768 |
| Layers | 12 |
| Attention heads | 12 |
| Vocabulary size | 31,116 (cased, scientific) |
| Max sequence length | 512 tokens |
| Training data | 1.14M scientific papers (Semantic Scholar) |
| Case sensitivity | Cased (preserves capitalization β important for gene names, chemical formulas, acronyms) |
Important caveat: this is a base BERT, not a sentence-transformer
SciBERT was trained only on masked language modeling. Its token-level representations are excellent on scientific text, but the mean-pooled sentence embeddings cluster poorly β a well-known limitation of base BERT-family models. For sentence-similarity tasks, retrieval, or topic modeling, fine-tuned variants like pritamdeka/S-Scibert-snli-multinli-stsb perform substantially better.
SciBERT itself is best used for: token-level tasks (NER, classification fine-tuning), masked language modeling, or as a starting point for further fine-tuning.
Usage from R
This mirror is set up to work with a pure-R BERT inference pipeline built on top of the torch (libtorch) R package, with no Python at runtime:
source("bert_r.R")
enc <- load_hf_bert("NetworkIsLife/SciBert_Cased_DAFS")
emb <- embed_texts(enc$model, enc$tokenizer,
c("CRISPR-Cas9 enables targeted gene editing.",
"Glioblastoma exhibits invasive growth patterns."))
dim(emb) # 2 x 768
The R loader looks for model.safetensors first (this file) and falls back to pytorch_model.bin if it isn't found. Since the safetensors file is present, that's the fast path taken.
For long-term reproducibility in course materials, pin to a specific revision:
enc <- load_hf_bert(
"NetworkIsLife/SciBert_Cased_DAFS",
weights_path = hfhub::hub_download(
"NetworkIsLife/SciBert_Cased_DAFS",
"model.safetensors",
revision = "MAIN_COMMIT_HASH_HERE"
)
)
Replace MAIN_COMMIT_HASH_HERE with the commit hash visible in this repo's commit history.
Usage from Python (unchanged from upstream)
from transformers import AutoTokenizer, AutoModel
tok = AutoTokenizer.from_pretrained("NetworkIsLife/SciBert_Cased_DAFS")
mod = AutoModel.from_pretrained("NetworkIsLife/SciBert_Cased_DAFS")
Files in this repo
| File | Source | Purpose |
|---|---|---|
model.safetensors |
converted from upstream pytorch_model.bin |
model weights, modern format |
pytorch_model.bin |
copied from upstream | model weights, legacy format (kept for compatibility) |
config.json |
copied from upstream | architecture parameters |
vocab.txt |
copied from upstream | WordPiece vocabulary |
README.md |
this file | provenance and usage |
Provenance and verification
The model.safetensors file in this repo was produced by HuggingFace's official SFconvertbot (the same automated conversion used across thousands of HuggingFace repos). The conversion is purely a re-serialization β every tensor in the safetensors file is bit-identical to the corresponding tensor in pytorch_model.bin. No re-training, no quantization, no precision loss.
You can verify this yourself in Python:
import torch
from safetensors.torch import load_file
a = torch.load("pytorch_model.bin", map_location="cpu", weights_only=True)
b = load_file("model.safetensors")
assert set(a.keys()) == set(b.keys())
for k in a:
assert torch.equal(a[k], b[k]), f"Mismatch in {k}"
print("Bit-identical.")
License and citation
This mirror inherits the upstream license: Apache 2.0. If you use this model in academic work, please cite the original SciBERT paper:
@inproceedings{beltagy-etal-2019-scibert,
title = "{SciBERT}: A Pretrained Language Model for Scientific Text",
author = "Beltagy, Iz and Lo, Kyle and Cohan, Arman",
booktitle = "Proceedings of EMNLP-IJCNLP",
year = "2019",
url = "https://www.aclweb.org/anthology/D19-1371"
}
Original model: allenai/scibert_scivocab_cased by the Allen Institute for AI.
Maintenance
This is a teaching artifact for a course on transformer-based topic modeling in R. It will not be updated except to fix conversion errors. For the canonical, maintained version of SciBERT, see the upstream repo.
- Downloads last month
- 37
Model tree for NetworkIsLife/SciBert_Cased_DAFS
Base model
allenai/scibert_scivocab_cased