Feature Extraction
Transformers
PyTorch
Safetensors
English
bert
scibert
scientific-text
mirror
r-compatible
Instructions to use NetworkIsLife/SciBert_Cased_DAFS with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use NetworkIsLife/SciBert_Cased_DAFS with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("feature-extraction", model="NetworkIsLife/SciBert_Cased_DAFS")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("NetworkIsLife/SciBert_Cased_DAFS", dtype="auto") - Notebooks
- Google Colab
- Kaggle
File size: 6,775 Bytes
0bd28d1 dd83bd2 0bd28d1 dd83bd2 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 | ---
license: apache-2.0
language:
- en
library_name: transformers
tags:
- bert
- scibert
- scientific-text
- mirror
- r-compatible
base_model: allenai/scibert_scivocab_cased
pipeline_tag: feature-extraction
---
# SciBERT (cased) — safetensors mirror for use from R
This is a format-converted mirror of [`allenai/scibert_scivocab_cased`](https://huggingface.co/allenai/scibert_scivocab_cased), maintained for teaching a course on transformer-based topic modeling in R.
The model itself is unchanged: same architecture, same weights, same tokenizer, same outputs as the upstream original. What's different is the on-disk format and provenance, both of which matter for a teaching context.
## Why this mirror exists
The upstream repo ships `pytorch_model.bin` (PyTorch pickle format) and `vocab.txt`. Both are fine for Python users, but they create friction for R users working through the `torch` (libtorch) and `safetensors` R packages:
- **Pickle weights** require either the `transformers` Python library or the R-torch pickle reader, which has known limitations (cannot remap CUDA-saved tensors to CPU, executes arbitrary code on load, slower to read than safetensors).
- **`tokenizer.json` is missing**, which forces R code to either depend on a separate WordPiece-tokenization package or to install the Python `tokenizers` library through `reticulate`.
This mirror adds:
- `model.safetensors` — the same weights in [safetensors](https://huggingface.co/docs/safetensors) format, which is device-agnostic, safe (cannot execute code on load), and faster to read than pickle.
Everything else — `config.json`, `vocab.txt`, the model architecture — is identical to the upstream original. The original `pytorch_model.bin` is preserved alongside the safetensors copy so the repo remains a strict superset of the upstream.
## What it is, briefly
SciBERT is BERT-base trained on a corpus of 1.14 million scientific papers from Semantic Scholar (biomedical and computer science). It uses a specialized vocabulary built from scientific text — terms like `protein`, `algorithm`, `mitochondria`, and `gradient` are single tokens rather than fragments, which gives the model a meaningful advantage on scientific text compared to general-purpose BERT.
| Property | Value |
|----------|-------|
| Architecture | BERT-base |
| Parameters | ~110M |
| Hidden size | 768 |
| Layers | 12 |
| Attention heads | 12 |
| Vocabulary size | 31,116 (cased, scientific) |
| Max sequence length | 512 tokens |
| Training data | 1.14M scientific papers (Semantic Scholar) |
| Case sensitivity | **Cased** (preserves capitalization — important for gene names, chemical formulas, acronyms) |
## Important caveat: this is a base BERT, not a sentence-transformer
SciBERT was trained only on masked language modeling. Its token-level representations are excellent on scientific text, but the **mean-pooled sentence embeddings cluster poorly** — a well-known limitation of base BERT-family models. For sentence-similarity tasks, retrieval, or topic modeling, fine-tuned variants like [`pritamdeka/S-Scibert-snli-multinli-stsb`](https://huggingface.co/pritamdeka/S-Scibert-snli-multinli-stsb) perform substantially better.
SciBERT itself is best used for: token-level tasks (NER, classification fine-tuning), masked language modeling, or as a starting point for further fine-tuning.
## Usage from R
This mirror is set up to work with a pure-R BERT inference pipeline built on top of the `torch` (libtorch) R package, with no Python at runtime:
```r
source("bert_r.R")
enc <- load_hf_bert("NetworkIsLife/SciBert_Cased_DAFS")
emb <- embed_texts(enc$model, enc$tokenizer,
c("CRISPR-Cas9 enables targeted gene editing.",
"Glioblastoma exhibits invasive growth patterns."))
dim(emb) # 2 x 768
```
The R loader looks for `model.safetensors` first (this file) and falls back to `pytorch_model.bin` if it isn't found. Since the safetensors file is present, that's the fast path taken.
For long-term reproducibility in course materials, pin to a specific revision:
```r
enc <- load_hf_bert(
"NetworkIsLife/SciBert_Cased_DAFS",
weights_path = hfhub::hub_download(
"NetworkIsLife/SciBert_Cased_DAFS",
"model.safetensors",
revision = "MAIN_COMMIT_HASH_HERE"
)
)
```
Replace `MAIN_COMMIT_HASH_HERE` with the commit hash visible in this repo's commit history.
## Usage from Python (unchanged from upstream)
```python
from transformers import AutoTokenizer, AutoModel
tok = AutoTokenizer.from_pretrained("NetworkIsLife/SciBert_Cased_DAFS")
mod = AutoModel.from_pretrained("NetworkIsLife/SciBert_Cased_DAFS")
```
## Files in this repo
| File | Source | Purpose |
|------|--------|---------|
| `model.safetensors` | converted from upstream `pytorch_model.bin` | model weights, modern format |
| `pytorch_model.bin` | copied from upstream | model weights, legacy format (kept for compatibility) |
| `config.json` | copied from upstream | architecture parameters |
| `vocab.txt` | copied from upstream | WordPiece vocabulary |
| `README.md` | this file | provenance and usage |
## Provenance and verification
The `model.safetensors` file in this repo was produced by HuggingFace's official `SFconvertbot` (the same automated conversion used across thousands of HuggingFace repos). The conversion is purely a re-serialization — every tensor in the safetensors file is bit-identical to the corresponding tensor in `pytorch_model.bin`. No re-training, no quantization, no precision loss.
You can verify this yourself in Python:
```python
import torch
from safetensors.torch import load_file
a = torch.load("pytorch_model.bin", map_location="cpu", weights_only=True)
b = load_file("model.safetensors")
assert set(a.keys()) == set(b.keys())
for k in a:
assert torch.equal(a[k], b[k]), f"Mismatch in {k}"
print("Bit-identical.")
```
## License and citation
This mirror inherits the upstream license: **Apache 2.0**. If you use this model in academic work, please cite the original SciBERT paper:
```bibtex
@inproceedings{beltagy-etal-2019-scibert,
title = "{SciBERT}: A Pretrained Language Model for Scientific Text",
author = "Beltagy, Iz and Lo, Kyle and Cohan, Arman",
booktitle = "Proceedings of EMNLP-IJCNLP",
year = "2019",
url = "https://www.aclweb.org/anthology/D19-1371"
}
```
Original model: [`allenai/scibert_scivocab_cased`](https://huggingface.co/allenai/scibert_scivocab_cased) by the Allen Institute for AI.
## Maintenance
This is a teaching artifact for a course on transformer-based topic modeling in R. It will not be updated except to fix conversion errors. For the canonical, maintained version of SciBERT, see the [upstream repo](https://huggingface.co/allenai/scibert_scivocab_cased).
|