File size: 6,775 Bytes
0bd28d1
 
dd83bd2
 
 
 
 
 
 
 
 
 
 
0bd28d1
dd83bd2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
---
license: apache-2.0
language:
- en
library_name: transformers
tags:
- bert
- scibert
- scientific-text
- mirror
- r-compatible
base_model: allenai/scibert_scivocab_cased
pipeline_tag: feature-extraction
---

# SciBERT (cased) — safetensors mirror for use from R

This is a format-converted mirror of [`allenai/scibert_scivocab_cased`](https://huggingface.co/allenai/scibert_scivocab_cased), maintained for teaching a course on transformer-based topic modeling in R.

The model itself is unchanged: same architecture, same weights, same tokenizer, same outputs as the upstream original. What's different is the on-disk format and provenance, both of which matter for a teaching context.

## Why this mirror exists

The upstream repo ships `pytorch_model.bin` (PyTorch pickle format) and `vocab.txt`. Both are fine for Python users, but they create friction for R users working through the `torch` (libtorch) and `safetensors` R packages:

- **Pickle weights** require either the `transformers` Python library or the R-torch pickle reader, which has known limitations (cannot remap CUDA-saved tensors to CPU, executes arbitrary code on load, slower to read than safetensors).
- **`tokenizer.json` is missing**, which forces R code to either depend on a separate WordPiece-tokenization package or to install the Python `tokenizers` library through `reticulate`.

This mirror adds:

- `model.safetensors` — the same weights in [safetensors](https://huggingface.co/docs/safetensors) format, which is device-agnostic, safe (cannot execute code on load), and faster to read than pickle.

Everything else — `config.json`, `vocab.txt`, the model architecture — is identical to the upstream original. The original `pytorch_model.bin` is preserved alongside the safetensors copy so the repo remains a strict superset of the upstream.

## What it is, briefly

SciBERT is BERT-base trained on a corpus of 1.14 million scientific papers from Semantic Scholar (biomedical and computer science). It uses a specialized vocabulary built from scientific text — terms like `protein`, `algorithm`, `mitochondria`, and `gradient` are single tokens rather than fragments, which gives the model a meaningful advantage on scientific text compared to general-purpose BERT.

| Property | Value |
|----------|-------|
| Architecture | BERT-base |
| Parameters | ~110M |
| Hidden size | 768 |
| Layers | 12 |
| Attention heads | 12 |
| Vocabulary size | 31,116 (cased, scientific) |
| Max sequence length | 512 tokens |
| Training data | 1.14M scientific papers (Semantic Scholar) |
| Case sensitivity | **Cased** (preserves capitalization — important for gene names, chemical formulas, acronyms) |

## Important caveat: this is a base BERT, not a sentence-transformer

SciBERT was trained only on masked language modeling. Its token-level representations are excellent on scientific text, but the **mean-pooled sentence embeddings cluster poorly** — a well-known limitation of base BERT-family models. For sentence-similarity tasks, retrieval, or topic modeling, fine-tuned variants like [`pritamdeka/S-Scibert-snli-multinli-stsb`](https://huggingface.co/pritamdeka/S-Scibert-snli-multinli-stsb) perform substantially better.

SciBERT itself is best used for: token-level tasks (NER, classification fine-tuning), masked language modeling, or as a starting point for further fine-tuning.

## Usage from R

This mirror is set up to work with a pure-R BERT inference pipeline built on top of the `torch` (libtorch) R package, with no Python at runtime:

```r
source("bert_r.R")
enc <- load_hf_bert("NetworkIsLife/SciBert_Cased_DAFS")

emb <- embed_texts(enc$model, enc$tokenizer,
                   c("CRISPR-Cas9 enables targeted gene editing.",
                     "Glioblastoma exhibits invasive growth patterns."))
dim(emb)   # 2 x 768
```

The R loader looks for `model.safetensors` first (this file) and falls back to `pytorch_model.bin` if it isn't found. Since the safetensors file is present, that's the fast path taken.

For long-term reproducibility in course materials, pin to a specific revision:

```r
enc <- load_hf_bert(
  "NetworkIsLife/SciBert_Cased_DAFS",
  weights_path = hfhub::hub_download(
    "NetworkIsLife/SciBert_Cased_DAFS",
    "model.safetensors",
    revision = "MAIN_COMMIT_HASH_HERE"
  )
)
```

Replace `MAIN_COMMIT_HASH_HERE` with the commit hash visible in this repo's commit history.

## Usage from Python (unchanged from upstream)

```python
from transformers import AutoTokenizer, AutoModel
tok = AutoTokenizer.from_pretrained("NetworkIsLife/SciBert_Cased_DAFS")
mod = AutoModel.from_pretrained("NetworkIsLife/SciBert_Cased_DAFS")
```

## Files in this repo

| File | Source | Purpose |
|------|--------|---------|
| `model.safetensors` | converted from upstream `pytorch_model.bin` | model weights, modern format |
| `pytorch_model.bin` | copied from upstream | model weights, legacy format (kept for compatibility) |
| `config.json` | copied from upstream | architecture parameters |
| `vocab.txt` | copied from upstream | WordPiece vocabulary |
| `README.md` | this file | provenance and usage |

## Provenance and verification

The `model.safetensors` file in this repo was produced by HuggingFace's official `SFconvertbot` (the same automated conversion used across thousands of HuggingFace repos). The conversion is purely a re-serialization — every tensor in the safetensors file is bit-identical to the corresponding tensor in `pytorch_model.bin`. No re-training, no quantization, no precision loss.

You can verify this yourself in Python:

```python
import torch
from safetensors.torch import load_file

a = torch.load("pytorch_model.bin", map_location="cpu", weights_only=True)
b = load_file("model.safetensors")
assert set(a.keys()) == set(b.keys())
for k in a:
    assert torch.equal(a[k], b[k]), f"Mismatch in {k}"
print("Bit-identical.")
```

## License and citation

This mirror inherits the upstream license: **Apache 2.0**. If you use this model in academic work, please cite the original SciBERT paper:

```bibtex
@inproceedings{beltagy-etal-2019-scibert,
  title = "{SciBERT}: A Pretrained Language Model for Scientific Text",
  author = "Beltagy, Iz and Lo, Kyle and Cohan, Arman",
  booktitle = "Proceedings of EMNLP-IJCNLP",
  year = "2019",
  url = "https://www.aclweb.org/anthology/D19-1371"
}
```

Original model: [`allenai/scibert_scivocab_cased`](https://huggingface.co/allenai/scibert_scivocab_cased) by the Allen Institute for AI.

## Maintenance

This is a teaching artifact for a course on transformer-based topic modeling in R. It will not be updated except to fix conversion errors. For the canonical, maintained version of SciBERT, see the [upstream repo](https://huggingface.co/allenai/scibert_scivocab_cased).