NetworkIsLife commited on
Commit
dd83bd2
·
verified ·
1 Parent(s): 401bf57

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +139 -0
README.md CHANGED
@@ -1,3 +1,142 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ language:
4
+ - en
5
+ library_name: transformers
6
+ tags:
7
+ - bert
8
+ - scibert
9
+ - scientific-text
10
+ - mirror
11
+ - r-compatible
12
+ base_model: allenai/scibert_scivocab_cased
13
+ pipeline_tag: feature-extraction
14
  ---
15
+
16
+ # SciBERT (cased) — safetensors mirror for use from R
17
+
18
+ This is a format-converted mirror of [`allenai/scibert_scivocab_cased`](https://huggingface.co/allenai/scibert_scivocab_cased), maintained for teaching a course on transformer-based topic modeling in R.
19
+
20
+ The model itself is unchanged: same architecture, same weights, same tokenizer, same outputs as the upstream original. What's different is the on-disk format and provenance, both of which matter for a teaching context.
21
+
22
+ ## Why this mirror exists
23
+
24
+ The upstream repo ships `pytorch_model.bin` (PyTorch pickle format) and `vocab.txt`. Both are fine for Python users, but they create friction for R users working through the `torch` (libtorch) and `safetensors` R packages:
25
+
26
+ - **Pickle weights** require either the `transformers` Python library or the R-torch pickle reader, which has known limitations (cannot remap CUDA-saved tensors to CPU, executes arbitrary code on load, slower to read than safetensors).
27
+ - **`tokenizer.json` is missing**, which forces R code to either depend on a separate WordPiece-tokenization package or to install the Python `tokenizers` library through `reticulate`.
28
+
29
+ This mirror adds:
30
+
31
+ - `model.safetensors` — the same weights in [safetensors](https://huggingface.co/docs/safetensors) format, which is device-agnostic, safe (cannot execute code on load), and faster to read than pickle.
32
+
33
+ Everything else — `config.json`, `vocab.txt`, the model architecture — is identical to the upstream original. The original `pytorch_model.bin` is preserved alongside the safetensors copy so the repo remains a strict superset of the upstream.
34
+
35
+ ## What it is, briefly
36
+
37
+ SciBERT is BERT-base trained on a corpus of 1.14 million scientific papers from Semantic Scholar (biomedical and computer science). It uses a specialized vocabulary built from scientific text — terms like `protein`, `algorithm`, `mitochondria`, and `gradient` are single tokens rather than fragments, which gives the model a meaningful advantage on scientific text compared to general-purpose BERT.
38
+
39
+ | Property | Value |
40
+ |----------|-------|
41
+ | Architecture | BERT-base |
42
+ | Parameters | ~110M |
43
+ | Hidden size | 768 |
44
+ | Layers | 12 |
45
+ | Attention heads | 12 |
46
+ | Vocabulary size | 31,116 (cased, scientific) |
47
+ | Max sequence length | 512 tokens |
48
+ | Training data | 1.14M scientific papers (Semantic Scholar) |
49
+ | Case sensitivity | **Cased** (preserves capitalization — important for gene names, chemical formulas, acronyms) |
50
+
51
+ ## Important caveat: this is a base BERT, not a sentence-transformer
52
+
53
+ SciBERT was trained only on masked language modeling. Its token-level representations are excellent on scientific text, but the **mean-pooled sentence embeddings cluster poorly** — a well-known limitation of base BERT-family models. For sentence-similarity tasks, retrieval, or topic modeling, fine-tuned variants like [`pritamdeka/S-Scibert-snli-multinli-stsb`](https://huggingface.co/pritamdeka/S-Scibert-snli-multinli-stsb) perform substantially better.
54
+
55
+ SciBERT itself is best used for: token-level tasks (NER, classification fine-tuning), masked language modeling, or as a starting point for further fine-tuning.
56
+
57
+ ## Usage from R
58
+
59
+ This mirror is set up to work with a pure-R BERT inference pipeline built on top of the `torch` (libtorch) R package, with no Python at runtime:
60
+
61
+ ```r
62
+ source("bert_r.R")
63
+ enc <- load_hf_bert("NetworkIsLife/SciBert_Cased_DAFS")
64
+
65
+ emb <- embed_texts(enc$model, enc$tokenizer,
66
+ c("CRISPR-Cas9 enables targeted gene editing.",
67
+ "Glioblastoma exhibits invasive growth patterns."))
68
+ dim(emb) # 2 x 768
69
+ ```
70
+
71
+ The R loader looks for `model.safetensors` first (this file) and falls back to `pytorch_model.bin` if it isn't found. Since the safetensors file is present, that's the fast path taken.
72
+
73
+ For long-term reproducibility in course materials, pin to a specific revision:
74
+
75
+ ```r
76
+ enc <- load_hf_bert(
77
+ "NetworkIsLife/SciBert_Cased_DAFS",
78
+ weights_path = hfhub::hub_download(
79
+ "NetworkIsLife/SciBert_Cased_DAFS",
80
+ "model.safetensors",
81
+ revision = "MAIN_COMMIT_HASH_HERE"
82
+ )
83
+ )
84
+ ```
85
+
86
+ Replace `MAIN_COMMIT_HASH_HERE` with the commit hash visible in this repo's commit history.
87
+
88
+ ## Usage from Python (unchanged from upstream)
89
+
90
+ ```python
91
+ from transformers import AutoTokenizer, AutoModel
92
+ tok = AutoTokenizer.from_pretrained("NetworkIsLife/SciBert_Cased_DAFS")
93
+ mod = AutoModel.from_pretrained("NetworkIsLife/SciBert_Cased_DAFS")
94
+ ```
95
+
96
+ ## Files in this repo
97
+
98
+ | File | Source | Purpose |
99
+ |------|--------|---------|
100
+ | `model.safetensors` | converted from upstream `pytorch_model.bin` | model weights, modern format |
101
+ | `pytorch_model.bin` | copied from upstream | model weights, legacy format (kept for compatibility) |
102
+ | `config.json` | copied from upstream | architecture parameters |
103
+ | `vocab.txt` | copied from upstream | WordPiece vocabulary |
104
+ | `README.md` | this file | provenance and usage |
105
+
106
+ ## Provenance and verification
107
+
108
+ The `model.safetensors` file in this repo was produced by HuggingFace's official `SFconvertbot` (the same automated conversion used across thousands of HuggingFace repos). The conversion is purely a re-serialization — every tensor in the safetensors file is bit-identical to the corresponding tensor in `pytorch_model.bin`. No re-training, no quantization, no precision loss.
109
+
110
+ You can verify this yourself in Python:
111
+
112
+ ```python
113
+ import torch
114
+ from safetensors.torch import load_file
115
+
116
+ a = torch.load("pytorch_model.bin", map_location="cpu", weights_only=True)
117
+ b = load_file("model.safetensors")
118
+ assert set(a.keys()) == set(b.keys())
119
+ for k in a:
120
+ assert torch.equal(a[k], b[k]), f"Mismatch in {k}"
121
+ print("Bit-identical.")
122
+ ```
123
+
124
+ ## License and citation
125
+
126
+ This mirror inherits the upstream license: **Apache 2.0**. If you use this model in academic work, please cite the original SciBERT paper:
127
+
128
+ ```bibtex
129
+ @inproceedings{beltagy-etal-2019-scibert,
130
+ title = "{SciBERT}: A Pretrained Language Model for Scientific Text",
131
+ author = "Beltagy, Iz and Lo, Kyle and Cohan, Arman",
132
+ booktitle = "Proceedings of EMNLP-IJCNLP",
133
+ year = "2019",
134
+ url = "https://www.aclweb.org/anthology/D19-1371"
135
+ }
136
+ ```
137
+
138
+ Original model: [`allenai/scibert_scivocab_cased`](https://huggingface.co/allenai/scibert_scivocab_cased) by the Allen Institute for AI.
139
+
140
+ ## Maintenance
141
+
142
+ This is a teaching artifact for a course on transformer-based topic modeling in R. It will not be updated except to fix conversion errors. For the canonical, maintained version of SciBERT, see the [upstream repo](https://huggingface.co/allenai/scibert_scivocab_cased).