Feature Extraction
Transformers
PyTorch
Safetensors
English
bert
scibert
scientific-text
mirror
r-compatible
Instructions to use NetworkIsLife/SciBert_Cased_DAFS with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use NetworkIsLife/SciBert_Cased_DAFS with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("feature-extraction", model="NetworkIsLife/SciBert_Cased_DAFS")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("NetworkIsLife/SciBert_Cased_DAFS", dtype="auto") - Notebooks
- Google Colab
- Kaggle
Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,142 @@
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
| 3 |
+
language:
|
| 4 |
+
- en
|
| 5 |
+
library_name: transformers
|
| 6 |
+
tags:
|
| 7 |
+
- bert
|
| 8 |
+
- scibert
|
| 9 |
+
- scientific-text
|
| 10 |
+
- mirror
|
| 11 |
+
- r-compatible
|
| 12 |
+
base_model: allenai/scibert_scivocab_cased
|
| 13 |
+
pipeline_tag: feature-extraction
|
| 14 |
---
|
| 15 |
+
|
| 16 |
+
# SciBERT (cased) — safetensors mirror for use from R
|
| 17 |
+
|
| 18 |
+
This is a format-converted mirror of [`allenai/scibert_scivocab_cased`](https://huggingface.co/allenai/scibert_scivocab_cased), maintained for teaching a course on transformer-based topic modeling in R.
|
| 19 |
+
|
| 20 |
+
The model itself is unchanged: same architecture, same weights, same tokenizer, same outputs as the upstream original. What's different is the on-disk format and provenance, both of which matter for a teaching context.
|
| 21 |
+
|
| 22 |
+
## Why this mirror exists
|
| 23 |
+
|
| 24 |
+
The upstream repo ships `pytorch_model.bin` (PyTorch pickle format) and `vocab.txt`. Both are fine for Python users, but they create friction for R users working through the `torch` (libtorch) and `safetensors` R packages:
|
| 25 |
+
|
| 26 |
+
- **Pickle weights** require either the `transformers` Python library or the R-torch pickle reader, which has known limitations (cannot remap CUDA-saved tensors to CPU, executes arbitrary code on load, slower to read than safetensors).
|
| 27 |
+
- **`tokenizer.json` is missing**, which forces R code to either depend on a separate WordPiece-tokenization package or to install the Python `tokenizers` library through `reticulate`.
|
| 28 |
+
|
| 29 |
+
This mirror adds:
|
| 30 |
+
|
| 31 |
+
- `model.safetensors` — the same weights in [safetensors](https://huggingface.co/docs/safetensors) format, which is device-agnostic, safe (cannot execute code on load), and faster to read than pickle.
|
| 32 |
+
|
| 33 |
+
Everything else — `config.json`, `vocab.txt`, the model architecture — is identical to the upstream original. The original `pytorch_model.bin` is preserved alongside the safetensors copy so the repo remains a strict superset of the upstream.
|
| 34 |
+
|
| 35 |
+
## What it is, briefly
|
| 36 |
+
|
| 37 |
+
SciBERT is BERT-base trained on a corpus of 1.14 million scientific papers from Semantic Scholar (biomedical and computer science). It uses a specialized vocabulary built from scientific text — terms like `protein`, `algorithm`, `mitochondria`, and `gradient` are single tokens rather than fragments, which gives the model a meaningful advantage on scientific text compared to general-purpose BERT.
|
| 38 |
+
|
| 39 |
+
| Property | Value |
|
| 40 |
+
|----------|-------|
|
| 41 |
+
| Architecture | BERT-base |
|
| 42 |
+
| Parameters | ~110M |
|
| 43 |
+
| Hidden size | 768 |
|
| 44 |
+
| Layers | 12 |
|
| 45 |
+
| Attention heads | 12 |
|
| 46 |
+
| Vocabulary size | 31,116 (cased, scientific) |
|
| 47 |
+
| Max sequence length | 512 tokens |
|
| 48 |
+
| Training data | 1.14M scientific papers (Semantic Scholar) |
|
| 49 |
+
| Case sensitivity | **Cased** (preserves capitalization — important for gene names, chemical formulas, acronyms) |
|
| 50 |
+
|
| 51 |
+
## Important caveat: this is a base BERT, not a sentence-transformer
|
| 52 |
+
|
| 53 |
+
SciBERT was trained only on masked language modeling. Its token-level representations are excellent on scientific text, but the **mean-pooled sentence embeddings cluster poorly** — a well-known limitation of base BERT-family models. For sentence-similarity tasks, retrieval, or topic modeling, fine-tuned variants like [`pritamdeka/S-Scibert-snli-multinli-stsb`](https://huggingface.co/pritamdeka/S-Scibert-snli-multinli-stsb) perform substantially better.
|
| 54 |
+
|
| 55 |
+
SciBERT itself is best used for: token-level tasks (NER, classification fine-tuning), masked language modeling, or as a starting point for further fine-tuning.
|
| 56 |
+
|
| 57 |
+
## Usage from R
|
| 58 |
+
|
| 59 |
+
This mirror is set up to work with a pure-R BERT inference pipeline built on top of the `torch` (libtorch) R package, with no Python at runtime:
|
| 60 |
+
|
| 61 |
+
```r
|
| 62 |
+
source("bert_r.R")
|
| 63 |
+
enc <- load_hf_bert("NetworkIsLife/SciBert_Cased_DAFS")
|
| 64 |
+
|
| 65 |
+
emb <- embed_texts(enc$model, enc$tokenizer,
|
| 66 |
+
c("CRISPR-Cas9 enables targeted gene editing.",
|
| 67 |
+
"Glioblastoma exhibits invasive growth patterns."))
|
| 68 |
+
dim(emb) # 2 x 768
|
| 69 |
+
```
|
| 70 |
+
|
| 71 |
+
The R loader looks for `model.safetensors` first (this file) and falls back to `pytorch_model.bin` if it isn't found. Since the safetensors file is present, that's the fast path taken.
|
| 72 |
+
|
| 73 |
+
For long-term reproducibility in course materials, pin to a specific revision:
|
| 74 |
+
|
| 75 |
+
```r
|
| 76 |
+
enc <- load_hf_bert(
|
| 77 |
+
"NetworkIsLife/SciBert_Cased_DAFS",
|
| 78 |
+
weights_path = hfhub::hub_download(
|
| 79 |
+
"NetworkIsLife/SciBert_Cased_DAFS",
|
| 80 |
+
"model.safetensors",
|
| 81 |
+
revision = "MAIN_COMMIT_HASH_HERE"
|
| 82 |
+
)
|
| 83 |
+
)
|
| 84 |
+
```
|
| 85 |
+
|
| 86 |
+
Replace `MAIN_COMMIT_HASH_HERE` with the commit hash visible in this repo's commit history.
|
| 87 |
+
|
| 88 |
+
## Usage from Python (unchanged from upstream)
|
| 89 |
+
|
| 90 |
+
```python
|
| 91 |
+
from transformers import AutoTokenizer, AutoModel
|
| 92 |
+
tok = AutoTokenizer.from_pretrained("NetworkIsLife/SciBert_Cased_DAFS")
|
| 93 |
+
mod = AutoModel.from_pretrained("NetworkIsLife/SciBert_Cased_DAFS")
|
| 94 |
+
```
|
| 95 |
+
|
| 96 |
+
## Files in this repo
|
| 97 |
+
|
| 98 |
+
| File | Source | Purpose |
|
| 99 |
+
|------|--------|---------|
|
| 100 |
+
| `model.safetensors` | converted from upstream `pytorch_model.bin` | model weights, modern format |
|
| 101 |
+
| `pytorch_model.bin` | copied from upstream | model weights, legacy format (kept for compatibility) |
|
| 102 |
+
| `config.json` | copied from upstream | architecture parameters |
|
| 103 |
+
| `vocab.txt` | copied from upstream | WordPiece vocabulary |
|
| 104 |
+
| `README.md` | this file | provenance and usage |
|
| 105 |
+
|
| 106 |
+
## Provenance and verification
|
| 107 |
+
|
| 108 |
+
The `model.safetensors` file in this repo was produced by HuggingFace's official `SFconvertbot` (the same automated conversion used across thousands of HuggingFace repos). The conversion is purely a re-serialization — every tensor in the safetensors file is bit-identical to the corresponding tensor in `pytorch_model.bin`. No re-training, no quantization, no precision loss.
|
| 109 |
+
|
| 110 |
+
You can verify this yourself in Python:
|
| 111 |
+
|
| 112 |
+
```python
|
| 113 |
+
import torch
|
| 114 |
+
from safetensors.torch import load_file
|
| 115 |
+
|
| 116 |
+
a = torch.load("pytorch_model.bin", map_location="cpu", weights_only=True)
|
| 117 |
+
b = load_file("model.safetensors")
|
| 118 |
+
assert set(a.keys()) == set(b.keys())
|
| 119 |
+
for k in a:
|
| 120 |
+
assert torch.equal(a[k], b[k]), f"Mismatch in {k}"
|
| 121 |
+
print("Bit-identical.")
|
| 122 |
+
```
|
| 123 |
+
|
| 124 |
+
## License and citation
|
| 125 |
+
|
| 126 |
+
This mirror inherits the upstream license: **Apache 2.0**. If you use this model in academic work, please cite the original SciBERT paper:
|
| 127 |
+
|
| 128 |
+
```bibtex
|
| 129 |
+
@inproceedings{beltagy-etal-2019-scibert,
|
| 130 |
+
title = "{SciBERT}: A Pretrained Language Model for Scientific Text",
|
| 131 |
+
author = "Beltagy, Iz and Lo, Kyle and Cohan, Arman",
|
| 132 |
+
booktitle = "Proceedings of EMNLP-IJCNLP",
|
| 133 |
+
year = "2019",
|
| 134 |
+
url = "https://www.aclweb.org/anthology/D19-1371"
|
| 135 |
+
}
|
| 136 |
+
```
|
| 137 |
+
|
| 138 |
+
Original model: [`allenai/scibert_scivocab_cased`](https://huggingface.co/allenai/scibert_scivocab_cased) by the Allen Institute for AI.
|
| 139 |
+
|
| 140 |
+
## Maintenance
|
| 141 |
+
|
| 142 |
+
This is a teaching artifact for a course on transformer-based topic modeling in R. It will not be updated except to fix conversion errors. For the canonical, maintained version of SciBERT, see the [upstream repo](https://huggingface.co/allenai/scibert_scivocab_cased).
|