Update README.md

dd83bd2 verified about 18 hours ago

6.78 kB

	---
	license: apache-2.0
	language:
	- en
	library_name: transformers
	tags:
	- bert
	- scibert
	- scientific-text
	- mirror
	- r-compatible
	base_model: allenai/scibert_scivocab_cased
	pipeline_tag: feature-extraction
	---

	# SciBERT (cased) — safetensors mirror for use from R

	This is a format-converted mirror of [`allenai/scibert_scivocab_cased`](https://huggingface.co/allenai/scibert_scivocab_cased), maintained for teaching a course on transformer-based topic modeling in R.

	The model itself is unchanged: same architecture, same weights, same tokenizer, same outputs as the upstream original. What's different is the on-disk format and provenance, both of which matter for a teaching context.

	## Why this mirror exists

	The upstream repo ships `pytorch_model.bin` (PyTorch pickle format) and `vocab.txt`. Both are fine for Python users, but they create friction for R users working through the `torch` (libtorch) and `safetensors` R packages:

	- Pickle weights require either the `transformers` Python library or the R-torch pickle reader, which has known limitations (cannot remap CUDA-saved tensors to CPU, executes arbitrary code on load, slower to read than safetensors).
	- `tokenizer.json` is missing, which forces R code to either depend on a separate WordPiece-tokenization package or to install the Python `tokenizers` library through `reticulate`.

	This mirror adds:

	- `model.safetensors` — the same weights in [safetensors](https://huggingface.co/docs/safetensors) format, which is device-agnostic, safe (cannot execute code on load), and faster to read than pickle.

	Everything else — `config.json`, `vocab.txt`, the model architecture — is identical to the upstream original. The original `pytorch_model.bin` is preserved alongside the safetensors copy so the repo remains a strict superset of the upstream.

	## What it is, briefly

	SciBERT is BERT-base trained on a corpus of 1.14 million scientific papers from Semantic Scholar (biomedical and computer science). It uses a specialized vocabulary built from scientific text — terms like `protein`, `algorithm`, `mitochondria`, and `gradient` are single tokens rather than fragments, which gives the model a meaningful advantage on scientific text compared to general-purpose BERT.

	\| Property \| Value \|
	\|----------\|-------\|
	\| Architecture \| BERT-base \|
	\| Parameters \| ~110M \|
	\| Hidden size \| 768 \|
	\| Layers \| 12 \|
	\| Attention heads \| 12 \|
	\| Vocabulary size \| 31,116 (cased, scientific) \|
	\| Max sequence length \| 512 tokens \|
	\| Training data \| 1.14M scientific papers (Semantic Scholar) \|
	\| Case sensitivity \| Cased (preserves capitalization — important for gene names, chemical formulas, acronyms) \|

	## Important caveat: this is a base BERT, not a sentence-transformer

	SciBERT was trained only on masked language modeling. Its token-level representations are excellent on scientific text, but the mean-pooled sentence embeddings cluster poorly — a well-known limitation of base BERT-family models. For sentence-similarity tasks, retrieval, or topic modeling, fine-tuned variants like [`pritamdeka/S-Scibert-snli-multinli-stsb`](https://huggingface.co/pritamdeka/S-Scibert-snli-multinli-stsb) perform substantially better.

	SciBERT itself is best used for: token-level tasks (NER, classification fine-tuning), masked language modeling, or as a starting point for further fine-tuning.

	## Usage from R

	This mirror is set up to work with a pure-R BERT inference pipeline built on top of the `torch` (libtorch) R package, with no Python at runtime:

	```r
	source("bert_r.R")
	enc <- load_hf_bert("NetworkIsLife/SciBert_Cased_DAFS")

	emb <- embed_texts(enc$model, enc$tokenizer,
	c("CRISPR-Cas9 enables targeted gene editing.",
	"Glioblastoma exhibits invasive growth patterns."))
	dim(emb) # 2 x 768
	```

	The R loader looks for `model.safetensors` first (this file) and falls back to `pytorch_model.bin` if it isn't found. Since the safetensors file is present, that's the fast path taken.

	For long-term reproducibility in course materials, pin to a specific revision:

	```r
	enc <- load_hf_bert(
	"NetworkIsLife/SciBert_Cased_DAFS",
	weights_path = hfhub::hub_download(
	"NetworkIsLife/SciBert_Cased_DAFS",
	"model.safetensors",
	revision = "MAIN_COMMIT_HASH_HERE"
	)
	)
	```

	Replace `MAIN_COMMIT_HASH_HERE` with the commit hash visible in this repo's commit history.

	## Usage from Python (unchanged from upstream)

	```python
	from transformers import AutoTokenizer, AutoModel
	tok = AutoTokenizer.from_pretrained("NetworkIsLife/SciBert_Cased_DAFS")
	mod = AutoModel.from_pretrained("NetworkIsLife/SciBert_Cased_DAFS")
	```

	## Files in this repo

	\| File \| Source \| Purpose \|
	\|------\|--------\|---------\|
	\| `model.safetensors` \| converted from upstream `pytorch_model.bin` \| model weights, modern format \|
	\| `pytorch_model.bin` \| copied from upstream \| model weights, legacy format (kept for compatibility) \|
	\| `config.json` \| copied from upstream \| architecture parameters \|
	\| `vocab.txt` \| copied from upstream \| WordPiece vocabulary \|
	\| `README.md` \| this file \| provenance and usage \|

	## Provenance and verification

	The `model.safetensors` file in this repo was produced by HuggingFace's official `SFconvertbot` (the same automated conversion used across thousands of HuggingFace repos). The conversion is purely a re-serialization — every tensor in the safetensors file is bit-identical to the corresponding tensor in `pytorch_model.bin`. No re-training, no quantization, no precision loss.

	You can verify this yourself in Python:

	```python
	import torch
	from safetensors.torch import load_file

	a = torch.load("pytorch_model.bin", map_location="cpu", weights_only=True)
	b = load_file("model.safetensors")
	assert set(a.keys()) == set(b.keys())
	for k in a:
	assert torch.equal(a[k], b[k]), f"Mismatch in {k}"
	print("Bit-identical.")
	```

	## License and citation

	This mirror inherits the upstream license: Apache 2.0. If you use this model in academic work, please cite the original SciBERT paper:

	```bibtex
	@inproceedings{beltagy-etal-2019-scibert,
	title = "{SciBERT}: A Pretrained Language Model for Scientific Text",
	author = "Beltagy, Iz and Lo, Kyle and Cohan, Arman",
	booktitle = "Proceedings of EMNLP-IJCNLP",
	year = "2019",
	url = "https://www.aclweb.org/anthology/D19-1371"
	}
	```

	Original model: [`allenai/scibert_scivocab_cased`](https://huggingface.co/allenai/scibert_scivocab_cased) by the Allen Institute for AI.

	## Maintenance

	This is a teaching artifact for a course on transformer-based topic modeling in R. It will not be updated except to fix conversion errors. For the canonical, maintained version of SciBERT, see the [upstream repo](https://huggingface.co/allenai/scibert_scivocab_cased).