Feature Extraction
Transformers
PyTorch
Safetensors
English
bert
scibert
scientific-text
mirror
r-compatible
Instructions to use NetworkIsLife/SciBert_Cased_DAFS with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use NetworkIsLife/SciBert_Cased_DAFS with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("feature-extraction", model="NetworkIsLife/SciBert_Cased_DAFS")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("NetworkIsLife/SciBert_Cased_DAFS", dtype="auto") - Notebooks
- Google Colab
- Kaggle
| license: apache-2.0 | |
| language: | |
| - en | |
| library_name: transformers | |
| tags: | |
| - bert | |
| - scibert | |
| - scientific-text | |
| - mirror | |
| - r-compatible | |
| base_model: allenai/scibert_scivocab_cased | |
| pipeline_tag: feature-extraction | |
| # SciBERT (cased) — safetensors mirror for use from R | |
| This is a format-converted mirror of [`allenai/scibert_scivocab_cased`](https://huggingface.co/allenai/scibert_scivocab_cased), maintained for teaching a course on transformer-based topic modeling in R. | |
| The model itself is unchanged: same architecture, same weights, same tokenizer, same outputs as the upstream original. What's different is the on-disk format and provenance, both of which matter for a teaching context. | |
| ## Why this mirror exists | |
| The upstream repo ships `pytorch_model.bin` (PyTorch pickle format) and `vocab.txt`. Both are fine for Python users, but they create friction for R users working through the `torch` (libtorch) and `safetensors` R packages: | |
| - **Pickle weights** require either the `transformers` Python library or the R-torch pickle reader, which has known limitations (cannot remap CUDA-saved tensors to CPU, executes arbitrary code on load, slower to read than safetensors). | |
| - **`tokenizer.json` is missing**, which forces R code to either depend on a separate WordPiece-tokenization package or to install the Python `tokenizers` library through `reticulate`. | |
| This mirror adds: | |
| - `model.safetensors` — the same weights in [safetensors](https://huggingface.co/docs/safetensors) format, which is device-agnostic, safe (cannot execute code on load), and faster to read than pickle. | |
| Everything else — `config.json`, `vocab.txt`, the model architecture — is identical to the upstream original. The original `pytorch_model.bin` is preserved alongside the safetensors copy so the repo remains a strict superset of the upstream. | |
| ## What it is, briefly | |
| SciBERT is BERT-base trained on a corpus of 1.14 million scientific papers from Semantic Scholar (biomedical and computer science). It uses a specialized vocabulary built from scientific text — terms like `protein`, `algorithm`, `mitochondria`, and `gradient` are single tokens rather than fragments, which gives the model a meaningful advantage on scientific text compared to general-purpose BERT. | |
| | Property | Value | | |
| |----------|-------| | |
| | Architecture | BERT-base | | |
| | Parameters | ~110M | | |
| | Hidden size | 768 | | |
| | Layers | 12 | | |
| | Attention heads | 12 | | |
| | Vocabulary size | 31,116 (cased, scientific) | | |
| | Max sequence length | 512 tokens | | |
| | Training data | 1.14M scientific papers (Semantic Scholar) | | |
| | Case sensitivity | **Cased** (preserves capitalization — important for gene names, chemical formulas, acronyms) | | |
| ## Important caveat: this is a base BERT, not a sentence-transformer | |
| SciBERT was trained only on masked language modeling. Its token-level representations are excellent on scientific text, but the **mean-pooled sentence embeddings cluster poorly** — a well-known limitation of base BERT-family models. For sentence-similarity tasks, retrieval, or topic modeling, fine-tuned variants like [`pritamdeka/S-Scibert-snli-multinli-stsb`](https://huggingface.co/pritamdeka/S-Scibert-snli-multinli-stsb) perform substantially better. | |
| SciBERT itself is best used for: token-level tasks (NER, classification fine-tuning), masked language modeling, or as a starting point for further fine-tuning. | |
| ## Usage from R | |
| This mirror is set up to work with a pure-R BERT inference pipeline built on top of the `torch` (libtorch) R package, with no Python at runtime: | |
| ```r | |
| source("bert_r.R") | |
| enc <- load_hf_bert("NetworkIsLife/SciBert_Cased_DAFS") | |
| emb <- embed_texts(enc$model, enc$tokenizer, | |
| c("CRISPR-Cas9 enables targeted gene editing.", | |
| "Glioblastoma exhibits invasive growth patterns.")) | |
| dim(emb) # 2 x 768 | |
| ``` | |
| The R loader looks for `model.safetensors` first (this file) and falls back to `pytorch_model.bin` if it isn't found. Since the safetensors file is present, that's the fast path taken. | |
| For long-term reproducibility in course materials, pin to a specific revision: | |
| ```r | |
| enc <- load_hf_bert( | |
| "NetworkIsLife/SciBert_Cased_DAFS", | |
| weights_path = hfhub::hub_download( | |
| "NetworkIsLife/SciBert_Cased_DAFS", | |
| "model.safetensors", | |
| revision = "MAIN_COMMIT_HASH_HERE" | |
| ) | |
| ) | |
| ``` | |
| Replace `MAIN_COMMIT_HASH_HERE` with the commit hash visible in this repo's commit history. | |
| ## Usage from Python (unchanged from upstream) | |
| ```python | |
| from transformers import AutoTokenizer, AutoModel | |
| tok = AutoTokenizer.from_pretrained("NetworkIsLife/SciBert_Cased_DAFS") | |
| mod = AutoModel.from_pretrained("NetworkIsLife/SciBert_Cased_DAFS") | |
| ``` | |
| ## Files in this repo | |
| | File | Source | Purpose | | |
| |------|--------|---------| | |
| | `model.safetensors` | converted from upstream `pytorch_model.bin` | model weights, modern format | | |
| | `pytorch_model.bin` | copied from upstream | model weights, legacy format (kept for compatibility) | | |
| | `config.json` | copied from upstream | architecture parameters | | |
| | `vocab.txt` | copied from upstream | WordPiece vocabulary | | |
| | `README.md` | this file | provenance and usage | | |
| ## Provenance and verification | |
| The `model.safetensors` file in this repo was produced by HuggingFace's official `SFconvertbot` (the same automated conversion used across thousands of HuggingFace repos). The conversion is purely a re-serialization — every tensor in the safetensors file is bit-identical to the corresponding tensor in `pytorch_model.bin`. No re-training, no quantization, no precision loss. | |
| You can verify this yourself in Python: | |
| ```python | |
| import torch | |
| from safetensors.torch import load_file | |
| a = torch.load("pytorch_model.bin", map_location="cpu", weights_only=True) | |
| b = load_file("model.safetensors") | |
| assert set(a.keys()) == set(b.keys()) | |
| for k in a: | |
| assert torch.equal(a[k], b[k]), f"Mismatch in {k}" | |
| print("Bit-identical.") | |
| ``` | |
| ## License and citation | |
| This mirror inherits the upstream license: **Apache 2.0**. If you use this model in academic work, please cite the original SciBERT paper: | |
| ```bibtex | |
| @inproceedings{beltagy-etal-2019-scibert, | |
| title = "{SciBERT}: A Pretrained Language Model for Scientific Text", | |
| author = "Beltagy, Iz and Lo, Kyle and Cohan, Arman", | |
| booktitle = "Proceedings of EMNLP-IJCNLP", | |
| year = "2019", | |
| url = "https://www.aclweb.org/anthology/D19-1371" | |
| } | |
| ``` | |
| Original model: [`allenai/scibert_scivocab_cased`](https://huggingface.co/allenai/scibert_scivocab_cased) by the Allen Institute for AI. | |
| ## Maintenance | |
| This is a teaching artifact for a course on transformer-based topic modeling in R. It will not be updated except to fix conversion errors. For the canonical, maintained version of SciBERT, see the [upstream repo](https://huggingface.co/allenai/scibert_scivocab_cased). | |