Model Card for SHARE-14B

SHARE-14B (Social-Humanities AI for Research and Education) is a 14-billion-parameter decoder-only causal language model pretrained exclusively on content relevant to the social sciences and humanities (SSH). It is intended as a domain-specific base model for SSH research and education, and is designed to be used through the MIRROR interface, which surfaces token-level surprisal rather than generating new text.

More information can be found in the paper SHARE: Social-Humanities AI for Research and Education.

Note: This is an intermediate checkpoint released after ~15% of planned pretraining (96B tokens of a target ~630B). It is a base (pretrained-only) model with no SFT, DPO, or RLHF. This base model is not suitable to chat applications.

Model Details

Model Description

SHARE-14B is the first causal language model fully pretrained by and for the SSH disciplines. It mirrors the Phi-4 14B architecture but uses a custom 50,000-token BPE tokenizer trained on the SHARE corpus, and is pretrained exclusively on a curated SSH dataset drawn from Wikipedia, Project Gutenberg, PeS2o, and CORE. On a custom SSH Cloze benchmark, the current checkpoint achieves performance close to Phi-4 14B (0.796 vs 0.818 prior-corrected accuracy) while having seen roughly 100× fewer training tokens.

  • Developed by: João Gonçalves, Sonia de Jager, Petr Knoth, David Pride, Nick Jelicic
  • Funded by: NVIDIA Academic Grant; NWO-SURF Small Compute Grant (EINF-15690); Dutch Research Council (NWO) VENI grant VI.Veni.221S.154
  • Model type: Decoder-only transformer causal language model (Phi-4 architecture)
  • Language(s) (NLP): Primarily English, with a smaller proportion of Dutch
  • License: Custom Responsible AI License (RAIL-SHARE) — non-commercial, no model distillation, restricted text generation use

Model Sources

Uses

Direct Use

SHARE-14B is intended primarily as a base model deployed through the MIRROR interface for SSH researchers, educators, and students. Through MIRROR, the model is used to compute token-level surprisal and entropy on user-written texts in order to:

  • Identify typos, stylistic anomalies, and possible factual mistakes in academic writing
  • Highlight innovative or unexpected contributions in scholarly texts
  • Surface disciplinary biases and norms encoded in SSH literature
  • Support reflective revision of student and scholarly writing in the SSH

Downstream Use

Potential downstream uses include perplexity-based analyses of SSH texts, domain-specific text classification, and research on the structure and biases of SSH scholarly discourse. Downstream use is governed by the RAIL-SHARE license (non-commercial; no distillation).

Out-of-Scope Use

  • Commercial applications of any kind (forbidden by license)
  • Model distillation into other models (forbidden by license)
  • Unconstrained text generation, especially in academic contexts where it could enable student or faculty fraud
  • STEM, biomedical, mathematical, or coding tasks — the model was deliberately not trained on these domains
  • Use as a chat assistant — the model is base-pretrained only, with no SFT or alignment
  • Multilingual applications outside of English and (to a lesser extent) Dutch
  • Any safety-critical decision-making

Bias, Risks, and Limitations

SHARE-14B inherits the systemic biases present in the open-access English-language SSH scholarship it was trained on. As illustrated in the paper, terms associated with non-Western scholarship (e.g. "African" in the context of locations of knowledge production) can register as unexpected, reflecting the field's existing imbalances rather than properties of the topics themselves.

Other limitations and risks:

  • English-dominant data, which is a meaningful constraint for SSH fields where multilingual scholarship matters
  • Intermediate checkpoint: only ~15% of planned pretraining is complete, so capabilities will continue to evolve
  • Causal interpretation effect: because surprisal is computed on preceding tokens, an early mistake in a text propagates and can mask later anomalies
  • Use in text reading/reviewing could be misused to shortcut careful reading of academic work
  • No alignment or safety tuning has been applied — the model is released as a base model

Recommendations

Users should treat MIRROR outputs as prompts for reflection rather than authoritative judgments. Surprisal does not equal correctness, and unexpectedness can signal innovation as readily as error. When using MIRROR for revision, work from the beginning of the text to mitigate the propagation of earlier surprisal into later tokens. Researchers should be aware of the model's biases toward dominant SSH discourses and read its outputs critically. Use of SHARE for direct text generation is discouraged.

Training Details

Training Data

The training corpus combines three SSH-focused subsets:

  • Wikipedia (English and Dutch): articles selected by traversing the category tree from SSH-relevant main topic classifications using PetScan and extracted with WikiExtractor
  • Project Gutenberg: books filtered by SSH-relevant Library of Congress Classes (B, C, D, G, H, J, K, L, M, N)
  • Academic publications: drawn from PeS2o and CORE, filtered using AllenAI's Field of Science (FoS) classifier to retain SSH disciplines (Art, Business, Economics, Geography, Education, History, Law, Linguistics, Philosophy, Political Science, Psychology, Sociology), plus additional materials provided through agreements with publishers including Open Humanities Press

The full corpus is in the order of dozens of billions of tokens. See the technical report for details on filtering and selection.

Training Procedure

Preprocessing

Raw data preprocessing was carried out exclusively on EU servers. A custom BPE tokenizer with a 50,000-token vocabulary was trained on the full SHARE corpus.

Training Hyperparameters

  • Training regime: Mixed precision with FlashAttention-2, torch.compile, Liger Kernel, sequence packing, and FSDP
  • Architecture: Phi-4 14B (decoder-only transformer)
  • Context length: 4096 tokens
  • Warm-up steps: 2000
  • Learning rate: Manually monitored and adjusted between 5-day Snellius runs (started at 1.58e-4, adjusted to 1e-4 for the second run), motivated by concerns that cosine decay underutilizes data fed in later pretraining stages
  • Weight decay: 0.1

Speeds, Sizes, Times

Training was initiated on Saturn Cloud using 8× NVIDIA A100 80GB GPUs for 167 hours under FSDP, then continued on the Dutch supercomputer Snellius using 5 nodes of 4× H100 GPUs (20 GPUs total) for approximately 225 hours. As of this checkpoint, the model has been trained on 96 billion tokens (~15% of the planned ~630B-token compute-optimal target across 2 epochs of the data mix).

Evaluation

Testing Data, Factors & Metrics

Testing Data

  • Perplexity comparison: Erasmus University Rotterdam research output abstracts from Q3–Q4 2025, out of distribution from the training data
  • SSH Cloze benchmark: 275 SSH abstracts published in Q1 2026 (25 per Web of Science field across 11 SSH disciplines), constructed by selecting sentences with equivalent-token decisions (e.g. positive/negative, higher/lower) where SSH knowledge is required to predict the correct token

Factors

  • Scientific domain (FoS classifier categories)
  • Faculty affiliation of authors at Erasmus University Rotterdam (used as an ecological-validity check)

Metrics

  • Log-perplexity difference relative to Phi-4 (lower means better SHARE fit)
  • Raw and prior-corrected accuracy on the SSH Cloze benchmark (prior correction accounts for models guessing the more frequent token)

Results

On the SSH Cloze benchmark, SHARE-14B achieves 77.1% raw accuracy and 79.6% prior-corrected accuracy at the 96B-token checkpoint. This is close to Phi-4 14B (81.8% / 81.8%) despite Phi-4 being trained on roughly 9.8 trillion tokens, and clearly above OLMO-2 13B at the 168B-token Step-20k checkpoint (74.9% / 73.8%) and fully trained Pythia-12B (67.3% / 61.5%).

Perplexity analyses show that the gap between SHARE-14B and Phi-4 is consistently smaller for SSH fields (Art, Education, Sociology) than for STEM fields (Biology, Engineering, Medicine), indicating the intended SSH specialization. At the faculty level, the same pattern holds: Erasmus MC (medical) shows the largest gap, while SSH-focused faculties show the smallest.

Summary

SHARE-14B at 15% of training is already substantially more capable than the smaller SHARE-4B (evaluation perplexity 5.26 vs 11.94) and approaches the performance of Phi-4 on SSH-relevant token prediction at a small fraction of its training cost.

Model Examination

Memorization probes using deterministic generation from texts in the pretraining corpus — including data seen most recently — show that SHARE-14B does not reproduce copyrighted content. The few instances of memorization observed correspond only to disclaimers and standard headers. Early experiments with instruction-tuned variants further suggest that, because the training data deliberately excludes domains such as cybersecurity, biological weapons, and CSAM, classical safety risks are limited; the model also tends to default to harm-reducing framings when prompted with SSH-relevant harmful queries.

Environmental Impact

  • Hardware Type: 8× NVIDIA A100 80GB (Saturn Cloud) and 20× NVIDIA H100 (5 nodes × 4 GPUs, Snellius supercomputer)
  • Hours used: ~167 hours on A100s + ~225 hours on H100s for the current checkpoint
  • Cloud Provider: Saturn Cloud (initial phase) and SURF / Snellius supercomputer (current phase)
  • Compute Region: United States (Saturn Cloud, initial phase only); Netherlands (Snellius)
  • Carbon Emitted: Not precisely measured for the 14B model yet;

The project applied Chinchilla scaling laws to budget compute, used efficiency techniques (mixed precision, torch.compile, FlashAttention-2, Liger Kernel, gradient checkpointing) to reduce energy use.

Citation

BibTeX:

@misc{gonçalves2026sharesocialhumanitiesairesearch,
      title={SHARE: Social-Humanities AI for Research and Education}, 
      author={João Gonçalves and Sonia de Jager and Petr Knoth and David Pride and Nick Jelicic},
      year={2026},
      eprint={2604.11152},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2604.11152}, 
}

APA:

Gonçalves, J., de Jager, S., Knoth, P., Pride, D., & Jelicic, N. (2026). SHARE: Social-humanities AI for research and education. arXiv. https://arxiv.org/abs/2604.11152

Privacy statement

Personal data, such as author names, may be included in the training documents for share, we use legitimate interest as legal basis for processing the data under the EU's GDPR. The full privacy statement can be consulted here: https://surfdrive.surf.nl/s/gFnxgL6f5jer8yy

Glossary

  • SSH: Social Sciences and Humanities
  • MIRROR: Model Interface for Reflective Research Output Revisions — the user interface that displays per-token surprisal from SHARE rather than generating text
  • Surprisal: Negative log probability of an observed token under the model
  • Prior-corrected accuracy: Cloze accuracy adjusted to discount correct guesses arising from token frequency priors
  • FoS: Field of Science (AllenAI classifier used for disciplinary labelling)
  • RAIL: Responsible AI License

More Information

This model is released as part of an intermediate technical report and is intended to invite feedback from the SSH and ML communities. Companion resources include the SHARE-4B model and the MIRROR interface.

Model Card Authors

João Gonçalves

Model Card Contact

ferreiragoncalves@eshcc.eur.nl

Downloads last month
67
Safetensors
Model size
14B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train Joaoffg/SHARE-14B-Base-2604

Paper for Joaoffg/SHARE-14B-Base-2604