Scientific-SLM β€” Scientific Small Language Model (SLM)

A domain-focused decoder-only transformer trained entirely from scratch on scientific research papers. This model is designed for tasks involving scientific writing, structure, and terminology.


πŸ”¬ Overview

This model is part of an exploration into building scientific-specialized language models without relying on existing pretrained checkpoints. It is trained purely from raw text extracted from 45,000+ arXiv papers, covering physics, mathematics, computer science, astronomy, and related domains.

The model is intended for:

  • scientific text generation
  • summarization of research content
  • exploratory reasoning tasks
  • experimentation with custom architectures and tokenizers

🧠 Model Architecture

A custom GPT-style, decoder-only Transformer implemented from scratch in PyTorch:

  • 239M parameters
  • 12 Transformer layers
  • 12 attention heads
  • 768-dim hidden size
  • 1024-token context length
  • Causal self-attention
  • Feed-forward MLP with residual pathways
  • Learned token + positional embeddings
  • Custom BPE tokenizer (~32k vocab)

The implementation includes hand-written attention masks, embedding layers, residual connections, and the full autoregressive decoding logic.


πŸ“š Training Data

The model was trained on a custom dataset constructed from:

  • 45,000+ arXiv PDFs, extracted and cleaned
  • PDF β†’ text β†’ chunking β†’ tokenization
  • 1024-token sliding windows
  • Deduplicated scientific text
  • Sharded numpy token dataset optimized for LM training

The dataset is not distributed here due to licensing considerations, but the preprocessing pipeline is documented in the accompanying notebook.


βš™οΈ Training Details

Training was conducted on Kaggle GPUs with:

  • Mixed-precision (fp16)
  • Gradient accumulation
  • Cosine learning rate schedule with warmup
  • Checkpointing and resume functionality
  • Perplexity-based validation on held-out shards

No external pretrained model, embedding, or checkpoint was used. Every component β€” tokenizer, architecture, data pipeline, training loop β€” was implemented manually.


πŸ§ͺ Intended Use

This model is suitable for:

  • generating scientific-style prose
  • summarizing research papers
  • experimenting with domain-specific LMs
  • studying training dynamics of from-scratch transformer models
  • educational and research applications

Not intended for deployment in critical systems or as a source of factual scientific claims without verification.


🚫 Limitations

  • Trained only on arXiv text; does not generalize to domains outside scientific writing.
  • May produce inaccurate or speculative statements.
  • Does not perform factual reasoning or source citation.
  • Limited context window (1024 tokens).
  • Version 1 model β€” improvements planned.

πŸ“Œ Roadmap

Future versions will explore:

  • larger scientific corpora (PubMed, textbooks, scholarly databases)
  • rotary positional embeddings
  • RMSNorm and FlashAttention-2
  • extended context length
  • scientific QA and reasoning benchmarks
  • scaling models to 500M–1B parameters

πŸ“Ž Notebook

The full training notebook, including tokenizer, data pipeline, and architecture implementation, is available here:

πŸ”— GitHub: https://github.com/dipeshlohchab/Research_Papers_SLM/tree/main


πŸ“¬ Contact

For discussion or collaboration: πŸ”— Linkedin: https://www.linkedin.com/in/dipesh-lohchab/

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support