Scientific-SLM β Scientific Small Language Model (SLM)
A domain-focused decoder-only transformer trained entirely from scratch on scientific research papers. This model is designed for tasks involving scientific writing, structure, and terminology.
π¬ Overview
This model is part of an exploration into building scientific-specialized language models without relying on existing pretrained checkpoints. It is trained purely from raw text extracted from 45,000+ arXiv papers, covering physics, mathematics, computer science, astronomy, and related domains.
The model is intended for:
- scientific text generation
- summarization of research content
- exploratory reasoning tasks
- experimentation with custom architectures and tokenizers
π§ Model Architecture
A custom GPT-style, decoder-only Transformer implemented from scratch in PyTorch:
- 239M parameters
- 12 Transformer layers
- 12 attention heads
- 768-dim hidden size
- 1024-token context length
- Causal self-attention
- Feed-forward MLP with residual pathways
- Learned token + positional embeddings
- Custom BPE tokenizer (~32k vocab)
The implementation includes hand-written attention masks, embedding layers, residual connections, and the full autoregressive decoding logic.
π Training Data
The model was trained on a custom dataset constructed from:
- 45,000+ arXiv PDFs, extracted and cleaned
- PDF β text β chunking β tokenization
- 1024-token sliding windows
- Deduplicated scientific text
- Sharded numpy token dataset optimized for LM training
The dataset is not distributed here due to licensing considerations, but the preprocessing pipeline is documented in the accompanying notebook.
βοΈ Training Details
Training was conducted on Kaggle GPUs with:
- Mixed-precision (fp16)
- Gradient accumulation
- Cosine learning rate schedule with warmup
- Checkpointing and resume functionality
- Perplexity-based validation on held-out shards
No external pretrained model, embedding, or checkpoint was used. Every component β tokenizer, architecture, data pipeline, training loop β was implemented manually.
π§ͺ Intended Use
This model is suitable for:
- generating scientific-style prose
- summarizing research papers
- experimenting with domain-specific LMs
- studying training dynamics of from-scratch transformer models
- educational and research applications
Not intended for deployment in critical systems or as a source of factual scientific claims without verification.
π« Limitations
- Trained only on arXiv text; does not generalize to domains outside scientific writing.
- May produce inaccurate or speculative statements.
- Does not perform factual reasoning or source citation.
- Limited context window (1024 tokens).
- Version 1 model β improvements planned.
π Roadmap
Future versions will explore:
- larger scientific corpora (PubMed, textbooks, scholarly databases)
- rotary positional embeddings
- RMSNorm and FlashAttention-2
- extended context length
- scientific QA and reasoning benchmarks
- scaling models to 500Mβ1B parameters
π Notebook
The full training notebook, including tokenizer, data pipeline, and architecture implementation, is available here:
π GitHub: https://github.com/dipeshlohchab/Research_Papers_SLM/tree/main
π¬ Contact
For discussion or collaboration: π Linkedin: https://www.linkedin.com/in/dipesh-lohchab/