Scientific-SLM — Scientific Small Language Model (SLM)

A domain-focused decoder-only transformer trained entirely from scratch on scientific research papers. This model is designed for tasks involving scientific writing, structure, and terminology.

🔬 Overview

This model is part of an exploration into building scientific-specialized language models without relying on existing pretrained checkpoints. It is trained purely from raw text extracted from 45,000+ arXiv papers, covering physics, mathematics, computer science, astronomy, and related domains.

The model is intended for:

scientific text generation
summarization of research content
exploratory reasoning tasks
experimentation with custom architectures and tokenizers

🧠 Model Architecture

A custom GPT-style, decoder-only Transformer implemented from scratch in PyTorch:

239M parameters
12 Transformer layers
12 attention heads
768-dim hidden size
1024-token context length
Causal self-attention
Feed-forward MLP with residual pathways
Learned token + positional embeddings
Custom BPE tokenizer (~32k vocab)

The implementation includes hand-written attention masks, embedding layers, residual connections, and the full autoregressive decoding logic.

📚 Training Data

The model was trained on a custom dataset constructed from:

45,000+ arXiv PDFs, extracted and cleaned
PDF → text → chunking → tokenization
1024-token sliding windows
Deduplicated scientific text
Sharded numpy token dataset optimized for LM training

The dataset is not distributed here due to licensing considerations, but the preprocessing pipeline is documented in the accompanying notebook.

⚙️ Training Details

Training was conducted on Kaggle GPUs with:

Mixed-precision (fp16)
Gradient accumulation
Cosine learning rate schedule with warmup
Checkpointing and resume functionality
Perplexity-based validation on held-out shards

No external pretrained model, embedding, or checkpoint was used. Every component — tokenizer, architecture, data pipeline, training loop — was implemented manually.

🧪 Intended Use

This model is suitable for:

generating scientific-style prose
summarizing research papers
experimenting with domain-specific LMs
studying training dynamics of from-scratch transformer models
educational and research applications

Not intended for deployment in critical systems or as a source of factual scientific claims without verification.

🚫 Limitations

Trained only on arXiv text; does not generalize to domains outside scientific writing.
May produce inaccurate or speculative statements.
Does not perform factual reasoning or source citation.
Limited context window (1024 tokens).
Version 1 model — improvements planned.

📌 Roadmap

Future versions will explore:

larger scientific corpora (PubMed, textbooks, scholarly databases)
rotary positional embeddings
RMSNorm and FlashAttention-2
extended context length
scientific QA and reasoning benchmarks
scaling models to 500M–1B parameters

📎 Notebook

The full training notebook, including tokenizer, data pipeline, and architecture implementation, is available here:

🔗 GitHub: https://github.com/dipeshlohchab/Research_Papers_SLM/tree/main

📬 Contact

For discussion or collaboration: 🔗 Linkedin: https://www.linkedin.com/in/dipesh-lohchab/

Downloads last month: -; Downloads are not tracked for this model. How to track