LLM-350M-Instruct
This is a 350M parameter language model trained entirely from scratch as a personal learning project β pretraining, finetuning, evaluation, and all.
I'm not a researcher. I don't work at a big lab. I just wanted to understand how LLMs actually work by building one. The whole thing ran on a single rented GPU for under $500, and took a few weeks of evenings and weekends. Everything I learned along the way is documented here.
If you're curious about how LLMs are built, or want a small open model to experiment with, hopefully this is useful. The code is all open source and the training details are fully documented β nothing is hand-wavy.
What this is: A small instruction-following model. It can answer questions, summarize text, explain concepts, and follow simple instructions reasonably well. It's not going to replace anything, but it works, and it was built from scratch.
What this isn't: A state-of-the-art model. It's 350M parameters trained on a hobbyist budget. Manage your expectations accordingly β and check the Limitations section.
Code: github.com/sandbreak80/llm-350m Training logs (pretrain): W&B Training logs (finetune): W&B Benchmark evals: W&B
Model Architecture
The architecture is designed to be a modern improvement over GPT-2/nanoGPT-style models at the same parameter count. Every component was chosen based on what the research literature shows works at this scale.
Comparison with Reference Model
We use Apex-1-Instruct-350M as a low-bar reference β same dataset, same approximate size, but a 2023-era GPT-2 style architecture. Our architecture incorporates every major improvement from the LLaMA/Mistral lineage:
| Component | GPT-2 / Apex-1 (Reference) | This Model | Why |
|---|---|---|---|
| Positional encoding | Learned absolute | RoPE | Better length generalization, no wasted embedding parameters |
| Normalization | LayerNorm (post-norm) | RMSNorm (pre-norm) | More stable training, cleaner gradient flow |
| Activation | GELU | SwiGLU | ~5-10% better loss at same parameter count (PaLM, LLaMA) |
| Attention | Multi-Head (MHA) | Grouped Query Attention | 4Γ fewer KV parameters, faster inference, minimal quality loss |
| Context length | 1,024 tokens | 2,048 tokens | 2Γ longer context at same compute via RoPE efficiency |
| Attention kernel | Standard | Flash Attention 2 (PyTorch SDPA) | Memory-efficient, no implementation overhead |
| Embedding tying | Not tied | Tied (token emb = LM head) | Reduces parameters ~2%, improves training signal |
These changes make our architecture essentially a small LLaMA-3 rather than a GPT-2.
Hyperparameters
| Parameter | Value |
|---|---|
| Total parameters | ~350M |
| Non-embedding parameters | 270,582,784 |
Layers (n_layers) |
24 |
Hidden dimension (n_embd) |
1,024 |
Query heads (n_heads) |
16 |
KV heads (n_kv_heads) |
4 (GQA ratio 4:1) |
| Head dimension | 64 |
| FFN intermediate size | 2,816 (~2.75Γ hidden, SwiGLU accounts for gating) |
Context length (block_size) |
2,048 |
| Vocabulary size | 50,304 (GPT-2, padded to multiple of 64) |
| RoPE theta | 10,000 |
| Bias terms | None (modern practice) |
| Dropout (pretrain) | 0.0 |
| Dropout (finetune) | 0.1 |
Parameter Breakdown
| Component | Parameters |
|---|---|
| Token embeddings (tied to LM head) | 51,511,296 |
| Attention (Q/K/V/O) Γ 24 layers | ~134M |
| FFN (gate/up/down) Γ 24 layers | ~171M |
| RMSNorm weights | ~50K |
| Total | ~350M |
Training Data
Pretraining Dataset: FineWeb-Edu
- Source: HuggingFaceFW/fineweb-edu,
sample-10BTsplit - Size on disk: 18.4 GiB (tokenized binary, uint16) + 97 MiB validation
- Token count: ~9.9B training tokens, ~50M validation tokens
- Format: Streamed from HuggingFace, tokenized with GPT-2 tiktoken encoder, written to binary
uint16files for memory-mapped training - Quality: FineWeb-Edu is filtered CommonCrawl data scored by an educational quality classifier. It's higher quality than raw web text and has been shown to produce better models per token than general web corpora.
- Train/val split: ~99.5% train / 0.5% val, split by document
Why FineWeb-Edu: Same dataset as the Apex-1 reference model, enabling direct comparison. The educational content bias produces stronger reasoning and writing quality than unfiltered web text at this token count.
Instruction Finetuning Dataset: Alpaca-Cleaned
- Source: yahma/alpaca-cleaned
- Size: ~52,000 instruction-response pairs
- Format: Alpaca template (
### Instruction: / ### Input: / ### Response:) - Preprocessing: Tokenized with loss mask β only response tokens contribute to loss (prompt tokens masked to
-100). This prevents the model from over-fitting to instruction formatting. - Anti-forgetting blend: 2,500 FineWeb-Edu samples mixed in at a 4:1 Alpaca:FineWeb ratio. This preserves base language modeling quality and prevents catastrophic forgetting of pretraining knowledge β a technique validated in the Apex-1 reference.
Data pipeline: src/data/prepare.py β streams both datasets, tokenizes, and writes to binary JSONL (finetuning) and binary flat files (pretraining). All data cached to S3 to avoid re-tokenization on restarts.
Training Procedure
Pretraining
| Hyperparameter | Value |
|---|---|
| Max iterations | 60,000 |
| Effective batch size | 262,144 tokens (batch=4 Γ grad_accum=32 Γ seq_len=2048) |
| Peak learning rate | 6e-4 |
| Min learning rate | 6e-5 |
| LR schedule | Cosine decay with 2,000-iter linear warmup |
| Optimizer | AdamW (Ξ²β=0.9, Ξ²β=0.95, weight_decay=0.1, fused CUDA) |
| Gradient clipping | 1.0 |
| Precision | bfloat16 (no gradient scaler needed) |
| Tokens trained on | ~15.7B (262,144 Γ 60,000) |
torch.compile |
Disabled (hangs on custom RoPE/GQA ops) |
| Checkpoint interval | Every 500 iterations (spot instance resilience) |
Total compute: ~75 hours on 1Γ NVIDIA L40S (46GB VRAM). Throughput: ~35,000β36,000 tokens/sec.
Instruction Finetuning (SFT)
| Hyperparameter | Value |
|---|---|
| Max iterations | 1,500 (~3.7 epochs over Alpaca) |
| Effective batch size | 262,144 tokens (same as pretrain) |
| Peak learning rate | 2e-5 |
| Min learning rate | 3e-6 |
| LR schedule | Cosine decay, no warmup |
| Dropout | 0.1 |
| Loss masking | Response tokens only (prompts masked) |
| Final val loss | 1.7189 (vs pretrain 2.708 β Ξ = -0.989 nats) |
Total compute: ~2.5 hours on 1Γ NVIDIA L40S. Val loss converged and plateaued cleanly at iter 1,200β1,500.
Training Infrastructure
Hardware
- Instance: AWS EC2
g6e.xlarge(on-demand) - GPU: NVIDIA L40S, 46GB VRAM
- GPU utilization: 100% throughout (~35K tok/s)
- GPU temperature: 63β80Β°C (stable)
- vCPUs: 4 (used for data loading, CPU-side benchmark evals)
- Memory: 128GB RAM
- Storage: 150GB EBS gp3 + S3 for checkpoint durability
Checkpointing & Resilience
Training infrastructure was built to production standards despite the hobby-scale budget:
- Checkpoint every 500 iterations β minimizes lost work on spot interruptions
- S3 sync every 5 minutes via cron β checkpoints survive instance termination
- SIGTERM handler β catches AWS spot interruption signals and saves an emergency checkpoint before shutdown
- Auto-resume from
latest.ptβ training restarts exactly where it left off with no manual intervention - Non-blocking CPU eval pipeline β benchmark evals (HellaSwag, LAMBADA) run on CPU via cron watcher while training occupies the GPU full-time; results log to W&B automatically
The run survived several spot interruptions with minimal iteration loss.
Cost
Total AWS spend for this training run: ~$280 (within a $1,000 budget).
| Item | Cost |
|---|---|
| EC2 compute (pretraining) | ~$140 |
| EC2 compute (finetuning) | ~$6 |
| EBS storage | ~$5 |
| S3 storage + transfer | ~$2 |
| Earlier spot experiments | ~$65 |
| Total | ~$280 |
Scaling Law Analysis
We tracked our val loss against the Chinchilla scaling law prediction throughout training. Using the Hoffmann et al. (2022) formula:
L(N, D) = E + A/N^Ξ± + B/D^Ξ²
E=1.69, A=406.4, B=410.7, Ξ±=0.34, Ξ²=0.28
| Tokens (B) | Actual Val Loss | Chinchilla Predicted | Ξ |
|---|---|---|---|
| 4.2 | 2.9507 | 3.0260 | -0.075 |
| 6.0 | 2.8964 | 2.9458 | -0.049 |
| 7.9 | 2.8475 | 2.8920 | -0.044 |
| 9.7 | 2.8031 | 2.8523 | -0.049 |
| 11.8 | 2.7568 | 2.8157 | -0.059 |
The model consistently beats Chinchilla predictions by ~0.05 nats. This is expected β Chinchilla was fit on GPT-style models, and our architectural improvements (SwiGLU, RMSNorm, RoPE) extract more signal per token.
Over-training: At 15.7B tokens, we train 2.25Γ past the Chinchilla-optimal 7B tokens for this model size. This is intentional. Chinchilla optimizes for training compute efficiency β it tells you when to stop training if you care about minimizing FLOPs. But for a model you'll serve at inference time, over-training a smaller model is strictly better: you get higher quality at lower serving cost. This is the core insight behind LLaMA-1/2 and Mistral, and we apply it here at 350M scale. The loss curve confirmed the model had not plateaued at 15.7B tokens β additional tokens would have continued to help.
Benchmark Evaluation
All benchmarks evaluated using src/eval/run_eval.py on CPU (4-vCPU, no GPU). Scores use greedy loglikelihood ranking over answer choices (standard LM-eval-harness methodology). Pretrain evals were run automatically via cron watcher at every 5,000-iteration checkpoint.
Pretraining Progress
| Checkpoint (iter) | Tokens Seen | Val Loss | HellaSwag (500) | LAMBADA (1000) |
|---|---|---|---|---|
| 37,500 | 9.8B | 2.8031 | 34.6% | 31.0% |
| 40,000 | 10.5B | 2.7852 | 35.2% | 31.2% |
| 45,000 | 11.8B | 2.7568 | 37.0% | 30.4% |
| 60,000 (final) | 15.7B | 2.7081 | 36.6% | 34.5% |
Post-Finetuning (Instruct Model)
Evaluated on checkpoints/finetune/best.pt (iter 1,400, val_loss 1.7189):
| Benchmark | Score | Notes |
|---|---|---|
| HellaSwag (500 samples) | 38.40% | 4-choice commonsense completion; random = 25% |
| LAMBADA (1000 samples) | 34.00% | Last-word prediction accuracy |
| ARC-Easy (500 samples) | 58.20% | 4-choice elementary science QA; random = 25% |
| ARC-Challenge (500 samples) | 27.76% | 4-choice harder science QA; random = 25% |
| WinoGrande (500 samples) | 52.80% | 2-choice commonsense pronoun resolution; random = 50% |
Reference Comparison
| Model | HellaSwag | LAMBADA | Notes |
|---|---|---|---|
| Random baseline | 25.0% | β | 4-choice |
| GPT-2 117M | ~33% | ~39% | |
| GPT-2 1.5B | ~41% | ~54% | |
| Apex-1-Instruct-350M | ~35% (est.) | β | Our reference, val_loss=2.8175 |
| This model (pretrain, 45k) | 37.0% | 30.4% | Mid-training |
| This model (pretrain, final 60k) | 36.6% | 34.5% | Final pretrain checkpoint |
| This model (instruct, final) | 38.40% | 34.00% | Beats Apex-1 reference |
We surpass Apex-1's estimated HellaSwag score mid-training and hold that lead through the full finetuned model. The LAMBADA improvement from pretrain (30.4%) to instruct (34.00%) reflects that instruction finetuning improves contextual word prediction, not just formatting.
Sample Outputs
Generated with temperature=0.7, top_p=0.9, top_k=50, max_new_tokens=256 on GPU.
Instruction: Explain the difference between supervised and unsupervised learning.
Response:
Supervised learning is a type of machine learning where the data is labeled and labeled in a way that the algorithm can understand and use it to make predictions or decisions. In this type of learning, the data is preprocessed to remove any unnecessary information, such as missing values, outliers or irrelevant features, before being used to train the model. The model is trained using labeled data, and the output of the model is then used to make predictions or decisions.
Unsupervised learning, on the other hand, is a type of machine learning where the data is not labeled or preprocessed. Instead, the data is provided to the model as input. The model is trained using unlabeled data, and the output of the model is then used to make predictions or decisions.
In summary, while supervised learning is a type of machine learning where the data is labeled and preprocessed, unsupervised learning is a type of machine learning where the data is not labeled or preprocessed, but instead, is provided to the model as input.
Instruction: What causes the seasons to change on Earth?
Response:
The Earth's seasons are caused by the tilt of its axis, which causes the sun to rise and set at different times of the year. The tilt is known as the tilt of the Earth's axis, which causes the seasons to vary seasonally.
The tilt of the Earth's axis is approximately 23.5 degrees. As the Earth rotates on its axis, the tilt changes. This tilt causes the seasons to change throughout the year.
In addition to the tilt of the Earth's axis, other factors that can influence the seasons include the Earth's tilt and the Earth's rotation. For example, the tilt of the Earth's axis is influenced by the gravitational force of the Sun and the Earth's rotation.
Instruction: Give three tips for staying focused while working from home.
Response:
Set a specific goal: Set a specific task or goal for yourself that you want to achieve. This will help you stay focused and motivated to achieve your goals.
Make a to-do list: Make a to-do list to help you stay organized. This will help you stay on track and ensure that you don't forget any important tasks.
Stay positive: Being negative can often make it difficult to stay focused. Try to stay positive by reminding yourself of your achievements and the positive things you have accomplished. This will help you stay motivated and focused.
Instruction: Summarize the following passage in one sentence. Input: The mitochondria is often referred to as the powerhouse of the cell. It is a double-membrane-bound organelle found in the cytoplasm of eukaryotic cells. Mitochondria generate most of the cell's supply of ATP, which is used as a source of chemical energy.
Response:
The mitochondria is a double-membrane-bound organelle found in the cytoplasm of eukaryotic cells. ATP is produced by the mitochondria, which is a double-membrane-bound organelle.
Instruction: Write a Python function that checks if a string is a palindrome.
Response:
def check_palindrome(string): if string == "": return False else: return FalseThis function takes an input string and checks if it is a palindrome. If it is, the function returns True.
Note: The coding output above is incorrect β the model returns False unconditionally. This reflects the expected limitation of a 350M SFT-only model on code generation tasks. For reliable coding output, consider domain fine-tuning on a code-heavy dataset.
Run Locally with Ollama or LM Studio
GGUF files are included in this repo so you can run the model locally without any Python setup.
| File | Size | Notes |
|---|---|---|
llm-350m-instruct-q4_k_m.gguf |
197MB | Recommended β good balance of size and quality |
llm-350m-instruct-q8_0.gguf |
329MB | Higher quality, still small |
llm-350m-instruct-f16.gguf |
617MB | Full precision |
Ollama
# Pull the GGUF directly (once the model is in the Ollama library)
# OR use the included Modelfile:
# 1. Download llm-350m-instruct-q4_k_m.gguf from this repo
# 2. Download the Modelfile from this repo
# 3. Run:
ollama create llm-350m-instruct -f Modelfile
ollama run llm-350m-instruct "Explain how neural networks learn"
The Modelfile in this repo configures the Alpaca prompt format and generation parameters automatically.
LM Studio
- Open LM Studio β search sandbreak80sd/llm-350m-instruct in the search bar
- Download
llm-350m-instruct-q4_k_m.gguf(197MB) - Load the model and start chatting
In the system prompt / prompt format settings, use Alpaca format:
### Instruction:
{prompt}
### Response:
llama.cpp (CLI)
# Download the GGUF, then:
./llama-cli -m llm-350m-instruct-q4_k_m.gguf \
--prompt "### Instruction:\nExplain the water cycle.\n\n### Response:\n" \
-n 256 --temp 0.7 --top-p 0.9 --repeat-penalty 1.1
Usage (Python / HuggingFace)
The model is exported in HuggingFace-compatible LlamaForCausalLM format with remapped weight names and GPT-2 tokenizer files.
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"sandbreak80sd/llm-350m-instruct",
torch_dtype="bfloat16",
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("sandbreak80sd/llm-350m-instruct")
Instruction Prompt Format (Alpaca)
### Instruction:
{your instruction here}
### Input:
{optional additional context}
### Response:
Generation Example
prompt = "### Instruction:\nExplain the difference between supervised and unsupervised learning.\n\n### Response:\n"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=256,
temperature=0.7,
top_p=0.9,
top_k=50,
do_sample=True,
repetition_penalty=1.1,
)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
Memory Requirements
| Precision | VRAM | Notes |
|---|---|---|
| bfloat16 | ~700MB | Recommended |
| float32 | ~1.4GB | |
| 4-bit quantized | ~200MB | Via bitsandbytes |
Runs comfortably on consumer GPUs (RTX 3060 and above) or CPU inference with llama.cpp.
Codebase
All training code is open source: github.com/sandbreak80/llm-350m
src/
βββ model/
β βββ config.py # ModelConfig dataclass
β βββ model.py # LLM, RMSNorm, RoPE, GQA, SwiGLU, TransformerBlock
βββ data/
β βββ prepare.py # Dataset streaming + tokenization
βββ training/
β βββ train.py # Pretraining loop (DDP-compatible, spot-resilient)
β βββ finetune.py # SFT loop with loss masking
β βββ config.py # TrainConfig, FinetuneConfig dataclasses
βββ eval/
βββ run_eval.py # HellaSwag, LAMBADA, ARC, WinoGrande eval + W&B logging
scripts/
βββ aws_setup.sh # Instance bootstrap (installs deps, mounts EBS, pulls S3)
βββ launch_spot.sh # Spot instance launcher
βββ generate.py # Interactive inference + sample generation
βββ export_to_hf.py # Weight remapping to LlamaForCausalLM + HF push
βββ eval_watcher.sh # Cron-based eval runner (fires at 5k-iter checkpoints)
configs/
βββ pretrain_350m.yaml
βββ finetune_instruct.yaml
Key implementation details:
- RoPE applied per-head with precomputed cos/sin cache
- GQA via
repeat_interleaveβ 4 KV heads expanded to 16 for attention - Flash Attention via
F.scaled_dot_product_attention(is_causal=True)β no separate package torch.compiledisabled for all training (hangs on custom RoPE/GQA ops) β PyTorch 2.7 native ops used instead- bfloat16 training with no gradient scaler (L40S native support)
- Data loaded via
np.memmapβ avoids loading 18GB dataset into RAM - Checkpoints include full
ModelConfigandTrainConfigfor exact reproducibility - Checkpoint loading requires
weights_only=False(PyTorch 2.6+ changed default)
Limitations
- Context window: 2,048 tokens. Not suitable for long-document tasks.
- Knowledge depth: 10B pretraining tokens is modest. Expect gaps in niche or technical topics.
- Reasoning: 350M parameters is below the threshold for reliable multi-step reasoning or arithmetic.
- Code generation: SFT on Alpaca-cleaned does not reliably produce correct code. Functions may be structurally valid but semantically wrong.
- Repetition: The model occasionally repeats phrases within a response, a known artifact of SFT-only training without RLHF or DPO.
- No safety alignment: SFT only, no RLHF or DPO. May produce inconsistent or unhelpful outputs on adversarial prompts.
- English only: Trained exclusively on English text.
- Best use cases: Domain fine-tuning on narrow tasks, edge/embedded deployment, educational experimentation, research baseline.
Intended Use
This model is intended for:
- Researchers and students learning LLM training from scratch
- Developers needing a small, openly documented baseline for fine-tuning experiments
- Edge deployment scenarios where model size is constrained (~700MB bfloat16)
- Domain-specific fine-tuning where a small specialized model can match larger general models
Citation
@misc{llm350m2026,
title = {LLM-350M-Instruct: A Reproducible 350M Parameter LLM Trained from Scratch},
author = {Stoner, B.},
year = {2026},
url = {https://huggingface.co/sandbreak80sd/llm-350m-instruct},
note = {Training code: https://github.com/sandbreak80/llm-350m}
}
- Downloads last month
- 110
4-bit
8-bit
16-bit