NPC Nano 0.5B — Base

NPC Nano 0.5B (base) is a 502M-parameter Llama-style decoder-only language model pretrained from scratch on a curated 8.93B-token mix of web, code, math, finance, and conversational data. This release is the base checkpoint at the end of pretraining — instruction-tuned variants will follow under separate model names.

  • Developer: Bottensor (Rama Krishna Bachu)
  • Model type: Llama-architecture causal language model
  • Language: English
  • License: Apache 2.0
  • Technical report: forthcoming on Zenodo

Model details

Parameters (total) 501,531,648 (~0.5B)
Architecture LlamaForCausalLM
Hidden size 1024
Intermediate (FFN) size 4992
Layers 24
Attention heads 16 (head_dim 64)
KV heads 16 (no GQA)
Tied input/output embeddings yes
Vocab size 32,000
Tokenizer BPE (HF PreTrainedTokenizerFast)
Context length 2,048 tokens
Positional encoding RoPE (theta = 10,000)
Activation SiLU (SwiGLU MLP)
Norm RMSNorm (eps 1e-5)
Precision bfloat16
Attention impl FlashAttention-2

Training

Compute & duration

  • Hardware: single NVIDIA A40 (46 GB), RunPod
  • Effective batch size: 6 × 41 grad-accum × 2,048 seq = 503,808 tokens/step
  • Steps: 17,733
  • Tokens seen: 8,934,027,264 (~8.93B)
  • MFU: 30.7 – 31.2% (stable across the run)

Optimizer & schedule

  • AdamW (β₁ = 0.9, β₂ = 0.95, ε = 1e-8), weight decay 0.1
  • Gradient clipping 1.0
  • Peak learning rate 1.0e-3 (winner of a Phase-1 LR ablation over {3e-4, 6e-4, 1e-3})
  • Cosine schedule, horizon = full corpus (8.934B tokens), 1% warmup
  • Z-loss coefficient 1e-4
  • Seed 1337

Data mix (natural weights, by token count)

Source Share Approx. tokens
FineWeb-Edu 49.0% ~4.38 B
The Stack (Python subset) 25.9% ~2.32 B
Proof-Pile-2 / OpenWebMath 15.3% ~1.37 B
SEC EDGAR (10-K / 10-Q filings) 7.8% ~696 M
UltraChat 1.9% ~170 M
Crypto whitepapers 0.07% ~6.0 M

A small identity-injection shard (500 curated Q: … A: … examples identifying the model as "NPC Nano") was mixed in over the final 2% of training (ramping from 0 → 5% sampling weight in the last 2%, holding at 5% in the last 1%). This gives the base model a stable self-identity without requiring SFT.


Evaluation

Evaluated at the end of pretraining (checkpoint at step 17,733, 8.93B tokens seen). Full evaluation report, including methodology and per-task details, is in the training repo under reports/phase2_v2_base_eval.md.

Capability benchmarks

Task Metric Score
HellaSwag acc_norm 36.82%
ARC-Easy acc_norm 49.96%
PIQA acc_norm 65.02%
OpenBookQA acc_norm 30.00%
WinoGrande acc 49.49%
GSM8K (5-shot, flex-extract) exact_match 1.67%
GSM8K (5-shot, strict) exact_match 0.68%

Run via lm-evaluation-harness 0.4.12.

Held-out perplexity

Domain Perplexity Tokens
SEC EDGAR 6.65 301,607 (148 docs)
Crypto whitepapers 11.35 22,752 (16 docs)

Identity smoke test (base mode, Q: … A: prompts)

Cohort Pass rate
A — direct identity questions 94.0% (47 / 50)
B — sibling-model questions 4.0% (2 / 50)
C — adversarial / jailbreak 75.0% (75 / 100)

Cohort B is expected to be low in base mode — sibling-model knowledge is delivered via SFT, not pretraining.


Intended use

NPC Nano 0.5B base is intended for:

  • Research into small-language-model pretraining, data mixes, and identity injection
  • A starting point for fine-tuning — SFT, DPO/GRPO, and downstream task adapters
  • Benchmarking small-model capability at ~9B-token compute budgets

This is a base model, not an instruction-tuned chat model. It performs best on:

  • Completion-style prompts (web-text continuation, code continuation, math expressions)
  • Plain Q: <question>\nA: few-shot prompts

Out-of-scope / limitations

  • Not safety-tuned. No RLHF, no DPO, no refusal training. The base model can and will produce undesirable, false, biased, or harmful outputs.
  • Not instruction-following in the chat sense. No chat template applied during pretraining. Use Q: …\nA: prompting or fine-tune for instructions.
  • Short context (2,048 tokens). No long-context training; do not expect coherent generation past the context window.
  • English-only. The training mix is overwhelmingly English; non-English performance is not characterized.
  • Math is weak. GSM8K performance is at the floor for this scale; the model emits arithmetic structure but rarely the right final number.
  • Knowledge cutoff is bounded by the pretraining sources (FineWeb-Edu, EDGAR, etc.); the model has no knowledge of events after those snapshots.
  • No code execution sandboxing. Generated code should not be run without review.

Users are responsible for evaluating fitness for any downstream task and for adding appropriate safety measures.


How to use

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "ramankrishna10/npc-nano-0.5b-base"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
model.config.use_cache = True  # speed up generation

prompt = "Q: What is the capital of France?\nA:"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=40, do_sample=False)
print(tokenizer.decode(out[0], skip_special_tokens=True))

Note: the released config.json carries use_cache: false (the training setting). Set model.config.use_cache = True for fast generation.


Citation

A technical report is forthcoming on Zenodo. In the meantime, please cite as:

@misc{bachu2026npcnano,
  title  = {NPC Nano 0.5B: A small language model with pretraining-time identity injection},
  author = {Bachu, Rama Krishna},
  year   = {2026},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/ramankrishna10/npc-nano-0.5b-base}},
  note = {Technical report forthcoming on Zenodo}
}

License

Apache License 2.0. See LICENSE for the full text.

Downloads last month
21
Safetensors
Model size
0.5B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ramankrishna10/npc-nano-0.5b-base

Finetunes
1 model