Instructions to use ramankrishna10/npc-nano-0.5b-base with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use ramankrishna10/npc-nano-0.5b-base with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="ramankrishna10/npc-nano-0.5b-base")

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("ramankrishna10/npc-nano-0.5b-base")
model = AutoModelForCausalLM.from_pretrained("ramankrishna10/npc-nano-0.5b-base")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use ramankrishna10/npc-nano-0.5b-base with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "ramankrishna10/npc-nano-0.5b-base"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ramankrishna10/npc-nano-0.5b-base",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/ramankrishna10/npc-nano-0.5b-base

SGLang

How to use ramankrishna10/npc-nano-0.5b-base with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "ramankrishna10/npc-nano-0.5b-base" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ramankrishna10/npc-nano-0.5b-base",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "ramankrishna10/npc-nano-0.5b-base" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ramankrishna10/npc-nano-0.5b-base",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use ramankrishna10/npc-nano-0.5b-base with Docker Model Runner:
```
docker model run hf.co/ramankrishna10/npc-nano-0.5b-base
```

NPC Nano 0.5B — Base

NPC Nano 0.5B (base) is a 502M-parameter Llama-style decoder-only language model pretrained from scratch on a curated 8.93B-token mix of web, code, math, finance, and conversational data. This release is the base checkpoint at the end of pretraining — instruction-tuned variants will follow under separate model names.

Developer: Bottensor (Rama Krishna Bachu)
Model type: Llama-architecture causal language model
Language: English
License: Apache 2.0
Technical report: forthcoming on Zenodo

Model details


Parameters (total)	501,531,648 (~0.5B)
Architecture	LlamaForCausalLM
Hidden size	1024
Intermediate (FFN) size	4992
Layers	24
Attention heads	16 (head_dim 64)
KV heads	16 (no GQA)
Tied input/output embeddings	yes
Vocab size	32,000
Tokenizer	BPE (HF `PreTrainedTokenizerFast`)
Context length	2,048 tokens
Positional encoding	RoPE (theta = 10,000)
Activation	SiLU (SwiGLU MLP)
Norm	RMSNorm (eps 1e-5)
Precision	bfloat16
Attention impl	FlashAttention-2

Training

Compute & duration

Hardware: single NVIDIA A40 (46 GB), RunPod
Effective batch size: 6 × 41 grad-accum × 2,048 seq = 503,808 tokens/step
Steps: 17,733
Tokens seen: 8,934,027,264 (~8.93B)
MFU: 30.7 – 31.2% (stable across the run)

Optimizer & schedule

AdamW (β₁ = 0.9, β₂ = 0.95, ε = 1e-8), weight decay 0.1
Gradient clipping 1.0
Peak learning rate 1.0e-3 (winner of a Phase-1 LR ablation over {3e-4, 6e-4, 1e-3})
Cosine schedule, horizon = full corpus (8.934B tokens), 1% warmup
Z-loss coefficient 1e-4
Seed 1337

Data mix (natural weights, by token count)

Source	Share	Approx. tokens
FineWeb-Edu	49.0%	~4.38 B
The Stack (Python subset)	25.9%	~2.32 B
Proof-Pile-2 / OpenWebMath	15.3%	~1.37 B
SEC EDGAR (10-K / 10-Q filings)	7.8%	~696 M
UltraChat	1.9%	~170 M
Crypto whitepapers	0.07%	~6.0 M

A small identity-injection shard (500 curated Q: … A: … examples identifying the model as "NPC Nano") was mixed in over the final 2% of training (ramping from 0 → 5% sampling weight in the last 2%, holding at 5% in the last 1%). This gives the base model a stable self-identity without requiring SFT.

Evaluation

Evaluated at the end of pretraining (checkpoint at step 17,733, 8.93B tokens seen). Full evaluation report, including methodology and per-task details, is in the training repo under reports/phase2_v2_base_eval.md.

Capability benchmarks

Task	Metric	Score
HellaSwag	acc_norm	36.82%
ARC-Easy	acc_norm	49.96%
PIQA	acc_norm	65.02%
OpenBookQA	acc_norm	30.00%
WinoGrande	acc	49.49%
GSM8K (5-shot, flex-extract)	exact_match	1.67%
GSM8K (5-shot, strict)	exact_match	0.68%

Run via lm-evaluation-harness 0.4.12.

Held-out perplexity

Domain	Perplexity	Tokens
SEC EDGAR	6.65	301,607 (148 docs)
Crypto whitepapers	11.35	22,752 (16 docs)

Identity smoke test (base mode, `Q: … A:` prompts)

Cohort	Pass rate
A — direct identity questions	94.0% (47 / 50)
B — sibling-model questions	4.0% (2 / 50)
C — adversarial / jailbreak	75.0% (75 / 100)

Cohort B is expected to be low in base mode — sibling-model knowledge is delivered via SFT, not pretraining.

Intended use

NPC Nano 0.5B base is intended for:

Research into small-language-model pretraining, data mixes, and identity injection
A starting point for fine-tuning — SFT, DPO/GRPO, and downstream task adapters
Benchmarking small-model capability at ~9B-token compute budgets

This is a base model, not an instruction-tuned chat model. It performs best on:

Completion-style prompts (web-text continuation, code continuation, math expressions)
Plain Q: <question>\nA: few-shot prompts

Out-of-scope / limitations

Not safety-tuned. No RLHF, no DPO, no refusal training. The base model can and will produce undesirable, false, biased, or harmful outputs.
Not instruction-following in the chat sense. No chat template applied during pretraining. Use Q: …\nA: prompting or fine-tune for instructions.
Short context (2,048 tokens). No long-context training; do not expect coherent generation past the context window.
English-only. The training mix is overwhelmingly English; non-English performance is not characterized.
Math is weak. GSM8K performance is at the floor for this scale; the model emits arithmetic structure but rarely the right final number.
Knowledge cutoff is bounded by the pretraining sources (FineWeb-Edu, EDGAR, etc.); the model has no knowledge of events after those snapshots.
No code execution sandboxing. Generated code should not be run without review.

Users are responsible for evaluating fitness for any downstream task and for adding appropriate safety measures.

How to use

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "ramankrishna10/npc-nano-0.5b-base"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
model.config.use_cache = True  # speed up generation

prompt = "Q: What is the capital of France?\nA:"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=40, do_sample=False)
print(tokenizer.decode(out[0], skip_special_tokens=True))

Note: the released config.json carries use_cache: false (the training setting). Set model.config.use_cache = True for fast generation.

Citation

A technical report is forthcoming on Zenodo. In the meantime, please cite as:

@misc{bachu2026npcnano,
  title  = {NPC Nano 0.5B: A small language model with pretraining-time identity injection},
  author = {Bachu, Rama Krishna},
  year   = {2026},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/ramankrishna10/npc-nano-0.5b-base}},
  note = {Technical report forthcoming on Zenodo}
}

License

Apache License 2.0. See LICENSE for the full text.

Downloads last month: 15

Safetensors

Model size

0.5B params

Tensor type

BF16

Model tree for ramankrishna10/npc-nano-0.5b-base

Finetunes

2 models