Instructions to use ramankrishna10/npc-nano-0.5b-base with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ramankrishna10/npc-nano-0.5b-base with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="ramankrishna10/npc-nano-0.5b-base")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("ramankrishna10/npc-nano-0.5b-base") model = AutoModelForCausalLM.from_pretrained("ramankrishna10/npc-nano-0.5b-base") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use ramankrishna10/npc-nano-0.5b-base with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "ramankrishna10/npc-nano-0.5b-base" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ramankrishna10/npc-nano-0.5b-base", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/ramankrishna10/npc-nano-0.5b-base
- SGLang
How to use ramankrishna10/npc-nano-0.5b-base with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "ramankrishna10/npc-nano-0.5b-base" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ramankrishna10/npc-nano-0.5b-base", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "ramankrishna10/npc-nano-0.5b-base" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ramankrishna10/npc-nano-0.5b-base", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use ramankrishna10/npc-nano-0.5b-base with Docker Model Runner:
docker model run hf.co/ramankrishna10/npc-nano-0.5b-base
NPC Nano 0.5B — Base
NPC Nano 0.5B (base) is a 502M-parameter Llama-style decoder-only language model pretrained from scratch on a curated 8.93B-token mix of web, code, math, finance, and conversational data. This release is the base checkpoint at the end of pretraining — instruction-tuned variants will follow under separate model names.
- Developer: Bottensor (Rama Krishna Bachu)
- Model type: Llama-architecture causal language model
- Language: English
- License: Apache 2.0
- Technical report: forthcoming on Zenodo
Model details
| Parameters (total) | 501,531,648 (~0.5B) |
| Architecture | LlamaForCausalLM |
| Hidden size | 1024 |
| Intermediate (FFN) size | 4992 |
| Layers | 24 |
| Attention heads | 16 (head_dim 64) |
| KV heads | 16 (no GQA) |
| Tied input/output embeddings | yes |
| Vocab size | 32,000 |
| Tokenizer | BPE (HF PreTrainedTokenizerFast) |
| Context length | 2,048 tokens |
| Positional encoding | RoPE (theta = 10,000) |
| Activation | SiLU (SwiGLU MLP) |
| Norm | RMSNorm (eps 1e-5) |
| Precision | bfloat16 |
| Attention impl | FlashAttention-2 |
Training
Compute & duration
- Hardware: single NVIDIA A40 (46 GB), RunPod
- Effective batch size: 6 × 41 grad-accum × 2,048 seq = 503,808 tokens/step
- Steps: 17,733
- Tokens seen: 8,934,027,264 (~8.93B)
- MFU: 30.7 – 31.2% (stable across the run)
Optimizer & schedule
- AdamW (β₁ = 0.9, β₂ = 0.95, ε = 1e-8), weight decay 0.1
- Gradient clipping 1.0
- Peak learning rate 1.0e-3 (winner of a Phase-1 LR ablation over {3e-4, 6e-4, 1e-3})
- Cosine schedule, horizon = full corpus (8.934B tokens), 1% warmup
- Z-loss coefficient 1e-4
- Seed 1337
Data mix (natural weights, by token count)
| Source | Share | Approx. tokens |
|---|---|---|
| FineWeb-Edu | 49.0% | ~4.38 B |
| The Stack (Python subset) | 25.9% | ~2.32 B |
| Proof-Pile-2 / OpenWebMath | 15.3% | ~1.37 B |
| SEC EDGAR (10-K / 10-Q filings) | 7.8% | ~696 M |
| UltraChat | 1.9% | ~170 M |
| Crypto whitepapers | 0.07% | ~6.0 M |
A small identity-injection shard (500 curated Q: … A: … examples
identifying the model as "NPC Nano") was mixed in over the final 2% of training
(ramping from 0 → 5% sampling weight in the last 2%, holding at 5% in the last
1%). This gives the base model a stable self-identity without requiring SFT.
Evaluation
Evaluated at the end of pretraining (checkpoint at step 17,733, 8.93B tokens
seen). Full evaluation report, including methodology and per-task details, is in
the training repo under reports/phase2_v2_base_eval.md.
Capability benchmarks
| Task | Metric | Score |
|---|---|---|
| HellaSwag | acc_norm | 36.82% |
| ARC-Easy | acc_norm | 49.96% |
| PIQA | acc_norm | 65.02% |
| OpenBookQA | acc_norm | 30.00% |
| WinoGrande | acc | 49.49% |
| GSM8K (5-shot, flex-extract) | exact_match | 1.67% |
| GSM8K (5-shot, strict) | exact_match | 0.68% |
Run via lm-evaluation-harness 0.4.12.
Held-out perplexity
| Domain | Perplexity | Tokens |
|---|---|---|
| SEC EDGAR | 6.65 | 301,607 (148 docs) |
| Crypto whitepapers | 11.35 | 22,752 (16 docs) |
Identity smoke test (base mode, Q: … A: prompts)
| Cohort | Pass rate |
|---|---|
| A — direct identity questions | 94.0% (47 / 50) |
| B — sibling-model questions | 4.0% (2 / 50) |
| C — adversarial / jailbreak | 75.0% (75 / 100) |
Cohort B is expected to be low in base mode — sibling-model knowledge is delivered via SFT, not pretraining.
Intended use
NPC Nano 0.5B base is intended for:
- Research into small-language-model pretraining, data mixes, and identity injection
- A starting point for fine-tuning — SFT, DPO/GRPO, and downstream task adapters
- Benchmarking small-model capability at ~9B-token compute budgets
This is a base model, not an instruction-tuned chat model. It performs best on:
- Completion-style prompts (web-text continuation, code continuation, math expressions)
- Plain
Q: <question>\nA:few-shot prompts
Out-of-scope / limitations
- Not safety-tuned. No RLHF, no DPO, no refusal training. The base model can and will produce undesirable, false, biased, or harmful outputs.
- Not instruction-following in the chat sense. No chat template applied
during pretraining. Use
Q: …\nA:prompting or fine-tune for instructions. - Short context (2,048 tokens). No long-context training; do not expect coherent generation past the context window.
- English-only. The training mix is overwhelmingly English; non-English performance is not characterized.
- Math is weak. GSM8K performance is at the floor for this scale; the model emits arithmetic structure but rarely the right final number.
- Knowledge cutoff is bounded by the pretraining sources (FineWeb-Edu, EDGAR, etc.); the model has no knowledge of events after those snapshots.
- No code execution sandboxing. Generated code should not be run without review.
Users are responsible for evaluating fitness for any downstream task and for adding appropriate safety measures.
How to use
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "ramankrishna10/npc-nano-0.5b-base"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
)
model.config.use_cache = True # speed up generation
prompt = "Q: What is the capital of France?\nA:"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=40, do_sample=False)
print(tokenizer.decode(out[0], skip_special_tokens=True))
Note: the released
config.jsoncarriesuse_cache: false(the training setting). Setmodel.config.use_cache = Truefor fast generation.
Citation
A technical report is forthcoming on Zenodo. In the meantime, please cite as:
@misc{bachu2026npcnano,
title = {NPC Nano 0.5B: A small language model with pretraining-time identity injection},
author = {Bachu, Rama Krishna},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/ramankrishna10/npc-nano-0.5b-base}},
note = {Technical report forthcoming on Zenodo}
}
License
Apache License 2.0. See LICENSE for the full text.
- Downloads last month
- 21