Instructions to use ramankrishna10/npc-fast-1.7b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use ramankrishna10/npc-fast-1.7b with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="ramankrishna10/npc-fast-1.7b")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("ramankrishna10/npc-fast-1.7b")
model = AutoModelForCausalLM.from_pretrained("ramankrishna10/npc-fast-1.7b")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use ramankrishna10/npc-fast-1.7b with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "ramankrishna10/npc-fast-1.7b"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ramankrishna10/npc-fast-1.7b",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/ramankrishna10/npc-fast-1.7b

SGLang

How to use ramankrishna10/npc-fast-1.7b with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "ramankrishna10/npc-fast-1.7b" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ramankrishna10/npc-fast-1.7b",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "ramankrishna10/npc-fast-1.7b" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ramankrishna10/npc-fast-1.7b",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use ramankrishna10/npc-fast-1.7b with Docker Model Runner:
```
docker model run hf.co/ramankrishna10/npc-fast-1.7b
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

NPC Fast 1.7B

Fast agentic router — a 1.7B parameter model that decides whether to handle a request itself or forward it to a larger partner model (NPC Fin 32B).

Trained on top of HuggingFaceTB/SmolLM2-1.7B-Instruct by Bottensor (a Falcon Hash company).

What it is

A small, fast router + agentic model. For every user request it emits:

{"route": "self" | "npc_fin", "reason": "<short>"}

self — it handles the task directly (lookup, format conversion, short code, tool calls with obvious args, identity, translation, chit-chat)
npc_fin — it forwards to a 32B finance-specialist model (deep multi-step financial reasoning, valuation, derivatives math, long-document synthesis)

Training recipe

Full-weight continual pre-training on top of SmolLM2-1.7B-Instruct
- 2,825 global steps, bf16, flash-attention-2, gradient checkpointing
- 5-stage curriculum planned (4K → 16K → 32K → 64K → 64K), actual training stopped after stage 2 (16K)
- Data: ~60K examples (agentic traces, function calling, tool use, reasoning)
- Liger fused kernels (fused linear CE + fused RMSNorm + fused SwiGLU + RoPE)
- YaRN RoPE scaling configured for 128K (factor 16) — but not validated past 16K, see limitations below
Router LoRA fine-tune (rank-32, 3 epochs, 189 steps, loss 0.001)
- 500 router pairs (300 self + 200 npc_fin)
- Merged back into the base weights — this repo is the merged bf16 checkpoint

Evaluation

Run against the merged checkpoint at 16K context:

Benchmark	Metric	Result
BFCL (tool calling, n=20)	JSON / name / args accuracy	100% / 100% / 100%
IFEval (n=200, 18 checkable)	instruction pass rate	77.8%
Agentic tool selection (n=100)	JSON valid / tool accuracy	100% / 57%
Router — in-distribution (n=200)	accuracy	100% (see note)
Router — out-of-distribution (n=60)	accuracy	98.3%
Router — OOD escalation recall / precision	recall / precision	100% / 100%
Needle-in-Haystack @ 16K	pass (1 of 5 depths)	20%
Needle-in-Haystack @ 32K+	pass	0% (see limitations)

In-distribution router eval uses the same seed query pool as the training set, so the 100% number measures format fidelity, not generalization. The OOD eval uses 60 genuinely novel queries — that 98.3% is the honest router number. The single OOD error was a JSON formatting glitch; the routing decision was correct.

Intended use

Agentic routing — deciding between self-handling and escalation
Light tool-calling and function-calling tasks
Short-context (≤16K) instruction following
Drop-in replacement for SmolLM2 in systems that want a router-fine-tuned head

Limitations and honest disclosures

Context is 16K in practice. The config advertises 128K via YaRN scaling, but training stopped after the 16K curriculum stage. Needle-in-haystack at 32K/64K/128K produces degenerate output (repetitive tokens). Use at your own risk past 16K.
Router trained on a small synthetic dataset (500 pairs). OOD eval is strong but the data diversity is limited. Expect edge cases outside finance vs general tasks.
No RLHF / DPO. This is pure continual pre-training + supervised fine-tune. Refusal behavior is inherited from the base SmolLM2-Instruct.

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model = AutoModelForCausalLM.from_pretrained(
    "ramankrishna10/npc-fast-1.7b",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
tok = AutoTokenizer.from_pretrained("ramankrishna10/npc-fast-1.7b")

SYSTEM = (
  "You are NPC Fast, a capable 1.7B model. Handle most requests yourself. "
  "Only forward to the larger NPC Fin 32B model when a task truly requires "
  "deep multi-step financial analysis that you cannot do well alone.\n\n"
  "Default: route=self.\n"
  "Escalate to npc_fin ONLY if ALL of these are true:\n"
  "  - the task is about finance, markets, banking, derivatives, or valuation\n"
  "  - it requires multi-step quantitative reasoning or deep domain knowledge\n"
  "  - a short answer would be wrong or superficial\n\n"
  "Output exactly one JSON object with fields route and reason."
)

messages = [
  {"role": "system", "content": SYSTEM},
  {"role": "user", "content": "Build a DCF for TSLA with 3 scenarios."},
]
enc = tok.apply_chat_template(messages, tokenize=True, return_tensors="pt",
                               add_generation_prompt=True).to(model.device)
out = model.generate(enc, max_new_tokens=60, do_sample=False)
print(tok.decode(out[0][enc.shape[-1]:], skip_special_tokens=True))
# → {"route": "npc_fin", "reason": "multi-step finance model"}

Runtime 4-bit quantization (bitsandbytes)

from transformers import BitsAndBytesConfig
bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4",
                         bnb_4bit_compute_dtype=torch.bfloat16)
model = AutoModelForCausalLM.from_pretrained(
    "ramankrishna10/npc-fast-1.7b", quantization_config=bnb, device_map="auto",
)

GGUF / llama.cpp

See companion repo: ramankrishna10/npc-fast-1.7b-gguf

Credits

Built by Bottensor (a Falcon Hash company), creator: dude.npc
Base model: HuggingFaceTB/SmolLM2-1.7B-Instruct
Training framework: custom trainer wrapping HF Trainer + Liger-Kernel + FlashAttention-2 + YaRN RoPE scaling

Citation

If you use this model or build on its training recipe, please cite the accompanying preprint:

Bachu, R. K. (2026). NPC Fast 1.7B: Building a Usable Small Model on a Single H100. Zenodo. https://doi.org/10.5281/zenodo.19771040

@misc{bachu2026npcfast,
  title     = {NPC Fast 1.7B: Building a Usable Small Model on a Single H100},
  author    = {Bachu, Rama Krishna},
  year      = {2026},
  publisher = {Zenodo},
  doi       = {10.5281/zenodo.19771040},
  url       = {https://doi.org/10.5281/zenodo.19771040},
  note      = {Preprint},
}

Downloads last month: 540

Safetensors

Model size

2B params

Tensor type

BF16

Model tree for ramankrishna10/npc-fast-1.7b

Base model

HuggingFaceTB/SmolLM2-1.7B

Quantized

HuggingFaceTB/SmolLM2-1.7B-Instruct

Finetuned

(134)

this model

Quantizations

1 model