Instructions to use seconds-0/nsa-117m-byte with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use seconds-0/nsa-117m-byte with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="seconds-0/nsa-117m-byte", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("seconds-0/nsa-117m-byte", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use seconds-0/nsa-117m-byte with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "seconds-0/nsa-117m-byte"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "seconds-0/nsa-117m-byte",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/seconds-0/nsa-117m-byte

SGLang

How to use seconds-0/nsa-117m-byte with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "seconds-0/nsa-117m-byte" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "seconds-0/nsa-117m-byte",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "seconds-0/nsa-117m-byte" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "seconds-0/nsa-117m-byte",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use seconds-0/nsa-117m-byte with Docker Model Runner:
```
docker model run hf.co/seconds-0/nsa-117m-byte
```

NSA 117M (FineWeb-Edu) — Remote Code

This repository contains a 117M NSA decoder-only model with remote code. It exposes NSAConfig and NSAForCausalLM so you can load via:

from transformers import AutoModelForCausalLM, AutoTokenizer
m = AutoModelForCausalLM.from_pretrained("seconds-0/nsa-117m-byte", trust_remote_code=True)
t = AutoTokenizer.from_pretrained("seconds-0/nsa-117m-byte")
out = m.generate(**t("Hello", return_tensors="pt"), max_new_tokens=16)

What is NSA

Native Sparse Attention (NSA) combines three branches — compressed (cmp), selected (sel), and sliding window (win) — mixed by a learned gate. The 117M configuration uses SDPA everywhere and keeps strict causality.

Architecture (overview):

cmp: compressed blocks (tile length l, stride d) attended with causal masks
sel: top-n selection over blockized keys (block l′, n ranges per step)
win: sliding window attention of size w
gate: small MLP (zero-initialized last layer), softmax(τ=1.0)

Defaults: l=32, d=16, l′=64, n=16, w=512; GQA groups=2.

Performance & Metrics (example targets)

A100 40GB: ≥600 tok/s; TTFT ≤ 350 ms (batch=1, seq=128)
RTX 4090: ≥400 tok/s; TTFT ≤ 450 ms
CPU: ≥10 tok/s; TTFT ≤ 2.0 s

Intended Use / Limitations

Toy assistant and demos; not suitable for high-stakes use.

Memory Budget (KV Cache)

Standard LM approx: Mem ≈ t × H × (d_k + d_v) × bytes_per_elem
NSA decode (M0): Mem ≈ (min(w, t) + n × l′) × H × (d_k + d_v) × bytes_per_elem
Example (w=512, n=16, l′=64): tokens_cached ≈ min(512, t) + 1024 (FP16 → a few MiB for 117M dims)

Notes

Tokenizer: byte-level tokenizer (vocab=256). This is not GPT‑2/BPE; input/output are raw UTF‑8 bytes.
Generation cache: no KV cache in v1 (slower decode for long sequences). Planned follow‑up.
Gate: initialized to uniform mixing by design (zero‑init last layer); differs from trained gate topology.
Remote code uses SDPA-only paths and includes a safe fallback block if NSA is forcibly disabled via env.

Downloads last month: 5