seconds-0
/

nsa-117m-byte

Text Generation

sparse-attention

Model card Files Files and versions

nsa-117m-byte / README.md

seconds-0's picture

NSA 117M initial export

4303959 verified 5 months ago

|

history blame contribute delete

2.25 kB

	---
	license: apache-2.0
	language:
	- en
	tags:
	- nsa
	- sparse-attention
	- 117m
	datasets:
	- fineweb-edu
	library_name: transformers
	pipeline_tag: text-generation
	base_model: byte-256
	---

	# NSA 117M (FineWeb-Edu) — Remote Code

	This repository contains a 117M NSA decoder-only model with remote code. It exposes `NSAConfig` and `NSAForCausalLM` so you can load via:

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	m = AutoModelForCausalLM.from_pretrained("seconds-0/nsa-117m-byte", trust_remote_code=True)
	t = AutoTokenizer.from_pretrained("seconds-0/nsa-117m-byte")
	out = m.generate(**t("Hello", return_tensors="pt"), max_new_tokens=16)
	```

	## What is NSA

	Native Sparse Attention (NSA) combines three branches — compressed (cmp), selected (sel), and sliding window (win) — mixed by a learned gate. The 117M configuration uses SDPA everywhere and keeps strict causality.

	Architecture (overview):
	- cmp: compressed blocks (tile length l, stride d) attended with causal masks
	- sel: top-n selection over blockized keys (block l′, n ranges per step)
	- win: sliding window attention of size w
	- gate: small MLP (zero-initialized last layer), softmax(τ=1.0)

	Defaults: l=32, d=16, l′=64, n=16, w=512; GQA groups=2.

	## Performance & Metrics (example targets)

	- A100 40GB: ≥600 tok/s; TTFT ≤ 350 ms (batch=1, seq=128)
	- RTX 4090: ≥400 tok/s; TTFT ≤ 450 ms
	- CPU: ≥10 tok/s; TTFT ≤ 2.0 s

	## Intended Use / Limitations

	- Toy assistant and demos; not suitable for high-stakes use.

	## Memory Budget (KV Cache)

	- Standard LM approx: Mem ≈ t × H × (d_k + d_v) × bytes_per_elem
	- NSA decode (M0): Mem ≈ (min(w, t) + n × l′) × H × (d_k + d_v) × bytes_per_elem
	- Example (w=512, n=16, l′=64): tokens_cached ≈ min(512, t) + 1024 (FP16 → a few MiB for 117M dims)

	## Notes

	- Tokenizer: byte-level tokenizer (vocab=256). This is not GPT‑2/BPE; input/output are raw UTF‑8 bytes.
	- Generation cache: no KV cache in v1 (slower decode for long sequences). Planned follow‑up.
	- Gate: initialized to uniform mixing by design (zero‑init last layer); differs from trained gate topology.
	- Remote code uses SDPA-only paths and includes a safe fallback block if NSA is forcibly disabled via env.