|
|
--- |
|
|
license: apache-2.0 |
|
|
language: |
|
|
- en |
|
|
tags: |
|
|
- nsa |
|
|
- sparse-attention |
|
|
- 117m |
|
|
datasets: |
|
|
- fineweb-edu |
|
|
library_name: transformers |
|
|
pipeline_tag: text-generation |
|
|
base_model: byte-256 |
|
|
--- |
|
|
|
|
|
# NSA 117M (FineWeb-Edu) — Remote Code |
|
|
|
|
|
This repository contains a 117M NSA decoder-only model with remote code. It exposes `NSAConfig` and `NSAForCausalLM` so you can load via: |
|
|
|
|
|
```python |
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
m = AutoModelForCausalLM.from_pretrained("seconds-0/nsa-117m-byte", trust_remote_code=True) |
|
|
t = AutoTokenizer.from_pretrained("seconds-0/nsa-117m-byte") |
|
|
out = m.generate(**t("Hello", return_tensors="pt"), max_new_tokens=16) |
|
|
``` |
|
|
|
|
|
## What is NSA |
|
|
|
|
|
Native Sparse Attention (NSA) combines three branches — compressed (cmp), selected (sel), and sliding window (win) — mixed by a learned gate. The 117M configuration uses SDPA everywhere and keeps strict causality. |
|
|
|
|
|
Architecture (overview): |
|
|
- cmp: compressed blocks (tile length l, stride d) attended with causal masks |
|
|
- sel: top-n selection over blockized keys (block l′, n ranges per step) |
|
|
- win: sliding window attention of size w |
|
|
- gate: small MLP (zero-initialized last layer), softmax(τ=1.0) |
|
|
|
|
|
Defaults: l=32, d=16, l′=64, n=16, w=512; GQA groups=2. |
|
|
|
|
|
## Performance & Metrics (example targets) |
|
|
|
|
|
- A100 40GB: ≥600 tok/s; TTFT ≤ 350 ms (batch=1, seq=128) |
|
|
- RTX 4090: ≥400 tok/s; TTFT ≤ 450 ms |
|
|
- CPU: ≥10 tok/s; TTFT ≤ 2.0 s |
|
|
|
|
|
## Intended Use / Limitations |
|
|
|
|
|
- Toy assistant and demos; not suitable for high-stakes use. |
|
|
|
|
|
## Memory Budget (KV Cache) |
|
|
|
|
|
- Standard LM approx: Mem ≈ t × H × (d_k + d_v) × bytes_per_elem |
|
|
- NSA decode (M0): Mem ≈ (min(w, t) + n × l′) × H × (d_k + d_v) × bytes_per_elem |
|
|
- Example (w=512, n=16, l′=64): tokens_cached ≈ min(512, t) + 1024 (FP16 → a few MiB for 117M dims) |
|
|
|
|
|
## Notes |
|
|
|
|
|
- Tokenizer: byte-level tokenizer (vocab=256). This is not GPT‑2/BPE; input/output are raw UTF‑8 bytes. |
|
|
- Generation cache: no KV cache in v1 (slower decode for long sequences). Planned follow‑up. |
|
|
- Gate: initialized to uniform mixing by design (zero‑init last layer); differs from trained gate topology. |
|
|
- Remote code uses SDPA-only paths and includes a safe fallback block if NSA is forcibly disabled via env. |
|
|
|
|
|
|