ternlight

A 1.58-bit BitNet-style sentence embedding model distilled from sentence-transformers/all-MiniLM-L6-v2 via quantization-aware training, with post-training int4 quantization at the embedding layer. The shipped binary is 4.6 MB; the full WASM bundle (engine + tokenizer + model) is 7 MB and runs on CPU in ~2 ms per call.

ternlight is designed for short-string semantic similarity — search queries, intent classification, FAQ matching, product cards — deployed on-device (browser, Node, edge runtimes, ARM single-board computers). It is not a frontier model; it trades absolute quality for size and on-device deployability.

Model variants

File	Bin size	Spearman vs teacher	Quality retained vs fp32 student
`model-int4.bin` ⭐	4.6 MB	0.835	95%
`model-embedding-int8.bin`	8.3 MB	0.841	95%
`model-ternary.bin`	2.9 MB	0.710	80%

model-int4.bin is the shipped default. int8 offers a slight quality bump at ~1.8× the size. ternary is the size-extreme variant — useful when bytes are at absolute premium and you can tolerate the ~15 pt drop in pair-ranking quality.

All variants share the same architecture and tokenizer.

How to use

ternlight runs via a custom Rust→WASM inference engine, not via the transformers library. Two paths:

Path 1 — via the `ternlight` npm package (recommended)

npm install ternlight

import { embed, cosineSim, similar } from 'ternlight';

const v1 = embed("arctic terns migrate from pole to pole");
const v2 = embed("longest migration in the animal kingdom");

cosineSim(v1, v2);   // ~0.71 — semantically related, different wording

// Nearest-neighbor search over a corpus
const matches = similar("which seabird travels farthest", corpus, { topK: 5 });

The model and tokenizer are bundled into the npm package — no separate download.

Path 2 — direct download

from huggingface_hub import hf_hub_download

model_bin = hf_hub_download(repo_id="wenshutang/ternlight", filename="model-int4.bin")
tokenizer = hf_hub_download(repo_id="wenshutang/ternlight", filename="tokenizer.json")

The .bin files are a custom BitNet b1.58 format. See the engine source for the binary layout and reference forward pass if you want to implement a custom loader (e.g., in another language or runtime).

Model details

Property	Value
Architecture	2-layer Transformer encoder
Parameters	~9.5M
Output dimension	384 (L2-normalized)
Max input	128 tokens (~95 English words; longer inputs are silently truncated)
d_model	256
Attention heads	4
FFN dim	1024
Vocabulary	30,522 (BERT WordPiece, identical to teacher)
Linear weights	Ternary `{-1, 0, +1}` + per-matrix fp32 scale
Embedding weights (int4 variant)	4-bit per-row PTQ + per-row fp32 scale
Embedding weights (int8 variant)	8-bit per-row PTQ + per-row fp32 scale
Embedding weights (ternary variant)	Ternary, same scheme as linear weights

Training

Distilled from sentence-transformers/all-MiniLM-L6-v2 in three stages:

Distillation objective — MSE loss between student and teacher 384-dim embeddings, plus an optional contrastive term.
BitNet b1.58 quantization-aware training — all linear layers use ternary weights trained end-to-end with the straight-through estimator. Training the model with the quantization constraint from the start (rather than quantizing post-hoc) preserves ~95% of the fp32 student's pair-ranking quality.
Post-training int4 quantization (PTQ) — applied to the token embedding table after QAT completes. The embedding table dominates parameter count, so compressing it aggressively gives the largest size win for the smallest quality cost.

Training data: ~1M sentences from MS MARCO and general web text. English-only.

Provenance (`model-int4.bin`)


Training run	`qat-resume-ep10-ep40`
Source checkpoint	`checkpoint_ep40.pt`
Source code commit	`dff16b1`
Packed at	2026-06-03
SHA-256	`07d8cf...e5b6c98`

Each .bin ships with a .bin.json sidecar containing the full provenance for reproducibility checks.

Evaluation

Spearman rank correlation vs teacher

Held-out MS MARCO queries, 1,000 deterministic random pairs, seed=42. Spearman of 1.0 = the candidate ranks pair similarities identically to the teacher.

Variant	Bin size	Bits/param	Spearman	Pearson
MiniLM-L6 (teacher)	90.9 MB	32.00	1.000	1.000
Student fp32 (pre-QAT)	38.0 MB	32.00	0.883	0.907
ternlight `emb_int8`	8.3 MB	7.37	0.841	0.872
ternlight `emb_int4` ⭐	4.6 MB	4.08	0.835	0.864
ternlight `emb_ternary`	2.9 MB	2.43	0.710	0.756

Full methodology and reproduction scripts: eval/quality/RESULTS.md.

Performance (M-series Mac, Node single-threaded)

Metric	Value
Latency p50	~2 ms
Throughput	~450 emb/sec (sentence-length input)
Cold start	~112 ms (require + first inference)
Memory (RSS, post-warmup)	~150 MB

Throughput scales inversely with sequence length — ~~900 emb/sec on short queries (3-4 tokens), ~150 emb/sec on long paragraphs (~~25 tokens). Methodology: [eval/benchmarks/perf.js][perf-js].

Intended use

Designed for:

Short-string semantic similarity (queries, intents, FAQs, product titles, tags)
On-device deployment — browsers, Node services, Cloudflare Workers, Deno Deploy, Vercel Edge, Raspberry Pi-class ARM single-board computers
Cost-free embedding at any scale (no per-call API charges)
Privacy-sensitive workloads where queries cannot leave the user's device

Not designed for:

Long-document understanding (max input is 128 tokens — silently truncated above)
Multilingual workloads (English-only, inherited from MiniLM-L6)
Maximum absolute quality (use a frontier model like text-embedding-3-large or voyage-3 if quality dominates over size and deployability)

Limitations

English-only: the tokenizer and training data are English. Performance on non-English text is undefined and likely poor.
128-token cap: text longer than 128 BERT WordPiece tokens is silently truncated. Embed at sentence or short-paragraph granularity, not full document.
Custom runtime required: no transformers.AutoModel.from_pretrained() path is provided. Use the ternlight npm package or implement a custom loader from the binary format.
Inherited biases: ternlight is distilled from all-MiniLM-L6-v2, which inherits training-data biases from the sentence-transformers corpus. The same caveats around demographic and topical bias apply.
Pre-alpha (v0.1): the binary format and JS API may change before v1.0.

License

MIT, matching the teacher model and the ternlight project. See LICENSE.

Citation

If you use ternlight in published work, please cite:

@software{ternlight2026,
  title  = {ternlight: a 1.58-bit BitNet sentence embedder in 7 MB of WASM},
  author = {Tang, Wen Shu},
  year   = {2026},
  url    = {https://github.com/soycaporal/ternlight}
}

ternlight builds on:

BitNet b1.58 (Ma et al., 2024) — ternary weight training
sentence-transformers/all-MiniLM-L6-v2 — teacher model

Model tree for wenshutang/ternlight

Base model

nreimers/MiniLM-L6-H384-uncased

Quantized

sentence-transformers/all-MiniLM-L6-v2

Quantized

(84)

this model

Paper for wenshutang/ternlight

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

Paper • 2402.17764 • Published Feb 27, 2024 • 630

wenshutang
/

ternlight

ternlight

Model variants

How to use

Path 1 — via the `ternlight` npm package (recommended)

Path 2 — direct download

Model details

Training

Provenance (`model-int4.bin`)

Evaluation

Spearman rank correlation vs teacher

Performance (M-series Mac, Node single-threaded)

Intended use

Limitations

License

Citation

Links

Model tree for wenshutang/ternlight

Paper for wenshutang/ternlight

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

ternlight

Model variants

How to use

Path 1 — via the ternlight npm package (recommended)

Path 2 — direct download

Model details

Training

Provenance (model-int4.bin)

Evaluation

Spearman rank correlation vs teacher

Performance (M-series Mac, Node single-threaded)

Intended use

Limitations

License

Citation

Links

Model tree for wenshutang/ternlight

Paper for wenshutang/ternlight

Path 1 — via the `ternlight` npm package (recommended)

Provenance (`model-int4.bin`)