synth-4.0-modernbert β LiteRT
LiteRT (.tflite) exports of a finetune of
nomic-ai/modernbert-embed-base,
trained on the synth-4.0 wafer-domain dataset. Bundled for on-device
inference on Android (XNNPACK CPU delegate) and other LiteRT-compatible
runtimes.
Each artifact is a self-contained frozen graph: encoder body, mean-pool
over attention_mask, and L2 normalization are all baked in. The
consumer only needs to prepend the prompt prefix, tokenize, pad, and
feed input_ids + attention_mask; the graph returns the L2-normalized
768-dim sentence embedding directly.
License inherits from upstream nomic-ai/modernbert-embed-base (Apache
2.0). Verify against the upstream model card if relicensing matters.
Files
One tflite, multi-signature: a single dynamic_int8 flatbuffer holds
7 graph entry points sharing the same encoder weight blob.
| file | quant | size | signatures |
|---|---|---|---|
synth-4.0-modernbert_multi_int8.tflite |
dynamic_int8 | 187 MB | embed_{128, 256, 512, 1024, 2048, 4096, 8192} (one per seq_len) |
Mirrors Google's gemma-4-E2B-it.litertlm packaging pattern (verified by
inspecting their tf_lite_prefill_decode section, which carries multiple
prefill_{SEQ-LEN} signatures over one weight blob). Earlier this repo
shipped 7 separate per-seqlen .tflite files totalling 1.05 GB; the
multi-signature bundle replaces that at 187 MB (5.6Γ reduction) with no
quality loss β the weight blob is unchanged, only the duplication of it
across files is removed.
dynamic_int8 = weight-quantized int8 matmuls via XNNPACK (best on ARM
CPUs). fp32 exports were used at conversion time for numerics validation
against the upstream sentence-transformers reference and were not shipped
(they would add ~4 GB of duplicated weights with no inference advantage
on phone). The conversion script (linked in Provenance) regenerates fp32
if needed.
Numerics validation
Validated against the upstream sentence-transformers reference loaded in fp32, over a 6-string suite:
| signature | mean cosine | min cosine | threshold |
|---|---|---|---|
embed_128 / embed_256 / embed_512 (verified) |
0.991590 | 0.989623 | 0.98 |
embed_{1024, 2048, 4096, 8192} (extrapolation) |
(same code path, same shared weights) | 0.98 |
(fp32 export was bit-equivalent β cos = 1.0 β at conversion time, then discarded.) int8 exhibits ~1-2% cosine drift which is decision-equivalent on retrieval (post-quant retrieval eval pending). Larger seqlens were not exhaustively run (sentence-transformers reference forward at seq=8192 on CPU is slow) but use the same shared weight blob and same graph code path as the validated short-seq signatures.
Architecture details
ModernBERT-base is a bidirectional encoder. 22 layers, hidden=768, head_dim=64, windowed attention with local=128 and global attention every 3 layers. Vocab=50368 (BPE tokenizer).
Consequences baked into the exported graph:
- Mean pool over
attention_mask. Pad-side-agnostic; right-pad in practice. - L2 normalize: output is unit-norm, ready for cosine retrieval.
- Matryoshka head: model trained at dim=768 with a 256-dim Matryoshka head. To use 256-dim on device: take the first 256 of the 768-dim output and re-normalize. Both dims are quality-validated by Nomic upstream.
Tensor shapes (all variants):
- Input
input_ids:[1, seq_len]int64 - Input
attention_mask:[1, seq_len]int64 - Output:
[1, 768]float32 (L2-normalized)
Inference notes for the bridge
Prompt prefix is required β the model was trained with sentence- transformers prompts. The bridge must prepend before tokenization:
- For retrieval queries:
"search_query: " + text(note trailing space) - For documents:
"search_document: " + text(note trailing space)
Without the prefix, the embedding distribution drifts off-distribution and retrieval quality drops. The prefix is not baked into the .tflite; the consumer is responsible.
Pad with the BERT [PAD] token (id=50283) right-pad. Mean-pool + mask
is pad-side-agnostic, but right-pad matches the converter's traced
sample shapes.
Reference Python usage
import numpy as np
from transformers import AutoTokenizer
from ai_edge_litert.interpreter import Interpreter
SEQ_LEN = 512
QUERY_PREFIX = "search_query: "
tok = AutoTokenizer.from_pretrained(".")
interp = Interpreter(model_path="synth-4.0-modernbert_multi_int8.tflite")
interp.allocate_tensors()
# Pick the signature for the desired seq_len. Available:
# embed_128, embed_256, embed_512, embed_1024, embed_2048, embed_4096, embed_8192
runner = interp.get_signature_runner(f"embed_{SEQ_LEN}")
input_names = list(runner.get_input_details().keys()) # [tokens, mask] in order
text = "was my coffee spending more or less in november?"
enc = tok(QUERY_PREFIX + text, padding="max_length", truncation=True,
max_length=SEQ_LEN, return_tensors="np")
out = runner(**{
input_names[0]: enc["input_ids"].astype(np.int64),
input_names[1]: enc["attention_mask"].astype(np.int64),
})
emb = list(out.values())[0] # [1, 768], L2-normalized
# Optional: Matryoshka 256-dim
emb_256 = emb[:, :256]
emb_256 /= np.linalg.norm(emb_256, axis=1, keepdims=True)
Provenance
- Upstream base: nomic-ai/modernbert-embed-base
- Training pipeline: synth-4.0 (cached MNR + matryoshka 768/256, lr=5e-4, batch=64, 500 steps, 6Γ 4090 DDP, gather_across_devices=True)
- Quality leaderboard: NDCG@10 = 0.4675, Recall@10 = 0.5296 (#6 of 21 in retrieval-20260427)
- Conversion script:
on-device/conversion/convert_modernbert_embed.pyon the project-switchboard repo - Conversion env:
litert-torch 0.8.0,transformers 5.5.4,torch 2.9.1+cu128, Python 3.11
- Downloads last month
- 88
Model tree for ckg/synth-4.0-modernbert-litert
Base model
answerdotai/ModernBERT-base