synth-4.0-modernbert → LiteRT

LiteRT (.tflite) exports of a finetune of nomic-ai/modernbert-embed-base, trained on the synth-4.0 wafer-domain dataset. Bundled for on-device inference on Android (XNNPACK CPU delegate) and other LiteRT-compatible runtimes.

Each artifact is a self-contained frozen graph: encoder body, mean-pool over attention_mask, and L2 normalization are all baked in. The consumer only needs to prepend the prompt prefix, tokenize, pad, and feed input_ids + attention_mask; the graph returns the L2-normalized 768-dim sentence embedding directly.

License inherits from upstream nomic-ai/modernbert-embed-base (Apache 2.0). Verify against the upstream model card if relicensing matters.

Files

One tflite, multi-signature: a single dynamic_int8 flatbuffer holds 7 graph entry points sharing the same encoder weight blob.

file	quant	size	signatures
`synth-4.0-modernbert_multi_int8.tflite`	dynamic_int8	187 MB	`embed_{128, 256, 512, 1024, 2048, 4096, 8192}` (one per seq_len)

Mirrors Google's gemma-4-E2B-it.litertlm packaging pattern (verified by inspecting their tf_lite_prefill_decode section, which carries multiple prefill_{SEQ-LEN} signatures over one weight blob). Earlier this repo shipped 7 separate per-seqlen .tflite files totalling ~~1.05 GB; the multi-signature bundle replaces that at 187 MB (~~5.6× reduction) with no quality loss — the weight blob is unchanged, only the duplication of it across files is removed.

dynamic_int8 = weight-quantized int8 matmuls via XNNPACK (best on ARM CPUs). fp32 exports were used at conversion time for numerics validation against the upstream sentence-transformers reference and were not shipped (they would add ~4 GB of duplicated weights with no inference advantage on phone). The conversion script (linked in Provenance) regenerates fp32 if needed.

Numerics validation

Validated against the upstream sentence-transformers reference loaded in fp32, over a 6-string suite:

signature	mean cosine	min cosine	threshold
`embed_128` / `embed_256` / `embed_512` (verified)	0.991590	0.989623	0.98
`embed_{1024, 2048, 4096, 8192}` (extrapolation)	(same code path, same shared weights)		0.98

(fp32 export was bit-equivalent — cos = 1.0 — at conversion time, then discarded.) int8 exhibits ~1-2% cosine drift which is decision-equivalent on retrieval (post-quant retrieval eval pending). Larger seqlens were not exhaustively run (sentence-transformers reference forward at seq=8192 on CPU is slow) but use the same shared weight blob and same graph code path as the validated short-seq signatures.

Architecture details

ModernBERT-base is a bidirectional encoder. 22 layers, hidden=768, head_dim=64, windowed attention with local=128 and global attention every 3 layers. Vocab=50368 (BPE tokenizer).

Consequences baked into the exported graph:

Mean pool over attention_mask. Pad-side-agnostic; right-pad in practice.
L2 normalize: output is unit-norm, ready for cosine retrieval.
Matryoshka head: model trained at dim=768 with a 256-dim Matryoshka head. To use 256-dim on device: take the first 256 of the 768-dim output and re-normalize. Both dims are quality-validated by Nomic upstream.

Tensor shapes (all variants):

Input input_ids: [1, seq_len] int64
Input attention_mask: [1, seq_len] int64
Output: [1, 768] float32 (L2-normalized)

Inference notes for the bridge

Prompt prefix is required — the model was trained with sentence- transformers prompts. The bridge must prepend before tokenization:

For retrieval queries: "search_query: " + text (note trailing space)
For documents: "search_document: " + text (note trailing space)

Without the prefix, the embedding distribution drifts off-distribution and retrieval quality drops. The prefix is not baked into the .tflite; the consumer is responsible.

Pad with the BERT [PAD] token (id=50283) right-pad. Mean-pool + mask is pad-side-agnostic, but right-pad matches the converter's traced sample shapes.

Reference Python usage

import numpy as np
from transformers import AutoTokenizer
from ai_edge_litert.interpreter import Interpreter

SEQ_LEN = 512
QUERY_PREFIX = "search_query: "

tok = AutoTokenizer.from_pretrained(".")
interp = Interpreter(model_path="synth-4.0-modernbert_multi_int8.tflite")
interp.allocate_tensors()

# Pick the signature for the desired seq_len. Available:
#   embed_128, embed_256, embed_512, embed_1024, embed_2048, embed_4096, embed_8192
runner = interp.get_signature_runner(f"embed_{SEQ_LEN}")
input_names = list(runner.get_input_details().keys())  # [tokens, mask] in order

text = "was my coffee spending more or less in november?"
enc = tok(QUERY_PREFIX + text, padding="max_length", truncation=True,
          max_length=SEQ_LEN, return_tensors="np")

out = runner(**{
    input_names[0]: enc["input_ids"].astype(np.int64),
    input_names[1]: enc["attention_mask"].astype(np.int64),
})
emb = list(out.values())[0]  # [1, 768], L2-normalized

# Optional: Matryoshka 256-dim
emb_256 = emb[:, :256]
emb_256 /= np.linalg.norm(emb_256, axis=1, keepdims=True)

Provenance

Upstream base: nomic-ai/modernbert-embed-base
Training pipeline: synth-4.0 (cached MNR + matryoshka 768/256, lr=5e-4, batch=64, 500 steps, 6× 4090 DDP, gather_across_devices=True)
Quality leaderboard: NDCG@10 = 0.4675, Recall@10 = 0.5296 (#6 of 21 in retrieval-20260427)
Conversion script: on-device/conversion/convert_modernbert_embed.py on the project-switchboard repo
Conversion env: litert-torch 0.8.0, transformers 5.5.4, torch 2.9.1+cu128, Python 3.11

Downloads last month: 88

Model tree for ckg/synth-4.0-modernbert-litert

Base model

answerdotai/ModernBERT-base

Finetuned

nomic-ai/modernbert-embed-base

Finetuned

(110)

this model