synth-4.0-modernbert β†’ LiteRT

LiteRT (.tflite) exports of a finetune of nomic-ai/modernbert-embed-base, trained on the synth-4.0 wafer-domain dataset. Bundled for on-device inference on Android (XNNPACK CPU delegate) and other LiteRT-compatible runtimes.

Each artifact is a self-contained frozen graph: encoder body, mean-pool over attention_mask, and L2 normalization are all baked in. The consumer only needs to prepend the prompt prefix, tokenize, pad, and feed input_ids + attention_mask; the graph returns the L2-normalized 768-dim sentence embedding directly.

License inherits from upstream nomic-ai/modernbert-embed-base (Apache 2.0). Verify against the upstream model card if relicensing matters.

Files

One tflite, multi-signature: a single dynamic_int8 flatbuffer holds 7 graph entry points sharing the same encoder weight blob.

file quant size signatures
synth-4.0-modernbert_multi_int8.tflite dynamic_int8 187 MB embed_{128, 256, 512, 1024, 2048, 4096, 8192} (one per seq_len)

Mirrors Google's gemma-4-E2B-it.litertlm packaging pattern (verified by inspecting their tf_lite_prefill_decode section, which carries multiple prefill_{SEQ-LEN} signatures over one weight blob). Earlier this repo shipped 7 separate per-seqlen .tflite files totalling 1.05 GB; the multi-signature bundle replaces that at 187 MB (5.6Γ— reduction) with no quality loss β€” the weight blob is unchanged, only the duplication of it across files is removed.

dynamic_int8 = weight-quantized int8 matmuls via XNNPACK (best on ARM CPUs). fp32 exports were used at conversion time for numerics validation against the upstream sentence-transformers reference and were not shipped (they would add ~4 GB of duplicated weights with no inference advantage on phone). The conversion script (linked in Provenance) regenerates fp32 if needed.

Numerics validation

Validated against the upstream sentence-transformers reference loaded in fp32, over a 6-string suite:

signature mean cosine min cosine threshold
embed_128 / embed_256 / embed_512 (verified) 0.991590 0.989623 0.98
embed_{1024, 2048, 4096, 8192} (extrapolation) (same code path, same shared weights) 0.98

(fp32 export was bit-equivalent β€” cos = 1.0 β€” at conversion time, then discarded.) int8 exhibits ~1-2% cosine drift which is decision-equivalent on retrieval (post-quant retrieval eval pending). Larger seqlens were not exhaustively run (sentence-transformers reference forward at seq=8192 on CPU is slow) but use the same shared weight blob and same graph code path as the validated short-seq signatures.

Architecture details

ModernBERT-base is a bidirectional encoder. 22 layers, hidden=768, head_dim=64, windowed attention with local=128 and global attention every 3 layers. Vocab=50368 (BPE tokenizer).

Consequences baked into the exported graph:

  • Mean pool over attention_mask. Pad-side-agnostic; right-pad in practice.
  • L2 normalize: output is unit-norm, ready for cosine retrieval.
  • Matryoshka head: model trained at dim=768 with a 256-dim Matryoshka head. To use 256-dim on device: take the first 256 of the 768-dim output and re-normalize. Both dims are quality-validated by Nomic upstream.

Tensor shapes (all variants):

  • Input input_ids: [1, seq_len] int64
  • Input attention_mask: [1, seq_len] int64
  • Output: [1, 768] float32 (L2-normalized)

Inference notes for the bridge

Prompt prefix is required β€” the model was trained with sentence- transformers prompts. The bridge must prepend before tokenization:

  • For retrieval queries: "search_query: " + text (note trailing space)
  • For documents: "search_document: " + text (note trailing space)

Without the prefix, the embedding distribution drifts off-distribution and retrieval quality drops. The prefix is not baked into the .tflite; the consumer is responsible.

Pad with the BERT [PAD] token (id=50283) right-pad. Mean-pool + mask is pad-side-agnostic, but right-pad matches the converter's traced sample shapes.

Reference Python usage

import numpy as np
from transformers import AutoTokenizer
from ai_edge_litert.interpreter import Interpreter

SEQ_LEN = 512
QUERY_PREFIX = "search_query: "

tok = AutoTokenizer.from_pretrained(".")
interp = Interpreter(model_path="synth-4.0-modernbert_multi_int8.tflite")
interp.allocate_tensors()

# Pick the signature for the desired seq_len. Available:
#   embed_128, embed_256, embed_512, embed_1024, embed_2048, embed_4096, embed_8192
runner = interp.get_signature_runner(f"embed_{SEQ_LEN}")
input_names = list(runner.get_input_details().keys())  # [tokens, mask] in order

text = "was my coffee spending more or less in november?"
enc = tok(QUERY_PREFIX + text, padding="max_length", truncation=True,
          max_length=SEQ_LEN, return_tensors="np")

out = runner(**{
    input_names[0]: enc["input_ids"].astype(np.int64),
    input_names[1]: enc["attention_mask"].astype(np.int64),
})
emb = list(out.values())[0]  # [1, 768], L2-normalized

# Optional: Matryoshka 256-dim
emb_256 = emb[:, :256]
emb_256 /= np.linalg.norm(emb_256, axis=1, keepdims=True)

Provenance

  • Upstream base: nomic-ai/modernbert-embed-base
  • Training pipeline: synth-4.0 (cached MNR + matryoshka 768/256, lr=5e-4, batch=64, 500 steps, 6Γ— 4090 DDP, gather_across_devices=True)
  • Quality leaderboard: NDCG@10 = 0.4675, Recall@10 = 0.5296 (#6 of 21 in retrieval-20260427)
  • Conversion script: on-device/conversion/convert_modernbert_embed.py on the project-switchboard repo
  • Conversion env: litert-torch 0.8.0, transformers 5.5.4, torch 2.9.1+cu128, Python 3.11
Downloads last month
88
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for ckg/synth-4.0-modernbert-litert

Finetuned
(110)
this model