Run a fine-tuned embedding model entirely in the browser

Community Article Published May 28, 2026

transformers.js + int8 ONNX + a Web Worker — no server, no API key, nothing leaves the tab.

I recently shipped a little demo that scores SaaS taglines 0–100 using a fine-tuned e5-large embedding model. The fun constraint: it runs 100% client-side. No inference server, no API key, the model downloads once and then works offline. Here's the exact shape that ended up working, including the two gotchas that cost me an afternoon.

The recipe, top to bottom:

Export your model to ONNX.
Quantize it (int8) so it's a reasonable browser download.
Run it with transformers.js — inside a Web Worker, on the WASM backend.

1. Export to ONNX

transformers.js loads ONNX, not safetensors. optimum does the conversion. For an embedding model you want the feature-extraction task:

pip install "optimum[onnxruntime]" transformers
optimum-cli export onnx --model your-org/your-embedding-model onnx_out/

Or in Python, if you want to script it / push to a repo:

from optimum.onnxruntime import ORTModelForFeatureExtraction
from transformers import AutoTokenizer

m = ORTModelForFeatureExtraction.from_pretrained("your-org/your-embedding-model", export=True)
m.save_pretrained("onnx_out")
AutoTokenizer.from_pretrained("your-org/your-embedding-model").save_pretrained("onnx_out")

Layout matters. transformers.js expects config.json + tokenizer.json at the repo root and the ONNX weights under an onnx/ subfolder:

your-repo/
├── config.json
├── tokenizer.json
├── ...
└── onnx/
    └── model.onnx

So move the exported .onnx into onnx/ before you upload it to the Hub.

2. Quantize (this is what makes it shippable)

fp32 e5-large is ~1.3 GB — a brutal browser download. Two cheap wins:

import onnx
from onnxconverter_common import float16
from onnxruntime.quantization import quantize_dynamic, QuantType

# fp16: ~half the size, WebGPU-native, basically lossless
onnx.save(float16.convert_float_to_float16(onnx.load("onnx/model.onnx"), keep_io_types=True),
          "onnx/model_fp16.onnx")

# int8 dynamic: ~a quarter the size
quantize_dynamic("onnx/model.onnx", "onnx/model_quantized.onnx", weight_type=QuantType.QInt8)

For e5-large that's roughly 1.3 GB → 670 MB (fp16) → 336 MB (int8).

Gotcha #1 — the filename mapping. transformers.js maps the dtype option to a filename: dtype: 'fp16' looks for onnx/model_fp16.onnx, and dtype: 'q8' looks for onnx/model_quantized.onnx (not model_q8.onnx). I named my int8 file model_q8.onnx and spent a while staring at a 404 before I figured that out. Name it model_quantized.onnx.

I verified the int8 model still discriminated correctly by running it through plain onnxruntime on CPU before trusting it in the browser — worth doing, because of gotcha #2.

3. Run it — in a Web Worker, on WASM

The naive version works but freezes the page: the transformer forward pass runs on the main thread and blocks the UI for the duration of every embedding. The fix is a Web Worker. Here's a self-contained module worker (no separate file needed) that loads the model and answers embed requests:

const workerSrc = `
  import { pipeline } from 'https://cdn.jsdelivr.net/npm/@huggingface/transformers@3';
  let extractor;
  self.onmessage = async (e) => {
    const { type, id, text } = e.data;
    if (type === 'load') {
      extractor = await pipeline('feature-extraction', 'your-org/your-model', {
        device: 'wasm',          // see gotcha #2
        dtype: 'q8',             // -> onnx/model_quantized.onnx
        progress_callback: (p) => self.postMessage({ type: 'progress', p }),
      });
      self.postMessage({ type: 'ready' });
    } else if (type === 'embed') {
      const out = await extractor(text, { pooling: 'mean', normalize: true });
      const arr = Float32Array.from(out.data);
      self.postMessage({ type: 'embed', id, data: arr }, [arr.buffer]); // transfer, no copy
    }
  };
`;
const worker = new Worker(URL.createObjectURL(new Blob([workerSrc], { type: 'text/javascript' })),
                         { type: 'module' });

And a tiny promise wrapper on the main thread so calling it feels normal:

const pending = new Map(); let reqId = 0;
worker.onmessage = (e) => {
  const m = e.data;
  if (m.type === 'embed') { pending.get(m.id)?.(m.data); pending.delete(m.id); }
  // (also handle 'progress' for a download bar, and 'ready')
};
function embed(text) {
  return new Promise((res) => { const id = ++reqId; pending.set(id, res); worker.postMessage({ type: 'embed', id, text }); });
}

worker.postMessage({ type: 'load' });
// later, off the main thread, UI never blocks:
const vector = await embed('your text');   // Float32Array

pooling: 'mean' + normalize: true gives you the sentence embedding directly. (If your model wants a prefix — e5 uses "query: " — prepend it before calling embed.)

Gotcha #2 — backend choice. I assumed WebGPU. It bit me twice:

device: 'webgpu', dtype: 'fp16' worked but OOM'd on machines with less VRAM (a 670 MB model + activations is a lot for some GPUs).

device: 'webgpu', dtype: 'q8' returned garbage — the embeddings collapsed to a near-constant vector. int8 matmul on the WebGPU backend isn't reliable.

device: 'wasm', dtype: 'q8' was correct and low-RAM (it runs in system memory, not VRAM). A bit slower per call — but in a worker, you don't feel it for single embeddings.

So I shipped WASM + int8. If your model is small enough that fp16 fits comfortably, WebGPU-fp16 is faster; for a 300M+ param model on unknown hardware, WASM-int8 is the safe default.

(Optional) applying a task head, client-side

My demo isn't just embeddings — it's a quality score. I trained the embedder end-to-end with a pairwise ranking loss, then kept the tiny linear head as a 1024-dim weight vector + bias in an .npz. In the browser that's just a dot product, no extra model:

// emb = Float32Array from embed(); coef/intercept loaded from a small JSON
let score = intercept;
for (let i = 0; i < coef.length; i++) score += emb[i] * coef[i];

Any linear probe / classifier head you trained on top of frozen (or fine-tuned) embeddings ports to the browser this way.

Recap of the gotchas

transformers.js needs config.json + tokenizer.json at root, ONNX under onnx/.
dtype: 'q8' → the file must be named model_quantized.onnx.
Run inference in a Web Worker or the UI freezes.
WASM + int8 is the reliable, low-RAM default; WebGPU-int8 can silently return garbage; WebGPU-fp16 can OOM.

Where I used it

This is the engine behind Tagline Rater — type any SaaS hero tagline and a fine-tuned e5-large rates it 0–100, entirely in your browser. The model and data are open: standd/tagline-quality-e5-ranker.

Built by the team behind Hey Lefty.

Models mentioned in this article 1

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote