voyage-4-nano-onnx
ONNX conversion of Voyage AI's voyage-4-nano embedding model for use with Transformers.js and ONNX Runtime. This repo contains both fp32 and quantized (q8) variants for flexible deployment.
Supported Runtimes
| Runtime | Device | Notes |
|---|---|---|
| Transformers.js | WebGPU | Browser GPU acceleration (Chrome 113+, Edge 113+) |
| Transformers.js | WASM/CPU | Universal browser support |
| ONNX Runtime | CPU/GPU | Python, Node.js, C++, C#, Java |
| ONNX Runtime Web | WebGPU/WASM | Direct ONNX Runtime in browser |
Model Details
| Property | Value |
|---|---|
| Original Model | voyageai/voyage-4-nano |
| Format | ONNX |
| Dimensions | 1024 (default), 512, 256 via Matryoshka |
| Context Length | 32,768 tokens |
| License | Apache 2.0 |
Available Variants
| Variant | dtype | File | Size | Use Case |
|---|---|---|---|---|
| Full Precision | fp32 |
onnx/model.onnx |
~740 MB | Highest accuracy |
| Quantized | q8 |
onnx/model_quantized.onnx |
~345 MB | Balanced speed/accuracy |
Usage
Transformers.js (WebGPU)
import { pipeline } from '@huggingface/transformers';
// Load with WebGPU (fp32 precision)
const extractor = await pipeline(
'feature-extraction',
'jsonMartin/voyage-4-nano-onnx',
{ device: 'webgpu', dtype: 'fp32' }
);
// Or use quantized model (q8) for smaller download
const extractorQ8 = await pipeline(
'feature-extraction',
'jsonMartin/voyage-4-nano-onnx',
{ device: 'webgpu', dtype: 'q8' }
);
// Document embedding (for indexing)
const docPrefix = "Represent the document for retrieval: ";
const docEmbedding = await extractor(docPrefix + "Your document text", {
pooling: 'mean',
normalize: true
});
// Query embedding (for search)
const queryPrefix = "Represent the query for retrieving supporting documents: ";
const queryEmbedding = await extractor(queryPrefix + "Your search query", {
pooling: 'mean',
normalize: true
});
console.log(docEmbedding.dims); // [1, 1024]
Transformers.js (CPU/WASM)
import { pipeline } from '@huggingface/transformers';
// CPU inference with quantized model (default for WASM)
const extractor = await pipeline(
'feature-extraction',
'jsonMartin/voyage-4-nano-onnx',
{ dtype: 'q8' } // defaults to CPU/WASM
);
const embedding = await extractor("Your text here", {
pooling: 'mean',
normalize: true
});
ONNX Runtime (Python)
import onnxruntime as ort
from transformers import AutoTokenizer
import numpy as np
tokenizer = AutoTokenizer.from_pretrained("jsonMartin/voyage-4-nano-onnx")
session = ort.InferenceSession("onnx/model.onnx")
text = "Represent the document for retrieval: Your text here"
inputs = tokenizer(text, return_tensors="np", padding=True, truncation=True)
# Create position_ids
seq_len = inputs["input_ids"].shape[1]
position_ids = np.arange(seq_len).reshape(1, -1).astype(np.int64)
outputs = session.run(None, {
"input_ids": inputs["input_ids"],
"attention_mask": inputs["attention_mask"],
"position_ids": position_ids
})
# Mean pooling
embeddings = outputs[0]
mask = inputs["attention_mask"]
pooled = (embeddings * mask[:, :, None]).sum(1) / mask.sum(1, keepdims=True)
# L2 normalize
normalized = pooled / np.linalg.norm(pooled, axis=1, keepdims=True)
Matryoshka Embeddings
Truncate to smaller dimensions with minimal quality loss:
// Get full 1024-dim embedding
const full = await extractor(text, { pooling: 'mean', normalize: false });
// Truncate to 512 dimensions
const truncated = full.data.slice(0, 512);
// Re-normalize
const norm = Math.sqrt(truncated.reduce((s, v) => s + v*v, 0));
const embedding512 = truncated.map(v => v / norm);
Binary & Int8 Quantization (JavaScript)
For testing and validation purposes. Production applications using databases with built-in vector functions should pass float arrays directly and let the database handle quantization.
/**
* Convert Float32 embedding to packed binary (Uint8Array).
* 1024 dimensions -> 128 bytes (32x compression)
*/
function floatToBinary(embedding) {
const numBytes = Math.ceil(embedding.length / 8);
const binary = new Uint8Array(numBytes);
for (let i = 0; i < embedding.length; i++) {
if (embedding[i] > 0) {
const byteIndex = Math.floor(i / 8);
const bitIndex = 7 - (i % 8); // MSB first
binary[byteIndex] |= (1 << bitIndex);
}
}
return binary;
}
/**
* Convert Float32 embedding to Int8 (scaled to [-127, 127]).
* 1024 dimensions -> 1024 bytes (4x compression)
*/
function floatToInt8(embedding) {
let norm = 0;
for (let i = 0; i < embedding.length; i++) {
norm += embedding[i] * embedding[i];
}
norm = Math.sqrt(norm) || 1e-9;
const int8 = new Int8Array(embedding.length);
for (let i = 0; i < embedding.length; i++) {
const normalized = embedding[i] / norm;
int8[i] = Math.max(-127, Math.min(127, Math.round(normalized * 127)));
}
return int8;
}
/**
* Hamming distance between packed binary vectors.
*/
function hammingDistance(a, b) {
let distance = 0;
for (let i = 0; i < a.length; i++) {
let xor = a[i] ^ b[i];
while (xor) {
distance += xor & 1;
xor >>= 1;
}
}
return distance;
}
// Usage: Two-stage retrieval
const embedding = await extractor(text, { pooling: 'mean', normalize: true });
const binary = floatToBinary(Array.from(embedding.data)); // For fast scan
const int8 = floatToInt8(Array.from(embedding.data)); // For rescore
Instruction Prefixes
Important: Use the appropriate prefix for best results:
- Documents (indexing):
"Represent the document for retrieval: " - Queries (search):
"Represent the query for retrieving supporting documents: "
Storage Sizes
| Format | Size (1024 dims) | Compression | Use Case |
|---|---|---|---|
| Float32 | 4,096 bytes | 1x (baseline) | Full precision |
| Int8 | 1,024 bytes | 4x | Rescore after binary filter |
| Binary | 128 bytes | 32x | Fast Hamming scan |
For 100,000 documents:
- Float32: 390.6 MB
- Int8: 97.7 MB
- Binary: 12.2 MB
Conversion Details
- Converted using HuggingFace Optimum
- Validated against original PyTorch model
- Opset version: 18
License
Apache 2.0 (same as original model)
Acknowledgments
- Voyage AI for the original model
- Hugging Face for Transformers and Optimum
- Downloads last month
- 10
Model tree for jsonMartin/voyage-4-nano-ONNX
Base model
voyageai/voyage-4-nano