voyage-4-nano-onnx

ONNX conversion of Voyage AI's voyage-4-nano embedding model for use with Transformers.js and ONNX Runtime. This repo contains both fp32 and quantized (q8) variants for flexible deployment.

Supported Runtimes

Runtime	Device	Notes
Transformers.js	WebGPU	Browser GPU acceleration (Chrome 113+, Edge 113+)
Transformers.js	WASM/CPU	Universal browser support
ONNX Runtime	CPU/GPU	Python, Node.js, C++, C#, Java
ONNX Runtime Web	WebGPU/WASM	Direct ONNX Runtime in browser

Model Details

Property	Value
Original Model	voyageai/voyage-4-nano
Format	ONNX
Dimensions	1024 (default), 512, 256 via Matryoshka
Context Length	32,768 tokens
License	Apache 2.0

Available Variants

Variant	dtype	File	Size	Use Case
Full Precision	`fp32`	`onnx/model.onnx`	~740 MB	Highest accuracy
Quantized	`q8`	`onnx/model_quantized.onnx`	~345 MB	Balanced speed/accuracy

Usage

Transformers.js (WebGPU)

import { pipeline } from '@huggingface/transformers';

// Load with WebGPU (fp32 precision)
const extractor = await pipeline(
  'feature-extraction',
  'jsonMartin/voyage-4-nano-onnx',
  { device: 'webgpu', dtype: 'fp32' }
);

// Or use quantized model (q8) for smaller download
const extractorQ8 = await pipeline(
  'feature-extraction',
  'jsonMartin/voyage-4-nano-onnx',
  { device: 'webgpu', dtype: 'q8' }
);

// Document embedding (for indexing)
const docPrefix = "Represent the document for retrieval: ";
const docEmbedding = await extractor(docPrefix + "Your document text", {
  pooling: 'mean',
  normalize: true
});

// Query embedding (for search)
const queryPrefix = "Represent the query for retrieving supporting documents: ";
const queryEmbedding = await extractor(queryPrefix + "Your search query", {
  pooling: 'mean',
  normalize: true
});

console.log(docEmbedding.dims);  // [1, 1024]

Transformers.js (CPU/WASM)

import { pipeline } from '@huggingface/transformers';

// CPU inference with quantized model (default for WASM)
const extractor = await pipeline(
  'feature-extraction',
  'jsonMartin/voyage-4-nano-onnx',
  { dtype: 'q8' }  // defaults to CPU/WASM
);

const embedding = await extractor("Your text here", {
  pooling: 'mean',
  normalize: true
});

ONNX Runtime (Python)

import onnxruntime as ort
from transformers import AutoTokenizer
import numpy as np

tokenizer = AutoTokenizer.from_pretrained("jsonMartin/voyage-4-nano-onnx")
session = ort.InferenceSession("onnx/model.onnx")

text = "Represent the document for retrieval: Your text here"
inputs = tokenizer(text, return_tensors="np", padding=True, truncation=True)

# Create position_ids
seq_len = inputs["input_ids"].shape[1]
position_ids = np.arange(seq_len).reshape(1, -1).astype(np.int64)

outputs = session.run(None, {
    "input_ids": inputs["input_ids"],
    "attention_mask": inputs["attention_mask"],
    "position_ids": position_ids
})

# Mean pooling
embeddings = outputs[0]
mask = inputs["attention_mask"]
pooled = (embeddings * mask[:, :, None]).sum(1) / mask.sum(1, keepdims=True)

# L2 normalize
normalized = pooled / np.linalg.norm(pooled, axis=1, keepdims=True)

Matryoshka Embeddings

Truncate to smaller dimensions with minimal quality loss:

// Get full 1024-dim embedding
const full = await extractor(text, { pooling: 'mean', normalize: false });

// Truncate to 512 dimensions
const truncated = full.data.slice(0, 512);

// Re-normalize
const norm = Math.sqrt(truncated.reduce((s, v) => s + v*v, 0));
const embedding512 = truncated.map(v => v / norm);

Binary & Int8 Quantization (JavaScript)

For testing and validation purposes. Production applications using databases with built-in vector functions should pass float arrays directly and let the database handle quantization.

/**
 * Convert Float32 embedding to packed binary (Uint8Array).
 * 1024 dimensions -> 128 bytes (32x compression)
 */
function floatToBinary(embedding) {
    const numBytes = Math.ceil(embedding.length / 8);
    const binary = new Uint8Array(numBytes);

    for (let i = 0; i < embedding.length; i++) {
        if (embedding[i] > 0) {
            const byteIndex = Math.floor(i / 8);
            const bitIndex = 7 - (i % 8); // MSB first
            binary[byteIndex] |= (1 << bitIndex);
        }
    }
    return binary;
}

/**
 * Convert Float32 embedding to Int8 (scaled to [-127, 127]).
 * 1024 dimensions -> 1024 bytes (4x compression)
 */
function floatToInt8(embedding) {
    let norm = 0;
    for (let i = 0; i < embedding.length; i++) {
        norm += embedding[i] * embedding[i];
    }
    norm = Math.sqrt(norm) || 1e-9;

    const int8 = new Int8Array(embedding.length);
    for (let i = 0; i < embedding.length; i++) {
        const normalized = embedding[i] / norm;
        int8[i] = Math.max(-127, Math.min(127, Math.round(normalized * 127)));
    }
    return int8;
}

/**
 * Hamming distance between packed binary vectors.
 */
function hammingDistance(a, b) {
    let distance = 0;
    for (let i = 0; i < a.length; i++) {
        let xor = a[i] ^ b[i];
        while (xor) {
            distance += xor & 1;
            xor >>= 1;
        }
    }
    return distance;
}

// Usage: Two-stage retrieval
const embedding = await extractor(text, { pooling: 'mean', normalize: true });
const binary = floatToBinary(Array.from(embedding.data));  // For fast scan
const int8 = floatToInt8(Array.from(embedding.data));      // For rescore

Instruction Prefixes

Important: Use the appropriate prefix for best results:

Documents (indexing): "Represent the document for retrieval: "
Queries (search): "Represent the query for retrieving supporting documents: "

Storage Sizes

Format	Size (1024 dims)	Compression	Use Case
Float32	4,096 bytes	1x (baseline)	Full precision
Int8	1,024 bytes	4x	Rescore after binary filter
Binary	128 bytes	32x	Fast Hamming scan

For 100,000 documents:

Float32: 390.6 MB
Int8: 97.7 MB
Binary: 12.2 MB

Conversion Details

Converted using HuggingFace Optimum
Validated against original PyTorch model
Opset version: 18

License

Apache 2.0 (same as original model)

Acknowledgments

Voyage AI for the original model
Hugging Face for Transformers and Optimum

Downloads last month: 10

Model tree for jsonMartin/voyage-4-nano-ONNX

Base model

voyageai/voyage-4-nano

Quantized

(5)

this model