voyage-4-nano-onnx

ONNX conversion of Voyage AI's voyage-4-nano embedding model for use with Transformers.js and ONNX Runtime. This repo contains both fp32 and quantized (q8) variants for flexible deployment.

Supported Runtimes

Runtime Device Notes
Transformers.js WebGPU Browser GPU acceleration (Chrome 113+, Edge 113+)
Transformers.js WASM/CPU Universal browser support
ONNX Runtime CPU/GPU Python, Node.js, C++, C#, Java
ONNX Runtime Web WebGPU/WASM Direct ONNX Runtime in browser

Model Details

Property Value
Original Model voyageai/voyage-4-nano
Format ONNX
Dimensions 1024 (default), 512, 256 via Matryoshka
Context Length 32,768 tokens
License Apache 2.0

Available Variants

Variant dtype File Size Use Case
Full Precision fp32 onnx/model.onnx ~740 MB Highest accuracy
Quantized q8 onnx/model_quantized.onnx ~345 MB Balanced speed/accuracy

Usage

Transformers.js (WebGPU)

import { pipeline } from '@huggingface/transformers';

// Load with WebGPU (fp32 precision)
const extractor = await pipeline(
  'feature-extraction',
  'jsonMartin/voyage-4-nano-onnx',
  { device: 'webgpu', dtype: 'fp32' }
);

// Or use quantized model (q8) for smaller download
const extractorQ8 = await pipeline(
  'feature-extraction',
  'jsonMartin/voyage-4-nano-onnx',
  { device: 'webgpu', dtype: 'q8' }
);

// Document embedding (for indexing)
const docPrefix = "Represent the document for retrieval: ";
const docEmbedding = await extractor(docPrefix + "Your document text", {
  pooling: 'mean',
  normalize: true
});

// Query embedding (for search)
const queryPrefix = "Represent the query for retrieving supporting documents: ";
const queryEmbedding = await extractor(queryPrefix + "Your search query", {
  pooling: 'mean',
  normalize: true
});

console.log(docEmbedding.dims);  // [1, 1024]

Transformers.js (CPU/WASM)

import { pipeline } from '@huggingface/transformers';

// CPU inference with quantized model (default for WASM)
const extractor = await pipeline(
  'feature-extraction',
  'jsonMartin/voyage-4-nano-onnx',
  { dtype: 'q8' }  // defaults to CPU/WASM
);

const embedding = await extractor("Your text here", {
  pooling: 'mean',
  normalize: true
});

ONNX Runtime (Python)

import onnxruntime as ort
from transformers import AutoTokenizer
import numpy as np

tokenizer = AutoTokenizer.from_pretrained("jsonMartin/voyage-4-nano-onnx")
session = ort.InferenceSession("onnx/model.onnx")

text = "Represent the document for retrieval: Your text here"
inputs = tokenizer(text, return_tensors="np", padding=True, truncation=True)

# Create position_ids
seq_len = inputs["input_ids"].shape[1]
position_ids = np.arange(seq_len).reshape(1, -1).astype(np.int64)

outputs = session.run(None, {
    "input_ids": inputs["input_ids"],
    "attention_mask": inputs["attention_mask"],
    "position_ids": position_ids
})

# Mean pooling
embeddings = outputs[0]
mask = inputs["attention_mask"]
pooled = (embeddings * mask[:, :, None]).sum(1) / mask.sum(1, keepdims=True)

# L2 normalize
normalized = pooled / np.linalg.norm(pooled, axis=1, keepdims=True)

Matryoshka Embeddings

Truncate to smaller dimensions with minimal quality loss:

// Get full 1024-dim embedding
const full = await extractor(text, { pooling: 'mean', normalize: false });

// Truncate to 512 dimensions
const truncated = full.data.slice(0, 512);

// Re-normalize
const norm = Math.sqrt(truncated.reduce((s, v) => s + v*v, 0));
const embedding512 = truncated.map(v => v / norm);

Binary & Int8 Quantization (JavaScript)

For testing and validation purposes. Production applications using databases with built-in vector functions should pass float arrays directly and let the database handle quantization.

/**
 * Convert Float32 embedding to packed binary (Uint8Array).
 * 1024 dimensions -> 128 bytes (32x compression)
 */
function floatToBinary(embedding) {
    const numBytes = Math.ceil(embedding.length / 8);
    const binary = new Uint8Array(numBytes);

    for (let i = 0; i < embedding.length; i++) {
        if (embedding[i] > 0) {
            const byteIndex = Math.floor(i / 8);
            const bitIndex = 7 - (i % 8); // MSB first
            binary[byteIndex] |= (1 << bitIndex);
        }
    }
    return binary;
}

/**
 * Convert Float32 embedding to Int8 (scaled to [-127, 127]).
 * 1024 dimensions -> 1024 bytes (4x compression)
 */
function floatToInt8(embedding) {
    let norm = 0;
    for (let i = 0; i < embedding.length; i++) {
        norm += embedding[i] * embedding[i];
    }
    norm = Math.sqrt(norm) || 1e-9;

    const int8 = new Int8Array(embedding.length);
    for (let i = 0; i < embedding.length; i++) {
        const normalized = embedding[i] / norm;
        int8[i] = Math.max(-127, Math.min(127, Math.round(normalized * 127)));
    }
    return int8;
}

/**
 * Hamming distance between packed binary vectors.
 */
function hammingDistance(a, b) {
    let distance = 0;
    for (let i = 0; i < a.length; i++) {
        let xor = a[i] ^ b[i];
        while (xor) {
            distance += xor & 1;
            xor >>= 1;
        }
    }
    return distance;
}

// Usage: Two-stage retrieval
const embedding = await extractor(text, { pooling: 'mean', normalize: true });
const binary = floatToBinary(Array.from(embedding.data));  // For fast scan
const int8 = floatToInt8(Array.from(embedding.data));      // For rescore

Instruction Prefixes

Important: Use the appropriate prefix for best results:

  • Documents (indexing): "Represent the document for retrieval: "
  • Queries (search): "Represent the query for retrieving supporting documents: "

Storage Sizes

Format Size (1024 dims) Compression Use Case
Float32 4,096 bytes 1x (baseline) Full precision
Int8 1,024 bytes 4x Rescore after binary filter
Binary 128 bytes 32x Fast Hamming scan

For 100,000 documents:

  • Float32: 390.6 MB
  • Int8: 97.7 MB
  • Binary: 12.2 MB

Conversion Details

  • Converted using HuggingFace Optimum
  • Validated against original PyTorch model
  • Opset version: 18

License

Apache 2.0 (same as original model)

Acknowledgments

Downloads last month
10
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for jsonMartin/voyage-4-nano-ONNX

Quantized
(5)
this model