Introduction

This repository hosts the LFM2.5-ColBERT-350M late-interaction retrieval model for the React Native ExecuTorch library, exported for the XNNPACK (Android / generic CPU) and MLX (Apple GPU) delegates.

Unlike a standard sentence embedder (one vector per text), ColBERT is a multi-vector / late-interaction model: it produces one vector per token ([numTokens, 128]). Relevance is computed with MaxSim (for each query token, the max dot product over document tokens, summed). Use it when you want stronger retrieval quality than single-vector embeddings — e.g. RAG / search.

Compatibility

The MLX variant requires a physical Apple Silicon device (it does not run on the iOS simulator). The XNNPACK variant runs everywhere. Make sure your runtime matches the ExecuTorch version used to export these .pte files; with React Native ExecuTorch the library constants guarantee this.

Using it (late interaction)

The model is a per-token embedder; scoring is the consumer's concern:

  1. Prepend the role marker the model was trained with: "[Q] " for queries, "[D] " for documents.
  2. Run forward to get the per-token [S, 128] matrix for each text.
  3. Score query↔document with MaxSim, optionally excluding the document skiplist token ids (punctuation) so they don't contribute. The skiplist for this model (from its config_sentence_transformers.json) tokenizes to: [510..524, 535..541, 568..573, 600..603] (32 ids).

Repository Structure

  • xnnpack/, mlx/ — the partitioned .pte files + per-backend config.json.
  • tokenizer.json — wire to tokenizerSource.
  • config.json, tokenizer_config.json — reference metadata.
Downloads last month
67
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including software-mansion/react-native-executorch-lfm2.5-colbert-350m