| --- |
| title: README |
| emoji: ๐ |
| colorFrom: blue |
| colorTo: green |
| sdk: static |
| pinned: false |
| --- |
| |
| # UniVec |
|
|
| UniVec publishes vector conversion models that map embeddings from one model's vector space into another without re-embedding the source text. |
|
|
| ## What is vector conversion? |
|
|
| A corpus embedded with a particular model is bound to that model's vector space: queries must be encoded by the same model for nearest-neighbour search to remain meaningful. Migrating to a different embedder (whether driven by deprecation, an upgrade or a provider change) normally requires re-embedding every document. The cost scales with corpus size and recurs each time the underlying model changes. |
|
|
| A conversion model takes pre-computed source-space vectors and outputs target-space vectors. The training objective is retrieval-order preservation: top-K nearest neighbours in the converted space should align with top-K in the target space despite differences in dimensionality, distance distribution and noise structure. |
|
|
| The category sits adjacent to embedding generation (text -> vector) and embedding distillation (large model -> small model) but addresses a separate problem: relocating existing vectors between spaces. |
|
|
| ## What's published here |
|
|
| Two tracks, both released under Apache 2.0. |
|
|
| **General-purpose converters** are trained on broad, heterogeneous data to generalise across domains. These are the recommended models for production self-hosting. Coverage targets the most-requested source/target pairs across OpenAI, Cohere, Google, AWS Titan, Snowflake Arctic, BGE and GTE. |
|
|
| **Benchmarking converters** are trained on MTEB-aligned distributions and serve as reference points against published translation results. Open weights, open metrics, identical evaluation script. |
|
|
| Both tracks ship as single ONNX files. They run on CPU or GPU via ONNX Runtime, take a `(batch, source_dim)` array of unit-normalized vectors and return `(batch, target_dim)`. |
|
|
| ## Available models |
|
|
| The catalog below lists what is currently published on the Hub. The hosted API at <https://univec.ai> covers additional pairs and bridge configurations. |
|
|
| ### General-purpose |
|
|
| | Source | Target | Model | |
| |--------|--------|-------| |
| | Alibaba GTE Large EN v1.5 | Nomic Embed Text v1.5 | [Conversion model](https://huggingface.co/univec/convert-alibaba_nlp_gte_large_en_v1.5-to-nomic_embed_text_v1.5) | |
| | Amazon Titan Embed Text v2.0 | OpenAI text-embedding-ada-002 | [Conversion model](https://huggingface.co/univec/convert-amazon_titan_embed_text_v2.0-to-openai_text_embedding_ada_002) | |
| | BAAI BGE-M3 | OpenAI text-embedding-ada-002 | [Conversion model](https://huggingface.co/univec/convert-baai_bge_m3-to-openai_text_embedding_ada_002) | |
| | BAAI BGE-M3 | Snowflake Arctic Embed L v2.0 | [Conversion model](https://huggingface.co/univec/convert-baai_bge_m3-to-snowflake_arctic_embed_l_v2.0) | |
| | Cohere Embed English v3.0 | OpenAI text-embedding-ada-002 | [Conversion model](https://huggingface.co/univec/convert-cohere_embed_english_v3.0-to-openai_text_embedding_ada_002) | |
| | Google EmbeddingGemma 300M | OpenAI text-embedding-ada-002 | [Conversion model](https://huggingface.co/univec/convert-google_embeddinggemma_300m-to-openai_text_embedding_ada_002) | |
| | OpenAI text-embedding-3-large | Google Gemini text-embedding-004 | [Conversion model](https://huggingface.co/univec/convert-openai_text_embedding_3_large-to-gemini_text_embedding_004) | |
| | OpenAI text-embedding-3-small | OpenAI text-embedding-ada-002 | [Conversion model](https://huggingface.co/univec/convert-openai_text_embedding_3_small-to-openai_text_embedding_ada_002) | |
| | Snowflake Arctic Embed L v2.0 | BAAI BGE-M3 | [Conversion model](https://huggingface.co/univec/convert-snowflake_arctic_embed_l_v2.0-to-baai_bge_m3) | |
| | Snowflake Arctic Embed L v2.0 | OpenAI text-embedding-ada-002 | [Conversion model](https://huggingface.co/univec/convert-snowflake_arctic_embed_l_v2.0-to-openai_text_embedding_ada_002) | |
|
|
| ### Benchmarking (MTEB-aligned) |
|
|
| | Source | Target | Model | |
| |--------|--------|-------| |
| | BAAI BGE-M3 | Snowflake Arctic Embed L v2.0 | [Conversion model](https://huggingface.co/univec/convert-baai_bge_m3-to-snowflake_arctic_embed_l_v2.0-mteb) | |
| | Cohere Embed English v3.0 | OpenAI text-embedding-ada-002 | [Conversion model](https://huggingface.co/univec/convert-cohere_embed_english_v3.0-to-openai_text_embedding_ada_002-mteb) | |
| | Google EmbeddingGemma 300M | OpenAI text-embedding-ada-002 | [Conversion model](https://huggingface.co/univec/convert-google_embeddinggemma_300m-to-openai_text_embedding_ada_002-mteb) | |
| | OpenAI text-embedding-3-small | OpenAI text-embedding-ada-002 | [Conversion model](https://huggingface.co/univec/convert-openai_text_embedding_3_small-to-openai_text_embedding_ada_002-mteb) | |
| | Snowflake Arctic Embed L v2.0 | BAAI BGE-M3 | [Conversion model](https://huggingface.co/univec/convert-snowflake_arctic_embed_l_v2.0-to-baai_bge_m3-mteb) | |
|
|
| ## Quick start |
|
|
| ```bash |
| pip install onnxruntime numpy |
| ``` |
|
|
| ```python |
| import numpy as np |
| import onnxruntime as ort |
| |
| session = ort.InferenceSession("model.onnx", providers=["CPUExecutionProvider"]) |
| input_name = session.get_inputs()[0].name |
| |
| # Source-model embeddings, shape (N, source_dim), float32 |
| embeddings = ... |
| |
| converted = session.run(None, {input_name: embeddings.astype("float32")})[0] |
| # converted has shape (N, target_dim) in the target model's space. |
| ``` |
|
|
| Each model repository also ships `univec_inference.py`, a CLI covering batching, GPU execution and `.npy` / `.jsonl` input. |
|
|
| ## What's not published here |
|
|
| The set above is a curated subset of the UniVec catalog. The full catalog covers around 100 conversion pairs and includes bridge configurations (two-hop conversions through an intermediary model when no direct pair is trained). The hosted API at <https://univec.ai> exposes the unpublished pairs along with managed inference. |
|
|
| ## License |
|
|
| Apache 2.0. Free for commercial use, redistribution and fine-tuning. |
|
|
| ## Links |
|
|
| - Website and hosted API: <https://univec.ai> |
| - Discussion: per-model HF discussion tabs |
|
|