---
license: apache-2.0
language:
- multilingual
library_name: transformers
pipeline_tag: feature-extraction
base_model: Qwen/Qwen3-Embedding-0.6B
tags:
- onnx
- teradata
- byom
- embeddings
- feature-extraction
- qwen
- qwen3
- decoder
---
> Read the disclaimer below before using this model.
----
# qwen3-embedding-0.6b -- ONNX for Teradata BYOM
This repository hosts an **ONNX-converted** version of the upstream
model [`Qwen/Qwen3-Embedding-0.6B`](https://huggingface.co/Qwen/Qwen3-Embedding-0.6B),
packaged for the Teradata Vantage `mldb.ONNXEmbeddings` BYOM
function. It is **not** the original PyTorch model -- only the
inference graph and tokenizer needed for in-database embedding
generation.
What's different from upstream:
- **Format**: ONNX (opset 14, IR version 8 -- BYOM 6+ compatible),
produced from the upstream weights with architecture-aware
post-processing baked in.
- **Precision**: dynamic int8 quantization. See the variants table
below for what is shipped for this model.
- **Pooling and post-processing**: this graph emits the raw
`sentence_embedding` tensor. Pooling rule is
**last_token** and the model expects
a query-time instruction prefix (see "Instruction prefix" below).
- **Verification**: every variant's cosine fidelity vs. the
upstream PyTorch reference is recorded on a fixed FLORES-200
sample. Numbers may not generalize to your data.
## Model details
| | |
|---|---|
| Upstream repo | [`Qwen/Qwen3-Embedding-0.6B`](https://huggingface.co/Qwen/Qwen3-Embedding-0.6B) |
| Architecture | `Qwen3ForCausalLM` (decoder) |
| Parameters | 595,776,512 |
| Output dimensions | 1024 |
| Pooling | `last_token` |
| Instruction prefix | yes |
| Max input tokens (advertised) | 32768 |
| Languages | 100 (100+ (100)) |
| License | apache-2.0 |
| ONNX opset | 14 |
| ONNX IR version | 8 (BYOM 6+ compatible) |
Full language list (100)
The upstream model card publishes a prose claim of "100+ languages" without an enumerated code list. Treat the audited language count as a marketing claim rather than a precise enumeration.
### Instruction prefix
This model was trained to expect a short natural-language instruction
prepended to each **query** at encode time. Document side stays
unprefixed. The ONNX graph itself is prefix-agnostic -- the prefix is
plain text that flows through the tokenizer. Downstream BYOM SQL is
responsible for prepending it (typically with a CTE that concatenates
the instruction with each input row).
The upstream model card uses a free-form natural-language **task
description** prepended to each query, in the format:
```
Instruct:
Query:
```
Example task descriptions from the upstream model card (snapshot at
publish time -- see the upstream card for fuller guidance):
- Given a web search query, retrieve relevant passages that answer the query
- Given a scientific claim, retrieve documents that support or refute it
- Retrieve semantically similar text
Instructions should be customized to your task; the upstream card
reports a typical 1-5% quality improvement vs. unprefixed queries.
For multilingual deployments, write instructions in English. See
[`Qwen/Qwen3-Embedding-0.6B`](https://huggingface.co/Qwen/Qwen3-Embedding-0.6B) for the
canonical guidance.
## Quantization variants
This repository ships the following variants. Quality numbers are
measured against the upstream PyTorch reference on a fixed
FLORES-200 sample. The **Size** column is the on-disk size of the
ONNX weight file in megabytes (MB, 10^6 bytes).
| Variant | Size (MB) | p50 cosine | R@1 |
|---|---|---|---|
| `ffn_skip` | 1391.6 | 0.993496 | 0.930 |
How to read the quality columns:
- **p50 cosine** is the median cosine similarity between this
variant's embeddings and the fp32 ONNX reference, computed over
a fixed evaluation set. Higher means closer to the unquantized
model; **1.0** is identical.
- **R@1** is top-1 retrieval consistency: if you use this variant
as a search index, R@1 is the fraction of queries that get the
same nearest neighbor as the fp32 reference would. Higher is
better.
Notes:
- **ffn_skip**: dynamic int8 with the feed-forward (FFN) MatMul
layers kept in **fp32**, while attention and projection MatMuls
stay quantized. The FFN layers are where most of the quantization
error in transformer blocks concentrates; leaving them in fp32
recovers most of the quality loss for a modest size increase.
The artifact is roughly **3x smaller than fp32** (larger than the
per_channel int8 sibling). Ship this variant when retrieval
quality is the priority and the per_channel drift on your workload
is unacceptable.
## Quickstart: using this model with Teradata BYOM
Requires Teradata Vantage with **BYOM 6+** (`mldb.ONNXEmbeddings`).
```python
import getpass
import teradataml as tdml
from huggingface_hub import hf_hub_download
repo_id = "Teradata/qwen3-embedding-0.6b"
model_id = "qwen3-embedding-0.6b" # arbitrary, used as the BYOM model_id
onnx_file = "onnx/model-ffn_skip.onnx"
# 1. Download the ONNX + tokenizer for the chosen variant.
hf_hub_download(repo_id=repo_id, filename=onnx_file, local_dir="./")
hf_hub_download(repo_id=repo_id, filename="tokenizer.json", local_dir="./")
# 2. Connect to Vantage.
tdml.create_context(
host=input("host: "),
username=input("user: "),
password=getpass.getpass("password: "),
)
# 3. Load model + tokenizer into BYOM tables (one-time per model_id).
tdml.save_byom(model_id=model_id, model_file=onnx_file,
table_name="embeddings_models")
tdml.save_byom(model_id=model_id, model_file="tokenizer.json",
table_name="embeddings_tokenizers")
```
Then call `mldb.ONNXEmbeddings` against an input table whose
`txt` column carries the strings to embed:
```sql
SELECT *
FROM mldb.ONNXEmbeddings(
ON (SELECT id, txt FROM your_input_table) AS InputTable
ON (SELECT model_id, model FROM embeddings_models
WHERE model_id = 'qwen3-embedding-0.6b') AS ModelTable DIMENSION
ON (SELECT model_id, tokenizer FROM embeddings_tokenizers
WHERE model_id = 'qwen3-embedding-0.6b') AS TokenizerTable DIMENSION
USING
Accumulate('id')
ModelOutputTensor('sentence_embedding')
OutputFormat('FLOAT32(1024)')
OverwriteCachedModel('*')
) AS t
ORDER BY id;
```
Pooling rule **`last_token`** is applied **inside** the converted
ONNX graph -- the output tensor named above already contains the
pooled, post-processed embedding vector. For instruction-prefix models, prepend
the recommended instruction text to each input `txt` before calling
`ONNXEmbeddings`; the prefix is plain text that the tokenizer handles
unchanged.
## Original model attribution
The original weights and training methodology belong to
**the Qwen team at Alibaba**. Please cite their work, not this
repository, in academic contexts. The canonical upstream model card
is at
[`Qwen/Qwen3-Embedding-0.6B`](https://huggingface.co/Qwen/Qwen3-Embedding-0.6B);
refer to it for benchmarks, training details, intended use, and
citation information.
## Reporting issues
For ONNX-conversion or BYOM-compatibility issues specific to this
Teradata-converted artifact, please open a **Discussion** on this
model's Hugging Face page. Questions about the underlying model
quality, training, or intended use should go to the upstream
maintainer's model card.
----
DISCLAIMER: The content herein ("Content") is provided "AS IS" and is not covered by any Teradata Operations, Inc. and its affiliates ("Teradata") agreements. Its listing here does not constitute certification or endorsement by Teradata.
To the extent any of the Content contains or is related to any artificial intelligence ("AI") or other language learning models ("Models") that interoperate with the products and services of Teradata, by accessing, bringing, deploying or using such Models, you acknowledge and agree that you are solely responsible for ensuring compliance with all applicable laws, regulations, and restrictions governing the use, deployment, and distribution of AI technologies. This includes, but is not limited to, AI Diffusion Rules, European Union AI Act, AI-related laws and regulations, privacy laws, export controls, and financial or sector-specific regulations.
While Teradata may provide support, guidance, or assistance in the deployment or implementation of Models to interoperate with Teradata's products and/or services, you remain fully responsible for ensuring that your Models, data, and applications comply with all relevant legal and regulatory obligations. Our assistance does not constitute legal or regulatory approval, and Teradata disclaims any liability arising from non-compliance with applicable laws.
You must determine the suitability of the Models for any purpose. Given the probabilistic nature of machine learning and modeling, the use of the Models may in some situations result in incorrect output that does not accurately reflect the action generated. You should evaluate the accuracy of any output as appropriate for your use case, including by using human review of the output.