Initial publish via td-embeddings

Browse files

Files changed (7) hide show

README.md +232 -0
config.json +77 -0
onnx/model-ffn_skip.onnx +3 -0
onnx/model-fp32.onnx +3 -0
special_tokens_map.json +37 -0
tokenizer.json +0 -0
tokenizer_config.json +55 -0

README.md ADDED Viewed

	@@ -0,0 +1,232 @@

+---
+license: apache-2.0
+language:
+  - en
+library_name: transformers
+pipeline_tag: feature-extraction
+base_model: nomic-ai/nomic-embed-text-v1.5
+tags:
+  - onnx
+  - teradata
+  - byom
+  - embeddings
+  - feature-extraction
+---
+> Read the disclaimer below before using this model.
+----
+# nomic-embed-text-v1.5 -- ONNX for Teradata BYOM
+This repository hosts an **ONNX-converted** version of the upstream
+model [`nomic-ai/nomic-embed-text-v1.5`](https://huggingface.co/nomic-ai/nomic-embed-text-v1.5),
+packaged for the Teradata Vantage `mldb.ONNXEmbeddings` BYOM
+function. It is **not** the original PyTorch model -- only the
+inference graph and tokenizer needed for in-database embedding
+generation.
+What's different from upstream:
+- **Format**: ONNX (opset 14, IR version 8 -- BYOM 6+ compatible),
+  produced from the upstream weights with architecture-aware
+  post-processing baked in.
+- **Precision**: dynamic int8 quantization. See the variants table
+  below for what is shipped for this model.
+- **Pooling and post-processing**: this graph emits the raw
+  `sentence_embedding` tensor. Pooling rule is
+  **mean** and the model expects
+  a query-time instruction prefix (see "Instruction prefix" below).
+- **Verification**: every variant's cosine fidelity vs. the
+  upstream PyTorch reference is recorded on a fixed
+  FLORES-200 sample. Numbers may not generalize
+  to your data.
+## Model details
+| | |
+|---|---|
+| Upstream repo | [`nomic-ai/nomic-embed-text-v1.5`](https://huggingface.co/nomic-ai/nomic-embed-text-v1.5) |
+| Architecture | `NomicBertModel` (encoder) |
+| Parameters | 136,731,648 |
+| Output dimensions | 768 |
+| Pooling | `mean` |
+| Instruction prefix | yes |
+| Max input tokens (native / advertised) | 2048 / 8192 |
+| Languages | 1 |
+| License | apache-2.0 |
+| ONNX opset | 14 |
+| ONNX IR version | 8 (BYOM 6+ compatible) |
+<details>
+<summary>Full language list (1)</summary>
+- `en`
+</details>
+### Instruction prefix
+This model was trained with two **fixed literal prefixes** that must
+be prepended to the raw text before encoding. Unlike free-form
+instruction-tuned models, the prefix wording is not customisable --
+the model only understands these specific tokens. The ONNX graph
+itself is prefix-agnostic; downstream BYOM SQL is responsible for
+prepending the prefix to each input row (typically with a CTE that
+concatenates the prefix string with the input text).
+Use the following prefixes (snapshot at publish time -- see the
+upstream model card for any updates):
+- `search_query: ` -- for query-side text
+- `search_document: ` -- for document / passage-side text
+**Both sides of a retrieval pair must be prefixed**: prepend
+`search_query: ` to user queries and `search_document: ` to the
+indexed passages. Omitting the prefix degrades retrieval quality
+materially. See
+[`nomic-ai/nomic-embed-text-v1.5`](https://huggingface.co/nomic-ai/nomic-embed-text-v1.5) for the
+canonical guidance.
+Example SQL (prepend the prefix at query time via a CTE):
+```sql
+WITH prefixed_queries AS (
+  SELECT id,
+         'search_query: ' || query_text AS text
+  FROM   my_query_table
+)
+SELECT *
+FROM   mldb.ONNXEmbeddings(
+         ON prefixed_queries
+         ON onnx_models    AS ModelTable DIMENSION
+         ON tokenizers     AS TokenizerTable DIMENSION
+         USING
+           Accumulate('id')
+           ModelOutputTensor('sentence_embedding')
+       ) AS s;
+```
+## Quantization variants
+This repository ships the following variants. Quality numbers are
+measured against the upstream PyTorch reference on a fixed
+FLORES-200 sample. The **Size** column is the
+on-disk size of the ONNX weight file in megabytes (MB, 10^6 bytes).
+| Variant | Size (MB) | p50 cosine | R@1 |
+|---|---|---|---|
+| `fp32` | 547.8 | 1.000000 | — |
+| `ffn_skip` | 414.2 | 0.991608 | 0.851 |
+How to read the quality columns:
+- **p50 cosine** is the median cosine similarity between this
+  variant's embeddings and the fp32 ONNX reference, computed over
+  a fixed evaluation set. Higher means closer to the unquantized
+  model; **1.0** is identical.
+- **R@1** is top-1 retrieval consistency: if you use this variant
+  as a search index, R@1 is the fraction of queries that get the
+  same nearest neighbor as the fp32 reference would. Higher is
+  better.
+Notes:
+- **fp32**: full-precision reference. Useful for an accuracy ceiling,
+  but BYOM users almost always want one of the int8 variants for
+  in-database scoring -- they are 3-4x smaller and load much faster.
+- **ffn_skip**: dynamic int8 with the feed-forward (FFN) MatMul
+  layers kept in **fp32**, while attention and projection MatMuls
+  stay quantized. The FFN layers are where most of the quantization
+  error in transformer blocks concentrates; leaving them in fp32
+  recovers most of the quality loss for a modest size increase.
+  The artifact is roughly **3x smaller than fp32** (larger than the
+  per_channel int8 sibling).
+## Quickstart: using this model with Teradata BYOM
+Requires Teradata Vantage with **BYOM 6+** (`mldb.ONNXEmbeddings`).
+```python
+import getpass
+import teradataml as tdml
+from huggingface_hub import hf_hub_download
+repo_id   = "Teradata/nomic-embed-text-v1.5"
+model_id  = "nomic-embed-text-v1.5"        # arbitrary, used as the BYOM model_id
+onnx_file = "onnx/model-ffn_skip.onnx"
+# 1. Download the ONNX + tokenizer for the chosen variant.
+hf_hub_download(repo_id=repo_id, filename=onnx_file,       local_dir="./")
+hf_hub_download(repo_id=repo_id, filename="tokenizer.json", local_dir="./")
+# 2. Connect to Vantage.
+tdml.create_context(
+    host=input("host: "),
+    username=input("user: "),
+    password=getpass.getpass("password: "),
+)
+# 3. Load model + tokenizer into BYOM tables (one-time per model_id).
+tdml.save_byom(model_id=model_id, model_file=onnx_file,
+               table_name="embeddings_models")
+tdml.save_byom(model_id=model_id, model_file="tokenizer.json",
+               table_name="embeddings_tokenizers")
+```
+Then call `mldb.ONNXEmbeddings` against an input table whose
+`txt` column carries the strings to embed:
+```sql
+SELECT *
+FROM mldb.ONNXEmbeddings(
+    ON (SELECT id, txt FROM your_input_table) AS InputTable
+    ON (SELECT model_id, model FROM embeddings_models
+         WHERE model_id = 'nomic-embed-text-v1.5') AS ModelTable DIMENSION
+    ON (SELECT model_id, tokenizer FROM embeddings_tokenizers
+         WHERE model_id = 'nomic-embed-text-v1.5') AS TokenizerTable DIMENSION
+    USING
+        Accumulate('id')
+        ModelOutputTensor('sentence_embedding')
+        OutputFormat('FLOAT32(768)')
+        OverwriteCachedModel('*')
+) AS t
+ORDER BY id;
+```
+Pooling rule **`mean`** is applied **inside** the converted
+ONNX graph -- the output tensor named above already contains the
+pooled, post-processed embedding vector. For instruction-prefix models, prepend
+the recommended instruction text to each input `txt` before calling
+`ONNXEmbeddings`; the prefix is plain text that the tokenizer handles
+unchanged.
+## Original model attribution
+The original weights and training methodology belong to
+**Nomic AI**. Please cite their work, not this
+repository, in academic contexts. The canonical upstream model card
+is at
+[`nomic-ai/nomic-embed-text-v1.5`](https://huggingface.co/nomic-ai/nomic-embed-text-v1.5);
+refer to it for benchmarks, training details, intended use, and
+citation information.
+## Reporting issues
+For ONNX-conversion or BYOM-compatibility issues specific to this
+Teradata-converted artifact, please open a **Discussion** on this
+model's Hugging Face page. Questions about the underlying model
+quality, training, or intended use should go to the upstream
+maintainer's model card.
+----
+DISCLAIMER: The content herein ("Content") is provided "AS IS" and is not covered by any Teradata Operations, Inc. and its affiliates ("Teradata") agreements. Its listing here does not constitute certification or endorsement by Teradata.
+To the extent any of the Content contains or is related to any artificial intelligence ("AI") or other language learning models ("Models") that interoperate with the products and services of Teradata, by accessing, bringing, deploying or using such Models, you acknowledge and agree that you are solely responsible for ensuring compliance with all applicable laws, regulations, and restrictions governing the use, deployment, and distribution of AI technologies. This includes, but is not limited to, AI Diffusion Rules, European Union AI Act, AI-related laws and regulations, privacy laws, export controls, and financial or sector-specific regulations.
+While Teradata may provide support, guidance, or assistance in the deployment or implementation of Models to interoperate with Teradata's products and/or services, you remain fully responsible for ensuring that your Models, data, and applications comply with all relevant legal and regulatory obligations. Our assistance does not constitute legal or regulatory approval, and Teradata disclaims any liability arising from non-compliance with applicable laws.
+You must determine the suitability of the Models for any purpose. Given the probabilistic nature of machine learning and modeling, the use of the Models may in some situations result in incorrect output that does not accurately reflect the action generated. You should evaluate the accuracy of any output as appropriate for your use case, including by using human review of the output.

config.json ADDED Viewed

	@@ -0,0 +1,77 @@

+{
+  "activation_function": "swiglu",
+  "architectures": [
+    "NomicBertModel"
+  ],
+  "attention_probs_dropout_prob": 0.0,
+  "attn_pdrop": 0.0,
+  "auto_map": {
+    "AutoConfig": "nomic-ai/nomic-bert-2048--configuration_hf_nomic_bert.NomicBertConfig",
+    "AutoModel": "nomic-ai/nomic-bert-2048--modeling_hf_nomic_bert.NomicBertModel",
+    "AutoModelForMaskedLM": "nomic-ai/nomic-bert-2048--modeling_hf_nomic_bert.NomicBertForPreTraining",
+    "AutoModelForSequenceClassification": "nomic-ai/nomic-bert-2048--modeling_hf_nomic_bert.NomicBertForSequenceClassification",
+    "AutoModelForMultipleChoice": "nomic-ai/nomic-bert-2048--modeling_hf_nomic_bert.NomicBertForMultipleChoice",
+    "AutoModelForQuestionAnswering": "nomic-ai/nomic-bert-2048--modeling_hf_nomic_bert.NomicBertForQuestionAnswering",
+    "AutoModelForTokenClassification": "nomic-ai/nomic-bert-2048--modeling_hf_nomic_bert.NomicBertForTokenClassification"
+  },
+  "bos_token_id": null,
+  "causal": false,
+  "classifier_dropout": null,
+  "dense_seq_output": true,
+  "embd_pdrop": 0.0,
+  "eos_token_id": null,
+  "fused_bias_fc": true,
+  "fused_dropout_add_ln": true,
+  "head_dim": 64,
+  "hidden_act": "silu",
+  "hidden_dropout_prob": 0.0,
+  "hidden_size": 768,
+  "initializer_range": 0.02,
+  "intermediate_size": 3072,
+  "layer_norm_epsilon": 1e-12,
+  "layer_norm_eps": 1e-12,
+  "max_position_embeddings": 2048,
+  "max_trained_positions": 2048,
+  "mlp_fc1_bias": false,
+  "mlp_fc2_bias": false,
+  "model_type": "nomic_bert",
+  "n_embd": 768,
+  "n_head": 12,
+  "n_inner": 3072,
+  "n_layer": 12,
+  "n_positions": 8192,
+  "num_attention_heads": 12,
+  "num_hidden_layers": 12,
+  "pad_token_id": 0,
+  "pad_vocab_size_multiple": 64,
+  "parallel_block": false,
+  "parallel_block_tied_norm": false,
+  "prenorm": false,
+  "qkv_proj_bias": false,
+  "reorder_and_upcast_attn": false,
+  "resid_pdrop": 0.0,
+  "rope_parameters": {
+    "rope_theta": 1000.0,
+    "rope_type": "default"
+  },
+  "rotary_emb_base": 1000,
+  "rotary_emb_fraction": 1.0,
+  "rotary_emb_interleaved": false,
+  "rotary_emb_scale_base": null,
+  "rotary_scaling_factor": null,
+  "scale_attn_by_inverse_layer_idx": false,
+  "scale_attn_weights": true,
+  "summary_activation": null,
+  "summary_first_dropout": 0.0,
+  "summary_proj_to_labels": true,
+  "summary_type": "cls_index",
+  "summary_use_proj": true,
+  "torch_dtype": "float32",
+  "transformers_version": "5.3.0.dev0",
+  "type_vocab_size": 2,
+  "use_cache": true,
+  "use_flash_attn": true,
+  "use_rms_norm": false,
+  "use_xentropy": true,
+  "vocab_size": 30528
+}

onnx/model-ffn_skip.onnx ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f22476412ebd4237cf35dd70200eacabe461ed4caf3e22495c0e78eae057317b
+size 414200091

onnx/model-fp32.onnx ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:02f4a06ad4826e945578302f4d6f567b81aaa2d05f5fed0827983ea02f1ea71c
+size 547759252

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,37 @@

+{
+  "cls_token": {
+    "content": "[CLS]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "mask_token": {
+    "content": "[MASK]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "[PAD]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "sep_token": {
+    "content": "[SEP]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "[UNK]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,55 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "[PAD]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "100": {
+      "content": "[UNK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "101": {
+      "content": "[CLS]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "102": {
+      "content": "[SEP]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "103": {
+      "content": "[MASK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "clean_up_tokenization_spaces": true,
+  "cls_token": "[CLS]",
+  "do_lower_case": true,
+  "mask_token": "[MASK]",
+  "model_max_length": 8192,
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "strip_accents": null,
+  "tokenize_chinese_chars": true,
+  "tokenizer_class": "BertTokenizer",
+  "unk_token": "[UNK]"
+}