sasha-smirnov commited on
Commit
6644dbf
·
verified ·
1 Parent(s): 9b5bd7e

Initial publish via td-embeddings

Browse files
README.md ADDED
@@ -0,0 +1,232 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ library_name: transformers
6
+ pipeline_tag: feature-extraction
7
+ base_model: nomic-ai/nomic-embed-text-v1.5
8
+ tags:
9
+ - onnx
10
+ - teradata
11
+ - byom
12
+ - embeddings
13
+ - feature-extraction
14
+ ---
15
+
16
+
17
+
18
+ > Read the disclaimer below before using this model.
19
+
20
+ ----
21
+
22
+ # nomic-embed-text-v1.5 -- ONNX for Teradata BYOM
23
+
24
+ This repository hosts an **ONNX-converted** version of the upstream
25
+ model [`nomic-ai/nomic-embed-text-v1.5`](https://huggingface.co/nomic-ai/nomic-embed-text-v1.5),
26
+ packaged for the Teradata Vantage `mldb.ONNXEmbeddings` BYOM
27
+ function. It is **not** the original PyTorch model -- only the
28
+ inference graph and tokenizer needed for in-database embedding
29
+ generation.
30
+
31
+ What's different from upstream:
32
+
33
+ - **Format**: ONNX (opset 14, IR version 8 -- BYOM 6+ compatible),
34
+ produced from the upstream weights with architecture-aware
35
+ post-processing baked in.
36
+ - **Precision**: dynamic int8 quantization. See the variants table
37
+ below for what is shipped for this model.
38
+ - **Pooling and post-processing**: this graph emits the raw
39
+ `sentence_embedding` tensor. Pooling rule is
40
+ **mean** and the model expects
41
+ a query-time instruction prefix (see "Instruction prefix" below).
42
+ - **Verification**: every variant's cosine fidelity vs. the
43
+ upstream PyTorch reference is recorded on a fixed
44
+ FLORES-200 sample. Numbers may not generalize
45
+ to your data.
46
+
47
+ ## Model details
48
+
49
+ | | |
50
+ |---|---|
51
+ | Upstream repo | [`nomic-ai/nomic-embed-text-v1.5`](https://huggingface.co/nomic-ai/nomic-embed-text-v1.5) |
52
+ | Architecture | `NomicBertModel` (encoder) |
53
+ | Parameters | 136,731,648 |
54
+ | Output dimensions | 768 |
55
+ | Pooling | `mean` |
56
+ | Instruction prefix | yes |
57
+ | Max input tokens (native / advertised) | 2048 / 8192 |
58
+ | Languages | 1 |
59
+ | License | apache-2.0 |
60
+ | ONNX opset | 14 |
61
+ | ONNX IR version | 8 (BYOM 6+ compatible) |
62
+
63
+ <details>
64
+ <summary>Full language list (1)</summary>
65
+
66
+ - `en`
67
+
68
+ </details>
69
+
70
+ ### Instruction prefix
71
+
72
+ This model was trained with two **fixed literal prefixes** that must
73
+ be prepended to the raw text before encoding. Unlike free-form
74
+ instruction-tuned models, the prefix wording is not customisable --
75
+ the model only understands these specific tokens. The ONNX graph
76
+ itself is prefix-agnostic; downstream BYOM SQL is responsible for
77
+ prepending the prefix to each input row (typically with a CTE that
78
+ concatenates the prefix string with the input text).
79
+
80
+ Use the following prefixes (snapshot at publish time -- see the
81
+ upstream model card for any updates):
82
+
83
+ - `search_query: ` -- for query-side text
84
+ - `search_document: ` -- for document / passage-side text
85
+
86
+ **Both sides of a retrieval pair must be prefixed**: prepend
87
+ `search_query: ` to user queries and `search_document: ` to the
88
+ indexed passages. Omitting the prefix degrades retrieval quality
89
+ materially. See
90
+ [`nomic-ai/nomic-embed-text-v1.5`](https://huggingface.co/nomic-ai/nomic-embed-text-v1.5) for the
91
+ canonical guidance.
92
+
93
+ Example SQL (prepend the prefix at query time via a CTE):
94
+
95
+ ```sql
96
+ WITH prefixed_queries AS (
97
+ SELECT id,
98
+ 'search_query: ' || query_text AS text
99
+ FROM my_query_table
100
+ )
101
+ SELECT *
102
+ FROM mldb.ONNXEmbeddings(
103
+ ON prefixed_queries
104
+ ON onnx_models AS ModelTable DIMENSION
105
+ ON tokenizers AS TokenizerTable DIMENSION
106
+ USING
107
+ Accumulate('id')
108
+ ModelOutputTensor('sentence_embedding')
109
+ ) AS s;
110
+ ```
111
+
112
+ ## Quantization variants
113
+
114
+ This repository ships the following variants. Quality numbers are
115
+ measured against the upstream PyTorch reference on a fixed
116
+ FLORES-200 sample. The **Size** column is the
117
+ on-disk size of the ONNX weight file in megabytes (MB, 10^6 bytes).
118
+
119
+ | Variant | Size (MB) | p50 cosine | R@1 |
120
+ |---|---|---|---|
121
+ | `fp32` | 547.8 | 1.000000 | — |
122
+ | `ffn_skip` | 414.2 | 0.991608 | 0.851 |
123
+
124
+
125
+ How to read the quality columns:
126
+
127
+ - **p50 cosine** is the median cosine similarity between this
128
+ variant's embeddings and the fp32 ONNX reference, computed over
129
+ a fixed evaluation set. Higher means closer to the unquantized
130
+ model; **1.0** is identical.
131
+ - **R@1** is top-1 retrieval consistency: if you use this variant
132
+ as a search index, R@1 is the fraction of queries that get the
133
+ same nearest neighbor as the fp32 reference would. Higher is
134
+ better.
135
+
136
+ Notes:
137
+ - **fp32**: full-precision reference. Useful for an accuracy ceiling,
138
+ but BYOM users almost always want one of the int8 variants for
139
+ in-database scoring -- they are 3-4x smaller and load much faster.
140
+ - **ffn_skip**: dynamic int8 with the feed-forward (FFN) MatMul
141
+ layers kept in **fp32**, while attention and projection MatMuls
142
+ stay quantized. The FFN layers are where most of the quantization
143
+ error in transformer blocks concentrates; leaving them in fp32
144
+ recovers most of the quality loss for a modest size increase.
145
+ The artifact is roughly **3x smaller than fp32** (larger than the
146
+ per_channel int8 sibling).
147
+
148
+ ## Quickstart: using this model with Teradata BYOM
149
+
150
+ Requires Teradata Vantage with **BYOM 6+** (`mldb.ONNXEmbeddings`).
151
+
152
+ ```python
153
+ import getpass
154
+ import teradataml as tdml
155
+ from huggingface_hub import hf_hub_download
156
+
157
+ repo_id = "Teradata/nomic-embed-text-v1.5"
158
+ model_id = "nomic-embed-text-v1.5" # arbitrary, used as the BYOM model_id
159
+ onnx_file = "onnx/model-ffn_skip.onnx"
160
+
161
+ # 1. Download the ONNX + tokenizer for the chosen variant.
162
+ hf_hub_download(repo_id=repo_id, filename=onnx_file, local_dir="./")
163
+ hf_hub_download(repo_id=repo_id, filename="tokenizer.json", local_dir="./")
164
+
165
+ # 2. Connect to Vantage.
166
+ tdml.create_context(
167
+ host=input("host: "),
168
+ username=input("user: "),
169
+ password=getpass.getpass("password: "),
170
+ )
171
+
172
+ # 3. Load model + tokenizer into BYOM tables (one-time per model_id).
173
+ tdml.save_byom(model_id=model_id, model_file=onnx_file,
174
+ table_name="embeddings_models")
175
+ tdml.save_byom(model_id=model_id, model_file="tokenizer.json",
176
+ table_name="embeddings_tokenizers")
177
+ ```
178
+
179
+ Then call `mldb.ONNXEmbeddings` against an input table whose
180
+ `txt` column carries the strings to embed:
181
+
182
+ ```sql
183
+ SELECT *
184
+ FROM mldb.ONNXEmbeddings(
185
+ ON (SELECT id, txt FROM your_input_table) AS InputTable
186
+ ON (SELECT model_id, model FROM embeddings_models
187
+ WHERE model_id = 'nomic-embed-text-v1.5') AS ModelTable DIMENSION
188
+ ON (SELECT model_id, tokenizer FROM embeddings_tokenizers
189
+ WHERE model_id = 'nomic-embed-text-v1.5') AS TokenizerTable DIMENSION
190
+ USING
191
+ Accumulate('id')
192
+ ModelOutputTensor('sentence_embedding')
193
+ OutputFormat('FLOAT32(768)')
194
+ OverwriteCachedModel('*')
195
+ ) AS t
196
+ ORDER BY id;
197
+ ```
198
+
199
+ Pooling rule **`mean`** is applied **inside** the converted
200
+ ONNX graph -- the output tensor named above already contains the
201
+ pooled, post-processed embedding vector. For instruction-prefix models, prepend
202
+ the recommended instruction text to each input `txt` before calling
203
+ `ONNXEmbeddings`; the prefix is plain text that the tokenizer handles
204
+ unchanged.
205
+
206
+ ## Original model attribution
207
+
208
+ The original weights and training methodology belong to
209
+ **Nomic AI**. Please cite their work, not this
210
+ repository, in academic contexts. The canonical upstream model card
211
+ is at
212
+ [`nomic-ai/nomic-embed-text-v1.5`](https://huggingface.co/nomic-ai/nomic-embed-text-v1.5);
213
+ refer to it for benchmarks, training details, intended use, and
214
+ citation information.
215
+
216
+ ## Reporting issues
217
+
218
+ For ONNX-conversion or BYOM-compatibility issues specific to this
219
+ Teradata-converted artifact, please open a **Discussion** on this
220
+ model's Hugging Face page. Questions about the underlying model
221
+ quality, training, or intended use should go to the upstream
222
+ maintainer's model card.
223
+
224
+ ----
225
+
226
+ DISCLAIMER: The content herein ("Content") is provided "AS IS" and is not covered by any Teradata Operations, Inc. and its affiliates ("Teradata") agreements. Its listing here does not constitute certification or endorsement by Teradata.
227
+
228
+ To the extent any of the Content contains or is related to any artificial intelligence ("AI") or other language learning models ("Models") that interoperate with the products and services of Teradata, by accessing, bringing, deploying or using such Models, you acknowledge and agree that you are solely responsible for ensuring compliance with all applicable laws, regulations, and restrictions governing the use, deployment, and distribution of AI technologies. This includes, but is not limited to, AI Diffusion Rules, European Union AI Act, AI-related laws and regulations, privacy laws, export controls, and financial or sector-specific regulations.
229
+
230
+ While Teradata may provide support, guidance, or assistance in the deployment or implementation of Models to interoperate with Teradata's products and/or services, you remain fully responsible for ensuring that your Models, data, and applications comply with all relevant legal and regulatory obligations. Our assistance does not constitute legal or regulatory approval, and Teradata disclaims any liability arising from non-compliance with applicable laws.
231
+
232
+ You must determine the suitability of the Models for any purpose. Given the probabilistic nature of machine learning and modeling, the use of the Models may in some situations result in incorrect output that does not accurately reflect the action generated. You should evaluate the accuracy of any output as appropriate for your use case, including by using human review of the output.
config.json ADDED
@@ -0,0 +1,77 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "activation_function": "swiglu",
3
+ "architectures": [
4
+ "NomicBertModel"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.0,
7
+ "attn_pdrop": 0.0,
8
+ "auto_map": {
9
+ "AutoConfig": "nomic-ai/nomic-bert-2048--configuration_hf_nomic_bert.NomicBertConfig",
10
+ "AutoModel": "nomic-ai/nomic-bert-2048--modeling_hf_nomic_bert.NomicBertModel",
11
+ "AutoModelForMaskedLM": "nomic-ai/nomic-bert-2048--modeling_hf_nomic_bert.NomicBertForPreTraining",
12
+ "AutoModelForSequenceClassification": "nomic-ai/nomic-bert-2048--modeling_hf_nomic_bert.NomicBertForSequenceClassification",
13
+ "AutoModelForMultipleChoice": "nomic-ai/nomic-bert-2048--modeling_hf_nomic_bert.NomicBertForMultipleChoice",
14
+ "AutoModelForQuestionAnswering": "nomic-ai/nomic-bert-2048--modeling_hf_nomic_bert.NomicBertForQuestionAnswering",
15
+ "AutoModelForTokenClassification": "nomic-ai/nomic-bert-2048--modeling_hf_nomic_bert.NomicBertForTokenClassification"
16
+ },
17
+ "bos_token_id": null,
18
+ "causal": false,
19
+ "classifier_dropout": null,
20
+ "dense_seq_output": true,
21
+ "embd_pdrop": 0.0,
22
+ "eos_token_id": null,
23
+ "fused_bias_fc": true,
24
+ "fused_dropout_add_ln": true,
25
+ "head_dim": 64,
26
+ "hidden_act": "silu",
27
+ "hidden_dropout_prob": 0.0,
28
+ "hidden_size": 768,
29
+ "initializer_range": 0.02,
30
+ "intermediate_size": 3072,
31
+ "layer_norm_epsilon": 1e-12,
32
+ "layer_norm_eps": 1e-12,
33
+ "max_position_embeddings": 2048,
34
+ "max_trained_positions": 2048,
35
+ "mlp_fc1_bias": false,
36
+ "mlp_fc2_bias": false,
37
+ "model_type": "nomic_bert",
38
+ "n_embd": 768,
39
+ "n_head": 12,
40
+ "n_inner": 3072,
41
+ "n_layer": 12,
42
+ "n_positions": 8192,
43
+ "num_attention_heads": 12,
44
+ "num_hidden_layers": 12,
45
+ "pad_token_id": 0,
46
+ "pad_vocab_size_multiple": 64,
47
+ "parallel_block": false,
48
+ "parallel_block_tied_norm": false,
49
+ "prenorm": false,
50
+ "qkv_proj_bias": false,
51
+ "reorder_and_upcast_attn": false,
52
+ "resid_pdrop": 0.0,
53
+ "rope_parameters": {
54
+ "rope_theta": 1000.0,
55
+ "rope_type": "default"
56
+ },
57
+ "rotary_emb_base": 1000,
58
+ "rotary_emb_fraction": 1.0,
59
+ "rotary_emb_interleaved": false,
60
+ "rotary_emb_scale_base": null,
61
+ "rotary_scaling_factor": null,
62
+ "scale_attn_by_inverse_layer_idx": false,
63
+ "scale_attn_weights": true,
64
+ "summary_activation": null,
65
+ "summary_first_dropout": 0.0,
66
+ "summary_proj_to_labels": true,
67
+ "summary_type": "cls_index",
68
+ "summary_use_proj": true,
69
+ "torch_dtype": "float32",
70
+ "transformers_version": "5.3.0.dev0",
71
+ "type_vocab_size": 2,
72
+ "use_cache": true,
73
+ "use_flash_attn": true,
74
+ "use_rms_norm": false,
75
+ "use_xentropy": true,
76
+ "vocab_size": 30528
77
+ }
onnx/model-ffn_skip.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f22476412ebd4237cf35dd70200eacabe461ed4caf3e22495c0e78eae057317b
3
+ size 414200091
onnx/model-fp32.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:02f4a06ad4826e945578302f4d6f567b81aaa2d05f5fed0827983ea02f1ea71c
3
+ size 547759252
special_tokens_map.json ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": {
3
+ "content": "[CLS]",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "mask_token": {
10
+ "content": "[MASK]",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "[PAD]",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "sep_token": {
24
+ "content": "[SEP]",
25
+ "lstrip": false,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "unk_token": {
31
+ "content": "[UNK]",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ }
37
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,55 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "100": {
12
+ "content": "[UNK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "101": {
20
+ "content": "[CLS]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "102": {
28
+ "content": "[SEP]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "103": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "clean_up_tokenization_spaces": true,
45
+ "cls_token": "[CLS]",
46
+ "do_lower_case": true,
47
+ "mask_token": "[MASK]",
48
+ "model_max_length": 8192,
49
+ "pad_token": "[PAD]",
50
+ "sep_token": "[SEP]",
51
+ "strip_accents": null,
52
+ "tokenize_chinese_chars": true,
53
+ "tokenizer_class": "BertTokenizer",
54
+ "unk_token": "[UNK]"
55
+ }