sasha-smirnov commited on
Commit
0218b87
·
verified ·
1 Parent(s): f19e9c4

Initial publish via td-embeddings

Browse files
README.md ADDED
@@ -0,0 +1,230 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - code
5
+ library_name: transformers
6
+ pipeline_tag: feature-extraction
7
+ base_model: codesage/codesage-large-v2
8
+ tags:
9
+ - onnx
10
+ - teradata
11
+ - byom
12
+ - embeddings
13
+ - feature-extraction
14
+ ---
15
+
16
+
17
+
18
+ > Read the disclaimer below before using this model.
19
+
20
+ ----
21
+
22
+ # codesage-large-v2 -- ONNX for Teradata BYOM
23
+
24
+ This repository hosts an **ONNX-converted** version of the upstream
25
+ model [`codesage/codesage-large-v2`](https://huggingface.co/codesage/codesage-large-v2),
26
+ packaged for the Teradata Vantage `mldb.ONNXEmbeddings` BYOM
27
+ function. It is **not** the original PyTorch model -- only the
28
+ inference graph and tokenizer needed for in-database embedding
29
+ generation.
30
+
31
+ What's different from upstream:
32
+
33
+ - **Format**: ONNX (opset 14, IR version 8 -- BYOM 6+ compatible),
34
+ produced from the upstream weights with architecture-aware
35
+ post-processing baked in.
36
+ - **Precision**: dynamic int8 quantization. See the variants table
37
+ below for what is shipped for this model.
38
+ - **Pooling and post-processing**: this graph emits the raw
39
+ `sentence_embedding` tensor. Pooling rule is
40
+ **mean**.
41
+ - **Verification**: every variant's cosine fidelity vs. the
42
+ upstream PyTorch reference is recorded on a fixed
43
+ CodeSearchNet sample. Numbers may not generalize
44
+ to your data.
45
+
46
+ ## Model details
47
+
48
+ | | |
49
+ |---|---|
50
+ | Upstream repo | [`codesage/codesage-large-v2`](https://huggingface.co/codesage/codesage-large-v2) |
51
+ | Architecture | `CodeSage` (encoder) |
52
+ | Parameters | 1,313,464,320 |
53
+ | Output dimensions | 2048 |
54
+ | Pooling | `mean` |
55
+ | Instruction prefix | no |
56
+ | Max input tokens (advertised) | 2048 |
57
+ | Languages | 9 |
58
+ | License | apache-2.0 |
59
+ | ONNX opset | 14 |
60
+ | ONNX IR version | 8 (BYOM 6+ compatible) |
61
+
62
+ <details>
63
+ <summary>Full language list (9)</summary>
64
+
65
+ - `c`
66
+ - `c-sharp`
67
+ - `go`
68
+ - `java`
69
+ - `javascript`
70
+ - `typescript`
71
+ - `php`
72
+ - `python`
73
+ - `ruby`
74
+
75
+ </details>
76
+
77
+ ## Quantization variants
78
+
79
+ This repository ships the following variants. Quality numbers are
80
+ measured against the upstream PyTorch reference on a fixed
81
+ CodeSearchNet sample. The **Size** column is the
82
+ on-disk size of the ONNX weight file in megabytes (MB, 10^6 bytes).
83
+
84
+ | Variant | Size (MB) | p50 cosine | R@1 |
85
+ |---|---|---|---|
86
+ | `ffn_skip` | 1318.9 | 0.819499 | 0.919 |
87
+
88
+
89
+ How to read the quality columns:
90
+
91
+ - **p50 cosine** is the median cosine similarity between this
92
+ variant's embeddings and the fp32 ONNX reference, computed over
93
+ a fixed evaluation set. Higher means closer to the unquantized
94
+ model; **1.0** is identical.
95
+ - **R@1** is top-1 retrieval consistency: if you use this variant
96
+ as a search index, R@1 is the fraction of queries that get the
97
+ same nearest neighbor as the fp32 reference would. Higher is
98
+ better.
99
+
100
+ Notes:
101
+ - **ffn_skip**: dynamic int8 with the feed-forward (FFN) MatMul
102
+ layers kept in **fp32**, while attention and projection MatMuls
103
+ stay quantized. The FFN layers are where most of the quantization
104
+ error in transformer blocks concentrates; leaving them in fp32
105
+ recovers most of the quality loss for a modest size increase.
106
+ The artifact is roughly **3x smaller than fp32** (larger than the
107
+ per_channel int8 sibling).
108
+
109
+ ## Quickstart: using this model with Teradata BYOM
110
+
111
+ Requires Teradata Vantage with **BYOM 6+** (`mldb.ONNXEmbeddings`).
112
+
113
+ ```python
114
+ import getpass
115
+ import teradataml as tdml
116
+ from huggingface_hub import hf_hub_download
117
+
118
+ repo_id = "Teradata/codesage-large-v2"
119
+ model_id = "codesage-large-v2" # arbitrary, used as the BYOM model_id
120
+ onnx_file = "onnx/model-ffn_skip.onnx"
121
+
122
+ # 1. Download the ONNX + tokenizer for the chosen variant.
123
+ hf_hub_download(repo_id=repo_id, filename=onnx_file, local_dir="./")
124
+ hf_hub_download(repo_id=repo_id, filename="tokenizer.json", local_dir="./")
125
+
126
+ # 2. Connect to Vantage.
127
+ tdml.create_context(
128
+ host=input("host: "),
129
+ username=input("user: "),
130
+ password=getpass.getpass("password: "),
131
+ )
132
+
133
+ # 3. Load model + tokenizer into BYOM tables (one-time per model_id).
134
+ tdml.save_byom(model_id=model_id, model_file=onnx_file,
135
+ table_name="embeddings_models")
136
+ tdml.save_byom(model_id=model_id, model_file="tokenizer.json",
137
+ table_name="embeddings_tokenizers")
138
+ ```
139
+
140
+ Then call `mldb.ONNXEmbeddings` against an input table whose
141
+ `txt` column carries the strings to embed:
142
+
143
+ This model emits a **2048-dimensional** embedding.
144
+ Teradata's wide-table output projection is capped at **2048 columns**,
145
+ which blocks the `FLOAT32(2048) + Accumulate('id')`
146
+ projection that smaller models use. Pick the SQL form that matches
147
+ your Vantage version:
148
+
149
+ **Option A -- `VARBYTE` (works on TD 17.20 and TD 20.0+)**
150
+
151
+ The vector lands as raw header-less float32 bytes
152
+ (`2048 * 4 = 8192` bytes),
153
+ which fits in a single `VARBYTE` column and dodges the 2048-column cap.
154
+ Decode on the client side or wrap the call in a UDF that returns
155
+ `FLOAT[]`.
156
+
157
+ ```sql
158
+ SELECT *
159
+ FROM mldb.ONNXEmbeddings(
160
+ ON (SELECT id, txt FROM your_input_table) AS InputTable
161
+ ON (SELECT model_id, model FROM embeddings_models
162
+ WHERE model_id = 'codesage-large-v2') AS ModelTable DIMENSION
163
+ ON (SELECT model_id, tokenizer FROM embeddings_tokenizers
164
+ WHERE model_id = 'codesage-large-v2') AS TokenizerTable DIMENSION
165
+ USING
166
+ Accumulate('id')
167
+ ModelOutputTensor('sentence_embedding')
168
+ OutputFormat('VARBYTE(8192)')
169
+ OverwriteCachedModel('*')
170
+ ) AS t
171
+ ORDER BY id;
172
+ ```
173
+
174
+ **Option B -- `VECTOR` (TD 20.0+ only)**
175
+
176
+ Vantage 20.0 introduced a native `VECTOR` datatype that holds the
177
+ full embedding as a single typed column, with native vector-similarity
178
+ operators available on it.
179
+
180
+ ```sql
181
+ SELECT *
182
+ FROM mldb.ONNXEmbeddings(
183
+ ON (SELECT id, txt FROM your_input_table) AS InputTable
184
+ ON (SELECT model_id, model FROM embeddings_models
185
+ WHERE model_id = 'codesage-large-v2') AS ModelTable DIMENSION
186
+ ON (SELECT model_id, tokenizer FROM embeddings_tokenizers
187
+ WHERE model_id = 'codesage-large-v2') AS TokenizerTable DIMENSION
188
+ USING
189
+ Accumulate('id')
190
+ ModelOutputTensor('sentence_embedding')
191
+ OutputFormat('VECTOR')
192
+ OverwriteCachedModel('*')
193
+ ) AS t
194
+ ORDER BY id;
195
+ ```
196
+
197
+ Use `VECTOR` if your Vantage version supports it; otherwise fall back
198
+ to `VARBYTE`. Both forms emit the same underlying float32 values.
199
+
200
+ Pooling rule **`mean`** is applied **inside** the converted
201
+ ONNX graph -- the output tensor named above already contains the
202
+ pooled, post-processed embedding vector.
203
+
204
+ ## Original model attribution
205
+
206
+ The original weights and training methodology belong to
207
+ **the CodeSage authors**. Please cite their work, not this
208
+ repository, in academic contexts. The canonical upstream model card
209
+ is at
210
+ [`codesage/codesage-large-v2`](https://huggingface.co/codesage/codesage-large-v2);
211
+ refer to it for benchmarks, training details, intended use, and
212
+ citation information.
213
+
214
+ ## Reporting issues
215
+
216
+ For ONNX-conversion or BYOM-compatibility issues specific to this
217
+ Teradata-converted artifact, please open a **Discussion** on this
218
+ model's Hugging Face page. Questions about the underlying model
219
+ quality, training, or intended use should go to the upstream
220
+ maintainer's model card.
221
+
222
+ ----
223
+
224
+ DISCLAIMER: The content herein ("Content") is provided "AS IS" and is not covered by any Teradata Operations, Inc. and its affiliates ("Teradata") agreements. Its listing here does not constitute certification or endorsement by Teradata.
225
+
226
+ To the extent any of the Content contains or is related to any artificial intelligence ("AI") or other language learning models ("Models") that interoperate with the products and services of Teradata, by accessing, bringing, deploying or using such Models, you acknowledge and agree that you are solely responsible for ensuring compliance with all applicable laws, regulations, and restrictions governing the use, deployment, and distribution of AI technologies. This includes, but is not limited to, AI Diffusion Rules, European Union AI Act, AI-related laws and regulations, privacy laws, export controls, and financial or sector-specific regulations.
227
+
228
+ While Teradata may provide support, guidance, or assistance in the deployment or implementation of Models to interoperate with Teradata's products and/or services, you remain fully responsible for ensuring that your Models, data, and applications comply with all relevant legal and regulatory obligations. Our assistance does not constitute legal or regulatory approval, and Teradata disclaims any liability arising from non-compliance with applicable laws.
229
+
230
+ You must determine the suitability of the Models for any purpose. Given the probabilistic nature of machine learning and modeling, the use of the Models may in some situations result in incorrect output that does not accurately reflect the action generated. You should evaluate the accuracy of any output as appropriate for your use case, including by using human review of the output.
config.json ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "codesage/codesage-large-v2",
3
+ "architectures": [
4
+ "CodeSage"
5
+ ],
6
+ "auto_map": {
7
+ "AutoConfig": "config_codesage.CodeSageConfig",
8
+ "AutoTokenizer": "tokenization_codesage.CodeSageTokenizer",
9
+ "AutoModel": "modeling_codesage.CodeSageModel",
10
+ "AutoModelForMaskedLM": "modeling_codesage.CodeSageForMaskedLM",
11
+ "AutoModelForSequenceClassification": "modeling_codesage.CodeSageForSequenceClassification"
12
+ },
13
+ "activation_function": "gelu_new",
14
+ "attention_dropout_prob": 0.1,
15
+ "embedding_dropout_prob": 0.1,
16
+ "initializer_range": 0.02,
17
+ "layer_norm_epsilon": 1e-05,
18
+ "hidden_size": 2048,
19
+ "num_attention_heads": 16,
20
+ "num_hidden_layers": 24,
21
+ "intermediate_size": 8192,
22
+ "max_position_embeddings": 2048,
23
+ "residual_dropout_prob": 0.1,
24
+ "vocab_size": 49154
25
+ }
onnx/model-ffn_skip.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e276a373d90da222c97e3248bd8f40c595a01c1613a022958fe93de24a8e41b4
3
+ size 1318916135
special_tokens_map.json ADDED
@@ -0,0 +1,28 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ "<|endoftext|>",
4
+ "<fim_prefix>",
5
+ "<fim_middle>",
6
+ "<fim_suffix>",
7
+ "<fim_pad>",
8
+ "<filename>",
9
+ "<gh_stars>",
10
+ "<issue_start>",
11
+ "<issue_comment>",
12
+ "<issue_closed>",
13
+ "<jupyter_start>",
14
+ "<jupyter_text>",
15
+ "<jupyter_code>",
16
+ "<jupyter_output>",
17
+ "<empty_output>",
18
+ "<commit_before>",
19
+ "<commit_msg>",
20
+ "<commit_after>",
21
+ "<reponame>"
22
+ ],
23
+ "bos_token": "<|endoftext|>",
24
+ "eos_token": "<|endoftext|>",
25
+ "mask_token": "<mask>",
26
+ "pad_token": "<pad>",
27
+ "unk_token": "<|endoftext|>"
28
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": false,
3
+ "additional_special_tokens": [
4
+ "<|endoftext|>",
5
+ "<fim_prefix>",
6
+ "<fim_middle>",
7
+ "<fim_suffix>",
8
+ "<fim_pad>",
9
+ "<filename>",
10
+ "<gh_stars>",
11
+ "<issue_start>",
12
+ "<issue_comment>",
13
+ "<issue_closed>",
14
+ "<jupyter_start>",
15
+ "<jupyter_text>",
16
+ "<jupyter_code>",
17
+ "<jupyter_output>",
18
+ "<empty_output>",
19
+ "<commit_before>",
20
+ "<commit_msg>",
21
+ "<commit_after>",
22
+ "<reponame>"
23
+ ],
24
+ "bos_token": "<|endoftext|>",
25
+ "eos_token": "<|endoftext|>",
26
+ "add_eos_token": true,
27
+ "model_max_length": 1000000000000000019884624838656,
28
+ "unk_token": "<|endoftext|>",
29
+ "vocab_size": 49152,
30
+ "tokenizer_class": "CodeSageTokenizer",
31
+ "auto_map": {
32
+ "AutoTokenizer": ["tokenization_codesage.CodeSageTokenizer", null]
33
+ }
34
+ }