cstr commited on
Commit
5f35c9d
·
verified ·
1 Parent(s): 150be43

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +343 -0
README.md ADDED
@@ -0,0 +1,343 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - multilingual
4
+ - af
5
+ - sq
6
+ - ar
7
+ - an
8
+ - hy
9
+ - ast
10
+ - az
11
+ - ba
12
+ - eu
13
+ - bar
14
+ - be
15
+ - bn
16
+ - inc
17
+ - bs
18
+ - br
19
+ - bg
20
+ - my
21
+ - ca
22
+ - ceb
23
+ - ce
24
+ - zh
25
+ - cv
26
+ - hr
27
+ - cs
28
+ - da
29
+ - nl
30
+ - en
31
+ - et
32
+ - fi
33
+ - fr
34
+ - gl
35
+ - ka
36
+ - de
37
+ - el
38
+ - gu
39
+ - ht
40
+ - he
41
+ - hi
42
+ - hu
43
+ - is
44
+ - io
45
+ - id
46
+ - ga
47
+ - it
48
+ - ja
49
+ - jv
50
+ - kn
51
+ - kk
52
+ - ky
53
+ - ko
54
+ - la
55
+ - lv
56
+ - lt
57
+ - roa
58
+ - nds
59
+ - lm
60
+ - mk
61
+ - mg
62
+ - ms
63
+ - ml
64
+ - mr
65
+ - mn
66
+ - min
67
+ - ne
68
+ - new
69
+ - nb
70
+ - nn
71
+ - oc
72
+ - fa
73
+ - pms
74
+ - pl
75
+ - pt
76
+ - pa
77
+ - ro
78
+ - ru
79
+ - sco
80
+ - sr
81
+ - hr
82
+ - scn
83
+ - sk
84
+ - sl
85
+ - aze
86
+ - es
87
+ - su
88
+ - sw
89
+ - sv
90
+ - tl
91
+ - tg
92
+ - th
93
+ - ta
94
+ - tt
95
+ - te
96
+ - tr
97
+ - uk
98
+ - ud
99
+ - uz
100
+ - vi
101
+ - vo
102
+ - war
103
+ - cy
104
+ - fry
105
+ - pnb
106
+ - yo
107
+ tags:
108
+ - onnx
109
+ - awesome-align
110
+ - word-alignment
111
+ - bert
112
+ pipeline_tag: feature-extraction
113
+ license: apache-2.0
114
+ datasets:
115
+ - wikipedia
116
+ ---
117
+ # Awesome-Align mBERT (ONNX FP32)
118
+
119
+ This repository contains an ONNX export of **bert-base-multilingual-cased** specifically optimized for word alignment using the **awesome-align** methodology.
120
+
121
+ ### Model Details
122
+ - **Base Model:** `bert-base-multilingual-cased`
123
+ - **Truncation:** This model is truncated to the **first 8 layers**. According to the [awesome-align research](https://github.com/neulab/awesome-align), Layer 8 provides the optimal "sweet spot" for extracting cross-lingual word embeddings for alignment.
124
+ - **Format:** ONNX (Full precision FP32)
125
+ - **Size:** ~596 MB
126
+
127
+ ### Usage
128
+ This model is intended to be used with `onnxruntime` to extract embeddings for source and target sentences. It is truncated to Layer 8, the optimal layer for cross-lingual feature extraction. Alignments are then calculated using Cosine Similarity and Mutual Argmax (Intersection).
129
+
130
+ ```python
131
+ import numpy as np
132
+ import onnxruntime as ort
133
+ from transformers import AutoTokenizer
134
+
135
+ # 1. Load Model and Tokenizer
136
+ # For INT8: use "cstr/awesome-align-onnx-int8"
137
+ model_id = "cstr/awesome-align-onnx"
138
+ session = ort.InferenceSession("model.onnx", providers=['CPUExecutionProvider'])
139
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
140
+
141
+ def get_word_embeddings(words):
142
+ # Tokenize with subword mapping
143
+ encoded = tokenizer(words, is_split_into_words=True, return_tensors="np")
144
+
145
+ # Track which subwords belong to which original word index
146
+ word_map = []
147
+ for i, w in enumerate(words):
148
+ sub_tokens = tokenizer.tokenize(w) or [tokenizer.unk_token]
149
+ word_map.extend([i] * len(sub_tokens))
150
+
151
+ # Run inference
152
+ outputs = session.run(None, {
153
+ "input_ids": encoded["input_ids"],
154
+ "attention_mask": encoded["attention_mask"]
155
+ })
156
+
157
+ # Slicing: [Batch 0, remove CLS/SEP, all hidden features]
158
+ embeddings = outputs[0][0, 1:-1, :]
159
+ return embeddings, word_map
160
+
161
+ def align(src_words, tgt_words):
162
+ # Get embeddings and maps
163
+ src_embeds, src_map = get_word_embeddings(src_words)
164
+ tgt_embeds, tgt_map = get_word_embeddings(tgt_words)
165
+
166
+ # Compute Cosine Similarity
167
+ src_norm = src_embeds / np.linalg.norm(src_embeds, axis=-1, keepdims=True)
168
+ tgt_norm = tgt_embeds / np.linalg.norm(tgt_embeds, axis=-1, keepdims=True)
169
+ similarity = np.dot(src_norm, tgt_norm.T)
170
+
171
+ # Mutual Argmax (Intersection) Logic
172
+ best_tgt_for_src = np.argmax(similarity, axis=1)
173
+ best_src_for_tgt = np.argmax(similarity, axis=0)
174
+
175
+ alignment = set()
176
+ for i, j in enumerate(best_tgt_for_src):
177
+ if best_src_for_tgt[j] == i:
178
+ alignment.add((src_map[i], tgt_map[j]))
179
+
180
+ return sorted(list(alignment))
181
+
182
+ # Example Usage
183
+ src = ["the", "cat", "sat"]
184
+ tgt = ["le", "chat", "assis"]
185
+ links = align(src, tgt)
186
+
187
+ print(f"Alignment Links: {links}")
188
+ # Output: [(0, 0), (1, 1), (2, 2)]
189
+ ```
190
+
191
+ ### Technical Notes
192
+
193
+ * **Subword Handling**: This script automatically handles BERT subword tokenization by mapping sub-tokens (e.g., `['ass', '##is']`) back to their parent word index.
194
+ * **Layer 8 Extraction**: The ONNX model is pre-truncated; the output of `session.run` is the 768-dimensional hidden state of the 8th layer.
195
+ * **Precision**: The Mutual Argmax logic ensures high-precision 1-to-1 alignments. For higher recall, you can implement the "Grow-Diag-Final" heuristic on the `similarity` matrix.
196
+
197
+ Original model card follows:
198
+
199
+ # BERT multilingual base model (cased)
200
+
201
+ Pretrained model on the top 104 languages with the largest Wikipedia using a masked language modeling (MLM) objective.
202
+ It was introduced in [this paper](https://arxiv.org/abs/1810.04805) and first released in
203
+ [this repository](https://github.com/google-research/bert). This model is case sensitive: it makes a difference
204
+ between english and English.
205
+
206
+ Disclaimer: The team releasing BERT did not write a model card for this model so this model card has been written by
207
+ the Hugging Face team.
208
+
209
+ ## Model description
210
+
211
+ BERT is a transformers model pretrained on a large corpus of multilingual data in a self-supervised fashion. This means
212
+ it was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of
213
+ publicly available data) with an automatic process to generate inputs and labels from those texts. More precisely, it
214
+ was pretrained with two objectives:
215
+
216
+ - Masked language modeling (MLM): taking a sentence, the model randomly masks 15% of the words in the input then run
217
+ the entire masked sentence through the model and has to predict the masked words. This is different from traditional
218
+ recurrent neural networks (RNNs) that usually see the words one after the other, or from autoregressive models like
219
+ GPT which internally mask the future tokens. It allows the model to learn a bidirectional representation of the
220
+ sentence.
221
+ - Next sentence prediction (NSP): the models concatenates two masked sentences as inputs during pretraining. Sometimes
222
+ they correspond to sentences that were next to each other in the original text, sometimes not. The model then has to
223
+ predict if the two sentences were following each other or not.
224
+
225
+ This way, the model learns an inner representation of the languages in the training set that can then be used to
226
+ extract features useful for downstream tasks: if you have a dataset of labeled sentences for instance, you can train a
227
+ standard classifier using the features produced by the BERT model as inputs.
228
+
229
+ ## Intended uses & limitations
230
+
231
+ You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to
232
+ be fine-tuned on a downstream task. See the [model hub](https://huggingface.co/models?filter=bert) to look for
233
+ fine-tuned versions on a task that interests you.
234
+
235
+ Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked)
236
+ to make decisions, such as sequence classification, token classification or question answering. For tasks such as text
237
+ generation you should look at model like GPT2.
238
+
239
+ ### How to use
240
+
241
+ You can use this model directly with a pipeline for masked language modeling:
242
+
243
+ ```python
244
+ >>> from transformers import pipeline
245
+ >>> unmasker = pipeline('fill-mask', model='bert-base-multilingual-cased')
246
+ >>> unmasker("Hello I'm a [MASK] model.")
247
+
248
+ [{'sequence': "[CLS] Hello I'm a model model. [SEP]",
249
+ 'score': 0.10182085633277893,
250
+ 'token': 13192,
251
+ 'token_str': 'model'},
252
+ {'sequence': "[CLS] Hello I'm a world model. [SEP]",
253
+ 'score': 0.052126359194517136,
254
+ 'token': 11356,
255
+ 'token_str': 'world'},
256
+ {'sequence': "[CLS] Hello I'm a data model. [SEP]",
257
+ 'score': 0.048930276185274124,
258
+ 'token': 11165,
259
+ 'token_str': 'data'},
260
+ {'sequence': "[CLS] Hello I'm a flight model. [SEP]",
261
+ 'score': 0.02036019042134285,
262
+ 'token': 23578,
263
+ 'token_str': 'flight'},
264
+ {'sequence': "[CLS] Hello I'm a business model. [SEP]",
265
+ 'score': 0.020079681649804115,
266
+ 'token': 14155,
267
+ 'token_str': 'business'}]
268
+ ```
269
+
270
+ Here is how to use this model to get the features of a given text in PyTorch:
271
+
272
+ ```python
273
+ from transformers import BertTokenizer, BertModel
274
+ tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased')
275
+ model = BertModel.from_pretrained("bert-base-multilingual-cased")
276
+ text = "Replace me by any text you'd like."
277
+ encoded_input = tokenizer(text, return_tensors='pt')
278
+ output = model(**encoded_input)
279
+ ```
280
+
281
+ and in TensorFlow:
282
+
283
+ ```python
284
+ from transformers import BertTokenizer, TFBertModel
285
+ tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased')
286
+ model = TFBertModel.from_pretrained("bert-base-multilingual-cased")
287
+ text = "Replace me by any text you'd like."
288
+ encoded_input = tokenizer(text, return_tensors='tf')
289
+ output = model(encoded_input)
290
+ ```
291
+
292
+ ## Training data
293
+
294
+ The BERT model was pretrained on the 104 languages with the largest Wikipedias. You can find the complete list
295
+ [here](https://github.com/google-research/bert/blob/master/multilingual.md#list-of-languages).
296
+
297
+ ## Training procedure
298
+
299
+ ### Preprocessing
300
+
301
+ The texts are lowercased and tokenized using WordPiece and a shared vocabulary size of 110,000. The languages with a
302
+ larger Wikipedia are under-sampled and the ones with lower resources are oversampled. For languages like Chinese,
303
+ Japanese Kanji and Korean Hanja that don't have space, a CJK Unicode block is added around every character.
304
+
305
+ The inputs of the model are then of the form:
306
+
307
+ ```
308
+ [CLS] Sentence A [SEP] Sentence B [SEP]
309
+ ```
310
+
311
+ With probability 0.5, sentence A and sentence B correspond to two consecutive sentences in the original corpus and in
312
+ the other cases, it's another random sentence in the corpus. Note that what is considered a sentence here is a
313
+ consecutive span of text usually longer than a single sentence. The only constrain is that the result with the two
314
+ "sentences" has a combined length of less than 512 tokens.
315
+
316
+ The details of the masking procedure for each sentence are the following:
317
+ - 15% of the tokens are masked.
318
+ - In 80% of the cases, the masked tokens are replaced by `[MASK]`.
319
+ - In 10% of the cases, the masked tokens are replaced by a random token (different) from the one they replace.
320
+ - In the 10% remaining cases, the masked tokens are left as is.
321
+
322
+
323
+ ### BibTeX entry and citation info
324
+
325
+ ```bibtex
326
+ @article{DBLP:journals/corr/abs-1810-04805,
327
+ author = {Jacob Devlin and
328
+ Ming{-}Wei Chang and
329
+ Kenton Lee and
330
+ Kristina Toutanova},
331
+ title = {{BERT:} Pre-training of Deep Bidirectional Transformers for Language
332
+ Understanding},
333
+ journal = {CoRR},
334
+ volume = {abs/1810.04805},
335
+ year = {2018},
336
+ url = {http://arxiv.org/abs/1810.04805},
337
+ archivePrefix = {arXiv},
338
+ eprint = {1810.04805},
339
+ timestamp = {Tue, 30 Oct 2018 20:39:56 +0100},
340
+ biburl = {https://dblp.org/rec/journals/corr/abs-1810-04805.bib},
341
+ bibsource = {dblp computer science bibliography, https://dblp.org}
342
+ }
343
+ ```