yezdata commited on
Commit
c47a352
·
verified ·
1 Parent(s): 81b23e9

Upload 14 files

Browse files
.gitattributes ADDED
@@ -0,0 +1 @@
 
 
1
+ emcoder/model.safetensors filter=lfs diff=lfs merge=lfs -text
LICENSE.md ADDED
@@ -0,0 +1,402 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Attribution-NonCommercial-NoDerivatives 4.0 International
2
+
3
+ =======================================================================
4
+
5
+ Creative Commons Corporation ("Creative Commons") is not a law firm and
6
+ does not provide legal services or legal advice. Distribution of
7
+ Creative Commons public licenses does not create a lawyer-client or
8
+ other relationship. Creative Commons makes its licenses and related
9
+ information available on an "as-is" basis. Creative Commons gives no
10
+ warranties regarding its licenses, any material licensed under their
11
+ terms and conditions, or any related information. Creative Commons
12
+ disclaims all liability for damages resulting from their use to the
13
+ fullest extent possible.
14
+
15
+ Using Creative Commons Public Licenses
16
+
17
+ Creative Commons public licenses provide a standard set of terms and
18
+ conditions that creators and other rights holders may use to share
19
+ original works of authorship and other material subject to copyright
20
+ and certain other rights specified in the public license below. The
21
+ following considerations are for informational purposes only, are not
22
+ exhaustive, and do not form part of our licenses.
23
+
24
+ Considerations for licensors: Our public licenses are
25
+ intended for use by those authorized to give the public
26
+ permission to use material in ways otherwise restricted by
27
+ copyright and certain other rights. Our licenses are
28
+ irrevocable. Licensors should read and understand the terms
29
+ and conditions of the license they choose before applying it.
30
+ Licensors should also secure all rights necessary before
31
+ applying our licenses so that the public can reuse the
32
+ material as expected. Licensors should clearly mark any
33
+ material not subject to the license. This includes other CC-
34
+ licensed material, or material used under an exception or
35
+ limitation to copyright. More considerations for licensors:
36
+ wiki.creativecommons.org/Considerations_for_licensors
37
+
38
+ Considerations for the public: By using one of our public
39
+ licenses, a licensor grants the public permission to use the
40
+ licensed material under specified terms and conditions. If
41
+ the licensor's permission is not necessary for any reason--for
42
+ example, because of any applicable exception or limitation to
43
+ copyright--then that use is not regulated by the license. Our
44
+ licenses grant only permissions under copyright and certain
45
+ other rights that a licensor has authority to grant. Use of
46
+ the licensed material may still be restricted for other
47
+ reasons, including because others have copyright or other
48
+ rights in the material. A licensor may make special requests,
49
+ such as asking that all changes be marked or described.
50
+ Although not required by our licenses, you are encouraged to
51
+ respect those requests where reasonable. More considerations
52
+ for the public:
53
+ wiki.creativecommons.org/Considerations_for_licensees
54
+
55
+ =======================================================================
56
+
57
+ Creative Commons Attribution-NonCommercial-NoDerivatives 4.0
58
+ International Public License
59
+
60
+ By exercising the Licensed Rights (defined below), You accept and agree
61
+ to be bound by the terms and conditions of this Creative Commons
62
+ Attribution-NonCommercial-NoDerivatives 4.0 International Public
63
+ License ("Public License"). To the extent this Public License may be
64
+ interpreted as a contract, You are granted the Licensed Rights in
65
+ consideration of Your acceptance of these terms and conditions, and the
66
+ Licensor grants You such rights in consideration of benefits the
67
+ Licensor receives from making the Licensed Material available under
68
+ these terms and conditions.
69
+
70
+
71
+ Section 1 -- Definitions.
72
+
73
+ a. Adapted Material means material subject to Copyright and Similar
74
+ Rights that is derived from or based upon the Licensed Material
75
+ and in which the Licensed Material is translated, altered,
76
+ arranged, transformed, or otherwise modified in a manner requiring
77
+ permission under the Copyright and Similar Rights held by the
78
+ Licensor. For purposes of this Public License, where the Licensed
79
+ Material is a musical work, performance, or sound recording,
80
+ Adapted Material is always produced where the Licensed Material is
81
+ synched in timed relation with a moving image.
82
+
83
+ b. Copyright and Similar Rights means copyright and/or similar rights
84
+ closely related to copyright including, without limitation,
85
+ performance, broadcast, sound recording, and Sui Generis Database
86
+ Rights, without regard to how the rights are labeled or
87
+ categorized. For purposes of this Public License, the rights
88
+ specified in Section 2(b)(1)-(2) are not Copyright and Similar
89
+ Rights.
90
+
91
+ c. Effective Technological Measures means those measures that, in the
92
+ absence of proper authority, may not be circumvented under laws
93
+ fulfilling obligations under Article 11 of the WIPO Copyright
94
+ Treaty adopted on December 20, 1996, and/or similar international
95
+ agreements.
96
+
97
+ d. Exceptions and Limitations means fair use, fair dealing, and/or
98
+ any other exception or limitation to Copyright and Similar Rights
99
+ that applies to Your use of the Licensed Material.
100
+
101
+ e. Licensed Material means the artistic or literary work, database,
102
+ or other material to which the Licensor applied this Public
103
+ License.
104
+
105
+ f. Licensed Rights means the rights granted to You subject to the
106
+ terms and conditions of this Public License, which are limited to
107
+ all Copyright and Similar Rights that apply to Your use of the
108
+ Licensed Material and that the Licensor has authority to license.
109
+
110
+ g. Licensor means the individual(s) or entity(ies) granting rights
111
+ under this Public License.
112
+
113
+ h. NonCommercial means not primarily intended for or directed towards
114
+ commercial advantage or monetary compensation. For purposes of
115
+ this Public License, the exchange of the Licensed Material for
116
+ other material subject to Copyright and Similar Rights by digital
117
+ file-sharing or similar means is NonCommercial provided there is
118
+ no payment of monetary compensation in connection with the
119
+ exchange.
120
+
121
+ i. Share means to provide material to the public by any means or
122
+ process that requires permission under the Licensed Rights, such
123
+ as reproduction, public display, public performance, distribution,
124
+ dissemination, communication, or importation, and to make material
125
+ available to the public including in ways that members of the
126
+ public may access the material from a place and at a time
127
+ individually chosen by them.
128
+
129
+ j. Sui Generis Database Rights means rights other than copyright
130
+ resulting from Directive 96/9/EC of the European Parliament and of
131
+ the Council of 11 March 1996 on the legal protection of databases,
132
+ as amended and/or succeeded, as well as other essentially
133
+ equivalent rights anywhere in the world.
134
+
135
+ k. You means the individual or entity exercising the Licensed Rights
136
+ under this Public License. Your has a corresponding meaning.
137
+
138
+
139
+ Section 2 -- Scope.
140
+
141
+ a. License grant.
142
+
143
+ 1. Subject to the terms and conditions of this Public License,
144
+ the Licensor hereby grants You a worldwide, royalty-free,
145
+ non-sublicensable, non-exclusive, irrevocable license to
146
+ exercise the Licensed Rights in the Licensed Material to:
147
+
148
+ a. reproduce and Share the Licensed Material, in whole or
149
+ in part, for NonCommercial purposes only; and
150
+
151
+ b. produce and reproduce, but not Share, Adapted Material
152
+ for NonCommercial purposes only.
153
+
154
+ 2. Exceptions and Limitations. For the avoidance of doubt, where
155
+ Exceptions and Limitations apply to Your use, this Public
156
+ License does not apply, and You do not need to comply with
157
+ its terms and conditions.
158
+
159
+ 3. Term. The term of this Public License is specified in Section
160
+ 6(a).
161
+
162
+ 4. Media and formats; technical modifications allowed. The
163
+ Licensor authorizes You to exercise the Licensed Rights in
164
+ all media and formats whether now known or hereafter created,
165
+ and to make technical modifications necessary to do so. The
166
+ Licensor waives and/or agrees not to assert any right or
167
+ authority to forbid You from making technical modifications
168
+ necessary to exercise the Licensed Rights, including
169
+ technical modifications necessary to circumvent Effective
170
+ Technological Measures. For purposes of this Public License,
171
+ simply making modifications authorized by this Section 2(a)
172
+ (4) never produces Adapted Material.
173
+
174
+ 5. Downstream recipients.
175
+
176
+ a. Offer from the Licensor -- Licensed Material. Every
177
+ recipient of the Licensed Material automatically
178
+ receives an offer from the Licensor to exercise the
179
+ Licensed Rights under the terms and conditions of this
180
+ Public License.
181
+
182
+ b. No downstream restrictions. You may not offer or impose
183
+ any additional or different terms or conditions on, or
184
+ apply any Effective Technological Measures to, the
185
+ Licensed Material if doing so restricts exercise of the
186
+ Licensed Rights by any recipient of the Licensed
187
+ Material.
188
+
189
+ 6. No endorsement. Nothing in this Public License constitutes or
190
+ may be construed as permission to assert or imply that You
191
+ are, or that Your use of the Licensed Material is, connected
192
+ with, or sponsored, endorsed, or granted official status by,
193
+ the Licensor or others designated to receive attribution as
194
+ provided in Section 3(a)(1)(A)(i).
195
+
196
+ b. Other rights.
197
+
198
+ 1. Moral rights, such as the right of integrity, are not
199
+ licensed under this Public License, nor are publicity,
200
+ privacy, and/or other similar personality rights; however, to
201
+ the extent possible, the Licensor waives and/or agrees not to
202
+ assert any such rights held by the Licensor to the limited
203
+ extent necessary to allow You to exercise the Licensed
204
+ Rights, but not otherwise.
205
+
206
+ 2. Patent and trademark rights are not licensed under this
207
+ Public License.
208
+
209
+ 3. To the extent possible, the Licensor waives any right to
210
+ collect royalties from You for the exercise of the Licensed
211
+ Rights, whether directly or through a collecting society
212
+ under any voluntary or waivable statutory or compulsory
213
+ licensing scheme. In all other cases the Licensor expressly
214
+ reserves any right to collect such royalties, including when
215
+ the Licensed Material is used other than for NonCommercial
216
+ purposes.
217
+
218
+
219
+ Section 3 -- License Conditions.
220
+
221
+ Your exercise of the Licensed Rights is expressly made subject to the
222
+ following conditions.
223
+
224
+ a. Attribution.
225
+
226
+ 1. If You Share the Licensed Material, You must:
227
+
228
+ a. retain the following if it is supplied by the Licensor
229
+ with the Licensed Material:
230
+
231
+ i. identification of the creator(s) of the Licensed
232
+ Material and any others designated to receive
233
+ attribution, in any reasonable manner requested by
234
+ the Licensor (including by pseudonym if
235
+ designated);
236
+
237
+ ii. a copyright notice;
238
+
239
+ iii. a notice that refers to this Public License;
240
+
241
+ iv. a notice that refers to the disclaimer of
242
+ warranties;
243
+
244
+ v. a URI or hyperlink to the Licensed Material to the
245
+ extent reasonably practicable;
246
+
247
+ b. indicate if You modified the Licensed Material and
248
+ retain an indication of any previous modifications; and
249
+
250
+ c. indicate the Licensed Material is licensed under this
251
+ Public License, and include the text of, or the URI or
252
+ hyperlink to, this Public License.
253
+
254
+ For the avoidance of doubt, You do not have permission under
255
+ this Public License to Share Adapted Material.
256
+
257
+ 2. You may satisfy the conditions in Section 3(a)(1) in any
258
+ reasonable manner based on the medium, means, and context in
259
+ which You Share the Licensed Material. For example, it may be
260
+ reasonable to satisfy the conditions by providing a URI or
261
+ hyperlink to a resource that includes the required
262
+ information.
263
+
264
+ 3. If requested by the Licensor, You must remove any of the
265
+ information required by Section 3(a)(1)(A) to the extent
266
+ reasonably practicable.
267
+
268
+
269
+ Section 4 -- Sui Generis Database Rights.
270
+
271
+ Where the Licensed Rights include Sui Generis Database Rights that
272
+ apply to Your use of the Licensed Material:
273
+
274
+ a. for the avoidance of doubt, Section 2(a)(1) grants You the right
275
+ to extract, reuse, reproduce, and Share all or a substantial
276
+ portion of the contents of the database for NonCommercial purposes
277
+ only and provided You do not Share Adapted Material;
278
+
279
+ b. if You include all or a substantial portion of the database
280
+ contents in a database in which You have Sui Generis Database
281
+ Rights, then the database in which You have Sui Generis Database
282
+ Rights (but not its individual contents) is Adapted Material; and
283
+
284
+ c. You must comply with the conditions in Section 3(a) if You Share
285
+ all or a substantial portion of the contents of the database.
286
+
287
+ For the avoidance of doubt, this Section 4 supplements and does not
288
+ replace Your obligations under this Public License where the Licensed
289
+ Rights include other Copyright and Similar Rights.
290
+
291
+
292
+ Section 5 -- Disclaimer of Warranties and Limitation of Liability.
293
+
294
+ a. UNLESS OTHERWISE SEPARATELY UNDERTAKEN BY THE LICENSOR, TO THE
295
+ EXTENT POSSIBLE, THE LICENSOR OFFERS THE LICENSED MATERIAL AS-IS
296
+ AND AS-AVAILABLE, AND MAKES NO REPRESENTATIONS OR WARRANTIES OF
297
+ ANY KIND CONCERNING THE LICENSED MATERIAL, WHETHER EXPRESS,
298
+ IMPLIED, STATUTORY, OR OTHER. THIS INCLUDES, WITHOUT LIMITATION,
299
+ WARRANTIES OF TITLE, MERCHANTABILITY, FITNESS FOR A PARTICULAR
300
+ PURPOSE, NON-INFRINGEMENT, ABSENCE OF LATENT OR OTHER DEFECTS,
301
+ ACCURACY, OR THE PRESENCE OR ABSENCE OF ERRORS, WHETHER OR NOT
302
+ KNOWN OR DISCOVERABLE. WHERE DISCLAIMERS OF WARRANTIES ARE NOT
303
+ ALLOWED IN FULL OR IN PART, THIS DISCLAIMER MAY NOT APPLY TO YOU.
304
+
305
+ b. TO THE EXTENT POSSIBLE, IN NO EVENT WILL THE LICENSOR BE LIABLE
306
+ TO YOU ON ANY LEGAL THEORY (INCLUDING, WITHOUT LIMITATION,
307
+ NEGLIGENCE) OR OTHERWISE FOR ANY DIRECT, SPECIAL, INDIRECT,
308
+ INCIDENTAL, CONSEQUENTIAL, PUNITIVE, EXEMPLARY, OR OTHER LOSSES,
309
+ COSTS, EXPENSES, OR DAMAGES ARISING OUT OF THIS PUBLIC LICENSE OR
310
+ USE OF THE LICENSED MATERIAL, EVEN IF THE LICENSOR HAS BEEN
311
+ ADVISED OF THE POSSIBILITY OF SUCH LOSSES, COSTS, EXPENSES, OR
312
+ DAMAGES. WHERE A LIMITATION OF LIABILITY IS NOT ALLOWED IN FULL OR
313
+ IN PART, THIS LIMITATION MAY NOT APPLY TO YOU.
314
+
315
+ c. The disclaimer of warranties and limitation of liability provided
316
+ above shall be interpreted in a manner that, to the extent
317
+ possible, most closely approximates an absolute disclaimer and
318
+ waiver of all liability.
319
+
320
+
321
+ Section 6 -- Term and Termination.
322
+
323
+ a. This Public License applies for the term of the Copyright and
324
+ Similar Rights licensed here. However, if You fail to comply with
325
+ this Public License, then Your rights under this Public License
326
+ terminate automatically.
327
+
328
+ b. Where Your right to use the Licensed Material has terminated under
329
+ Section 6(a), it reinstates:
330
+
331
+ 1. automatically as of the date the violation is cured, provided
332
+ it is cured within 30 days of Your discovery of the
333
+ violation; or
334
+
335
+ 2. upon express reinstatement by the Licensor.
336
+
337
+ For the avoidance of doubt, this Section 6(b) does not affect any
338
+ right the Licensor may have to seek remedies for Your violations
339
+ of this Public License.
340
+
341
+ c. For the avoidance of doubt, the Licensor may also offer the
342
+ Licensed Material under separate terms or conditions or stop
343
+ distributing the Licensed Material at any time; however, doing so
344
+ will not terminate this Public License.
345
+
346
+ d. Sections 1, 5, 6, 7, and 8 survive termination of this Public
347
+ License.
348
+
349
+
350
+ Section 7 -- Other Terms and Conditions.
351
+
352
+ a. The Licensor shall not be bound by any additional or different
353
+ terms or conditions communicated by You unless expressly agreed.
354
+
355
+ b. Any arrangements, understandings, or agreements regarding the
356
+ Licensed Material not stated herein are separate from and
357
+ independent of the terms and conditions of this Public License.
358
+
359
+
360
+ Section 8 -- Interpretation.
361
+
362
+ a. For the avoidance of doubt, this Public License does not, and
363
+ shall not be interpreted to, reduce, limit, restrict, or impose
364
+ conditions on any use of the Licensed Material that could lawfully
365
+ be made without permission under this Public License.
366
+
367
+ b. To the extent possible, if any provision of this Public License is
368
+ deemed unenforceable, it shall be automatically reformed to the
369
+ minimum extent necessary to make it enforceable. If the provision
370
+ cannot be reformed, it shall be severed from this Public License
371
+ without affecting the enforceability of the remaining terms and
372
+ conditions.
373
+
374
+ c. No term or condition of this Public License will be waived and no
375
+ failure to comply consented to unless expressly agreed to by the
376
+ Licensor.
377
+
378
+ d. Nothing in this Public License constitutes or may be interpreted
379
+ as a limitation upon, or waiver of, any privileges and immunities
380
+ that apply to the Licensor or You, including from the legal
381
+ processes of any jurisdiction or authority.
382
+
383
+ =======================================================================
384
+
385
+ Creative Commons is not a party to its public
386
+ licenses. Notwithstanding, Creative Commons may elect to apply one of
387
+ its public licenses to material it publishes and in those instances
388
+ will be considered the “Licensor.” The text of the Creative Commons
389
+ public licenses is dedicated to the public domain under the CC0 Public
390
+ Domain Dedication. Except for the limited purpose of indicating that
391
+ material is shared under a Creative Commons public license or as
392
+ otherwise permitted by the Creative Commons policies published at
393
+ creativecommons.org/policies, Creative Commons does not authorize the
394
+ use of the trademark "Creative Commons" or any other trademark or logo
395
+ of Creative Commons without its prior written consent including,
396
+ without limitation, in connection with any unauthorized modifications
397
+ to any of its public licenses or any other arrangements,
398
+ understandings, or agreements concerning use of licensed material. For
399
+ the avoidance of doubt, this paragraph does not form part of the
400
+ public licenses.
401
+
402
+ Creative Commons may be contacted at creativecommons.org.
README.md ADDED
@@ -0,0 +1,220 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ license: cc-by-nc-nd-4.0
5
+ library_name: generic
6
+ tags:
7
+ - emotion-recognition
8
+ - bayesian-deep-learning
9
+ - mc-dropout
10
+ - uncertainty-quantification
11
+ - multi-label-classification
12
+ datasets:
13
+ - Skylion007/openwebtext
14
+ - google-research-datasets/go_emotions
15
+ snippet: |
16
+ from huggingface_hub import snapshot_download
17
+ from emcoder import EmCoder
18
+
19
+ model_dir = snapshot_download(repo_id="yezdata/EmCoder")
20
+
21
+ tokenizer = AutoTokenizer.from_pretrained(model_dir)
22
+ model = EmCoder.from_pretrained(model_dir)
23
+ metrics:
24
+ - precision
25
+ - recall
26
+ - f1
27
+ model-index:
28
+ - name: EmCoder (v1)
29
+ results:
30
+ - task:
31
+ type: text-classification
32
+ name: Multi-label Emotion Classification
33
+ dataset:
34
+ name: GoEmotions
35
+ type: go_emotions
36
+ split: test
37
+ metrics:
38
+ - name: Macro F1
39
+ type: f1
40
+ value: 0.44
41
+ - name: Macro Precision
42
+ type: precision
43
+ value: 0.408
44
+ - name: Macro Recall
45
+ type: recall
46
+ value: 0.495
47
+ ---
48
+
49
+ # EmCoder
50
+ <blockquote>
51
+ <b>Probabilistic Emotion Recognition & Uncertainty Quantification</b><br>
52
+ <b>28 Emotion multi-label classifier trained with MC Dropout methodology</b>
53
+ </blockquote>
54
+
55
+
56
+ Unlike standard classifiers, EmCoder quantifies what it doesn't know using Monte Carlo Dropout, making it suitable for high-stakes AI pipelines.<br>
57
+ EmCoder is optimized for **MC Dropout inference**.
58
+
59
+
60
+
61
+ ## SOTA benchmark
62
+ ### Evaluation on the GoEmotions test split (macro avg metrics)
63
+ EmCoder achieves competitive F1-scores while being ~35% smaller than RoBERTa-base and ~45% smaller than ModernBERT, offering a superior efficiency-to-uncertainty ratio.
64
+ | Model | Precision | Recall | F1-Score | Params |
65
+ | :--- | :--- | :--- | :--- | :--- |
66
+ | **EmCoder (v1)** | **0.408** | **0.495** | **0.440** | **82.1M** |
67
+ | Google BERT (Original) | 0.400 | 0.630 | 0.460 | 110M |
68
+ | RoBERTa-base | 0.575 | 0.396 | 0.450 | 125M |
69
+ | ModernBERT-base | 0.652 | 0.443 | 0.500 | 149M |
70
+
71
+
72
+ ## How to use
73
+ Since `.safetensors` files only store model weights and not the class logic, you need to use the provided `emcoder.py` to enable **MC Dropout inference**.<br>EmCoder v1.0 requires the `roberta-base` tokenizer for correct token-to-embedding mapping.
74
+ ### 1. Setup & Tokenization
75
+ Install dependencies
76
+ ```bash
77
+ pip install -r requirements.txt
78
+ ```
79
+ Setup EmCoder
80
+ ```python
81
+ from transformers import AutoTokenizer
82
+ from huggingface_hub import snapshot_download
83
+ from emcoder import EmCoder # Ensure emcoder.py is in your directory
84
+
85
+ repo_id = "yezdata/EmCoder"
86
+ model_dir = snapshot_download(repo_id=repo_id)
87
+ print(model_dir)
88
+
89
+ # Load the same tokenizer used during training
90
+ tokenizer = AutoTokenizer.from_pretrained(model_dir)
91
+
92
+ # Initialize with same config as training
93
+ model = EmCoder.from_pretrained(model_dir)
94
+ ```
95
+ ### 2. Bayesian inference
96
+ To obtain probabilistic outputs and uncertainty metrics, use the `mc_forward` method:
97
+ ```python
98
+ import torch
99
+
100
+ # Perform 50 stochastic passes
101
+ N_SAMPLES = 50
102
+ model.eval()
103
+
104
+ inputs = tokenizer("I am so happy you are here!", return_tensors="pt")
105
+ logits_mc = model.mc_forward(inputs['input_ids'], inputs['attention_mask'], n_samples=N_SAMPLES) # Automatically keeps Dropout active, even when in model.eval
106
+
107
+ # Bayesian Post-processing
108
+ probs_all = torch.sigmoid(logits_mc) # (n_samples, B, 28)
109
+
110
+ mean_probs = probs_all.mean(dim=0) # Mean Predicted Probability
111
+ uncertainty = probs_all.std(dim=0) # Epistemic Uncertainty (Standard Deviation)
112
+
113
+
114
+ # Formatted Output
115
+ m_probs = mean_probs.squeeze(0)
116
+ u_vals = uncertainty.squeeze(0)
117
+
118
+ print(f"{'Emotion':<15} | {'Prob':<10} | {'Uncertainty':<10}")
119
+ print("-" * 40)
120
+
121
+ sorted_indices = torch.argsort(m_probs, descending=True)
122
+
123
+ for idx in sorted_indices:
124
+ prob, unc = m_probs[idx].item(), u_vals[idx].item()
125
+ label = model.config.id2label[idx.item()]
126
+
127
+ if prob > 0.05: # Print only emotions with prob > 5% (optional for clarity)
128
+ print(f"{label:<15} | {prob:>8.2%} | ±{unc:>8.4f}")
129
+ ```
130
+
131
+
132
+ ## Model Architecture
133
+ ![EmCoder Architecture](outputs/architecture.png)
134
+
135
+
136
+ ### Optimization
137
+ The model is trained using a Weighted Bayesian Binary Cross Entropy loss:
138
+
139
+ $$
140
+ \mathcal{L}_{Bayesian} = \frac{1}{T} \sum_{t=1}^{T} \text{BCEWithLogits}(z^{(t)}, y; w)
141
+ $$
142
+
143
+ Where weights $w$ are calculated using a logarithmic class-balancing scale to handle extreme label imbalance:
144
+
145
+ $$
146
+ w_{c} = \max\left( 0.1, \min\left( 20, 1 + \ln \left( \frac{N_{neg,c} + \epsilon}{N_{pos,c} + \epsilon} \right) \right) \right)
147
+ $$
148
+
149
+
150
+
151
+ ## Performance
152
+ **Using threshold of 0.5 for binarizing predictions**
153
+ | | precision | recall | f1-score | support |
154
+ |:---------------|------------:|---------:|-----------:|----------:|
155
+ | micro avg | 0.494 | 0.596 | 0.54 | 6329 |
156
+ | macro avg | 0.408 | 0.495 | 0.44 | 6329 |
157
+ | weighted avg | 0.492 | 0.596 | 0.535 | 6329 |
158
+ | samples avg | 0.525 | 0.616 | 0.544 | 6329 |
159
+ |----------------|-------------|----------|------------|-----------|
160
+ | admiration | 0.541 | 0.673 | 0.599 | 504 |
161
+ | amusement | 0.688 | 0.909 | 0.783 | 264 |
162
+ | anger | 0.419 | 0.47 | 0.443 | 198 |
163
+ | annoyance | 0.31 | 0.25 | 0.277 | 320 |
164
+ | approval | 0.304 | 0.271 | 0.287 | 351 |
165
+ | caring | 0.229 | 0.281 | 0.252 | 135 |
166
+ | confusion | 0.26 | 0.497 | 0.342 | 153 |
167
+ | curiosity | 0.432 | 0.764 | 0.552 | 284 |
168
+ | desire | 0.453 | 0.518 | 0.483 | 83 |
169
+ | disappointment | 0.176 | 0.152 | 0.163 | 151 |
170
+ | disapproval | 0.279 | 0.404 | 0.33 | 267 |
171
+ | disgust | 0.447 | 0.545 | 0.491 | 123 |
172
+ | embarrassment | 0.325 | 0.351 | 0.338 | 37 |
173
+ | excitement | 0.288 | 0.427 | 0.344 | 103 |
174
+ | fear | 0.47 | 0.692 | 0.56 | 78 |
175
+ | gratitude | 0.834 | 0.943 | 0.885 | 352 |
176
+ | grief | 0 | 0 | 0 | 6 |
177
+ | joy | 0.445 | 0.652 | 0.529 | 161 |
178
+ | love | 0.724 | 0.895 | 0.801 | 238 |
179
+ | nervousness | 0.24 | 0.261 | 0.25 | 23 |
180
+ | optimism | 0.483 | 0.543 | 0.511 | 186 |
181
+ | pride | 0.667 | 0.375 | 0.48 | 16 |
182
+ | realization | 0.226 | 0.166 | 0.191 | 145 |
183
+ | relief | 0.222 | 0.182 | 0.2 | 11 |
184
+ | remorse | 0.516 | 0.857 | 0.644 | 56 |
185
+ | sadness | 0.405 | 0.545 | 0.464 | 156 |
186
+ | surprise | 0.429 | 0.539 | 0.478 | 141 |
187
+ | neutral | 0.602 | 0.695 | 0.645 | 1787 |
188
+
189
+
190
+
191
+ **Model uncertainty estimation**
192
+ ![epistemic_unc](outputs/epistemic_unc_scatter.png)
193
+
194
+ **Confusion matrix**
195
+ ![multi_label_confusion_matrix](outputs/confusion_matrix.png)
196
+
197
+
198
+
199
+ ## Workflow
200
+ ![EmCoder Workflow](outputs/workflow.png)
201
+
202
+
203
+ ### Note
204
+ Note that this model was trained on GoEmotions dataset (social networks domain) and it may not generalize well to other domains.
205
+
206
+
207
+ ## Citation
208
+ If you use this model, please cite it as follows:
209
+
210
+ ```bibtex
211
+ @software{jez2026emcoder,
212
+ author = {Václav Jež},
213
+ title = {EmCoder: Probabilistic Emotion Recognition & Uncertainty Quantification},
214
+ year = {2026},
215
+ publisher = {GitHub},
216
+ journal = {GitHub repository},
217
+ howpublished = {\url{https://github.com/yezdata/emcoder}},
218
+ version = {1.0.0}
219
+ }
220
+ ```
emcoder.py ADDED
@@ -0,0 +1,155 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import torch
2
+ import torch.nn as nn
3
+ from safetensors.torch import load_file
4
+ from pydantic import BaseModel, model_validator, field_validator
5
+
6
+
7
+ class ModelConfig(BaseModel):
8
+ vocab_size: int
9
+ max_seq_len: int
10
+
11
+ d_model: int
12
+ n_head: int
13
+ n_layers: int
14
+ d_ffn: int
15
+
16
+ dropout: float
17
+
18
+ num_labels: int
19
+ id2label: dict[int, str]
20
+ label2id: dict[str, int]
21
+
22
+ base_encoder_path: str
23
+
24
+ @field_validator("id2label", mode="before")
25
+ @classmethod
26
+ def coerce_keys_to_int(cls, v):
27
+ return {int(k): val for k, val in v.items()}
28
+
29
+ @model_validator(mode='after')
30
+ def check_consistency(self):
31
+ if len(self.id2label) != self.num_labels:
32
+ raise ValueError("num_labels does not match id2label dictionary len")
33
+ return self
34
+
35
+
36
+
37
+
38
+ class EmCoderCore(nn.Module):
39
+ """The core encoder architecture of EmCoder, without the classification head."""
40
+ def __init__(self, config: ModelConfig):
41
+ super().__init__()
42
+
43
+ self.token_embedding = nn.Embedding(
44
+ config.vocab_size,
45
+ config.d_model
46
+ )
47
+ self.pos_embedding = nn.Embedding(
48
+ config.max_seq_len,
49
+ config.d_model
50
+ )
51
+
52
+ self.embed_norm = nn.LayerNorm(config.d_model)
53
+
54
+ encoder_layer = nn.TransformerEncoderLayer(
55
+ d_model=config.d_model,
56
+ nhead=config.n_head,
57
+ dim_feedforward=config.d_ffn,
58
+ dropout=config.dropout,
59
+ activation="gelu",
60
+ norm_first=True,
61
+ batch_first=True
62
+ )
63
+ self.encoder = nn.TransformerEncoder(
64
+ encoder_layer=encoder_layer,
65
+ num_layers=config.n_layers
66
+ )
67
+
68
+ self.final_norm = nn.LayerNorm(config.d_model)
69
+ self.dropout = nn.Dropout(config.dropout)
70
+
71
+ def forward(self, x: torch.Tensor, mask: torch.Tensor) -> torch.Tensor:
72
+ """Standard forward pass through the encoder."""
73
+ seq_len = x.size(1)
74
+ pos_ids = torch.arange(seq_len, device=x.device).unsqueeze(0)
75
+
76
+ x = self.token_embedding(x) + self.pos_embedding(pos_ids)
77
+
78
+ x = self.embed_norm(x)
79
+ x = self.dropout(x)
80
+
81
+ padding_mask = (mask == 0)
82
+
83
+ encoded = self.encoder(x, src_key_padding_mask=padding_mask)
84
+ return self.final_norm(encoded)
85
+
86
+
87
+
88
+ class EmCoder(nn.Module):
89
+ """The full EmCoder model, including the classification head."""
90
+ def __init__(self, encoder: EmCoderCore, config: ModelConfig):
91
+ super().__init__()
92
+
93
+ self.encoder = encoder
94
+ self.config = config
95
+
96
+ self.classifier = nn.Sequential(
97
+ nn.Linear(config.d_model, config.d_model),
98
+ nn.GELU(),
99
+ nn.Dropout(config.dropout),
100
+ nn.Linear(config.d_model, config.num_labels)
101
+ )
102
+
103
+
104
+ def _set_mc_dropout(self, active: bool = True):
105
+ for m in self.modules():
106
+ if isinstance(m, nn.Dropout):
107
+ m.train(active)
108
+
109
+
110
+ @classmethod
111
+ def from_pretrained(cls, emcoder_path: str):
112
+ """Loads the EmCoder model from the specified directory."""
113
+ # Use model_config.json to initialize same parameterers as in training
114
+ with open(f"{emcoder_path}/model_config.json", "r") as f:
115
+ model_config = ModelConfig.model_validate_json(f.read())
116
+
117
+
118
+ encoder = EmCoderCore(model_config)
119
+ model = cls(encoder, model_config)
120
+
121
+ state_dict = load_file(f"{emcoder_path}/model.safetensors")
122
+ model.load_state_dict(state_dict, strict=True)
123
+ return model
124
+
125
+
126
+ @staticmethod
127
+ def _masked_mean_pooling(features: torch.Tensor, mask: torch.Tensor) -> torch.Tensor:
128
+ mask = mask.unsqueeze(-1) # (B, S, 1)
129
+ masked_features = features * mask # (B, S, D)
130
+ sum_masked_features = masked_features.sum(dim=1) # (B, D)
131
+ count_tokens = torch.clamp(mask.sum(dim=1), min=1e-9) # (B, 1)
132
+ return sum_masked_features / count_tokens # (B, D)
133
+
134
+
135
+ def mc_forward(self, x: torch.Tensor, mask: torch.Tensor, n_samples: int) -> torch.Tensor:
136
+ """Performs Monte Carlo Dropout inference to quantify epistemic uncertainty."""
137
+ self._set_mc_dropout(active=True)
138
+
139
+ B, S = x.shape
140
+ x_stacked = x.repeat(n_samples, 1) # (n_samples * B, S)
141
+ mask_stacked = mask.repeat(n_samples, 1)
142
+
143
+ features = self.encoder(x_stacked, mask_stacked)
144
+ pooled = self._masked_mean_pooling(features, mask_stacked)
145
+ logits = self.classifier(pooled) # (n_samples * B, num_labels)
146
+
147
+ return logits.view(n_samples, B, -1)
148
+
149
+
150
+ def forward(self, x: torch.Tensor, mask: torch.Tensor) -> torch.Tensor:
151
+ """Standard forward pass without MC Dropout."""
152
+ features = self.encoder(x, mask)
153
+
154
+ pooled = self._masked_mean_pooling(features, mask)
155
+ return self.classifier(pooled)
emcoder/model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f79307191a44f91b6c9b7e2373062bd655a38efef31a16831e7629d18ce33f50
3
+ size 328565600
emcoder/model_config.json ADDED
@@ -0,0 +1,71 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "vocab_size": 50265,
3
+ "max_seq_len": 512,
4
+ "d_model": 768,
5
+ "n_head": 12,
6
+ "n_layers": 6,
7
+ "d_ffn": 3072,
8
+ "dropout": 0.15,
9
+ "num_labels": 28,
10
+ "id2label": {
11
+ "0": "admiration",
12
+ "1": "amusement",
13
+ "2": "anger",
14
+ "3": "annoyance",
15
+ "4": "approval",
16
+ "5": "caring",
17
+ "6": "confusion",
18
+ "7": "curiosity",
19
+ "8": "desire",
20
+ "9": "disappointment",
21
+ "10": "disapproval",
22
+ "11": "disgust",
23
+ "12": "embarrassment",
24
+ "13": "excitement",
25
+ "14": "fear",
26
+ "15": "gratitude",
27
+ "16": "grief",
28
+ "17": "joy",
29
+ "18": "love",
30
+ "19": "nervousness",
31
+ "20": "optimism",
32
+ "21": "pride",
33
+ "22": "realization",
34
+ "23": "relief",
35
+ "24": "remorse",
36
+ "25": "sadness",
37
+ "26": "surprise",
38
+ "27": "neutral"
39
+ },
40
+ "label2id": {
41
+ "admiration": 0,
42
+ "amusement": 1,
43
+ "anger": 2,
44
+ "annoyance": 3,
45
+ "approval": 4,
46
+ "caring": 5,
47
+ "confusion": 6,
48
+ "curiosity": 7,
49
+ "desire": 8,
50
+ "disappointment": 9,
51
+ "disapproval": 10,
52
+ "disgust": 11,
53
+ "embarrassment": 12,
54
+ "excitement": 13,
55
+ "fear": 14,
56
+ "gratitude": 15,
57
+ "grief": 16,
58
+ "joy": 17,
59
+ "love": 18,
60
+ "nervousness": 19,
61
+ "optimism": 20,
62
+ "pride": 21,
63
+ "realization": 22,
64
+ "relief": 23,
65
+ "remorse": 24,
66
+ "sadness": 25,
67
+ "surprise": 26,
68
+ "neutral": 27
69
+ },
70
+ "base_encoder_path": "models/v1/pretrain/checkpoints/epoch_2/step_40000"
71
+ }
emcoder/model_state.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "train_loss": 0.264223575592041,
3
+ "eval_loss": 0.2328128303236821
4
+ }
emcoder/tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
emcoder/tokenizer_config.json ADDED
The diff for this file is too large to render. See raw diff
 
emcoder/train_config.json ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bayesian_train": true,
3
+ "loss_weights": "log",
4
+ "tokenized_ds_dir": "data/goemotions_v1_seq512",
5
+ "encoder_lr": 0.00001,
6
+ "head_lr": 0.0005,
7
+ "lr_warmup": 0.05,
8
+ "weight_decay": 0.01,
9
+ "batch_size": 32,
10
+ "gradient_accumulation_steps": 8,
11
+ "num_epochs": 10
12
+ }
outputs/architecture.png ADDED

Git LFS Details

  • SHA256: 229562b918c3486a436e38527897d2e7087c65b0e16a3ad957063e2efa2217ad
  • Pointer size: 131 Bytes
  • Size of remote file: 378 kB
outputs/confusion_matrix.png ADDED
outputs/epistemic_unc_scatter.png ADDED

Git LFS Details

  • SHA256: 0ee421e2f00a10cb2f4839060cd3b8a9245dbe97680355774f9e18c336f37467
  • Pointer size: 131 Bytes
  • Size of remote file: 440 kB
outputs/workflow.png ADDED

Git LFS Details

  • SHA256: 4b500480998dcb30822266bcd6929af399c82c312fa329ddbdecb74c1fa188ac
  • Pointer size: 131 Bytes
  • Size of remote file: 608 kB
requirements.txt ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ pydantic>=2.13.3
2
+ torch>=2.11.0
3
+ transformers>=5.7.0
4
+ huggingface-hub>=1.13.0
5
+ safetensors>=0.7.0