maadgrom commited on
Commit
0cecd81
Β·
verified Β·
1 Parent(s): d78d445

Upload folder using huggingface_hub

Browse files
Files changed (2) hide show
  1. README.md +164 -0
  2. manifest.json +3 -2
README.md ADDED
@@ -0,0 +1,164 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers.js
3
+ pipeline_tag: token-classification
4
+ base_model: microsoft/xtremedistil-l6-h256-uncased
5
+ language:
6
+ - en
7
+ license: cc-by-4.0
8
+ tags:
9
+ - token-classification
10
+ - slot-filling
11
+ - ner
12
+ - transformers.js
13
+ - onnx
14
+ - quantized
15
+ datasets:
16
+ - AmazonScience/massive
17
+ metrics:
18
+ - f1
19
+ - precision
20
+ - recall
21
+ - accuracy
22
+ model-index:
23
+ - name: notes-slots
24
+ results:
25
+ - task:
26
+ type: token-classification
27
+ name: Slot Extraction
28
+ dataset:
29
+ name: MASSIVE en-US (+ synthetic productivity/realistic)
30
+ type: AmazonScience/massive
31
+ metrics:
32
+ - type: f1
33
+ value: 0.8535
34
+ name: Token F1 (fp32)
35
+ - type: f1
36
+ value: 0.7392
37
+ name: Token F1 (q8, shipped)
38
+ - type: accuracy
39
+ value: 0.9550
40
+ name: Accuracy (fp32)
41
+ ---
42
+
43
+ # notes-slots
44
+
45
+ Compact token-classification model that extracts scheduling/task slots from short
46
+ English notes β€” **participants**, **datetimes**, **priorities**, and
47
+ **recurrences** β€” and runs fully client-side via [Transformers.js](https://huggingface.co/docs/transformers.js).
48
+
49
+ The shipped artifact is an **INT8-quantized ONNX** bundle (~13 MB) intended for
50
+ in-browser WASM inference, not a PyTorch checkpoint.
51
+
52
+ ## Model details
53
+
54
+ | | |
55
+ |---|---|
56
+ | Base model | [`microsoft/xtremedistil-l6-h256-uncased`](https://huggingface.co/microsoft/xtremedistil-l6-h256-uncased) (MIT) |
57
+ | Architecture | `BertForTokenClassification` β€” 6 layers, hidden size 256, 8 heads, intermediate 1024, vocab 30522, max positions 512 (~13M params) |
58
+ | Task | Token classification (BIO slot tagging) |
59
+ | Schema version | `slot-labels-v0.3.0` |
60
+ | Model version | 0.1.0 |
61
+ | Languages | English |
62
+ | Runtime | Transformers.js v4, WASM device, dtype `q8` |
63
+ | Bundle size | 13.32 MB |
64
+ | `transformers` (training) | 4.57.6 |
65
+ | License | CC BY 4.0 |
66
+
67
+ ### Labels (9, BIO)
68
+
69
+ `O`, `B-PARTICIPANT`, `I-PARTICIPANT`, `B-PRIORITY`, `I-PRIORITY`,
70
+ `B-DATETIME`, `I-DATETIME`, `B-RECURRENCE`, `I-RECURRENCE`
71
+
72
+ A bundled `transitions.json` carries empirical BIO transition log-probabilities
73
+ (Laplace-smoothed, invalid transitions hard-zeroed) for optional Viterbi-style
74
+ decoding on top of the raw token logits.
75
+
76
+ ## Intended use
77
+
78
+ - **In scope:** extracting participants / datetimes / priorities / recurrence
79
+ cues from short, informal English notes and reminders (calendar, to-do, email
80
+ intent style text).
81
+ - **Out of scope:** long documents, languages other than English, normalization
82
+ of extracted spans into structured datetimes (use a downstream parser such as
83
+ `chrono-node` for that), and any high-stakes decisioning.
84
+
85
+ ## Usage (Transformers.js)
86
+
87
+ ```js
88
+ import { pipeline } from "@huggingface/transformers";
89
+
90
+ const tagger = await pipeline("token-classification", "jottypro/notes-slots", {
91
+ dtype: "q8",
92
+ });
93
+
94
+ const out = await tagger("call Sarah next Friday at 5pm, high priority, every week");
95
+ console.log(out);
96
+ ```
97
+
98
+ The ONNX weights live at `onnx/model_quantized.onnx`, which is the layout
99
+ Transformers.js expects when loading from the Hub.
100
+
101
+ ## Training
102
+
103
+ - **Data:** [AmazonScience/MASSIVE](https://huggingface.co/datasets/AmazonScience/massive)
104
+ `en-US` (config `en-US`, revision `d2362678…`), filtered to the
105
+ `calendar` / `datetime` / `email` / `lists` scenarios with MASSIVE slots
106
+ remapped onto the local 4-slot schema (e.g. `person`/`relation`/`email_address`
107
+ β†’ `PARTICIPANT`, `date`/`time`/`time_zone` β†’ `DATETIME`,
108
+ `general_frequency` β†’ `RECURRENCE`), combined with synthetic
109
+ *productivity* and *realistic* note generators.
110
+ - **Augmentation:** light, training-split only (`AUGMENT_FACTOR=2`) β€” random
111
+ filler-word prefix, trailing punctuation, occasional `O`-token dropout;
112
+ deduplicated against originals.
113
+ - **Hyperparameters:** 10 epochs with early stopping (patience 2, restore best
114
+ by F1), batch size 64, learning rate 5e-5, cosine schedule, warmup ratio 0.1,
115
+ weight decay 0.01, label smoothing 0.1, max sequence length 128, seed 42.
116
+ - **Quantization:** dynamic, per-channel `QInt8`, applied to `MatMul` and
117
+ `Gather` ops via ONNX Runtime.
118
+
119
+ ## Evaluation
120
+
121
+ Token-level metrics (seqeval) on the held-out test split (n β‰ˆ 559).
122
+ **The q8 column reflects the artifact actually shipped in this repo.**
123
+
124
+ | Metric | fp32 | q8 (shipped) |
125
+ |---|---|---|
126
+ | Accuracy | 0.9550 | 0.9050 |
127
+ | Precision | 0.8283 | 0.8718 |
128
+ | Recall | 0.8802 | 0.6416 |
129
+ | **F1** | **0.8535** | **0.7392** |
130
+ | DATETIME F1 | 0.8191 | 0.6724 |
131
+ | PARTICIPANT F1 | 0.9208 | 0.8943 |
132
+ | PRIORITY F1 | 0.7979 | 0.6316 |
133
+ | RECURRENCE F1 | 0.8981 | 0.7093 |
134
+
135
+ ## Limitations and bias
136
+
137
+ - **Quantization cost:** INT8 quantization raises precision slightly but cuts
138
+ recall substantially (0.88 β†’ 0.64; F1 0.85 β†’ 0.74). The model misses more
139
+ true spans than the fp32 model; tune downstream thresholds accordingly.
140
+ - **Domain:** trained on short calendar/task-style English notes plus synthetic
141
+ data; expect degradation on long-form text, other domains, or other languages.
142
+ - **Synthetic data:** part of the training distribution is generated, so phrasing
143
+ diversity and demographic coverage of names/relations is limited and may carry
144
+ generator biases.
145
+ - **No span normalization:** the model tags spans only; converting a `DATETIME`
146
+ span to an actual timestamp is a downstream concern.
147
+
148
+ ## License and attribution
149
+
150
+ Released under **CC BY 4.0**, consistent with the MASSIVE training data
151
+ (CC BY 4.0). The base model `microsoft/xtremedistil-l6-h256-uncased` is MIT.
152
+
153
+ Part of the training data is derived from the MASSIVE dataset; CC BY 4.0
154
+ requires attribution to that source:
155
+
156
+ ```bibtex
157
+ @misc{fitzgerald2022massive,
158
+ title = {MASSIVE: A 1M-Example Multilingual Natural Language Understanding Dataset with 51 Typologically-Diverse Languages},
159
+ author = {FitzGerald, Jack and others},
160
+ year = {2022},
161
+ eprint = {2204.08582},
162
+ archivePrefix = {arXiv}
163
+ }
164
+ ```
manifest.json CHANGED
@@ -1,8 +1,9 @@
1
  {
2
  "base_model": "microsoft/xtremedistil-l6-h256-uncased",
3
- "bundle_size_mb": 13.32,
4
- "created_at": "2026-05-03T11:46:52.122552+00:00",
5
  "files": {
 
6
  "config.json": "a42323a8b73212169370a0265018350775bc8556e795c3e3348293ce877f0c26",
7
  "onnx/model_quantized.onnx": "0b9edd8df5baa28b4a6e4e7b07c796234ab2e19280e38eb7e1705d09462f143f",
8
  "special_tokens_map.json": "b6d346be366a7d1d48332dbc9fdf3bf8960b5d879522b7799ddba59e76237ee3",
 
1
  {
2
  "base_model": "microsoft/xtremedistil-l6-h256-uncased",
3
+ "bundle_size_mb": 13.325,
4
+ "created_at": "2026-05-17T10:39:56.428239+00:00",
5
  "files": {
6
+ "README.md": "e2d57051221c694b32de2b4152770940ea68dfeae6339581eaa6b6bfa710ae00",
7
  "config.json": "a42323a8b73212169370a0265018350775bc8556e795c3e3348293ce877f0c26",
8
  "onnx/model_quantized.onnx": "0b9edd8df5baa28b4a6e4e7b07c796234ab2e19280e38eb7e1705d09462f143f",
9
  "special_tokens_map.json": "b6d346be366a7d1d48332dbc9fdf3bf8960b5d879522b7799ddba59e76237ee3",