first

Browse files

Files changed (14) hide show

README.md +199 -3
config.json +165 -0
onnx/model.onnx +3 -0
onnx/model_bnb4.onnx +3 -0
onnx/model_fp16.onnx +3 -0
onnx/model_int8.onnx +3 -0
onnx/model_q4.onnx +3 -0
onnx/model_q4f16.onnx +3 -0
onnx/model_quantized.onnx +3 -0
onnx/model_uint8.onnx +3 -0
special_tokens_map.json +37 -0
tokenizer.json +0 -0
tokenizer_config.json +62 -0
vocab.txt +0 -0

README.md CHANGED Viewed

@@ -1,3 +1,199 @@
----
-license: apache-2.0
----

+---
+license: apache-2.0
+language: multilingual
+library_name: transformers.js
+pipeline_tag: text-classification
+base_model: huawei-noah/TinyBERT_General_4L_312D
+tags:
+  - autofill
+  - field-classification
+  - bert
+  - tinybert
+  - onnx
+  - transformers.js
+  - browser
+---
+# TinyBERT Address Autofill
+A compact field-type classifier for HTML form autofill. Given a string
+describing a single form field's attributes, it predicts one of 66 autofill
+field types (`given-name`, `family-name`, `email`, `postal-code`,
+`address-line1`, `cc-number`, etc.) or `other` when the field should not be
+filled.
+The model is fine-tuned from `huawei-noah/TinyBERT_General_4L_312D` on a
+corpus of manually annotated shopping and address forms collected by Mozilla, and is
+intended to run client-side inside Firefox (or any Transformers.js host) as
+a replacement or augmentation for the existing regex-based heuristic field
+detector.
+## ONNX variants
+All variants live under `onnx/` and are loadable through Transformers.js by
+passing the corresponding `dtype` argument.
+| File | Precision | Size | Transformers.js `dtype` |
+| --- | --- | ---: | --- |
+| `onnx/model.onnx` | fp32 | 57.6 MB | `fp32` |
+| `onnx/model_fp16.onnx` | fp16 | 28.9 MB | `fp16` |
+| `onnx/model_quantized.onnx` | int8 dynamic (default) | 14.6 MB | `q8` |
+| `onnx/model_int8.onnx` | int8 dynamic | 14.6 MB | `int8` |
+| `onnx/model_uint8.onnx` | uint8 dynamic | 14.6 MB | `uint8` |
+| `onnx/model_q4.onnx` | 4-bit weight-only on MatMul | 42.3 MB | `q4` |
+| `onnx/model_q4f16.onnx` | 4-bit on top of fp16 | 22.4 MB | `q4f16` |
+| `onnx/model_bnb4.onnx` | bitsandbytes NF4 | 41.9 MB | `bnb4` |
+## How to use
+### Transformers.js (browser)
+```js
+import { pipeline } from "@huggingface/transformers";
+const classifier = await pipeline(
+  "text-classification",
+  "vazish/tinybert-address-autofill",
+  { dtype: "q8" }   // try "fp16" for highest fidelity, "q4f16" for smallest
+);
+const out = await classifier(
+  "a-c-postal-code billing zip code dwfrm billing address fields postal code"
+);
+// → [{ label: "postal-code", score: 0.99 }]
+```
+### Python (Optimum + ONNX Runtime)
+```python
+from optimum.onnxruntime import ORTModelForSequenceClassification
+from transformers import AutoTokenizer, pipeline
+model = ORTModelForSequenceClassification.from_pretrained(
+    "vazish/tinybert-address-autofill",
+    file_name="onnx/model.onnx",   # or onnx/model_quantized.onnx, etc.
+)
+tokenizer = AutoTokenizer.from_pretrained("vazish/tinybert-address-autofill")
+clf = pipeline("text-classification", model=model, tokenizer=tokenizer)
+clf("email email mail **email")
+# → [{"label": "email", "score": 0.99}]
+```
+## Input format
+The model expects a single string per field, built by concatenating that
+field's HTML attributes after light normalisation:
+1. Concatenate (in order): `type` + `autocomplete` + `id` + `name` +
+   `placeholder` + the field's computed `<label>` text.
+2. Split camelCase boundaries to whitespace (`firstName` → `first name`).
+3. Lowercase the whole thing.
+4. If the field declares an `autocomplete` attribute, prepend an
+   `a-c-<value>` token (e.g. `a-c-postal-code`).
+5. Optionally include adjacent-field context — `bb`-prefixed tokens for
+   the previous field on the same form and `aa`-prefixed tokens for the
+   next. Including adjacent context improves accuracy by roughly 8 percentage
+   points relative to the same model trained on isolated fields.
+Example input for a "first name" field followed by a "last name" field:
+```
+first name first name enter first name aaa-c-family-name aalast aaname
+```
+## Training
+| | |
+| --- | --- |
+| Base model | `huawei-noah/TinyBERT_General_4L_312D` (4 layers, hidden 312, intermediate 1200, 12 heads, ~14M params, max sequence length 512) |
+| Head | `BertForSequenceClassification`, 66 output classes |
+| Training set | ~360 real shopping / checkout / address forms, 6,691 labelled fields |
+| Validation / test | ~246 forms, 4,300 fields, split into validation and test |
+| Regions covered | US, CA, GB, FR, DE, BR, ES, JP, AT, IN, IT, PL, AU, CH (supported); some additional regions also represented for evaluation |
+| Optimizer / schedule | Hugging Face `Trainer` defaults, 50 epochs |
+| Hardware | Apple M1 MacBook Pro, ~75 minutes wall time |
+Each form field is annotated with `data-mozautofill-type="<type>"` set to
+the expected autofill class; fields that should not be filled receive no
+attribute and are mapped to `other`.
+## Evaluation
+Evaluated on the project's held-out test set (2,168 labelled fields drawn
+from real address / shopping forms) using ONNX Runtime on CPU.
+- **Total** — strict exact-match accuracy.
+- **Close** — counts predictions on closely related labels as correct
+  (e.g. `street-address` predicted when ground truth is `address-line1`,
+  `tel` predicted when ground truth is `tel-national`).
+- **Blank** — false-fill rate. Fraction of `other`-labelled fields the
+  model predicted as a real autofill type. Lower is better; this metric
+  matters most for user experience because high false-fill means filling
+  search boxes, comments, and gift-card fields with personal data.
+| Variant | Total | Close | Blank | Throughput (CPU) |
+| --- | ---: | ---: | ---: | ---: |
+| fp32 | **89.62%** | 91.51% | 2.40% | ~218/s |
+| fp16 | **89.71%** | 91.61% | 2.31% | ~132/s |
+| bnb4 | 88.42% | 90.64% | 2.77% | ~214/s |
+| q4 | 88.01% | 90.54% | 2.58% | ~209/s |
+| q4f16 | 88.01% | 90.54% | 2.58% | ~95/s |
+| uint8 | 87.27% | 89.53% | 3.27% | ~163/s |
+| int8 / quantized | 84.82% | 87.73% | **1.94%** | ~257/s |
+For reference, the existing Firefox regex-based heuristic detector reaches
+roughly 85% total accuracy on comparable test sets.
+Highlights:
+- **fp16** is statistically indistinguishable from fp32 across all metrics
+  while halving the file size. It is the recommended high-fidelity
+  variant. Latency on CPU is ~2× fp32 because most CPUs lack native fp16
+  ops, but the gap closes on hardware with fp16 support and on
+  WebGPU.
+- **int8 / quantized** has the lowest exact accuracy but **the lowest
+  false-fill rate of any variant** (1.94%, below the fp32 baseline). It
+  errs toward `other` when uncertain — the safer failure mode for an
+  autofill UI. This is the recommended size-constrained default.
+- 4-bit variants (`q4`, `q4f16`, `bnb4`) cluster around 88% total accuracy
+  with `q4f16` being the smallest at 22 MB.
+## Limitations
+- Trained primarily on the supported-region list above. Accuracy on
+  unsupported regions trained-without-data drops ~5–10 percentage points;
+  adding region-specific samples to the training set typically recovers
+  most of that gap.
+- Underrepresented field types (`address-line3`, `additional-name`,
+  `phonetic-*`, `tel-local-prefix`, etc.) have very few training examples
+  and are sometimes confidently misclassified.
+- Quantized variants disagree with fp32 on roughly 0.1% (`fp16`) to ~5%
+  (`int8`) of inputs. The exact disagreement pattern is captured in the
+  evaluation table above.
+- The model assumes the team's preprocessing format (camelCase-split,
+  lowercased, with optional `a-c-`/`bb`/`aa` markers). Feeding raw HTML
+  attribute strings without this normalisation will degrade accuracy.
+## Citation
+This model is built on TinyBERT:
+```bibtex
+@inproceedings{jiao-etal-2020-tinybert,
+  title     = {{TinyBERT}: Distilling {BERT} for Natural Language Understanding},
+  author    = {Jiao, Xiaoqi and Yin, Yichun and Shang, Lifeng and Jiang, Xin
+               and Chen, Xiao and Li, Linlin and Wang, Fang and Liu, Qun},
+  booktitle = {Findings of the Association for Computational Linguistics: EMNLP 2020},
+  year      = {2020},
+  pages     = {4163--4174},
+  url       = {https://aclanthology.org/2020.findings-emnlp.372}
+}
+```
+If you use this checkpoint, please also cite the Mozilla autofill ML
+investigation that produced it (citation forthcoming).
+## License
+Apache 2.0.

config.json ADDED Viewed

	@@ -0,0 +1,165 @@

+{
+  "architectures": [
+    "BertForSequenceClassification"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "cell": {},
+  "classifier_dropout": null,
+  "dtype": "float32",
+  "emb_size": 312,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 312,
+  "id2label": {
+    "1": "other",
+    "2": "given-name",
+    "3": "family-name",
+    "4": "name",
+    "5": "additional-name",
+    "6": "phonetic-given-name",
+    "7": "phonetic-family-name",
+    "8": "phonetic-name",
+    "9": "honorific-prefix",
+    "10": "honorific-suffix",
+    "11": "nickname",
+    "12": "street-address",
+    "13": "address-lookup",
+    "14": "address-line1",
+    "15": "address-line2",
+    "16": "address-line3",
+    "17": "address-level1",
+    "18": "address-level2",
+    "19": "address-level3",
+    "20": "address-level4",
+    "21": "street",
+    "22": "address-streetname",
+    "23": "address-housenumber",
+    "24": "address-extra-housesuffix",
+    "25": "postal-code",
+    "26": "postal-code-lookup",
+    "27": "postal-code-and-city",
+    "28": "postal-code-or-suburb",
+    "29": "country",
+    "30": "country-name",
+    "31": "tel",
+    "32": "tel-country-code",
+    "33": "tel-national",
+    "34": "tel-area-code",
+    "35": "tel-local",
+    "36": "tel-local-prefix",
+    "37": "tel-local-suffix",
+    "38": "tel-extension",
+    "39": "organization",
+    "40": "organization-title",
+    "41": "bday",
+    "42": "bday-day",
+    "43": "bday-month",
+    "44": "bday-year",
+    "45": "email",
+    "46": "apartment",
+    "47": "floor",
+    "48": "stair",
+    "49": "building",
+    "50": "block",
+    "51": "address-extra",
+    "52": "cc-name",
+    "53": "cc-given-name",
+    "54": "cc-additional-name",
+    "55": "cc-family-name",
+    "56": "cc-number",
+    "57": "cc-exp",
+    "58": "cc-exp-month",
+    "59": "cc-exp-year",
+    "60": "cc-csc",
+    "61": "cc-type",
+    "62": "sex",
+    "63": "id-number",
+    "64": "vat-number",
+    "65": "reference-point",
+    "66": "loginname"
+  },
+  "initializer_range": 0.02,
+  "intermediate_size": 1200,
+  "label2id": {
+    "additional-name": 5,
+    "address-extra": 51,
+    "address-extra-housesuffix": 24,
+    "address-housenumber": 23,
+    "address-level1": 17,
+    "address-level2": 18,
+    "address-level3": 19,
+    "address-level4": 20,
+    "address-line1": 14,
+    "address-line2": 15,
+    "address-line3": 16,
+    "address-lookup": 13,
+    "address-streetname": 22,
+    "apartment": 46,
+    "bday": 41,
+    "bday-day": 42,
+    "bday-month": 43,
+    "bday-year": 44,
+    "block": 50,
+    "building": 49,
+    "cc-additional-name": 54,
+    "cc-csc": 60,
+    "cc-exp": 57,
+    "cc-exp-month": 58,
+    "cc-exp-year": 59,
+    "cc-family-name": 55,
+    "cc-given-name": 53,
+    "cc-name": 52,
+    "cc-number": 56,
+    "cc-type": 61,
+    "country": 29,
+    "country-name": 30,
+    "email": 45,
+    "family-name": 3,
+    "floor": 47,
+    "given-name": 2,
+    "honorific-prefix": 9,
+    "honorific-suffix": 10,
+    "id-number": 63,
+    "loginname": 66,
+    "name": 4,
+    "nickname": 11,
+    "organization": 39,
+    "organization-title": 40,
+    "other": 1,
+    "phonetic-family-name": 7,
+    "phonetic-given-name": 6,
+    "phonetic-name": 8,
+    "postal-code": 25,
+    "postal-code-and-city": 27,
+    "postal-code-lookup": 26,
+    "postal-code-or-suburb": 28,
+    "reference-point": 65,
+    "sex": 62,
+    "stair": 48,
+    "street": 21,
+    "street-address": 12,
+    "tel": 31,
+    "tel-area-code": 34,
+    "tel-country-code": 32,
+    "tel-extension": 38,
+    "tel-local": 35,
+    "tel-local-prefix": 36,
+    "tel-local-suffix": 37,
+    "tel-national": 33,
+    "vat-number": 64
+  },
+  "layer_norm_eps": 1e-12,
+  "max_position_embeddings": 512,
+  "model_type": "bert",
+  "num_attention_heads": 12,
+  "num_hidden_layers": 4,
+  "pad_token_id": 0,
+  "position_embedding_type": "absolute",
+  "pre_trained": "",
+  "problem_type": "single_label_classification",
+  "structure": [],
+  "transformers_version": "4.57.6",
+  "type_vocab_size": 2,
+  "use_cache": true,
+  "vocab_size": 30522
+}

onnx/model.onnx ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:6ca907af8e24d3ec61567bdd42dfe31aeb873a9eeff8023df2aa1fc5b73a06d7
+size 57560725

onnx/model_bnb4.onnx ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:bb31133ea6c69123ba1273ee52bae452cb83e16d5d3f765d8d28675f6df96da1
+size 41914512

onnx/model_fp16.onnx ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:5192afb88fa3b4f64db2362f32518ee8ed445fba49fbe426310d0e8108a3aab6
+size 28851649

onnx/model_int8.onnx ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d3b69884aab01e3af514935c70a24bf935ea87272ebb4d0dd6656fc8efe4eb6d
+size 14563081

onnx/model_q4.onnx ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e13491b840aa0ae3666c443a280277455f84d3012fe33d031a3b13576610856b
+size 42260304

onnx/model_q4f16.onnx ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:87f433c2775284502dd25c6ab8eaf281f3a01dca8797bfc695599939493dc230
+size 22361887

onnx/model_quantized.onnx ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d3b69884aab01e3af514935c70a24bf935ea87272ebb4d0dd6656fc8efe4eb6d
+size 14563081

onnx/model_uint8.onnx ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:b9a6ff07e301d108e2d7e0e777b777182066c08556bdee3b26f381f26a9f9b4f
+size 14563097

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,37 @@

+{
+  "cls_token": {
+    "content": "[CLS]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "mask_token": {
+    "content": "[MASK]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "[PAD]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "sep_token": {
+    "content": "[SEP]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "[UNK]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,62 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "[PAD]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "100": {
+      "content": "[UNK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "101": {
+      "content": "[CLS]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "102": {
+      "content": "[SEP]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "103": {
+      "content": "[MASK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "clean_up_tokenization_spaces": true,
+  "cls_token": "[CLS]",
+  "do_basic_tokenize": true,
+  "do_lower_case": true,
+  "extra_special_tokens": {},
+  "mask_token": "[MASK]",
+  "max_length": 512,
+  "model_max_length": 1000000000000000019884624838656,
+  "never_split": null,
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "stride": 0,
+  "strip_accents": null,
+  "tokenize_chinese_chars": true,
+  "tokenizer_class": "BertTokenizer",
+  "truncation_side": "right",
+  "truncation_strategy": "longest_first",
+  "unk_token": "[UNK]"
+}

vocab.txt ADDED Viewed

The diff for this file is too large to render. See raw diff