Milan Straka commited on
Commit ·
41cfe08
1
Parent(s): 354716a
Better formulation.
Browse files
README.md
CHANGED
|
@@ -14,7 +14,9 @@ tags:
|
|
| 14 |
## Version History
|
| 15 |
|
| 16 |
- **version 1.1**: Version 1.1 was released in Jan 2024, with a change to the
|
| 17 |
-
tokenizer; the
|
|
|
|
|
|
|
| 18 |
|
| 19 |
The tokenizer in the initial release (a) contained a hole (51959 did not
|
| 20 |
correspond to any token), and (b) mapped several tokens (unseen during training
|
|
@@ -25,7 +27,7 @@ tags:
|
|
| 25 |
|
| 26 |
In version 1.1, the tokenizer was modified by (a) removing the hole, (b)
|
| 27 |
mapping all tokens to a unique ID. That also required increasing the
|
| 28 |
-
vocabulary
|
| 29 |
`[UNK]` token). Without finetuning, version 1.1 and version 1.0 gives exactly
|
| 30 |
the same results on any input, and the tokens in version 1.0 that mapped to
|
| 31 |
a different ID than the `[UNK]` token map to the same ID in version 1.1.
|
|
|
|
| 14 |
## Version History
|
| 15 |
|
| 16 |
- **version 1.1**: Version 1.1 was released in Jan 2024, with a change to the
|
| 17 |
+
tokenizer; the model parameters were mostly kept the same, but the embeddings
|
| 18 |
+
were enlarged (by copying suitable rows) to correspond to the updated
|
| 19 |
+
tokenizer.
|
| 20 |
|
| 21 |
The tokenizer in the initial release (a) contained a hole (51959 did not
|
| 22 |
correspond to any token), and (b) mapped several tokens (unseen during training
|
|
|
|
| 27 |
|
| 28 |
In version 1.1, the tokenizer was modified by (a) removing the hole, (b)
|
| 29 |
mapping all tokens to a unique ID. That also required increasing the
|
| 30 |
+
vocabulary size and embeddings weights (by replicating the embedding of the
|
| 31 |
`[UNK]` token). Without finetuning, version 1.1 and version 1.0 gives exactly
|
| 32 |
the same results on any input, and the tokens in version 1.0 that mapped to
|
| 33 |
a different ID than the `[UNK]` token map to the same ID in version 1.1.
|