Update README.md
Browse files
README.md
CHANGED
|
@@ -58,22 +58,22 @@ Import the tokenizer and model:
|
|
| 58 |
|
| 59 |
```python
|
| 60 |
tokenizer = tokun.huggingface.ByteTokenizer()
|
| 61 |
-
model = hh.from_pretrained_keras('tokun/variants/
|
| 62 |
```
|
| 63 |
|
| 64 |
### With Base Tensorflow / Keras
|
| 65 |
|
| 66 |
You can directly load the weights [from the repository](../models/).
|
| 67 |
|
| 68 |
-
For the most performant variant of the model, `
|
| 69 |
|
| 70 |
```python
|
| 71 |
import tensorflow as tf
|
| 72 |
import tokun.model
|
| 73 |
import urllib.request
|
| 74 |
|
| 75 |
-
urllib.request.urlretrieve('https://github.com/apehex/tokun/raw/main/models/
|
| 76 |
-
model = tf.keras.models.load_model('model.keras')
|
| 77 |
```
|
| 78 |
|
| 79 |
## Usage
|
|
@@ -121,7 +121,7 @@ print(__p.shape) # back to x shape
|
|
| 121 |
### With Base Tensorflow / Keras
|
| 122 |
|
| 123 |
```python
|
| 124 |
-
__x = tokun.pipeline.preprocess(text=__s, groups=[
|
| 125 |
__e = model._encoder(__x) # final embedding = input for another model
|
| 126 |
# these embeddings would be the input of a LLM
|
| 127 |
__o = llm(__e) # replace with your LLM
|
|
@@ -178,10 +178,6 @@ Notes on each iteration:
|
|
| 178 |
- `tokun-4`: [Github][article-file-tokun-4]
|
| 179 |
- `tokun-16`: [Github][article-file-tokun-16]
|
| 180 |
|
| 181 |
-
## TODO
|
| 182 |
-
|
| 183 |
-
See [TODO](TODO.md).
|
| 184 |
-
|
| 185 |
## Credits
|
| 186 |
|
| 187 |
This project was inspired by a video from Andrej Karpathy, ["Let's build the GPT tokenizer"][youtube-karpathy-tokenizer].
|
|
|
|
| 58 |
|
| 59 |
```python
|
| 60 |
tokenizer = tokun.huggingface.ByteTokenizer()
|
| 61 |
+
model = hh.from_pretrained_keras('tokun/variants/16x4/')
|
| 62 |
```
|
| 63 |
|
| 64 |
### With Base Tensorflow / Keras
|
| 65 |
|
| 66 |
You can directly load the weights [from the repository](../models/).
|
| 67 |
|
| 68 |
+
For the most performant variant of the model, `16x4`:
|
| 69 |
|
| 70 |
```python
|
| 71 |
import tensorflow as tf
|
| 72 |
import tokun.model
|
| 73 |
import urllib.request
|
| 74 |
|
| 75 |
+
urllib.request.urlretrieve('https://github.com/apehex/tokun/raw/main/models/16x4/1/7.7.keras', 'model.keras')
|
| 76 |
+
model = tf.keras.models.load_model('model.keras', compile=False)
|
| 77 |
```
|
| 78 |
|
| 79 |
## Usage
|
|
|
|
| 121 |
### With Base Tensorflow / Keras
|
| 122 |
|
| 123 |
```python
|
| 124 |
+
__x = tokun.pipeline.preprocess(text=__s, groups=[16, 4], expand=[1], flatten=True)
|
| 125 |
__e = model._encoder(__x) # final embedding = input for another model
|
| 126 |
# these embeddings would be the input of a LLM
|
| 127 |
__o = llm(__e) # replace with your LLM
|
|
|
|
| 178 |
- `tokun-4`: [Github][article-file-tokun-4]
|
| 179 |
- `tokun-16`: [Github][article-file-tokun-16]
|
| 180 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 181 |
## Credits
|
| 182 |
|
| 183 |
This project was inspired by a video from Andrej Karpathy, ["Let's build the GPT tokenizer"][youtube-karpathy-tokenizer].
|