Spaces:
Running
Running
File size: 11,970 Bytes
9ce984a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 |
"""
Title: GPT2 Text Generation with KerasHub
Author: Chen Qian
Date created: 2023/04/17
Last modified: 2024/04/12
Description: Use KerasHub GPT2 model and `samplers` to do text generation.
Accelerator: GPU
"""
"""
In this tutorial, you will learn to use [KerasHub](https://keras.io/keras_hub/) to load a
pre-trained Large Language Model (LLM) - [GPT-2 model](https://openai.com/research/better-language-models)
(originally invented by OpenAI), finetune it to a specific text style, and
generate text based on users' input (also known as prompt). You will also learn
how GPT2 adapts quickly to non-English languages, such as Chinese.
"""
"""
## Before we begin
Colab offers different kinds of runtimes. Make sure to go to **Runtime ->
Change runtime type** and choose the GPU Hardware Accelerator runtime
(which should have >12G host RAM and ~15G GPU RAM) since you will finetune the
GPT-2 model. Running this tutorial on CPU runtime will take hours.
"""
"""
## Install KerasHub, Choose Backend and Import Dependencies
This examples uses [Keras 3](https://keras.io/keras_3/) to work in any of
`"tensorflow"`, `"jax"` or `"torch"`. Support for Keras 3 is baked into
KerasHub, simply change the `"KERAS_BACKEND"` environment variable to select
the backend of your choice. We select the JAX backend below.
"""
"""shell
pip install git+https://github.com/keras-team/keras-hub.git -q
"""
import os
os.environ["KERAS_BACKEND"] = "jax" # or "tensorflow" or "torch"
import keras_hub
import keras
import tensorflow as tf
import time
keras.mixed_precision.set_global_policy("mixed_float16")
"""
## Introduction to Generative Large Language Models (LLMs)
Large language models (LLMs) are a type of machine learning models that are
trained on a large corpus of text data to generate outputs for various natural
language processing (NLP) tasks, such as text generation, question answering,
and machine translation.
Generative LLMs are typically based on deep learning neural networks, such as
the [Transformer architecture](https://arxiv.org/abs/1706.03762) invented by
Google researchers in 2017, and are trained on massive amounts of text data,
often involving billions of words. These models, such as Google [LaMDA](https://blog.google/technology/ai/lamda/)
and [PaLM](https://ai.googleblog.com/2022/04/pathways-language-model-palm-scaling-to.html),
are trained with a large dataset from various data sources which allows them to
generate output for many tasks. The core of Generative LLMs is predicting the
next word in a sentence, often referred as **Causal LM Pretraining**. In this
way LLMs can generate coherent text based on user prompts. For a more
pedagogical discussion on language models, you can refer to the
[Stanford CS324 LLM class](https://stanford-cs324.github.io/winter2022/lectures/introduction/).
"""
"""
## Introduction to KerasHub
Large Language Models are complex to build and expensive to train from scratch.
Luckily there are pretrained LLMs available for use right away. [KerasHub](https://keras.io/keras_hub/)
provides a large number of pre-trained checkpoints that allow you to experiment
with SOTA models without needing to train them yourself.
KerasHub is a natural language processing library that supports users through
their entire development cycle. KerasHub offers both pretrained models and
modularized building blocks, so developers could easily reuse pretrained models
or stack their own LLM.
In a nutshell, for generative LLM, KerasHub offers:
- Pretrained models with `generate()` method, e.g.,
`keras_hub.models.GPT2CausalLM` and `keras_hub.models.OPTCausalLM`.
- Sampler class that implements generation algorithms such as Top-K, Beam and
contrastive search. These samplers can be used to generate text with
custom models.
"""
"""
## Load a pre-trained GPT-2 model and generate some text
KerasHub provides a number of pre-trained models, such as [Google
Bert](https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html)
and [GPT-2](https://openai.com/research/better-language-models). You can see
the list of models available in the [KerasHub repository](https://github.com/keras-team/keras-hub/tree/master/keras_hub/models).
It's very easy to load the GPT-2 model as you can see below:
"""
# To speed up training and generation, we use preprocessor of length 128
# instead of full length 1024.
preprocessor = keras_hub.models.GPT2CausalLMPreprocessor.from_preset(
"gpt2_base_en",
sequence_length=128,
)
gpt2_lm = keras_hub.models.GPT2CausalLM.from_preset(
"gpt2_base_en", preprocessor=preprocessor
)
"""
Once the model is loaded, you can use it to generate some text right away. Run
the cells below to give it a try. It's as simple as calling a single function
*generate()*:
"""
start = time.time()
output = gpt2_lm.generate("My trip to Yosemite was", max_length=200)
print("\nGPT-2 output:")
print(output)
end = time.time()
print(f"TOTAL TIME ELAPSED: {end - start:.2f}s")
"""
Try another one:
"""
start = time.time()
output = gpt2_lm.generate("That Italian restaurant is", max_length=200)
print("\nGPT-2 output:")
print(output)
end = time.time()
print(f"TOTAL TIME ELAPSED: {end - start:.2f}s")
"""
Notice how much faster the second call is. This is because the computational
graph is [XLA compiled](https://www.tensorflow.org/xla) in the 1st run and
re-used in the 2nd behind the scenes.
The quality of the generated text looks OK, but we can improve it via
fine-tuning.
"""
"""
## More on the GPT-2 model from KerasHub
Next up, we will actually fine-tune the model to update its parameters, but
before we do, let's take a look at the full set of tools we have to for working
with for GPT2.
The code of GPT2 can be found
[here](https://github.com/keras-team/keras-hub/blob/master/keras_hub/models/gpt2/).
Conceptually the `GPT2CausalLM` can be hierarchically broken down into several
modules in KerasHub, all of which have a *from_preset()* function that loads a
pretrained model:
- `keras_hub.models.GPT2Tokenizer`: The tokenizer used by GPT2 model, which is a
[byte-pair encoder](https://huggingface.co/course/chapter6/5?fw=pt).
- `keras_hub.models.GPT2CausalLMPreprocessor`: the preprocessor used by GPT2
causal LM training. It does the tokenization along with other preprocessing
works such as creating the label and appending the end token.
- `keras_hub.models.GPT2Backbone`: the GPT2 model, which is a stack of
`keras_hub.layers.TransformerDecoder`. This is usually just referred as
`GPT2`.
- `keras_hub.models.GPT2CausalLM`: wraps `GPT2Backbone`, it multiplies the
output of `GPT2Backbone` by embedding matrix to generate logits over
vocab tokens.
"""
"""
## Finetune on Reddit dataset
Now you have the knowledge of the GPT-2 model from KerasHub, you can take one
step further to finetune the model so that it generates text in a specific
style, short or long, strict or casual. In this tutorial, we will use reddit
dataset for example.
"""
import tensorflow_datasets as tfds
reddit_ds = tfds.load("reddit_tifu", split="train", as_supervised=True)
"""
Let's take a look inside sample data from the reddit TensorFlow Dataset. There
are two features:
- **__document__**: text of the post.
- **__title__**: the title.
"""
for document, title in reddit_ds:
print(document.numpy())
print(title.numpy())
break
"""
In our case, we are performing next word prediction in a language model, so we
only need the 'document' feature.
"""
train_ds = (
reddit_ds.map(lambda document, _: document)
.batch(32)
.cache()
.prefetch(tf.data.AUTOTUNE)
)
"""
Now you can finetune the model using the familiar *fit()* function. Note that
`preprocessor` will be automatically called inside `fit` method since
`GPT2CausalLM` is a `keras_hub.models.Task` instance.
This step takes quite a bit of GPU memory and a long time if we were to train
it all the way to a fully trained state. Here we just use part of the dataset
for demo purposes.
"""
train_ds = train_ds.take(500)
num_epochs = 1
# Linearly decaying learning rate.
learning_rate = keras.optimizers.schedules.PolynomialDecay(
5e-5,
decay_steps=train_ds.cardinality() * num_epochs,
end_learning_rate=0.0,
)
loss = keras.losses.SparseCategoricalCrossentropy(from_logits=True)
gpt2_lm.compile(
optimizer=keras.optimizers.Adam(learning_rate),
loss=loss,
weighted_metrics=["accuracy"],
)
gpt2_lm.fit(train_ds, epochs=num_epochs)
"""
After fine-tuning is finished, you can again generate text using the same
*generate()* function. This time, the text will be closer to Reddit writing
style, and the generated length will be close to our preset length in the
training set.
"""
start = time.time()
output = gpt2_lm.generate("I like basketball", max_length=200)
print("\nGPT-2 output:")
print(output)
end = time.time()
print(f"TOTAL TIME ELAPSED: {end - start:.2f}s")
"""
## Into the Sampling Method
In KerasHub, we offer a few sampling methods, e.g., contrastive search,
Top-K and beam sampling. By default, our `GPT2CausalLM` uses Top-k search, but
you can choose your own sampling method.
Much like optimizer and activations, there are two ways to specify your custom
sampler:
- Use a string identifier, such as "greedy", you are using the default
configuration via this way.
- Pass a `keras_hub.samplers.Sampler` instance, you can use custom configuration
via this way.
"""
# Use a string identifier.
gpt2_lm.compile(sampler="top_k")
output = gpt2_lm.generate("I like basketball", max_length=200)
print("\nGPT-2 output:")
print(output)
# Use a `Sampler` instance. `GreedySampler` tends to repeat itself,
greedy_sampler = keras_hub.samplers.GreedySampler()
gpt2_lm.compile(sampler=greedy_sampler)
output = gpt2_lm.generate("I like basketball", max_length=200)
print("\nGPT-2 output:")
print(output)
"""
For more details on KerasHub `Sampler` class, you can check the code
[here](https://github.com/keras-team/keras-hub/tree/master/keras_hub/samplers).
"""
"""
## Finetune on Chinese Poem Dataset
We can also finetune GPT2 on non-English datasets. For readers knowing Chinese,
this part illustrates how to fine-tune GPT2 on Chinese poem dataset to teach our
model to become a poet!
Because GPT2 uses byte-pair encoder, and the original pretraining dataset
contains some Chinese characters, we can use the original vocab to finetune on
Chinese dataset.
"""
"""shell
# Load chinese poetry dataset.
git clone https://github.com/chinese-poetry/chinese-poetry.git
"""
"""
Load text from the json file. We only use《全唐诗》for demo purposes.
"""
import os
import json
poem_collection = []
for file in os.listdir("chinese-poetry/全唐诗"):
if ".json" not in file or "poet" not in file:
continue
full_filename = "%s/%s" % ("chinese-poetry/全唐诗", file)
with open(full_filename, "r") as f:
content = json.load(f)
poem_collection.extend(content)
paragraphs = ["".join(data["paragraphs"]) for data in poem_collection]
"""
Let's take a look at sample data.
"""
print(paragraphs[0])
"""
Similar as Reddit example, we convert to TF dataset, and only use partial data
to train.
"""
train_ds = (
tf.data.Dataset.from_tensor_slices(paragraphs)
.batch(16)
.cache()
.prefetch(tf.data.AUTOTUNE)
)
# Running through the whole dataset takes long, only take `500` and run 1
# epochs for demo purposes.
train_ds = train_ds.take(500)
num_epochs = 1
learning_rate = keras.optimizers.schedules.PolynomialDecay(
5e-4,
decay_steps=train_ds.cardinality() * num_epochs,
end_learning_rate=0.0,
)
loss = keras.losses.SparseCategoricalCrossentropy(from_logits=True)
gpt2_lm.compile(
optimizer=keras.optimizers.Adam(learning_rate),
loss=loss,
weighted_metrics=["accuracy"],
)
gpt2_lm.fit(train_ds, epochs=num_epochs)
"""
Let's check the result!
"""
output = gpt2_lm.generate("昨夜雨疏风骤", max_length=200)
print(output)
"""
Not bad 😀
"""
|