| # | |
| # 1. [Prepare the dataset](#1-prepare-the-dataset) | |
| # 2. [Train a Tokenizer](#2-train-a-tokenizer) | |
| # 3. [Preprocess the dataset](#3-preprocess-the-dataset) | |
| # 4. [Pre-train BERT on Habana Gaudi](#4-pre-train-bert-on-habana-gaudi) | |
| # | |
| # _Note: Step 1 to 3 can/should be run on a different instance size those are CPU intensive tasks._ | |
| # %% | |
| # ## 1. Prepare the dataset | |
| # Log into the [Hugging Face Hub](https://huggingface.co/models) to push our dataset, tokenizer, model artifacts, logs and metrics during training and afterwards to the hub. | |
| from huggingface_hub import HfApi | |
| user_id = HfApi().whoami()["name"] | |
| print(f"user id '{user_id}' will be used during the example") | |
| from datasets import concatenate_datasets, load_dataset | |
| # The [original BERT](https://arxiv.org/abs/1810.04805) was pretrained on [Wikipedia](https://huggingface.co/datasets/wikipedia) and [BookCorpus](https://huggingface.co/datasets/bookcorpus) dataset. Both datasets are available on the [Hugging Face Hub](https://huggingface.co/datasets) and can be loaded with `datasets`. | |
| # | |
| # _Note: For wikipedia we will use the `20220301`, which is different to the original split._ | |
| # | |
| # As a first step are we loading the dataset and merging them together to create on big dataset. | |
| bookcorpus = load_dataset("bookcorpus", split="train") | |
| wiki = load_dataset("wikipedia", "20220301.en", split="train") | |
| wiki = wiki.remove_columns([col for col in wiki.column_names if col != "text"]) # only keep the 'text' column | |
| assert bookcorpus.features.type == wiki.features.type | |
| raw_datasets = concatenate_datasets([bookcorpus, wiki]) | |
| print(raw_datasets) | |
| # %% [markdown] | |
| # > We are not going to do some advanced dataset preparation, like de-duplication, filtering or any other pre-processing. If you are planning to apply this notebook to train your own BERT model from scratch I highly recommend to including those data preparation steps into your workflow. This will help you improve your Language Model. | |
| # ## 2. Train a Tokenizer | |
| # | |
| # To be able to train our model we need to convert our text into a tokenized format. Most Transformer models are coming with a pre-trained tokenizer, but since we are pre-training our model from scratch we also need to train a Tokenizer on our data. We can train a tokenizer on our data with `transformers` and the `BertTokenizerFast` class. | |
| # | |
| # More information about training a new tokenizer can be found in our [Hugging Face Course](https://huggingface.co/course/chapter6/2?fw=pt). | |
| from tqdm import tqdm | |
| from transformers import BertTokenizerFast | |
| # repositor id for saving the tokenizer | |
| tokenizer_id="chaoyan/bert-base-uncased-cat" | |
| # create a python generator to dynamically load the data | |
| def batch_iterator(batch_size=10000): | |
| for i in tqdm(range(0, len(raw_datasets), batch_size)): | |
| yield raw_datasets[i : i + batch_size]["text"] | |
| tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased") | |
| bert_tokenizer = tokenizer.train_new_from_iterator(text_iterator=batch_iterator(), vocab_size=32_000) | |
| print(tokenizer) | |
| bert_tokenizer.save_pretrained("cat_tokenizer") | |
| # We push the tokenizer to [Hugging Face Hub](https://huggingface.co/models) for later training our model. | |
| bert_tokenizer.push_to_hub(tokenizer_id) | |