| # %% | |
| # ## 4. Pre-train BERT on processed dataset | |
| import os | |
| # hyperparameters | |
| hyperparameters = { | |
| "model_config_id": "bert-base-uncased", | |
| "dataset_id": "chaoyan/processed_bert_dataset", | |
| "tokenizer_id": "cat_tokenizer", | |
| "repository_id": "bert-base-uncased-cat", | |
| "max_steps": 100_000, | |
| "per_device_train_batch_size": 16, | |
| "learning_rate": 5e-5, | |
| } | |
| hyperparameters_string = " ".join(f"--{key} {value}" for key, value in hyperparameters.items()) | |
| cmd_str = f"python3 run_mlm_local.py {hyperparameters_string}" | |
| os.system(cmd_str) | |
| # %% [markdown] | |
| #  | |
| # _This [experiment](https://huggingface.co/philschmid/bert-base-uncased-2022-habana-test-6) ran for 60k steps_ | |
| # | |
| # In our `hyperparameters` we defined a `max_steps` property, which limited the pre-training to only `100_000` steps. The `100_000` steps with a global batch size of `256` took around 12,5 hour. | |
| # | |
| # BERT was originial pre-trained on [1 Million Steps](https://arxiv.org/pdf/1810.04805.pdf) with a global batch size of `256`: | |
| # > We train with batch size of 256 sequences (256 sequences * 512 tokens = 128,000 tokens/batch) for 1,000,000 steps, which is approximately 40 epochs over the 3.3 billion word corpus. | |
| # | |
| # Meaning if we want to do a full pre-training it would take around 125h hours (12,5 hour * 10) and would cost us around ~$1,650 using Habana Gaudi on AWS, which is extermely cheap. | |
| # | |
| # For comparison the DeepSpeed Team, who holds the record for the [fastest BERT-pretraining](https://www.deepspeed.ai/tutorials/bert-pretraining/) [reported](https://www.deepspeed.ai/tutorials/bert-pretraining/) that pre-training BERT on 1 [DGX-2](https://www.nvidia.com/en-us/data-center/dgx-2/) (powered by 16 NVIDIA V100 GPUs with 32GB of memory each) takes around 33,25 hours. | |
| # | |
| # To be able to compare the cost we can use the [p3dn.24xlarge](https://aws.amazon.com/de/ec2/instance-types/p3/) as reference, which comes with 8x NVIDIA V100 32GB GPUs and costs ~31,22$/h. We would need two of these instances to have the same "setup" as the one DeepSpeed reported, for now we are ignoring any overhead created to the multi-node setup (I/O, Network etc.). | |
| # This would bring the cost of the DeepSpeed GPU based training on AWS to around ~$2,075, which is 25% more than what Habana Gaudi currently delivers. | |
| # _Something to note here is that using [DeepSpeed](https://www.deepspeed.ai/tutorials/bert-pretraining/#deepspeed-single-gpu-throughput-results) in general improves the performance by a factor of ~2._ | |
| # | |
| # We are looking forward on re-doing the experiment once the [Gaudi DeepSpeed integration](https://docs.habana.ai/en/latest/PyTorch/DeepSpeed/DeepSpeed_User_Guide.html#deepspeed-configs) is more widely available. | |
| # | |
| # | |
| # ## Conlusion | |
| # | |
| # That's it for this tutorial. Now you know the basics on how to pre-train BERT from scratch using Hugging Face Transformers and Habana Gaudi. You also saw how easy it is to migrate from the `Trainer` to the `GaudiTrainer`. | |
| # | |
| # We compared our implementation with the [fastest BERT-pretraining](https://www.deepspeed.ai/tutorials/bert-pretraining/) results and saw that Habana Gaudi still delivers a 25% cost reduction and allows us to pre-train BERT for ~$1,650. | |
| # | |
| # Those results are incredible, since it will allow companies to adapt their pre-trained models to their language and domain to [improve accuracy up to 10%](https://huggingface.co/pile-of-law/legalbert-large-1.7M-1#evaluation-results) compared to the general BERT models. | |
| # |