chaoyan
/

bert-base-uncased-cat

Model card Files Files and versions

bert-base-uncased-cat / run_cat.py

chaoyan's picture

Remove auth token

fcc13f9 over 3 years ago

history blame contribute delete

3.54 kB


	# %%
	# ## 4. Pre-train BERT on processed dataset

	import os

	# hyperparameters
	hyperparameters = {
	"model_config_id": "bert-base-uncased",
	"dataset_id": "chaoyan/processed_bert_dataset",
	"tokenizer_id": "cat_tokenizer",
	"repository_id": "bert-base-uncased-cat",
	"max_steps": 100_000,
	"per_device_train_batch_size": 16,
	"learning_rate": 5e-5,
	}
	hyperparameters_string = " ".join(f"--{key} {value}" for key, value in hyperparameters.items())

	cmd_str = f"python3 run_mlm_local.py {hyperparameters_string}"
	os.system(cmd_str)


	# %% [markdown]
	# ![tensorboard logs](../assets/tensorboard.png)
	# _This [experiment](https://huggingface.co/philschmid/bert-base-uncased-2022-habana-test-6) ran for 60k steps_
	#
	# In our `hyperparameters` we defined a `max_steps` property, which limited the pre-training to only `100_000` steps. The `100_000` steps with a global batch size of `256` took around 12,5 hour.
	#
	# BERT was originial pre-trained on [1 Million Steps](https://arxiv.org/pdf/1810.04805.pdf) with a global batch size of `256`:
	# > We train with batch size of 256 sequences (256 sequences * 512 tokens = 128,000 tokens/batch) for 1,000,000 steps, which is approximately 40 epochs over the 3.3 billion word corpus.
	#
	# Meaning if we want to do a full pre-training it would take around 125h hours (12,5 hour * 10) and would cost us around ~$1,650 using Habana Gaudi on AWS, which is extermely cheap.
	#
	# For comparison the DeepSpeed Team, who holds the record for the [fastest BERT-pretraining](https://www.deepspeed.ai/tutorials/bert-pretraining/) [reported](https://www.deepspeed.ai/tutorials/bert-pretraining/) that pre-training BERT on 1 [DGX-2](https://www.nvidia.com/en-us/data-center/dgx-2/) (powered by 16 NVIDIA V100 GPUs with 32GB of memory each) takes around 33,25 hours.
	#
	# To be able to compare the cost we can use the [p3dn.24xlarge](https://aws.amazon.com/de/ec2/instance-types/p3/) as reference, which comes with 8x NVIDIA V100 32GB GPUs and costs ~31,22$/h. We would need two of these instances to have the same "setup" as the one DeepSpeed reported, for now we are ignoring any overhead created to the multi-node setup (I/O, Network etc.).
	# This would bring the cost of the DeepSpeed GPU based training on AWS to around ~$2,075, which is 25% more than what Habana Gaudi currently delivers.
	# _Something to note here is that using [DeepSpeed](https://www.deepspeed.ai/tutorials/bert-pretraining/#deepspeed-single-gpu-throughput-results) in general improves the performance by a factor of ~2._
	#
	# We are looking forward on re-doing the experiment once the [Gaudi DeepSpeed integration](https://docs.habana.ai/en/latest/PyTorch/DeepSpeed/DeepSpeed_User_Guide.html#deepspeed-configs) is more widely available.
	#
	#
	# ## Conlusion
	#
	# That's it for this tutorial. Now you know the basics on how to pre-train BERT from scratch using Hugging Face Transformers and Habana Gaudi. You also saw how easy it is to migrate from the `Trainer` to the `GaudiTrainer`.
	#
	# We compared our implementation with the [fastest BERT-pretraining](https://www.deepspeed.ai/tutorials/bert-pretraining/) results and saw that Habana Gaudi still delivers a 25% cost reduction and allows us to pre-train BERT for ~$1,650.
	#
	# Those results are incredible, since it will allow companies to adapt their pre-trained models to their language and domain to [improve accuracy up to 10%](https://huggingface.co/pile-of-law/legalbert-large-1.7M-1#evaluation-results) compared to the general BERT models.
	#