Text Generation
Transformers
Safetensors
English
llama
tiny-model
sub-1M
cpu
small
tiny
quark
1m
text-generation-inference
Instructions to use LH-Tech-AI/Quark-0.5M with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use LH-Tech-AI/Quark-0.5M with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="LH-Tech-AI/Quark-0.5M")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("LH-Tech-AI/Quark-0.5M") model = AutoModelForCausalLM.from_pretrained("LH-Tech-AI/Quark-0.5M") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use LH-Tech-AI/Quark-0.5M with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "LH-Tech-AI/Quark-0.5M" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "LH-Tech-AI/Quark-0.5M", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/LH-Tech-AI/Quark-0.5M
- SGLang
How to use LH-Tech-AI/Quark-0.5M with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "LH-Tech-AI/Quark-0.5M" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "LH-Tech-AI/Quark-0.5M", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "LH-Tech-AI/Quark-0.5M" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "LH-Tech-AI/Quark-0.5M", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use LH-Tech-AI/Quark-0.5M with Docker Model Runner:
docker model run hf.co/LH-Tech-AI/Quark-0.5M
Create logs.log
Browse files
logs.log
ADDED
|
@@ -0,0 +1,528 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
[*] Loading libraries...
|
| 2 |
+
[*] Loading tokenizer...
|
| 3 |
+
[*] Gathering 100 million tokens by streaming dataset...
|
| 4 |
+
Resolving data files: 100%|ββββββββββββββ| 2410/2410 [00:00<00:00, 30853.46it/s]
|
| 5 |
+
[*] Gathering tokens: 100%|ββ| 400000000/400000000 [13:58<00:00, 477048.96tok/s]
|
| 6 |
+
[+] Collected 400,000,000 tokens β 1,562,500 chunks.
|
| 7 |
+
[*] Setting up model...
|
| 8 |
+
[*] Model parameters: 465,504
|
| 9 |
+
[*] Defining training arguments...
|
| 10 |
+
[*] Starting training...
|
| 11 |
+
{'loss': '5.986', 'grad_norm': '0.5017', 'learning_rate': '9.9e-05', 'epoch': '0.008192'}
|
| 12 |
+
{'loss': '5.403', 'grad_norm': '0.394', 'learning_rate': '0.000199', 'epoch': '0.01638'}
|
| 13 |
+
{'loss': '4.75', 'grad_norm': '0.9517', 'learning_rate': '0.000299', 'epoch': '0.02458'}
|
| 14 |
+
{'loss': '4.192', 'grad_norm': '1.073', 'learning_rate': '0.000399', 'epoch': '0.03277'}
|
| 15 |
+
{'loss': '3.702', 'grad_norm': '1.364', 'learning_rate': '0.000499', 'epoch': '0.04096'}
|
| 16 |
+
1%|β | 500/36624 [00:34<40:21, 14.92it/s]
|
| 17 |
+
Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 126.22it/s]
|
| 18 |
+
{'loss': '3.378', 'grad_norm': '1.906', 'learning_rate': '0.0004986', 'epoch': '0.04915'}
|
| 19 |
+
{'loss': '3.195', 'grad_norm': '1.332', 'learning_rate': '0.0004972', 'epoch': '0.05734'}
|
| 20 |
+
{'loss': '3.085', 'grad_norm': '1.36', 'learning_rate': '0.0004959', 'epoch': '0.06553'}
|
| 21 |
+
{'loss': '3.011', 'grad_norm': '1.354', 'learning_rate': '0.0004945', 'epoch': '0.07373'}
|
| 22 |
+
{'loss': '2.955', 'grad_norm': '1.423', 'learning_rate': '0.0004931', 'epoch': '0.08192'}
|
| 23 |
+
3%|β | 1000/36624 [01:08<40:59, 14.48it/s]
|
| 24 |
+
Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 185.00it/s]
|
| 25 |
+
{'loss': '2.914', 'grad_norm': '1.194', 'learning_rate': '0.0004917', 'epoch': '0.09011'}
|
| 26 |
+
{'loss': '2.887', 'grad_norm': '1.145', 'learning_rate': '0.0004903', 'epoch': '0.0983'}
|
| 27 |
+
{'loss': '2.861', 'grad_norm': '1.353', 'learning_rate': '0.0004889', 'epoch': '0.1065'}
|
| 28 |
+
{'loss': '2.833', 'grad_norm': '1.226', 'learning_rate': '0.0004876', 'epoch': '0.1147'}
|
| 29 |
+
{'loss': '2.824', 'grad_norm': '1.226', 'learning_rate': '0.0004862', 'epoch': '0.1229'}
|
| 30 |
+
4%|ββ | 1500/36624 [01:42<40:32, 14.44it/s]
|
| 31 |
+
Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 182.87it/s]
|
| 32 |
+
{'loss': '2.806', 'grad_norm': '1.204', 'learning_rate': '0.0004848', 'epoch': '0.1311'}
|
| 33 |
+
{'loss': '2.786', 'grad_norm': '1.139', 'learning_rate': '0.0004834', 'epoch': '0.1393'}
|
| 34 |
+
{'loss': '2.777', 'grad_norm': '1.099', 'learning_rate': '0.000482', 'epoch': '0.1475'}
|
| 35 |
+
{'loss': '2.765', 'grad_norm': '1.127', 'learning_rate': '0.0004806', 'epoch': '0.1556'}
|
| 36 |
+
{'loss': '2.754', 'grad_norm': '1.186', 'learning_rate': '0.0004793', 'epoch': '0.1638'}
|
| 37 |
+
5%|ββ | 2000/36624 [02:16<39:37, 14.56it/s]
|
| 38 |
+
Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 196.92it/s]
|
| 39 |
+
{'loss': '2.749', 'grad_norm': '1.068', 'learning_rate': '0.0004779', 'epoch': '0.172'}
|
| 40 |
+
{'loss': '2.732', 'grad_norm': '1.086', 'learning_rate': '0.0004765', 'epoch': '0.1802'}
|
| 41 |
+
{'loss': '2.73', 'grad_norm': '1.105', 'learning_rate': '0.0004751', 'epoch': '0.1884'}
|
| 42 |
+
{'loss': '2.721', 'grad_norm': '1.213', 'learning_rate': '0.0004737', 'epoch': '0.1966'}
|
| 43 |
+
{'loss': '2.717', 'grad_norm': '1.168', 'learning_rate': '0.0004723', 'epoch': '0.2048'}
|
| 44 |
+
7%|βββ | 2500/36624 [02:50<39:00, 14.58it/s]
|
| 45 |
+
Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 183.91it/s]
|
| 46 |
+
{'loss': '2.708', 'grad_norm': '1.081', 'learning_rate': '0.0004709', 'epoch': '0.213'}
|
| 47 |
+
{'loss': '2.705', 'grad_norm': '1.083', 'learning_rate': '0.0004696', 'epoch': '0.2212'}
|
| 48 |
+
{'loss': '2.697', 'grad_norm': '1.079', 'learning_rate': '0.0004682', 'epoch': '0.2294'}
|
| 49 |
+
{'loss': '2.692', 'grad_norm': '1.123', 'learning_rate': '0.0004668', 'epoch': '0.2376'}
|
| 50 |
+
{'loss': '2.687', 'grad_norm': '1.147', 'learning_rate': '0.0004654', 'epoch': '0.2458'}
|
| 51 |
+
8%|βββ | 3000/36624 [03:24<37:58, 14.76it/s]
|
| 52 |
+
Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 192.12it/s]
|
| 53 |
+
{'loss': '2.681', 'grad_norm': '1.052', 'learning_rate': '0.000464', 'epoch': '0.2539'}
|
| 54 |
+
{'loss': '2.676', 'grad_norm': '1.099', 'learning_rate': '0.0004626', 'epoch': '0.2621'}
|
| 55 |
+
{'loss': '2.674', 'grad_norm': '1.084', 'learning_rate': '0.0004613', 'epoch': '0.2703'}
|
| 56 |
+
{'loss': '2.672', 'grad_norm': '1.057', 'learning_rate': '0.0004599', 'epoch': '0.2785'}
|
| 57 |
+
{'loss': '2.672', 'grad_norm': '1.103', 'learning_rate': '0.0004585', 'epoch': '0.2867'}
|
| 58 |
+
10%|ββββ | 3500/36624 [03:59<38:12, 14.45it/s]
|
| 59 |
+
Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 199.64it/s]
|
| 60 |
+
{'loss': '2.661', 'grad_norm': '1.062', 'learning_rate': '0.0004571', 'epoch': '0.2949'}
|
| 61 |
+
{'loss': '2.658', 'grad_norm': '1.055', 'learning_rate': '0.0004557', 'epoch': '0.3031'}
|
| 62 |
+
{'loss': '2.656', 'grad_norm': '1.06', 'learning_rate': '0.0004543', 'epoch': '0.3113'}
|
| 63 |
+
{'loss': '2.653', 'grad_norm': '1.1', 'learning_rate': '0.000453', 'epoch': '0.3195'}
|
| 64 |
+
{'loss': '2.651', 'grad_norm': '1.137', 'learning_rate': '0.0004516', 'epoch': '0.3277'}
|
| 65 |
+
11%|βββββ | 4000/36624 [04:33<37:14, 14.60it/s]
|
| 66 |
+
Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 196.63it/s]
|
| 67 |
+
{'loss': '2.648', 'grad_norm': '1.009', 'learning_rate': '0.0004502', 'epoch': '0.3359'}
|
| 68 |
+
{'loss': '2.639', 'grad_norm': '1', 'learning_rate': '0.0004488', 'epoch': '0.3441'}
|
| 69 |
+
{'loss': '2.641', 'grad_norm': '1.044', 'learning_rate': '0.0004474', 'epoch': '0.3522'}
|
| 70 |
+
{'loss': '2.641', 'grad_norm': '1.039', 'learning_rate': '0.000446', 'epoch': '0.3604'}
|
| 71 |
+
{'loss': '2.637', 'grad_norm': '1.036', 'learning_rate': '0.0004446', 'epoch': '0.3686'}
|
| 72 |
+
12%|βββββ | 4500/36624 [05:07<36:26, 14.69it/s]
|
| 73 |
+
Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 193.36it/s]
|
| 74 |
+
{'loss': '2.632', 'grad_norm': '0.9873', 'learning_rate': '0.0004433', 'epoch': '0.3768'}
|
| 75 |
+
{'loss': '2.631', 'grad_norm': '1.043', 'learning_rate': '0.0004419', 'epoch': '0.385'}
|
| 76 |
+
{'loss': '2.63', 'grad_norm': '1.063', 'learning_rate': '0.0004405', 'epoch': '0.3932'}
|
| 77 |
+
{'loss': '2.624', 'grad_norm': '1.026', 'learning_rate': '0.0004391', 'epoch': '0.4014'}
|
| 78 |
+
{'loss': '2.624', 'grad_norm': '1.011', 'learning_rate': '0.0004377', 'epoch': '0.4096'}
|
| 79 |
+
14%|ββββββ | 5000/36624 [05:41<36:09, 14.58it/s]
|
| 80 |
+
Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 189.30it/s]
|
| 81 |
+
{'loss': '2.625', 'grad_norm': '1.08', 'learning_rate': '0.0004363', 'epoch': '0.4178'}
|
| 82 |
+
{'loss': '2.621', 'grad_norm': '1.007', 'learning_rate': '0.000435', 'epoch': '0.426'}
|
| 83 |
+
{'loss': '2.618', 'grad_norm': '1.025', 'learning_rate': '0.0004336', 'epoch': '0.4342'}
|
| 84 |
+
{'loss': '2.616', 'grad_norm': '0.9491', 'learning_rate': '0.0004322', 'epoch': '0.4424'}
|
| 85 |
+
{'loss': '2.615', 'grad_norm': '1.072', 'learning_rate': '0.0004308', 'epoch': '0.4505'}
|
| 86 |
+
15%|ββββββ | 5500/36624 [06:15<35:20, 14.67it/s]
|
| 87 |
+
Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 196.28it/s]
|
| 88 |
+
{'loss': '2.604', 'grad_norm': '0.986', 'learning_rate': '0.0004294', 'epoch': '0.4587'}
|
| 89 |
+
{'loss': '2.609', 'grad_norm': '0.9908', 'learning_rate': '0.000428', 'epoch': '0.4669'}
|
| 90 |
+
{'loss': '2.606', 'grad_norm': '0.9686', 'learning_rate': '0.0004267', 'epoch': '0.4751'}
|
| 91 |
+
{'loss': '2.61', 'grad_norm': '1.009', 'learning_rate': '0.0004253', 'epoch': '0.4833'}
|
| 92 |
+
{'loss': '2.606', 'grad_norm': '1.003', 'learning_rate': '0.0004239', 'epoch': '0.4915'}
|
| 93 |
+
16%|βββββββ | 6000/36624 [06:49<34:56, 14.61it/s]
|
| 94 |
+
Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 178.44it/s]
|
| 95 |
+
{'loss': '2.602', 'grad_norm': '0.9795', 'learning_rate': '0.0004225', 'epoch': '0.4997'}
|
| 96 |
+
{'loss': '2.601', 'grad_norm': '1.023', 'learning_rate': '0.0004211', 'epoch': '0.5079'}
|
| 97 |
+
{'loss': '2.596', 'grad_norm': '1.023', 'learning_rate': '0.0004197', 'epoch': '0.5161'}
|
| 98 |
+
{'loss': '2.598', 'grad_norm': '0.9583', 'learning_rate': '0.0004184', 'epoch': '0.5243'}
|
| 99 |
+
{'loss': '2.597', 'grad_norm': '0.9572', 'learning_rate': '0.000417', 'epoch': '0.5325'}
|
| 100 |
+
18%|βββββββ | 6500/36624 [07:24<34:21, 14.61it/s]
|
| 101 |
+
Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 207.81it/s]
|
| 102 |
+
{'loss': '2.596', 'grad_norm': '1.056', 'learning_rate': '0.0004156', 'epoch': '0.5407'}
|
| 103 |
+
{'loss': '2.594', 'grad_norm': '1.007', 'learning_rate': '0.0004142', 'epoch': '0.5488'}
|
| 104 |
+
{'loss': '2.593', 'grad_norm': '0.9365', 'learning_rate': '0.0004128', 'epoch': '0.557'}
|
| 105 |
+
{'loss': '2.593', 'grad_norm': '0.9879', 'learning_rate': '0.0004114', 'epoch': '0.5652'}
|
| 106 |
+
{'loss': '2.594', 'grad_norm': '1.078', 'learning_rate': '0.00041', 'epoch': '0.5734'}
|
| 107 |
+
19%|ββββββββ | 7000/36624 [07:58<33:22, 14.79it/s]
|
| 108 |
+
Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 149.60it/s]
|
| 109 |
+
{'loss': '2.589', 'grad_norm': '1.011', 'learning_rate': '0.0004087', 'epoch': '0.5816'}
|
| 110 |
+
{'loss': '2.585', 'grad_norm': '0.9979', 'learning_rate': '0.0004073', 'epoch': '0.5898'}
|
| 111 |
+
{'loss': '2.587', 'grad_norm': '0.9675', 'learning_rate': '0.0004059', 'epoch': '0.598'}
|
| 112 |
+
{'loss': '2.584', 'grad_norm': '0.9291', 'learning_rate': '0.0004045', 'epoch': '0.6062'}
|
| 113 |
+
{'loss': '2.583', 'grad_norm': '0.9513', 'learning_rate': '0.0004031', 'epoch': '0.6144'}
|
| 114 |
+
20%|ββββββββ | 7500/36624 [08:32<33:15, 14.60it/s]
|
| 115 |
+
Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 179.61it/s]
|
| 116 |
+
{'loss': '2.584', 'grad_norm': '1.012', 'learning_rate': '0.0004017', 'epoch': '0.6226'}
|
| 117 |
+
{'loss': '2.585', 'grad_norm': '1.012', 'learning_rate': '0.0004004', 'epoch': '0.6308'}
|
| 118 |
+
{'loss': '2.578', 'grad_norm': '1.016', 'learning_rate': '0.000399', 'epoch': '0.639'}
|
| 119 |
+
{'loss': '2.58', 'grad_norm': '0.994', 'learning_rate': '0.0003976', 'epoch': '0.6471'}
|
| 120 |
+
{'loss': '2.578', 'grad_norm': '1.003', 'learning_rate': '0.0003962', 'epoch': '0.6553'}
|
| 121 |
+
22%|βββββββββ | 8000/36624 [09:06<32:34, 14.64it/s]
|
| 122 |
+
Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 178.38it/s]
|
| 123 |
+
{'loss': '2.581', 'grad_norm': '1.01', 'learning_rate': '0.0003948', 'epoch': '0.6635'}
|
| 124 |
+
{'loss': '2.573', 'grad_norm': '0.9192', 'learning_rate': '0.0003934', 'epoch': '0.6717'}
|
| 125 |
+
{'loss': '2.577', 'grad_norm': '0.955', 'learning_rate': '0.0003921', 'epoch': '0.6799'}
|
| 126 |
+
{'loss': '2.575', 'grad_norm': '1.005', 'learning_rate': '0.0003907', 'epoch': '0.6881'}
|
| 127 |
+
{'loss': '2.577', 'grad_norm': '0.922', 'learning_rate': '0.0003893', 'epoch': '0.6963'}
|
| 128 |
+
23%|βββββββββ | 8500/36624 [09:40<31:54, 14.69it/s]
|
| 129 |
+
Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 189.22it/s]
|
| 130 |
+
{'loss': '2.573', 'grad_norm': '0.9621', 'learning_rate': '0.0003879', 'epoch': '0.7045'}
|
| 131 |
+
{'loss': '2.57', 'grad_norm': '0.9889', 'learning_rate': '0.0003865', 'epoch': '0.7127'}
|
| 132 |
+
{'loss': '2.568', 'grad_norm': '0.9244', 'learning_rate': '0.0003851', 'epoch': '0.7209'}
|
| 133 |
+
{'loss': '2.57', 'grad_norm': '1.009', 'learning_rate': '0.0003837', 'epoch': '0.7291'}
|
| 134 |
+
{'loss': '2.567', 'grad_norm': '0.9754', 'learning_rate': '0.0003824', 'epoch': '0.7373'}
|
| 135 |
+
25%|ββββββββββ | 9000/36624 [10:14<31:31, 14.60it/s]
|
| 136 |
+
Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 177.72it/s]
|
| 137 |
+
{'loss': '2.57', 'grad_norm': '0.964', 'learning_rate': '0.000381', 'epoch': '0.7454'}
|
| 138 |
+
{'loss': '2.567', 'grad_norm': '0.9354', 'learning_rate': '0.0003796', 'epoch': '0.7536'}
|
| 139 |
+
{'loss': '2.569', 'grad_norm': '0.9461', 'learning_rate': '0.0003782', 'epoch': '0.7618'}
|
| 140 |
+
{'loss': '2.565', 'grad_norm': '0.9415', 'learning_rate': '0.0003768', 'epoch': '0.77'}
|
| 141 |
+
{'loss': '2.566', 'grad_norm': '0.9319', 'learning_rate': '0.0003754', 'epoch': '0.7782'}
|
| 142 |
+
26%|ββββββββββ | 9500/36624 [10:49<31:23, 14.40it/s]
|
| 143 |
+
Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 187.52it/s]
|
| 144 |
+
{'loss': '2.56', 'grad_norm': '0.917', 'learning_rate': '0.0003741', 'epoch': '0.7864'}
|
| 145 |
+
{'loss': '2.562', 'grad_norm': '0.982', 'learning_rate': '0.0003727', 'epoch': '0.7946'}
|
| 146 |
+
{'loss': '2.563', 'grad_norm': '0.9996', 'learning_rate': '0.0003713', 'epoch': '0.8028'}
|
| 147 |
+
{'loss': '2.559', 'grad_norm': '0.9066', 'learning_rate': '0.0003699', 'epoch': '0.811'}
|
| 148 |
+
{'loss': '2.562', 'grad_norm': '0.9582', 'learning_rate': '0.0003685', 'epoch': '0.8192'}
|
| 149 |
+
27%|ββββββββββ | 10000/36624 [11:23<30:09, 14.72it/s]
|
| 150 |
+
Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 182.08it/s]
|
| 151 |
+
{'loss': '2.557', 'grad_norm': '0.9477', 'learning_rate': '0.0003671', 'epoch': '0.8274'}
|
| 152 |
+
{'loss': '2.56', 'grad_norm': '0.9513', 'learning_rate': '0.0003658', 'epoch': '0.8356'}
|
| 153 |
+
{'loss': '2.559', 'grad_norm': '0.9462', 'learning_rate': '0.0003644', 'epoch': '0.8437'}
|
| 154 |
+
{'loss': '2.558', 'grad_norm': '0.9505', 'learning_rate': '0.000363', 'epoch': '0.8519'}
|
| 155 |
+
{'loss': '2.556', 'grad_norm': '0.9055', 'learning_rate': '0.0003616', 'epoch': '0.8601'}
|
| 156 |
+
29%|βββββββββββ | 10500/36624 [11:57<29:42, 14.66it/s]
|
| 157 |
+
Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 186.19it/s]
|
| 158 |
+
{'loss': '2.552', 'grad_norm': '0.9765', 'learning_rate': '0.0003602', 'epoch': '0.8683'}
|
| 159 |
+
{'loss': '2.557', 'grad_norm': '0.9443', 'learning_rate': '0.0003588', 'epoch': '0.8765'}
|
| 160 |
+
{'loss': '2.555', 'grad_norm': '0.8971', 'learning_rate': '0.0003574', 'epoch': '0.8847'}
|
| 161 |
+
{'loss': '2.553', 'grad_norm': '0.9489', 'learning_rate': '0.0003561', 'epoch': '0.8929'}
|
| 162 |
+
{'loss': '2.552', 'grad_norm': '1', 'learning_rate': '0.0003547', 'epoch': '0.9011'}
|
| 163 |
+
30%|βββββββββββ | 11000/36624 [12:31<28:47, 14.83it/s]
|
| 164 |
+
Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 176.59it/s]
|
| 165 |
+
{'loss': '2.557', 'grad_norm': '0.915', 'learning_rate': '0.0003533', 'epoch': '0.9093'}
|
| 166 |
+
{'loss': '2.552', 'grad_norm': '0.911', 'learning_rate': '0.0003519', 'epoch': '0.9175'}
|
| 167 |
+
{'loss': '2.554', 'grad_norm': '0.9488', 'learning_rate': '0.0003505', 'epoch': '0.9257'}
|
| 168 |
+
{'loss': '2.547', 'grad_norm': '0.9326', 'learning_rate': '0.0003491', 'epoch': '0.9339'}
|
| 169 |
+
{'loss': '2.555', 'grad_norm': '0.9041', 'learning_rate': '0.0003478', 'epoch': '0.942'}
|
| 170 |
+
31%|ββββββββββββ | 11500/36624 [13:06<28:39, 14.61it/s]
|
| 171 |
+
Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 193.03it/s]
|
| 172 |
+
{'loss': '2.547', 'grad_norm': '0.9229', 'learning_rate': '0.0003464', 'epoch': '0.9502'}
|
| 173 |
+
{'loss': '2.547', 'grad_norm': '0.9645', 'learning_rate': '0.000345', 'epoch': '0.9584'}
|
| 174 |
+
{'loss': '2.548', 'grad_norm': '0.9408', 'learning_rate': '0.0003436', 'epoch': '0.9666'}
|
| 175 |
+
{'loss': '2.546', 'grad_norm': '0.9032', 'learning_rate': '0.0003422', 'epoch': '0.9748'}
|
| 176 |
+
{'loss': '2.549', 'grad_norm': '0.918', 'learning_rate': '0.0003408', 'epoch': '0.983'}
|
| 177 |
+
33%|ββββββββββββ | 12000/36624 [13:40<28:04, 14.62it/s]
|
| 178 |
+
Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 188.44it/s]
|
| 179 |
+
{'loss': '2.547', 'grad_norm': '0.9086', 'learning_rate': '0.0003395', 'epoch': '0.9912'}
|
| 180 |
+
{'loss': '2.544', 'grad_norm': '0.9125', 'learning_rate': '0.0003381', 'epoch': '0.9994'}
|
| 181 |
+
{'loss': '2.541', 'grad_norm': '0.9181', 'learning_rate': '0.0003367', 'epoch': '1.008'}
|
| 182 |
+
{'loss': '2.545', 'grad_norm': '0.9132', 'learning_rate': '0.0003353', 'epoch': '1.016'}
|
| 183 |
+
{'loss': '2.542', 'grad_norm': '0.9156', 'learning_rate': '0.0003339', 'epoch': '1.024'}
|
| 184 |
+
34%|βββββββββββββ | 12500/36624 [14:15<27:29, 14.62it/s]
|
| 185 |
+
Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 114.07it/s]
|
| 186 |
+
{'loss': '2.538', 'grad_norm': '0.9441', 'learning_rate': '0.0003325', 'epoch': '1.032'}
|
| 187 |
+
{'loss': '2.542', 'grad_norm': '0.9385', 'learning_rate': '0.0003312', 'epoch': '1.04'}
|
| 188 |
+
{'loss': '2.536', 'grad_norm': '0.9842', 'learning_rate': '0.0003298', 'epoch': '1.048'}
|
| 189 |
+
{'loss': '2.542', 'grad_norm': '0.9319', 'learning_rate': '0.0003284', 'epoch': '1.057'}
|
| 190 |
+
{'loss': '2.537', 'grad_norm': '0.8883', 'learning_rate': '0.000327', 'epoch': '1.065'}
|
| 191 |
+
35%|ββββββββββββββ | 13000/36624 [14:50<27:04, 14.54it/s]
|
| 192 |
+
Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 171.43it/s]
|
| 193 |
+
{'loss': '2.54', 'grad_norm': '0.9869', 'learning_rate': '0.0003256', 'epoch': '1.073'}
|
| 194 |
+
{'loss': '2.539', 'grad_norm': '0.8919', 'learning_rate': '0.0003242', 'epoch': '1.081'}
|
| 195 |
+
{'loss': '2.533', 'grad_norm': '0.9155', 'learning_rate': '0.0003228', 'epoch': '1.089'}
|
| 196 |
+
{'loss': '2.537', 'grad_norm': '0.9485', 'learning_rate': '0.0003215', 'epoch': '1.098'}
|
| 197 |
+
{'loss': '2.539', 'grad_norm': '0.9354', 'learning_rate': '0.0003201', 'epoch': '1.106'}
|
| 198 |
+
37%|ββββββββββββββ | 13500/36624 [15:24<26:16, 14.67it/s]
|
| 199 |
+
Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 199.73it/s]
|
| 200 |
+
{'loss': '2.535', 'grad_norm': '0.9028', 'learning_rate': '0.0003187', 'epoch': '1.114'}
|
| 201 |
+
{'loss': '2.533', 'grad_norm': '0.9042', 'learning_rate': '0.0003173', 'epoch': '1.122'}
|
| 202 |
+
{'loss': '2.533', 'grad_norm': '0.9192', 'learning_rate': '0.0003159', 'epoch': '1.13'}
|
| 203 |
+
{'loss': '2.533', 'grad_norm': '0.8816', 'learning_rate': '0.0003145', 'epoch': '1.139'}
|
| 204 |
+
{'loss': '2.53', 'grad_norm': '0.9064', 'learning_rate': '0.0003132', 'epoch': '1.147'}
|
| 205 |
+
38%|βββββββββββββββ | 14000/36624 [15:58<26:09, 14.42it/s]
|
| 206 |
+
Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 174.55it/s]
|
| 207 |
+
{'loss': '2.534', 'grad_norm': '0.9424', 'learning_rate': '0.0003118', 'epoch': '1.155'}
|
| 208 |
+
{'loss': '2.53', 'grad_norm': '0.9198', 'learning_rate': '0.0003104', 'epoch': '1.163'}
|
| 209 |
+
{'loss': '2.53', 'grad_norm': '0.9234', 'learning_rate': '0.000309', 'epoch': '1.171'}
|
| 210 |
+
{'loss': '2.533', 'grad_norm': '1.027', 'learning_rate': '0.0003076', 'epoch': '1.18'}
|
| 211 |
+
{'loss': '2.531', 'grad_norm': '0.9083', 'learning_rate': '0.0003062', 'epoch': '1.188'}
|
| 212 |
+
40%|βββββββββββββββ | 14500/36624 [16:32<25:17, 14.58it/s]
|
| 213 |
+
Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 192.31it/s]
|
| 214 |
+
{'loss': '2.53', 'grad_norm': '0.8941', 'learning_rate': '0.0003049', 'epoch': '1.196'}
|
| 215 |
+
{'loss': '2.533', 'grad_norm': '0.9395', 'learning_rate': '0.0003035', 'epoch': '1.204'}
|
| 216 |
+
{'loss': '2.53', 'grad_norm': '0.9605', 'learning_rate': '0.0003021', 'epoch': '1.212'}
|
| 217 |
+
{'loss': '2.53', 'grad_norm': '0.9029', 'learning_rate': '0.0003007', 'epoch': '1.221'}
|
| 218 |
+
{'loss': '2.529', 'grad_norm': '0.9056', 'learning_rate': '0.0002993', 'epoch': '1.229'}
|
| 219 |
+
41%|ββββββββββββββββ | 15000/36624 [17:07<24:39, 14.62it/s]
|
| 220 |
+
Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 180.68it/s]
|
| 221 |
+
{'loss': '2.528', 'grad_norm': '0.8955', 'learning_rate': '0.0002979', 'epoch': '1.237'}
|
| 222 |
+
{'loss': '2.53', 'grad_norm': '0.9041', 'learning_rate': '0.0002965', 'epoch': '1.245'}
|
| 223 |
+
{'loss': '2.527', 'grad_norm': '0.9242', 'learning_rate': '0.0002952', 'epoch': '1.253'}
|
| 224 |
+
{'loss': '2.525', 'grad_norm': '0.9313', 'learning_rate': '0.0002938', 'epoch': '1.261'}
|
| 225 |
+
{'loss': '2.525', 'grad_norm': '0.9721', 'learning_rate': '0.0002924', 'epoch': '1.27'}
|
| 226 |
+
42%|ββββββββββββββββ | 15500/36624 [17:41<23:50, 14.77it/s]
|
| 227 |
+
Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 195.36it/s]
|
| 228 |
+
{'loss': '2.522', 'grad_norm': '0.9043', 'learning_rate': '0.000291', 'epoch': '1.278'}
|
| 229 |
+
{'loss': '2.524', 'grad_norm': '0.9181', 'learning_rate': '0.0002896', 'epoch': '1.286'}
|
| 230 |
+
{'loss': '2.527', 'grad_norm': '0.9111', 'learning_rate': '0.0002882', 'epoch': '1.294'}
|
| 231 |
+
{'loss': '2.523', 'grad_norm': '0.9105', 'learning_rate': '0.0002869', 'epoch': '1.302'}
|
| 232 |
+
{'loss': '2.526', 'grad_norm': '1.005', 'learning_rate': '0.0002855', 'epoch': '1.311'}
|
| 233 |
+
44%|βββββββββββββββββ | 16000/36624 [18:15<23:29, 14.63it/s]
|
| 234 |
+
Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 192.30it/s]
|
| 235 |
+
{'loss': '2.526', 'grad_norm': '0.9184', 'learning_rate': '0.0002841', 'epoch': '1.319'}
|
| 236 |
+
{'loss': '2.52', 'grad_norm': '0.8872', 'learning_rate': '0.0002827', 'epoch': '1.327'}
|
| 237 |
+
{'loss': '2.519', 'grad_norm': '0.9441', 'learning_rate': '0.0002813', 'epoch': '1.335'}
|
| 238 |
+
{'loss': '2.525', 'grad_norm': '0.9462', 'learning_rate': '0.0002799', 'epoch': '1.343'}
|
| 239 |
+
{'loss': '2.525', 'grad_norm': '0.9307', 'learning_rate': '0.0002786', 'epoch': '1.352'}
|
| 240 |
+
45%|βββββββββββββββββ | 16500/36624 [18:49<23:00, 14.58it/s]
|
| 241 |
+
Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 184.49it/s]
|
| 242 |
+
{'loss': '2.519', 'grad_norm': '0.9708', 'learning_rate': '0.0002772', 'epoch': '1.36'}
|
| 243 |
+
{'loss': '2.522', 'grad_norm': '0.9035', 'learning_rate': '0.0002758', 'epoch': '1.368'}
|
| 244 |
+
{'loss': '2.518', 'grad_norm': '0.9394', 'learning_rate': '0.0002744', 'epoch': '1.376'}
|
| 245 |
+
{'loss': '2.521', 'grad_norm': '0.9519', 'learning_rate': '0.000273', 'epoch': '1.384'}
|
| 246 |
+
{'loss': '2.518', 'grad_norm': '0.915', 'learning_rate': '0.0002716', 'epoch': '1.393'}
|
| 247 |
+
46%|ββββββββββββββββββ | 17000/36624 [19:23<22:15, 14.69it/s]
|
| 248 |
+
Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 188.87it/s]
|
| 249 |
+
{'loss': '2.517', 'grad_norm': '0.9166', 'learning_rate': '0.0002702', 'epoch': '1.401'}
|
| 250 |
+
{'loss': '2.513', 'grad_norm': '0.9377', 'learning_rate': '0.0002689', 'epoch': '1.409'}
|
| 251 |
+
{'loss': '2.516', 'grad_norm': '0.9178', 'learning_rate': '0.0002675', 'epoch': '1.417'}
|
| 252 |
+
{'loss': '2.519', 'grad_norm': '0.9151', 'learning_rate': '0.0002661', 'epoch': '1.425'}
|
| 253 |
+
{'loss': '2.515', 'grad_norm': '0.9612', 'learning_rate': '0.0002647', 'epoch': '1.434'}
|
| 254 |
+
48%|ββββββββββββββββββ | 17500/36624 [19:58<21:56, 14.53it/s]
|
| 255 |
+
Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 176.02it/s]
|
| 256 |
+
{'loss': '2.519', 'grad_norm': '0.9229', 'learning_rate': '0.0002633', 'epoch': '1.442'}
|
| 257 |
+
{'loss': '2.518', 'grad_norm': '0.9195', 'learning_rate': '0.0002619', 'epoch': '1.45'}
|
| 258 |
+
{'loss': '2.514', 'grad_norm': '0.9046', 'learning_rate': '0.0002606', 'epoch': '1.458'}
|
| 259 |
+
{'loss': '2.52', 'grad_norm': '0.9383', 'learning_rate': '0.0002592', 'epoch': '1.466'}
|
| 260 |
+
{'loss': '2.516', 'grad_norm': '0.9361', 'learning_rate': '0.0002578', 'epoch': '1.474'}
|
| 261 |
+
49%|βββββββββββββββββββ | 18000/36624 [20:32<21:18, 14.57it/s]
|
| 262 |
+
Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 184.81it/s]
|
| 263 |
+
{'loss': '2.509', 'grad_norm': '0.9623', 'learning_rate': '0.0002564', 'epoch': '1.483'}
|
| 264 |
+
{'loss': '2.511', 'grad_norm': '0.9627', 'learning_rate': '0.000255', 'epoch': '1.491'}
|
| 265 |
+
{'loss': '2.516', 'grad_norm': '0.9481', 'learning_rate': '0.0002536', 'epoch': '1.499'}
|
| 266 |
+
{'loss': '2.516', 'grad_norm': '0.9699', 'learning_rate': '0.0002523', 'epoch': '1.507'}
|
| 267 |
+
{'loss': '2.514', 'grad_norm': '0.9232', 'learning_rate': '0.0002509', 'epoch': '1.515'}
|
| 268 |
+
51%|βββββββββββββββββββ | 18500/36624 [21:06<20:39, 14.63it/s]
|
| 269 |
+
Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 181.07it/s]
|
| 270 |
+
{'loss': '2.508', 'grad_norm': '0.8967', 'learning_rate': '0.0002495', 'epoch': '1.524'}
|
| 271 |
+
{'loss': '2.51', 'grad_norm': '0.9512', 'learning_rate': '0.0002481', 'epoch': '1.532'}
|
| 272 |
+
{'loss': '2.511', 'grad_norm': '0.9096', 'learning_rate': '0.0002467', 'epoch': '1.54'}
|
| 273 |
+
{'loss': '2.509', 'grad_norm': '0.9213', 'learning_rate': '0.0002453', 'epoch': '1.548'}
|
| 274 |
+
{'loss': '2.513', 'grad_norm': '0.9172', 'learning_rate': '0.000244', 'epoch': '1.556'}
|
| 275 |
+
52%|ββββββββββββββββββββ | 19000/36624 [21:40<20:00, 14.69it/s]
|
| 276 |
+
Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 180.06it/s]
|
| 277 |
+
{'loss': '2.51', 'grad_norm': '0.9369', 'learning_rate': '0.0002426', 'epoch': '1.565'}
|
| 278 |
+
{'loss': '2.512', 'grad_norm': '0.9091', 'learning_rate': '0.0002412', 'epoch': '1.573'}
|
| 279 |
+
{'loss': '2.512', 'grad_norm': '0.8935', 'learning_rate': '0.0002398', 'epoch': '1.581'}
|
| 280 |
+
{'loss': '2.51', 'grad_norm': '0.9206', 'learning_rate': '0.0002384', 'epoch': '1.589'}
|
| 281 |
+
{'loss': '2.507', 'grad_norm': '0.9272', 'learning_rate': '0.000237', 'epoch': '1.597'}
|
| 282 |
+
53%|ββββββββββββββββββββ | 19500/36624 [22:15<19:28, 14.66it/s]
|
| 283 |
+
Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 183.45it/s]
|
| 284 |
+
{'loss': '2.51', 'grad_norm': '0.9499', 'learning_rate': '0.0002356', 'epoch': '1.606'}
|
| 285 |
+
{'loss': '2.513', 'grad_norm': '0.9095', 'learning_rate': '0.0002343', 'epoch': '1.614'}
|
| 286 |
+
{'loss': '2.508', 'grad_norm': '0.9086', 'learning_rate': '0.0002329', 'epoch': '1.622'}
|
| 287 |
+
{'loss': '2.507', 'grad_norm': '0.9389', 'learning_rate': '0.0002315', 'epoch': '1.63'}
|
| 288 |
+
{'loss': '2.514', 'grad_norm': '0.8963', 'learning_rate': '0.0002301', 'epoch': '1.638'}
|
| 289 |
+
55%|βββββββββββββββββββββ | 20000/36624 [22:49<18:57, 14.61it/s]
|
| 290 |
+
Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 174.24it/s]
|
| 291 |
+
{'loss': '2.506', 'grad_norm': '0.978', 'learning_rate': '0.0002287', 'epoch': '1.646'}
|
| 292 |
+
{'loss': '2.507', 'grad_norm': '0.9966', 'learning_rate': '0.0002273', 'epoch': '1.655'}
|
| 293 |
+
{'loss': '2.507', 'grad_norm': '0.9281', 'learning_rate': '0.000226', 'epoch': '1.663'}
|
| 294 |
+
{'loss': '2.51', 'grad_norm': '0.9063', 'learning_rate': '0.0002246', 'epoch': '1.671'}
|
| 295 |
+
{'loss': '2.509', 'grad_norm': '0.9708', 'learning_rate': '0.0002232', 'epoch': '1.679'}
|
| 296 |
+
56%|βββββββββββββββββββββ | 20500/36624 [23:23<18:18, 14.67it/s]
|
| 297 |
+
Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 182.16it/s]
|
| 298 |
+
{'loss': '2.505', 'grad_norm': '0.946', 'learning_rate': '0.0002218', 'epoch': '1.687'}
|
| 299 |
+
{'loss': '2.507', 'grad_norm': '0.9184', 'learning_rate': '0.0002204', 'epoch': '1.696'}
|
| 300 |
+
{'loss': '2.506', 'grad_norm': '0.9702', 'learning_rate': '0.000219', 'epoch': '1.704'}
|
| 301 |
+
{'loss': '2.499', 'grad_norm': '0.9535', 'learning_rate': '0.0002177', 'epoch': '1.712'}
|
| 302 |
+
{'loss': '2.502', 'grad_norm': '0.9017', 'learning_rate': '0.0002163', 'epoch': '1.72'}
|
| 303 |
+
57%|ββββββββββββββββββββββ | 21000/36624 [23:57<18:03, 14.42it/s]
|
| 304 |
+
Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 190.95it/s]
|
| 305 |
+
{'loss': '2.509', 'grad_norm': '0.9587', 'learning_rate': '0.0002149', 'epoch': '1.728'}
|
| 306 |
+
{'loss': '2.504', 'grad_norm': '0.9648', 'learning_rate': '0.0002135', 'epoch': '1.737'}
|
| 307 |
+
{'loss': '2.503', 'grad_norm': '0.953', 'learning_rate': '0.0002121', 'epoch': '1.745'}
|
| 308 |
+
{'loss': '2.5', 'grad_norm': '0.9445', 'learning_rate': '0.0002107', 'epoch': '1.753'}
|
| 309 |
+
{'loss': '2.501', 'grad_norm': '0.9414', 'learning_rate': '0.0002093', 'epoch': '1.761'}
|
| 310 |
+
59%|ββββββββββββββββββββββ | 21500/36624 [24:32<17:10, 14.67it/s]
|
| 311 |
+
Writing model shards: 100%|βββββββββββββββββββββββ| 1/1 [00:00<00:00, 52.34it/s]
|
| 312 |
+
{'loss': '2.503', 'grad_norm': '0.9309', 'learning_rate': '0.000208', 'epoch': '1.769'}
|
| 313 |
+
{'loss': '2.502', 'grad_norm': '0.9301', 'learning_rate': '0.0002066', 'epoch': '1.778'}
|
| 314 |
+
{'loss': '2.504', 'grad_norm': '0.895', 'learning_rate': '0.0002052', 'epoch': '1.786'}
|
| 315 |
+
{'loss': '2.502', 'grad_norm': '0.9428', 'learning_rate': '0.0002038', 'epoch': '1.794'}
|
| 316 |
+
{'loss': '2.501', 'grad_norm': '0.9539', 'learning_rate': '0.0002024', 'epoch': '1.802'}
|
| 317 |
+
60%|βββββββββββββββββββββββ | 22000/36624 [25:06<16:28, 14.79it/s]
|
| 318 |
+
Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 203.41it/s]
|
| 319 |
+
{'loss': '2.5', 'grad_norm': '0.9179', 'learning_rate': '0.000201', 'epoch': '1.81'}
|
| 320 |
+
{'loss': '2.501', 'grad_norm': '0.9195', 'learning_rate': '0.0001997', 'epoch': '1.819'}
|
| 321 |
+
{'loss': '2.499', 'grad_norm': '1.047', 'learning_rate': '0.0001983', 'epoch': '1.827'}
|
| 322 |
+
{'loss': '2.499', 'grad_norm': '0.931', 'learning_rate': '0.0001969', 'epoch': '1.835'}
|
| 323 |
+
{'loss': '2.499', 'grad_norm': '0.9269', 'learning_rate': '0.0001955', 'epoch': '1.843'}
|
| 324 |
+
61%|βββββββββββββββββββββββ | 22500/36624 [25:40<16:03, 14.65it/s]
|
| 325 |
+
Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 194.41it/s]
|
| 326 |
+
{'loss': '2.501', 'grad_norm': '0.939', 'learning_rate': '0.0001941', 'epoch': '1.851'}
|
| 327 |
+
{'loss': '2.495', 'grad_norm': '0.9119', 'learning_rate': '0.0001927', 'epoch': '1.859'}
|
| 328 |
+
{'loss': '2.499', 'grad_norm': '0.9755', 'learning_rate': '0.0001914', 'epoch': '1.868'}
|
| 329 |
+
{'loss': '2.497', 'grad_norm': '0.9444', 'learning_rate': '0.00019', 'epoch': '1.876'}
|
| 330 |
+
{'loss': '2.496', 'grad_norm': '0.9551', 'learning_rate': '0.0001886', 'epoch': '1.884'}
|
| 331 |
+
63%|ββββββββββββββββββββββββ | 23000/36624 [26:15<15:34, 14.58it/s]
|
| 332 |
+
Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 183.11it/s]
|
| 333 |
+
{'loss': '2.5', 'grad_norm': '0.9524', 'learning_rate': '0.0001872', 'epoch': '1.892'}
|
| 334 |
+
{'loss': '2.502', 'grad_norm': '0.9583', 'learning_rate': '0.0001858', 'epoch': '1.9'}
|
| 335 |
+
{'loss': '2.497', 'grad_norm': '0.9206', 'learning_rate': '0.0001844', 'epoch': '1.909'}
|
| 336 |
+
{'loss': '2.495', 'grad_norm': '0.9133', 'learning_rate': '0.0001831', 'epoch': '1.917'}
|
| 337 |
+
{'loss': '2.491', 'grad_norm': '0.9201', 'learning_rate': '0.0001817', 'epoch': '1.925'}
|
| 338 |
+
64%|ββββββββββββββββββββββββ | 23500/36624 [26:49<14:51, 14.72it/s]
|
| 339 |
+
Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 175.74it/s]
|
| 340 |
+
{'loss': '2.499', 'grad_norm': '0.9536', 'learning_rate': '0.0001803', 'epoch': '1.933'}
|
| 341 |
+
{'loss': '2.497', 'grad_norm': '0.9332', 'learning_rate': '0.0001789', 'epoch': '1.941'}
|
| 342 |
+
{'loss': '2.491', 'grad_norm': '0.9358', 'learning_rate': '0.0001775', 'epoch': '1.95'}
|
| 343 |
+
{'loss': '2.493', 'grad_norm': '0.9568', 'learning_rate': '0.0001761', 'epoch': '1.958'}
|
| 344 |
+
{'loss': '2.494', 'grad_norm': '0.9585', 'learning_rate': '0.0001747', 'epoch': '1.966'}
|
| 345 |
+
66%|βββββββββββββββββββββββββ | 24000/36624 [27:23<14:14, 14.78it/s]
|
| 346 |
+
Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 178.42it/s]
|
| 347 |
+
{'loss': '2.496', 'grad_norm': '0.9171', 'learning_rate': '0.0001734', 'epoch': '1.974'}
|
| 348 |
+
{'loss': '2.495', 'grad_norm': '0.9789', 'learning_rate': '0.000172', 'epoch': '1.982'}
|
| 349 |
+
{'loss': '2.493', 'grad_norm': '0.9548', 'learning_rate': '0.0001706', 'epoch': '1.991'}
|
| 350 |
+
{'loss': '2.495', 'grad_norm': '1.021', 'learning_rate': '0.0001692', 'epoch': '1.999'}
|
| 351 |
+
{'loss': '2.486', 'grad_norm': '0.9158', 'learning_rate': '0.0001678', 'epoch': '2.007'}
|
| 352 |
+
67%|βββββββββββββββββββββββββ | 24500/36624 [27:58<14:15, 14.17it/s]
|
| 353 |
+
Writing model shards: 100%|βββββββββββββββββββββββ| 1/1 [00:00<00:00, 42.99it/s]
|
| 354 |
+
{'loss': '2.494', 'grad_norm': '0.9763', 'learning_rate': '0.0001664', 'epoch': '2.015'}
|
| 355 |
+
{'loss': '2.495', 'grad_norm': '0.9613', 'learning_rate': '0.0001651', 'epoch': '2.023'}
|
| 356 |
+
{'loss': '2.489', 'grad_norm': '0.9664', 'learning_rate': '0.0001637', 'epoch': '2.031'}
|
| 357 |
+
{'loss': '2.498', 'grad_norm': '1.016', 'learning_rate': '0.0001623', 'epoch': '2.04'}
|
| 358 |
+
{'loss': '2.488', 'grad_norm': '0.9416', 'learning_rate': '0.0001609', 'epoch': '2.048'}
|
| 359 |
+
68%|ββββββββββββββββββββββββββ | 25000/36624 [28:33<13:16, 14.60it/s]
|
| 360 |
+
Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 180.22it/s]
|
| 361 |
+
{'loss': '2.489', 'grad_norm': '0.9494', 'learning_rate': '0.0001595', 'epoch': '2.056'}
|
| 362 |
+
{'loss': '2.486', 'grad_norm': '0.9252', 'learning_rate': '0.0001581', 'epoch': '2.064'}
|
| 363 |
+
{'loss': '2.492', 'grad_norm': '0.9568', 'learning_rate': '0.0001568', 'epoch': '2.072'}
|
| 364 |
+
{'loss': '2.492', 'grad_norm': '0.9466', 'learning_rate': '0.0001554', 'epoch': '2.081'}
|
| 365 |
+
{'loss': '2.486', 'grad_norm': '0.9349', 'learning_rate': '0.000154', 'epoch': '2.089'}
|
| 366 |
+
70%|ββββββββββββββββββββββββββ | 25500/36624 [29:07<12:43, 14.57it/s]
|
| 367 |
+
Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 167.93it/s]
|
| 368 |
+
{'loss': '2.489', 'grad_norm': '0.9689', 'learning_rate': '0.0001526', 'epoch': '2.097'}
|
| 369 |
+
{'loss': '2.488', 'grad_norm': '0.9909', 'learning_rate': '0.0001512', 'epoch': '2.105'}
|
| 370 |
+
{'loss': '2.49', 'grad_norm': '0.9703', 'learning_rate': '0.0001498', 'epoch': '2.113'}
|
| 371 |
+
{'loss': '2.487', 'grad_norm': '1.02', 'learning_rate': '0.0001484', 'epoch': '2.122'}
|
| 372 |
+
{'loss': '2.485', 'grad_norm': '1.005', 'learning_rate': '0.0001471', 'epoch': '2.13'}
|
| 373 |
+
71%|βββββββββββββββββββββββββββ | 26000/36624 [29:41<12:02, 14.70it/s]
|
| 374 |
+
Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 201.60it/s]
|
| 375 |
+
{'loss': '2.486', 'grad_norm': '0.9853', 'learning_rate': '0.0001457', 'epoch': '2.138'}
|
| 376 |
+
{'loss': '2.485', 'grad_norm': '0.9916', 'learning_rate': '0.0001443', 'epoch': '2.146'}
|
| 377 |
+
{'loss': '2.488', 'grad_norm': '0.9691', 'learning_rate': '0.0001429', 'epoch': '2.154'}
|
| 378 |
+
{'loss': '2.488', 'grad_norm': '0.9773', 'learning_rate': '0.0001415', 'epoch': '2.163'}
|
| 379 |
+
{'loss': '2.482', 'grad_norm': '0.953', 'learning_rate': '0.0001401', 'epoch': '2.171'}
|
| 380 |
+
72%|βββββββββββββββββββββββββββ | 26500/36624 [30:15<11:31, 14.63it/s]
|
| 381 |
+
Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 187.70it/s]
|
| 382 |
+
{'loss': '2.487', 'grad_norm': '0.9432', 'learning_rate': '0.0001388', 'epoch': '2.179'}
|
| 383 |
+
{'loss': '2.487', 'grad_norm': '0.962', 'learning_rate': '0.0001374', 'epoch': '2.187'}
|
| 384 |
+
{'loss': '2.488', 'grad_norm': '0.9646', 'learning_rate': '0.000136', 'epoch': '2.195'}
|
| 385 |
+
{'loss': '2.483', 'grad_norm': '0.9822', 'learning_rate': '0.0001346', 'epoch': '2.203'}
|
| 386 |
+
{'loss': '2.484', 'grad_norm': '0.9462', 'learning_rate': '0.0001332', 'epoch': '2.212'}
|
| 387 |
+
74%|ββββββββββββββββββββββββββββ | 27000/36624 [30:50<11:02, 14.53it/s]
|
| 388 |
+
Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 148.78it/s]
|
| 389 |
+
{'loss': '2.485', 'grad_norm': '0.983', 'learning_rate': '0.0001318', 'epoch': '2.22'}
|
| 390 |
+
{'loss': '2.483', 'grad_norm': '0.9827', 'learning_rate': '0.0001305', 'epoch': '2.228'}
|
| 391 |
+
{'loss': '2.486', 'grad_norm': '0.987', 'learning_rate': '0.0001291', 'epoch': '2.236'}
|
| 392 |
+
{'loss': '2.486', 'grad_norm': '1.003', 'learning_rate': '0.0001277', 'epoch': '2.244'}
|
| 393 |
+
{'loss': '2.487', 'grad_norm': '0.9763', 'learning_rate': '0.0001263', 'epoch': '2.253'}
|
| 394 |
+
75%|ββββββββββββββββββββββββββββ | 27500/36624 [31:24<10:20, 14.70it/s]
|
| 395 |
+
Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 197.76it/s]
|
| 396 |
+
{'loss': '2.484', 'grad_norm': '0.9718', 'learning_rate': '0.0001249', 'epoch': '2.261'}
|
| 397 |
+
{'loss': '2.482', 'grad_norm': '0.964', 'learning_rate': '0.0001235', 'epoch': '2.269'}
|
| 398 |
+
{'loss': '2.486', 'grad_norm': '0.9918', 'learning_rate': '0.0001221', 'epoch': '2.277'}
|
| 399 |
+
{'loss': '2.482', 'grad_norm': '0.9895', 'learning_rate': '0.0001208', 'epoch': '2.285'}
|
| 400 |
+
{'loss': '2.483', 'grad_norm': '0.9978', 'learning_rate': '0.0001194', 'epoch': '2.294'}
|
| 401 |
+
76%|βββββββββββββββββββββββββββββ | 28000/36624 [31:58<10:07, 14.20it/s]
|
| 402 |
+
Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 187.44it/s]
|
| 403 |
+
{'loss': '2.484', 'grad_norm': '1.008', 'learning_rate': '0.000118', 'epoch': '2.302'}
|
| 404 |
+
{'loss': '2.484', 'grad_norm': '1.029', 'learning_rate': '0.0001166', 'epoch': '2.31'}
|
| 405 |
+
{'loss': '2.482', 'grad_norm': '0.9828', 'learning_rate': '0.0001152', 'epoch': '2.318'}
|
| 406 |
+
{'loss': '2.484', 'grad_norm': '0.9815', 'learning_rate': '0.0001138', 'epoch': '2.326'}
|
| 407 |
+
{'loss': '2.484', 'grad_norm': '0.97', 'learning_rate': '0.0001125', 'epoch': '2.335'}
|
| 408 |
+
78%|βββββββββββββββββββββββββββββ | 28500/36624 [32:32<09:16, 14.61it/s]
|
| 409 |
+
Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 194.61it/s]
|
| 410 |
+
{'loss': '2.48', 'grad_norm': '1.007', 'learning_rate': '0.0001111', 'epoch': '2.343'}
|
| 411 |
+
{'loss': '2.477', 'grad_norm': '1.018', 'learning_rate': '0.0001097', 'epoch': '2.351'}
|
| 412 |
+
{'loss': '2.477', 'grad_norm': '0.945', 'learning_rate': '0.0001083', 'epoch': '2.359'}
|
| 413 |
+
{'loss': '2.479', 'grad_norm': '1.001', 'learning_rate': '0.0001069', 'epoch': '2.367'}
|
| 414 |
+
{'loss': '2.48', 'grad_norm': '0.9743', 'learning_rate': '0.0001055', 'epoch': '2.376'}
|
| 415 |
+
79%|ββββββββββββββββββββββββββββββ | 29000/36624 [33:07<08:43, 14.57it/s]
|
| 416 |
+
Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 193.78it/s]
|
| 417 |
+
{'loss': '2.48', 'grad_norm': '1.005', 'learning_rate': '0.0001042', 'epoch': '2.384'}
|
| 418 |
+
{'loss': '2.482', 'grad_norm': '1.016', 'learning_rate': '0.0001028', 'epoch': '2.392'}
|
| 419 |
+
{'loss': '2.473', 'grad_norm': '0.9854', 'learning_rate': '0.0001014', 'epoch': '2.4'}
|
| 420 |
+
{'loss': '2.476', 'grad_norm': '0.9408', 'learning_rate': '0.0001', 'epoch': '2.408'}
|
| 421 |
+
{'loss': '2.475', 'grad_norm': '0.9968', 'learning_rate': '9.862e-05', 'epoch': '2.416'}
|
| 422 |
+
81%|ββββββββββββββββββββββββββββββ | 29500/36624 [33:41<08:09, 14.55it/s]
|
| 423 |
+
Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 189.43it/s]
|
| 424 |
+
{'loss': '2.476', 'grad_norm': '1.016', 'learning_rate': '9.723e-05', 'epoch': '2.425'}
|
| 425 |
+
{'loss': '2.48', 'grad_norm': '0.9962', 'learning_rate': '9.585e-05', 'epoch': '2.433'}
|
| 426 |
+
{'loss': '2.478', 'grad_norm': '1.032', 'learning_rate': '9.447e-05', 'epoch': '2.441'}
|
| 427 |
+
{'loss': '2.477', 'grad_norm': '1.001', 'learning_rate': '9.308e-05', 'epoch': '2.449'}
|
| 428 |
+
{'loss': '2.477', 'grad_norm': '0.9866', 'learning_rate': '9.17e-05', 'epoch': '2.457'}
|
| 429 |
+
82%|βββββββββββββββββββββββββββββββ | 30000/36624 [34:15<07:36, 14.53it/s]
|
| 430 |
+
Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 184.80it/s]
|
| 431 |
+
{'loss': '2.476', 'grad_norm': '1.028', 'learning_rate': '9.031e-05', 'epoch': '2.466'}
|
| 432 |
+
{'loss': '2.475', 'grad_norm': '0.9801', 'learning_rate': '8.893e-05', 'epoch': '2.474'}
|
| 433 |
+
{'loss': '2.48', 'grad_norm': '0.9972', 'learning_rate': '8.755e-05', 'epoch': '2.482'}
|
| 434 |
+
{'loss': '2.479', 'grad_norm': '0.9972', 'learning_rate': '8.616e-05', 'epoch': '2.49'}
|
| 435 |
+
{'loss': '2.479', 'grad_norm': '1.076', 'learning_rate': '8.478e-05', 'epoch': '2.498'}
|
| 436 |
+
83%|βββββββββββββββββββββββββββββββ | 30500/36624 [34:50<07:03, 14.47it/s]
|
| 437 |
+
Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 156.83it/s]
|
| 438 |
+
{'loss': '2.472', 'grad_norm': '0.9888', 'learning_rate': '8.339e-05', 'epoch': '2.507'}
|
| 439 |
+
{'loss': '2.477', 'grad_norm': '1.045', 'learning_rate': '8.201e-05', 'epoch': '2.515'}
|
| 440 |
+
{'loss': '2.475', 'grad_norm': '1.039', 'learning_rate': '8.063e-05', 'epoch': '2.523'}
|
| 441 |
+
{'loss': '2.476', 'grad_norm': '1.038', 'learning_rate': '7.924e-05', 'epoch': '2.531'}
|
| 442 |
+
{'loss': '2.474', 'grad_norm': '1.013', 'learning_rate': '7.786e-05', 'epoch': '2.539'}
|
| 443 |
+
85%|ββββββββββββββββββββββββββββββββ | 31000/36624 [35:25<06:26, 14.55it/s]
|
| 444 |
+
Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 173.79it/s]
|
| 445 |
+
{'loss': '2.478', 'grad_norm': '0.9911', 'learning_rate': '7.647e-05', 'epoch': '2.548'}
|
| 446 |
+
{'loss': '2.475', 'grad_norm': '1.003', 'learning_rate': '7.509e-05', 'epoch': '2.556'}
|
| 447 |
+
{'loss': '2.476', 'grad_norm': '0.9986', 'learning_rate': '7.37e-05', 'epoch': '2.564'}
|
| 448 |
+
{'loss': '2.475', 'grad_norm': '1.034', 'learning_rate': '7.232e-05', 'epoch': '2.572'}
|
| 449 |
+
{'loss': '2.476', 'grad_norm': '0.9733', 'learning_rate': '7.094e-05', 'epoch': '2.58'}
|
| 450 |
+
86%|ββββββββββββββββββββββββββββββββ | 31500/36624 [35:59<05:53, 14.51it/s]
|
| 451 |
+
Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 181.57it/s]
|
| 452 |
+
{'loss': '2.468', 'grad_norm': '1.055', 'learning_rate': '6.955e-05', 'epoch': '2.588'}
|
| 453 |
+
{'loss': '2.475', 'grad_norm': '1.026', 'learning_rate': '6.817e-05', 'epoch': '2.597'}
|
| 454 |
+
{'loss': '2.476', 'grad_norm': '1.029', 'learning_rate': '6.678e-05', 'epoch': '2.605'}
|
| 455 |
+
{'loss': '2.471', 'grad_norm': '1.032', 'learning_rate': '6.54e-05', 'epoch': '2.613'}
|
| 456 |
+
{'loss': '2.473', 'grad_norm': '1.007', 'learning_rate': '6.402e-05', 'epoch': '2.621'}
|
| 457 |
+
87%|βββββββββββββββββββββββββββββββββ | 32000/36624 [36:34<05:15, 14.66it/s]
|
| 458 |
+
Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 196.78it/s]
|
| 459 |
+
{'loss': '2.47', 'grad_norm': '1.028', 'learning_rate': '6.263e-05', 'epoch': '2.629'}
|
| 460 |
+
{'loss': '2.47', 'grad_norm': '0.9969', 'learning_rate': '6.125e-05', 'epoch': '2.638'}
|
| 461 |
+
{'loss': '2.473', 'grad_norm': '1.037', 'learning_rate': '5.986e-05', 'epoch': '2.646'}
|
| 462 |
+
{'loss': '2.47', 'grad_norm': '1', 'learning_rate': '5.848e-05', 'epoch': '2.654'}
|
| 463 |
+
{'loss': '2.468', 'grad_norm': '1.034', 'learning_rate': '5.71e-05', 'epoch': '2.662'}
|
| 464 |
+
89%|βββββββββββββββββββββββββββββββββ | 32500/36624 [37:08<04:47, 14.34it/s]
|
| 465 |
+
Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 181.32it/s]
|
| 466 |
+
{'loss': '2.471', 'grad_norm': '1.073', 'learning_rate': '5.571e-05', 'epoch': '2.67'}
|
| 467 |
+
{'loss': '2.47', 'grad_norm': '1.003', 'learning_rate': '5.433e-05', 'epoch': '2.679'}
|
| 468 |
+
{'loss': '2.472', 'grad_norm': '1.033', 'learning_rate': '5.294e-05', 'epoch': '2.687'}
|
| 469 |
+
{'loss': '2.469', 'grad_norm': '1.076', 'learning_rate': '5.156e-05', 'epoch': '2.695'}
|
| 470 |
+
{'loss': '2.469', 'grad_norm': '1.061', 'learning_rate': '5.017e-05', 'epoch': '2.703'}
|
| 471 |
+
90%|ββββββββββββββββββββββββββββββββββ | 33000/36624 [37:43<04:11, 14.39it/s]
|
| 472 |
+
Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 188.79it/s]
|
| 473 |
+
{'loss': '2.468', 'grad_norm': '1.043', 'learning_rate': '4.879e-05', 'epoch': '2.711'}
|
| 474 |
+
{'loss': '2.474', 'grad_norm': '1.115', 'learning_rate': '4.741e-05', 'epoch': '2.72'}
|
| 475 |
+
{'loss': '2.469', 'grad_norm': '1.028', 'learning_rate': '4.602e-05', 'epoch': '2.728'}
|
| 476 |
+
{'loss': '2.468', 'grad_norm': '1.017', 'learning_rate': '4.464e-05', 'epoch': '2.736'}
|
| 477 |
+
{'loss': '2.471', 'grad_norm': '0.9948', 'learning_rate': '4.325e-05', 'epoch': '2.744'}
|
| 478 |
+
91%|ββββββββββββββββββββββββββββββββββ | 33500/36624 [38:18<03:37, 14.39it/s]
|
| 479 |
+
Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 155.26it/s]
|
| 480 |
+
{'loss': '2.466', 'grad_norm': '1.027', 'learning_rate': '4.187e-05', 'epoch': '2.752'}
|
| 481 |
+
{'loss': '2.467', 'grad_norm': '1.023', 'learning_rate': '4.049e-05', 'epoch': '2.761'}
|
| 482 |
+
{'loss': '2.47', 'grad_norm': '1.021', 'learning_rate': '3.91e-05', 'epoch': '2.769'}
|
| 483 |
+
{'loss': '2.468', 'grad_norm': '0.9946', 'learning_rate': '3.772e-05', 'epoch': '2.777'}
|
| 484 |
+
{'loss': '2.464', 'grad_norm': '1.031', 'learning_rate': '3.633e-05', 'epoch': '2.785'}
|
| 485 |
+
93%|βββββββββββββββββββββββββββββββββββ | 34000/36624 [38:52<03:00, 14.52it/s]
|
| 486 |
+
Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 183.19it/s]
|
| 487 |
+
{'loss': '2.467', 'grad_norm': '1.05', 'learning_rate': '3.495e-05', 'epoch': '2.793'}
|
| 488 |
+
{'loss': '2.469', 'grad_norm': '1.043', 'learning_rate': '3.356e-05', 'epoch': '2.801'}
|
| 489 |
+
{'loss': '2.468', 'grad_norm': '0.9955', 'learning_rate': '3.218e-05', 'epoch': '2.81'}
|
| 490 |
+
{'loss': '2.461', 'grad_norm': '0.9882', 'learning_rate': '3.08e-05', 'epoch': '2.818'}
|
| 491 |
+
{'loss': '2.463', 'grad_norm': '1.023', 'learning_rate': '2.941e-05', 'epoch': '2.826'}
|
| 492 |
+
94%|βββββββββββββββββββββββββββββββββββ | 34500/36624 [39:27<02:28, 14.27it/s]
|
| 493 |
+
Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 201.85it/s]
|
| 494 |
+
{'loss': '2.464', 'grad_norm': '1.062', 'learning_rate': '2.803e-05', 'epoch': '2.834'}
|
| 495 |
+
{'loss': '2.465', 'grad_norm': '1.065', 'learning_rate': '2.664e-05', 'epoch': '2.842'}
|
| 496 |
+
{'loss': '2.468', 'grad_norm': '1.01', 'learning_rate': '2.526e-05', 'epoch': '2.851'}
|
| 497 |
+
{'loss': '2.463', 'grad_norm': '0.9994', 'learning_rate': '2.388e-05', 'epoch': '2.859'}
|
| 498 |
+
{'loss': '2.464', 'grad_norm': '1.032', 'learning_rate': '2.249e-05', 'epoch': '2.867'}
|
| 499 |
+
96%|ββββββββββββββββββββββββββββββββββββ | 35000/36624 [40:02<01:53, 14.32it/s]
|
| 500 |
+
Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 174.93it/s]
|
| 501 |
+
{'loss': '2.467', 'grad_norm': '1.233', 'learning_rate': '2.111e-05', 'epoch': '2.875'}
|
| 502 |
+
{'loss': '2.466', 'grad_norm': '1.069', 'learning_rate': '1.972e-05', 'epoch': '2.883'}
|
| 503 |
+
{'loss': '2.469', 'grad_norm': '1.015', 'learning_rate': '1.834e-05', 'epoch': '2.892'}
|
| 504 |
+
{'loss': '2.466', 'grad_norm': '1.033', 'learning_rate': '1.696e-05', 'epoch': '2.9'}
|
| 505 |
+
{'loss': '2.463', 'grad_norm': '1.04', 'learning_rate': '1.557e-05', 'epoch': '2.908'}
|
| 506 |
+
97%|ββββββββββββββββββββββββββββββββββββ | 35500/36624 [40:37<01:17, 14.42it/s]
|
| 507 |
+
Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 184.19it/s]
|
| 508 |
+
{'loss': '2.469', 'grad_norm': '1.007', 'learning_rate': '1.419e-05', 'epoch': '2.916'}
|
| 509 |
+
{'loss': '2.468', 'grad_norm': '1.025', 'learning_rate': '1.28e-05', 'epoch': '2.924'}
|
| 510 |
+
{'loss': '2.465', 'grad_norm': '1.033', 'learning_rate': '1.142e-05', 'epoch': '2.933'}
|
| 511 |
+
{'loss': '2.464', 'grad_norm': '1.045', 'learning_rate': '1.003e-05', 'epoch': '2.941'}
|
| 512 |
+
{'loss': '2.464', 'grad_norm': '1.008', 'learning_rate': '8.651e-06', 'epoch': '2.949'}
|
| 513 |
+
98%|βββββββββββββββββββββββββββββββββββββ| 36000/36624 [41:11<00:43, 14.40it/s]
|
| 514 |
+
Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 163.65it/s]
|
| 515 |
+
{'loss': '2.463', 'grad_norm': '1.035', 'learning_rate': '7.267e-06', 'epoch': '2.957'}
|
| 516 |
+
{'loss': '2.46', 'grad_norm': '1.018', 'learning_rate': '5.883e-06', 'epoch': '2.965'}
|
| 517 |
+
{'loss': '2.463', 'grad_norm': '1.01', 'learning_rate': '4.498e-06', 'epoch': '2.973'}
|
| 518 |
+
{'loss': '2.463', 'grad_norm': '1.014', 'learning_rate': '3.114e-06', 'epoch': '2.982'}
|
| 519 |
+
{'loss': '2.459', 'grad_norm': '0.9757', 'learning_rate': '1.73e-06', 'epoch': '2.99'}
|
| 520 |
+
100%|βββββββββββββββββββββββββββββββββββββ| 36500/36624 [41:46<00:08, 14.23it/s]
|
| 521 |
+
Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 177.29it/s]
|
| 522 |
+
{'loss': '2.464', 'grad_norm': '1.004', 'learning_rate': '3.46e-07', 'epoch': '2.998'}
|
| 523 |
+
100%|βββββββββββββββββββββββββββββββββββββ| 36623/36624 [41:55<00:00, 11.70it/s]
|
| 524 |
+
Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 197.55it/s]
|
| 525 |
+
{'train_runtime': '2516', 'train_samples_per_second': '1863', 'train_steps_per_second': '14.56', 'train_loss': '2.575', 'epoch': '3'}
|
| 526 |
+
100%|βββββββββββββββββββββββββββββββββββββ| 36624/36624 [41:56<00:00, 14.56it/s]
|
| 527 |
+
Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 160.99it/s]
|
| 528 |
+
[*] Training finished.
|