azure-scripts / h100_training.log
vivekvar's picture
azure home scripts: data gen, training, misc
a70eb3d verified
WARNING ⚠️ user config directory '/home/azureuser/.config/Ultralytics' is not writable, using '/tmp/Ultralytics'. Set YOLO_CONFIG_DIR to override.
Creating new Ultralytics Settings v0.0.6 file βœ…
View Ultralytics Settings with 'yolo settings' or at '/tmp/Ultralytics/settings.json'
Update Settings with 'yolo settings key=value', i.e. 'yolo settings runs_dir=path/to/dir'. For help see https://docs.ultralytics.com/quickstart/#ultralytics-settings.
CUDA initialization: The NVIDIA driver on your system is too old (found version 12080). Please update your GPU driver by downloading and installing a new version from the URL: http://www.nvidia.com/Download/index.aspx Alternatively, go to: https://pytorch.org to install a PyTorch version that has been compiled with your version of the CUDA driver. (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:119.)
======================================================================
FINAL TRAINING ON H100 - BALANCED DATASET
======================================================================
GPU Available: False
======================================================================
STEP 1: Downloading Datasets from Roboflow
======================================================================
Dataset 1: New helmet images (212)...
loading Roboflow workspace...
loading Roboflow project...
Downloading Dataset Version Zip in ~/helmet_212 to yolov8:: 0%| | 0/12500 [00:00<?, ?it/s] Downloading Dataset Version Zip in ~/helmet_212 to yolov8:: 3%|β–Ž | 319/12500 [00:00<00:03, 3182.36it/s] Downloading Dataset Version Zip in ~/helmet_212 to yolov8:: 21%|β–ˆβ–ˆ | 2625/12500 [00:00<00:00, 14777.55it/s] Downloading Dataset Version Zip in ~/helmet_212 to yolov8:: 94%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 11768/12500 [00:00<00:00, 49648.18it/s] Downloading Dataset Version Zip in ~/helmet_212 to yolov8:: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 12500/12500 [00:00<00:00, 40786.43it/s]
Extracting Dataset Version Zip to ~/helmet_212 in yolov8:: 0%| | 0/427 [00:00<?, ?it/s] Extracting Dataset Version Zip to ~/helmet_212 in yolov8:: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 427/427 [00:00<00:00, 12797.93it/s]
Dataset 2: No-helmet images (499)...
loading Roboflow workspace...
loading Roboflow project...
Downloading Dataset Version Zip in ~/no_helmet_499 to yolov8:: 0%| | 0/32368 [00:00<?, ?it/s] Downloading Dataset Version Zip in ~/no_helmet_499 to yolov8:: 1%| | 399/32368 [00:00<00:08, 3930.72it/s] Downloading Dataset Version Zip in ~/no_helmet_499 to yolov8:: 20%|β–ˆβ–‰ | 6318/32368 [00:00<00:00, 36234.63it/s] Downloading Dataset Version Zip in ~/no_helmet_499 to yolov8:: 38%|β–ˆβ–ˆβ–ˆβ–Š | 12289/32368 [00:00<00:00, 38745.21it/s] Downloading Dataset Version Zip in ~/no_helmet_499 to yolov8:: 50%|β–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 16132/32368 [00:00<00:00, 38523.52it/s] Downloading Dataset Version Zip in ~/no_helmet_499 to yolov8:: 70%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 22529/32368 [00:00<00:00, 46709.73it/s] Downloading Dataset Version Zip in ~/no_helmet_499 to yolov8:: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 32368/32368 [00:00<00:00, 52238.32it/s]
Extracting Dataset Version Zip to ~/no_helmet_499 in yolov8:: 0%| | 0/1003 [00:00<?, ?it/s] Extracting Dataset Version Zip to ~/no_helmet_499 in yolov8:: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1003/1003 [00:00<00:00, 11989.05it/s]
Dataset 3: With-helmet images (300)...
loading Roboflow workspace...
loading Roboflow project...
Downloading Dataset Version Zip in ~/with_helmet_300 to yolov8:: 0%| | 0/18485 [00:00<?, ?it/s] Downloading Dataset Version Zip in ~/with_helmet_300 to yolov8:: 2%|▏ | 287/18485 [00:00<00:06, 2865.07it/s] Downloading Dataset Version Zip in ~/with_helmet_300 to yolov8:: 24%|β–ˆβ–ˆβ– | 4425/18485 [00:00<00:00, 25503.78it/s] Downloading Dataset Version Zip in ~/with_helmet_300 to yolov8:: 89%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 16385/18485 [00:00<00:00, 62736.87it/s] Downloading Dataset Version Zip in ~/with_helmet_300 to yolov8:: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 18485/18485 [00:00<00:00, 55136.84it/s]
Extracting Dataset Version Zip to ~/with_helmet_300 in yolov8:: 0%| | 0/605 [00:00<?, ?it/s] Extracting Dataset Version Zip to ~/with_helmet_300 in yolov8:: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 605/605 [00:00<00:00, 13499.49it/s]
Dataset 4: Triple-riding (626)...
loading Roboflow workspace...
loading Roboflow project...
Downloading Dataset Version Zip in ~/triple_riding_626 to yolov8:: 0%| | 0/30315 [00:00<?, ?it/s] Downloading Dataset Version Zip in ~/triple_riding_626 to yolov8:: 1%| | 303/30315 [00:00<00:10, 2987.75it/s] Downloading Dataset Version Zip in ~/triple_riding_626 to yolov8:: 9%|β–Š | 2650/30315 [00:00<00:01, 14965.95it/s] Downloading Dataset Version Zip in ~/triple_riding_626 to yolov8:: 39%|β–ˆβ–ˆβ–ˆβ–‰ | 11837/30315 [00:00<00:00, 49968.55it/s] Downloading Dataset Version Zip in ~/triple_riding_626 to yolov8:: 68%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 20481/30315 [00:00<00:00, 50538.10it/s] Downloading Dataset Version Zip in ~/triple_riding_626 to yolov8:: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 30315/30315 [00:00<00:00, 55162.02it/s]
Extracting Dataset Version Zip to ~/triple_riding_626 in yolov8:: 0%| | 0/1264 [00:00<?, ?it/s] Extracting Dataset Version Zip to ~/triple_riding_626 in yolov8:: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1264/1264 [00:00<00:00, 14533.21it/s]
βœ… All datasets downloaded!
======================================================================
STEP 2: Merging ALL Datasets
======================================================================
Unified classes (8): ['Helmet', 'Motorcycle', 'Rider', 'Triple Riding', 'helmet', 'more-than-2-person-on-2-wheeler', 'no helmet', 'with helmet']
Copying datasets...
helmet212: 206 images
nohelmet499: 496 images
withhelmet300: 242 images
triple626: 626 images
Final merged dataset:
train: 1335 images
valid: 186 images
test: 116 images
Config saved: /home/azureuser/final_merged_h100/data.yaml
======================================================================
STEP 3: TRAINING ON H100 (96GB VRAM!)
======================================================================
Downloading https://github.com/ultralytics/assets/releases/download/v8.4.0/yolo26m.pt to 'yolo26m.pt': 69% ━━━━━━━━──── 29.1/42.2MB 291.1MB/s 0.1s<0.0s Downloading https://github.com/ultralytics/assets/releases/download/v8.4.0/yolo26m.pt to 'yolo26m.pt': 100% ━━━━━━━━━━━━ 42.2MB 320.4MB/s 0.1s
Training config:
Model: YOLO26m
Epochs: 150 (faster with H100)
Batch: -1 (auto - H100 can handle 64-128!)
Image size: 640
Classes: 8
Starting training...
Ultralytics 8.4.37 πŸš€ Python-3.12.3 torch-2.11.0+cu130
Traceback (most recent call last):
File "/home/azureuser/train_h100_final.py", line 182, in <module>
results = model.train(
^^^^^^^^^^^^
File "/home/azureuser/yolo_h100_env/lib/python3.12/site-packages/ultralytics/engine/model.py", line 781, in train
self.trainer = (trainer or self._smart_load("trainer"))(overrides=args, _callbacks=self.callbacks)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/azureuser/yolo_h100_env/lib/python3.12/site-packages/ultralytics/models/yolo/detect/train.py", line 63, in __init__
super().__init__(cfg, overrides, _callbacks)
File "/home/azureuser/yolo_h100_env/lib/python3.12/site-packages/ultralytics/engine/trainer.py", line 128, in __init__
self.device = select_device(self.args.device)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/azureuser/yolo_h100_env/lib/python3.12/site-packages/ultralytics/utils/torch_utils.py", line 230, in select_device
raise ValueError(
ValueError: Invalid CUDA 'device=0' requested. Use 'device=cpu' or pass valid CUDA device(s) if available, i.e. 'device=0' or 'device=0,1,2,3' for Multi-GPU.
torch.cuda.is_available(): False
torch.cuda.device_count(): 1
os.environ['CUDA_VISIBLE_DEVICES']: None