Upload 3 files

Browse files

Files changed (3) hide show

checkpoint_last.pt +3 -0
hydra_train.log +217 -0
hydra_train2.log +0 -0

checkpoint_last.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:9a96af868b72e07027aa333664743af18fc9160ecbd7c18cc0ff8748e4a92b63
+size 1497630535

hydra_train.log ADDED Viewed

	@@ -0,0 +1,217 @@

+[2026-02-02 23:13:56,489][fairseq.distributed.utils][INFO] - setting CUDA device=1 on rank 1
+[2026-02-02 23:13:57,103][fairseq.distributed.utils][INFO] - distributed init (rank 0): env://
+[2026-02-02 23:13:57,110][fairseq.distributed.utils][INFO] - initialized host a8491112a835 as rank 0
+[2026-02-02 23:13:57,112][fairseq.distributed.utils][INFO] - distributed init (rank 1): env://
+[2026-02-02 23:13:57,118][fairseq.distributed.utils][INFO] - initialized host a8491112a835 as rank 1
+[2026-02-02 23:13:57,510][fairseq_cli.train][INFO] - {'_name': None, 'common': {'_name': None, 'no_progress_bar': False, 'log_interval': 100, 'log_format': 'json', 'log_file': None, 'aim_repo': None, 'aim_run_hash': None, 'tensorboard_logdir': 'tb_asa_finetune', 'wandb_project': None, 'azureml_logging': False, 'seed': 1, 'cpu': False, 'tpu': False, 'bf16': False, 'memory_efficient_bf16': False, 'fp16': True, 'memory_efficient_fp16': False, 'fp16_no_flatten_grads': False, 'fp16_init_scale': 128, 'fp16_scale_window': None, 'fp16_scale_tolerance': 0.0, 'on_cpu_convert_precision': False, 'min_loss_scale': 0.0001, 'threshold_loss_scale': None, 'amp': False, 'amp_batch_retries': 2, 'amp_init_scale': 128, 'amp_scale_window': None, 'user_dir': 'examples/dinosr', 'empty_cache_freq': 0, 'all_gather_list_size': 16384, 'model_parallel_size': 1, 'quantization_config_path': None, 'profile': False, 'reset_logging': False, 'suppress_crashes': False, 'use_plasma_view': False, 'plasma_path': '/tmp/plasma'}, 'common_eval': {'_name': None, 'path': None, 'post_process': None, 'quiet': False, 'model_overrides': '{}', 'results_path': None}, 'distributed_training': {'_name': None, 'distributed_world_size': 2, 'distributed_num_procs': 2, 'distributed_rank': 0, 'distributed_backend': 'nccl', 'distributed_init_method': 'env://', 'distributed_port': 29500, 'device_id': 0, 'distributed_no_spawn': True, 'ddp_backend': 'pytorch_ddp', 'ddp_comm_hook': 'none', 'bucket_cap_mb': 25, 'fix_batches_to_gpus': False, 'find_unused_parameters': True, 'gradient_as_bucket_view': False, 'fast_stat_sync': False, 'heartbeat_timeout': -1, 'broadcast_buffers': False, 'slowmo_momentum': None, 'slowmo_base_algorithm': 'localsgd', 'localsgd_frequency': 3, 'nprocs_per_node': 2, 'pipeline_model_parallel': False, 'pipeline_balance': None, 'pipeline_devices': None, 'pipeline_chunks': 0, 'pipeline_encoder_balance': None, 'pipeline_encoder_devices': None, 'pipeline_decoder_balance': None, 'pipeline_decoder_devices': None, 'pipeline_checkpoint': 'never', 'zero_sharding': 'none', 'fp16': True, 'memory_efficient_fp16': False, 'tpu': False, 'no_reshard_after_forward': False, 'fp32_reduce_scatter': False, 'cpu_offload': False, 'use_sharded_state': False, 'not_fsdp_flatten_parameters': False}, 'dataset': {'_name': None, 'num_workers': 4, 'skip_invalid_size_inputs_valid_test': True, 'max_tokens': 1400000, 'batch_size': None, 'required_batch_size_multiple': 1, 'required_seq_len_multiple': 1, 'dataset_impl': None, 'data_buffer_size': 10, 'train_subset': 'train', 'valid_subset': 'valid', 'combine_valid_subsets': None, 'ignore_unused_valid_subsets': False, 'validate_interval': 5, 'validate_interval_updates': 0, 'validate_after_updates': 0, 'fixed_validation_seed': None, 'disable_validation': True, 'max_tokens_valid': 1400000, 'batch_size_valid': None, 'max_valid_steps': None, 'curriculum': 0, 'gen_subset': 'test', 'num_shards': 1, 'shard_id': 0, 'grouped_shuffling': False, 'update_epoch_batch_itr': False, 'update_ordered_indices_seed': False}, 'optimization': {'_name': None, 'max_epoch': 0, 'max_update': 25000, 'stop_time_hours': 0.0, 'clip_norm': 0.0, 'sentence_avg': False, 'update_freq': [4], 'lr': [5e-05], 'stop_min_lr': -1.0, 'use_bmuf': False, 'skip_remainder_batch': False, 'debug_param_names': False}, 'checkpoint': {'_name': None, 'save_dir': 'checkpoints', 'restore_file': '/kaggle/working/dinosr_base.ckpt', 'continue_once': None, 'finetune_from_model': None, 'reset_dataloader': False, 'reset_lr_scheduler': True, 'reset_meters': False, 'reset_optimizer': True, 'optimizer_overrides': '{}', 'save_interval': 1, 'save_interval_updates': 2500, 'keep_interval_updates': 1, 'keep_interval_updates_pattern': -1, 'keep_last_epochs': -1, 'keep_best_checkpoints': 1, 'no_save': False, 'no_epoch_checkpoints': True, 'no_last_checkpoints': False, 'no_save_optimizer_state': False, 'best_checkpoint_metric': 'loss', 'maximize_best_checkpoint_metric': False, 'patience': -1, 'checkpoint_suffix': '', 'checkpoint_shard_count': 1, 'load_checkpoint_on_all_dp_ranks': True, 'write_checkpoints_asynchronously': False, 'model_parallel_size': 1}, 'bmuf': {'_name': None, 'block_lr': 1.0, 'block_momentum': 0.875, 'global_sync_iter': 50, 'warmup_iterations': 500, 'use_nbm': False, 'average_sync': False, 'distributed_world_size': 2}, 'generation': {'_name': None, 'beam': 5, 'beam_mt': 0, 'nbest': 1, 'max_len_a': 0.0, 'max_len_b': 200, 'max_len_a_mt': 0.0, 'max_len_b_mt': 200, 'min_len': 1, 'match_source_len': False, 'unnormalized': False, 'no_early_stop': False, 'no_beamable_mm': False, 'lenpen': 1.0, 'lenpen_mt': 1.0, 'unkpen': 0.0, 'replace_unk': None, 'sacrebleu': False, 'score_reference': False, 'prefix_size': 0, 'no_repeat_ngram_size': 0, 'sampling': False, 'sampling_topk': -1, 'sampling_topp': -1.0, 'constraints': None, 'temperature': 1.0, 'diverse_beam_groups': -1, 'diverse_beam_strength': 0.5, 'diversity_rate': -1.0, 'print_alignment': None, 'print_step': False, 'lm_path': None, 'lm_weight': 0.0, 'iter_decode_eos_penalty': 0.0, 'iter_decode_max_iter': 10, 'iter_decode_force_max_iter': False, 'iter_decode_with_beam': 1, 'iter_decode_with_external_reranker': False, 'retain_iter_history': False, 'retain_dropout': False, 'retain_dropout_modules': None, 'decoding_format': None, 'no_seed_provided': False, 'eos_token': None}, 'eval_lm': {'_name': None, 'output_word_probs': False, 'output_word_stats': False, 'context_window': 0, 'softmax_batch': 9223372036854775807}, 'interactive': {'_name': None, 'buffer_size': 0, 'input': '-'}, 'model': {'_name': 'dinosr', 'extractor_mode': layer_norm, 'encoder_layers': 12, 'encoder_embed_dim': 768, 'encoder_ffn_embed_dim': 3072, 'encoder_attention_heads': 12, 'activation_fn': gelu, 'layer_type': transformer, 'dropout': 0.1, 'attention_dropout': 0.1, 'activation_dropout': 0.0, 'encoder_layerdrop': 0.05, 'dropout_input': 0.0, 'dropout_features': 0.0, 'final_dim': 0, 'layer_norm_first': False, 'conv_feature_layers': '[(512, 10, 5)] + [(512, 3, 2)] * 4 + [(512,2,2)] + [(512,2,2)]', 'conv_bias': False, 'logit_temp': 0.1, 'quantize_targets': False, 'quantize_input': False, 'same_quantizer': False, 'target_glu': False, 'feature_grad_mult': 0.1, 'quantizer_depth': 1, 'quantizer_factor': 3, 'latent_vars': 320, 'latent_groups': 2, 'latent_dim': 0, 'mask_length': 10, 'mask_prob': 0.8, 'mask_selection': static, 'mask_other': 0.0, 'no_mask_overlap': False, 'mask_min_space': 1, 'require_same_masks': True, 'mask_dropout': 0.0, 'mask_channel_length': 10, 'mask_channel_prob': 0.0, 'mask_channel_before': False, 'mask_channel_selection': static, 'mask_channel_other': 0.0, 'no_mask_channel_overlap': False, 'mask_channel_min_space': 1, 'num_negatives': 100, 'negatives_from_everywhere': False, 'cross_sample_negatives': 0, 'codebook_negatives': 0, 'conv_pos': 95, 'conv_pos_groups': 16, 'pos_conv_depth': 5, 'latent_temp': [2.0, 0.5, 0.999995], 'max_positions': 100000, 'checkpoint_activations': False, 'required_seq_len_multiple': 2, 'crop_seq_to_multiple': 1, 'depthwise_conv_kernel_size': 31, 'attn_type': '', 'pos_enc_type': 'abs', 'fp16': False, 'adp_num': -1, 'adp_dim': 64, 'adp_act_fn': 'relu', 'adp_trf_idx': 'all', 'discrete': True, 'codebook_size': 256, 'normal_init_codebook': False, 'codebook_init_decay': 0.9, 'codebook_end_decay': 0.9, 'codebook_end_decay_step': 0, 'freeze_teacher_step': 200001, 'freeze_pre_enc_modules': True, 'loss_beta': 0.0, 'loss_scale': None, 'average_top_k_layers': 8, 'enable_asa': True, 'layer_norm_target_layer': False, 'instance_norm_target_layer': True, 'instance_norm_targets': False, 'layer_norm_targets': False, 'batch_norm_target_layer': False, 'group_norm_target_layer': False, 'ema_decay': 0.999, 'ema_end_decay': 0.9999, 'ema_anneal_end_step': 15000, 'ema_transformer_only': True, 'ema_layers_only': True, 'max_update': '${optimization.max_update}', 'min_target_var': 0.1, 'min_pred_var': 0.01}, 'task': {'_name': 'audio_pretraining', 'data': '/kaggle/dataset/manifests', 'labels': None, 'multi_corpus_keys': None, 'multi_corpus_sampling_weights': None, 'binarized_dataset': False, 'sample_rate': 16000, 'normalize': True, 'enable_padding': False, 'max_sample_size': 320000, 'min_sample_size': 32000, 'num_batch_buckets': 0, 'tpu': False, 'text_compression_level': none, 'rebuild_batches': True, 'precompute_mask_config': None, 'post_save_script': None, 'subsample': 1.0, 'seed': 1}, 'criterion': {'_name': 'model', 'loss_weights': {}, 'log_keys': ['ema_decay', 'target_ppl', 'pred_ppl', 'codebook_decay', 'asa_I_mu', 'asa_I_sigma', 'asa_Sigma_prime_mu', 'asa_Sigma_prime_sigma'], 'can_sum': True}, 'optimizer': {'_name': 'adam', 'adam_betas': '(0.9,0.98)', 'adam_eps': 1e-06, 'weight_decay': 0.01, 'use_old_adam': False, 'fp16_adam_stats': False, 'tpu': False, 'lr': [5e-05]}, 'lr_scheduler': {'_name': 'tri_stage', 'warmup_steps': 0, 'hold_steps': 0, 'decay_steps': 0, 'phase_ratio': [0.1, 0.4, 0.5], 'init_lr_scale': 0.01, 'final_lr_scale': 0.01, 'max_update': 25000.0, 'lr': [5e-05]}, 'scoring': None, 'bpe': None, 'tokenizer': None, 'ema': {'_name': None, 'store_ema': False, 'ema_decay': 0.9999, 'ema_start_update': 0, 'ema_seed_model': None, 'ema_update_freq': 1, 'ema_fp32': False}, 'job_logging_cfg': {'version': 1, 'formatters': {'simple': {'format': '[%(asctime)s][%(name)s][%(levelname)s] - %(message)s'}}, 'handlers': {'console': {'class': 'logging.StreamHandler', 'formatter': 'simple', 'stream': 'ext://sys.stdout'}, 'file': {'class': 'logging.FileHandler', 'formatter': 'simple', 'filename': '/kaggle/working/fairseq/outputs/2026-02-02/23-13-56/hydra_train.log'}}, 'root': {'level': 'INFO', 'handlers': ['console', 'file']}, 'disable_existing_loggers': False}}
+[2026-02-02 23:13:57,585][dinosr.models.dinosr][INFO] - SAVC: Adversarial Style Augmentation enabled.
+[2026-02-02 23:13:59,050][fairseq_cli.train][INFO] - DinosrModel(
+  (feature_extractor): ConvFeatureExtractionModel(
+    (conv_layers): ModuleList(
+      (0): Sequential(
+        (0): Conv1d(1, 512, kernel_size=(10,), stride=(5,), bias=False)
+        (1): Dropout(p=0.0, inplace=False)
+        (2): Sequential(
+          (0): TransposeLast()
+          (1): Fp32LayerNorm((512,), eps=1e-05, elementwise_affine=True)
+          (2): TransposeLast()
+        )
+        (3): GELU(approximate='none')
+      )
+      (1-4): 4 x Sequential(
+        (0): Conv1d(512, 512, kernel_size=(3,), stride=(2,), bias=False)
+        (1): Dropout(p=0.0, inplace=False)
+        (2): Sequential(
+          (0): TransposeLast()
+          (1): Fp32LayerNorm((512,), eps=1e-05, elementwise_affine=True)
+          (2): TransposeLast()
+        )
+        (3): GELU(approximate='none')
+      )
+      (5-6): 2 x Sequential(
+        (0): Conv1d(512, 512, kernel_size=(2,), stride=(2,), bias=False)
+        (1): Dropout(p=0.0, inplace=False)
+        (2): Sequential(
+          (0): TransposeLast()
+          (1): Fp32LayerNorm((512,), eps=1e-05, elementwise_affine=True)
+          (2): TransposeLast()
+        )
+        (3): GELU(approximate='none')
+      )
+    )
+  )
+  (asa_module): SAVC_ASA(
+    (grl): GradientReversal()
+  )
+  (post_extract_proj): Linear(in_features=512, out_features=768, bias=True)
+  (dropout_input): Dropout(p=0.0, inplace=False)
+  (dropout_features): Dropout(p=0.0, inplace=False)
+  (encoder): TransformerEncoder(
+    (pos_conv): Sequential(
+      (0): Sequential(
+        (0): Conv1d(768, 768, kernel_size=(19,), stride=(1,), padding=(9,), groups=16)
+        (1): SamePad()
+        (2): TransposeLast()
+        (3): LayerNorm((768,), eps=1e-05, elementwise_affine=False)
+        (4): TransposeLast()
+        (5): GELU(approximate='none')
+      )
+      (1): Sequential(
+        (0): Conv1d(768, 768, kernel_size=(19,), stride=(1,), padding=(9,), groups=16)
+        (1): SamePad()
+        (2): TransposeLast()
+        (3): LayerNorm((768,), eps=1e-05, elementwise_affine=False)
+        (4): TransposeLast()
+        (5): GELU(approximate='none')
+      )
+      (2): Sequential(
+        (0): Conv1d(768, 768, kernel_size=(19,), stride=(1,), padding=(9,), groups=16)
+        (1): SamePad()
+        (2): TransposeLast()
+        (3): LayerNorm((768,), eps=1e-05, elementwise_affine=False)
+        (4): TransposeLast()
+        (5): GELU(approximate='none')
+      )
+      (3): Sequential(
+        (0): Conv1d(768, 768, kernel_size=(19,), stride=(1,), padding=(9,), groups=16)
+        (1): SamePad()
+        (2): TransposeLast()
+        (3): LayerNorm((768,), eps=1e-05, elementwise_affine=False)
+        (4): TransposeLast()
+        (5): GELU(approximate='none')
+      )
+      (4): Sequential(
+        (0): Conv1d(768, 768, kernel_size=(19,), stride=(1,), padding=(9,), groups=16)
+        (1): SamePad()
+        (2): TransposeLast()
+        (3): LayerNorm((768,), eps=1e-05, elementwise_affine=False)
+        (4): TransposeLast()
+        (5): GELU(approximate='none')
+      )
+    )
+    (layers): ModuleList(
+      (0-11): 12 x TransformerSentenceEncoderLayer(
+        (self_attn): MultiheadAttention(
+          (dropout_module): FairseqDropout()
+          (k_proj): Linear(in_features=768, out_features=768, bias=True)
+          (v_proj): Linear(in_features=768, out_features=768, bias=True)
+          (q_proj): Linear(in_features=768, out_features=768, bias=True)
+          (out_proj): Linear(in_features=768, out_features=768, bias=True)
+        )
+        (dropout1): Dropout(p=0.1, inplace=False)
+        (dropout2): Dropout(p=0.0, inplace=False)
+        (dropout3): Dropout(p=0.1, inplace=False)
+        (self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
+        (fc1): Linear(in_features=768, out_features=3072, bias=True)
+        (fc2): Linear(in_features=3072, out_features=768, bias=True)
+        (final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
+      )
+    )
+    (layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
+  )
+  (layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
+  (heads): ModuleList(
+    (0-7): 8 x Linear(in_features=768, out_features=256, bias=True)
+  )
+)
+[2026-02-02 23:13:59,053][fairseq_cli.train][INFO] - task: AudioPretrainingTask
+[2026-02-02 23:13:59,053][fairseq_cli.train][INFO] - model: DinosrModel
+[2026-02-02 23:13:59,053][fairseq_cli.train][INFO] - criterion: ModelCriterion
+[2026-02-02 23:13:59,054][fairseq_cli.train][INFO] - num. shared model params: 94,740,224 (num. trained: 94,740,224)
+[2026-02-02 23:13:59,055][fairseq_cli.train][INFO] - num. expert model params: 0 (num. trained: 0)
+[2026-02-02 23:13:59,171][fairseq.trainer][INFO] - detected shared parameter: feature_extractor.conv_layers.0.0.bias <- feature_extractor.conv_layers.1.0.bias
+[2026-02-02 23:13:59,171][fairseq.trainer][INFO] - detected shared parameter: feature_extractor.conv_layers.0.0.bias <- feature_extractor.conv_layers.2.0.bias
+[2026-02-02 23:13:59,171][fairseq.trainer][INFO] - detected shared parameter: feature_extractor.conv_layers.0.0.bias <- feature_extractor.conv_layers.3.0.bias
+[2026-02-02 23:13:59,171][fairseq.trainer][INFO] - detected shared parameter: feature_extractor.conv_layers.0.0.bias <- feature_extractor.conv_layers.4.0.bias
+[2026-02-02 23:13:59,171][fairseq.trainer][INFO] - detected shared parameter: feature_extractor.conv_layers.0.0.bias <- feature_extractor.conv_layers.5.0.bias
+[2026-02-02 23:13:59,171][fairseq.trainer][INFO] - detected shared parameter: feature_extractor.conv_layers.0.0.bias <- feature_extractor.conv_layers.6.0.bias
+[2026-02-02 23:13:59,171][fairseq.trainer][INFO] - detected shared parameter: feature_extractor.conv_layers.0.0.bias <- encoder.pos_conv.0.3.weight
+[2026-02-02 23:13:59,172][fairseq.trainer][INFO] - detected shared parameter: feature_extractor.conv_layers.0.0.bias <- encoder.pos_conv.0.3.bias
+[2026-02-02 23:13:59,172][fairseq.trainer][INFO] - detected shared parameter: feature_extractor.conv_layers.0.0.bias <- encoder.pos_conv.1.3.weight
+[2026-02-02 23:13:59,172][fairseq.trainer][INFO] - detected shared parameter: feature_extractor.conv_layers.0.0.bias <- encoder.pos_conv.1.3.bias
+[2026-02-02 23:13:59,172][fairseq.trainer][INFO] - detected shared parameter: feature_extractor.conv_layers.0.0.bias <- encoder.pos_conv.2.3.weight
+[2026-02-02 23:13:59,172][fairseq.trainer][INFO] - detected shared parameter: feature_extractor.conv_layers.0.0.bias <- encoder.pos_conv.2.3.bias
+[2026-02-02 23:13:59,172][fairseq.trainer][INFO] - detected shared parameter: feature_extractor.conv_layers.0.0.bias <- encoder.pos_conv.3.3.weight
+[2026-02-02 23:13:59,172][fairseq.trainer][INFO] - detected shared parameter: feature_extractor.conv_layers.0.0.bias <- encoder.pos_conv.3.3.bias
+[2026-02-02 23:13:59,172][fairseq.trainer][INFO] - detected shared parameter: feature_extractor.conv_layers.0.0.bias <- encoder.pos_conv.4.3.weight
+[2026-02-02 23:13:59,172][fairseq.trainer][INFO] - detected shared parameter: feature_extractor.conv_layers.0.0.bias <- encoder.pos_conv.4.3.bias
+[2026-02-02 23:13:59,240][fairseq.utils][INFO] - ***********************CUDA enviroments for all 2 workers***********************
+[2026-02-02 23:13:59,241][fairseq.utils][INFO] - rank   0: capabilities =  7.5  ; total memory = 14.563 GB ; name = Tesla T4
+[2026-02-02 23:13:59,241][fairseq.utils][INFO] - rank   1: capabilities =  7.5  ; total memory = 14.563 GB ; name = Tesla T4
+[2026-02-02 23:13:59,241][fairseq.utils][INFO] - ***********************CUDA enviroments for all 2 workers***********************
+[2026-02-02 23:13:59,241][fairseq_cli.train][INFO] - training on 2 devices (GPUs/TPUs)
+[2026-02-02 23:13:59,241][fairseq_cli.train][INFO] - max tokens per device = 1400000 and max sentences per device = None
+[2026-02-02 23:13:59,242][fairseq.trainer][INFO] - Preparing to load checkpoint /kaggle/working/dinosr_base.ckpt
+[2026-02-02 23:14:01,494][dinosr.models.dinosr][INFO] - making ema teacher
+[2026-02-02 23:14:01,836][fairseq.trainer][INFO] - Loaded checkpoint /kaggle/working/dinosr_base.ckpt (epoch 428 @ 0 updates)
+[2026-02-02 23:14:01,884][fairseq.trainer][INFO] - loading train data for epoch 428
+[2026-02-02 23:14:01,906][fairseq.data.audio.raw_audio_dataset][INFO] - loaded 27124, skipped 22 samples
+[2026-02-02 23:14:01,914][fairseq.tasks.fairseq_task][INFO] - can_reuse_epoch_itr = True
+[2026-02-02 23:14:01,915][fairseq.tasks.fairseq_task][INFO] - reuse_dataloader = True
+[2026-02-02 23:14:01,915][fairseq.tasks.fairseq_task][INFO] - rebuild_batches = True
+[2026-02-02 23:14:01,915][fairseq.tasks.fairseq_task][INFO] - batches will be rebuilt for each epoch
+[2026-02-02 23:14:01,915][fairseq.tasks.fairseq_task][INFO] - creating new batches for epoch 428
+[2026-02-02 23:14:01,969][fairseq.data.iterators][INFO] - grouped total_num_itrs = 539
+[2026-02-02 23:14:01,972][fairseq.trainer][INFO] - begin training epoch 428
+[2026-02-02 23:14:01,972][fairseq_cli.train][INFO] - Start iterating over samples
+[2026-02-02 23:14:06,867][fairseq.trainer][INFO] - NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 64.0
+[2026-02-02 23:14:09,870][fairseq.trainer][INFO] - NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 32.0
+[2026-02-02 23:14:12,814][fairseq.trainer][INFO] - NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 16.0
+[2026-02-02 23:14:15,725][fairseq.trainer][INFO] - NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 8.0
+[2026-02-02 23:14:18,668][fairseq.trainer][INFO] - NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 4.0
+[2026-02-02 23:14:21,533][fairseq.trainer][INFO] - NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 2.0
+[2026-02-02 23:14:24,443][fairseq.trainer][INFO] - NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 1.0
+[2026-02-02 23:14:27,308][fairseq.trainer][INFO] - NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.5
+[2026-02-02 23:14:33,114][fairseq.trainer][INFO] - NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.25
+[2026-02-02 23:14:35,985][fairseq.trainer][INFO] - NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.125
+[2026-02-02 23:14:53,388][fairseq.trainer][INFO] - NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.0625
+[2026-02-02 23:15:17,206][fairseq.trainer][INFO] - NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.03125
+[2026-02-02 23:19:57,349][train_inner][INFO] - {"epoch": 428, "update": 427.301, "loss": "5.304", "ntokens": "17008.6", "nsentences": "50.68", "ema_decay": "999.003", "target_ppl": "136.565", "pred_ppl": "107.323", "codebook_decay": "0.9", "sample_size": "17008.6", "asa_I_mu": "1", "asa_I_sigma": "1", "asa_Sigma_prime_mu": "0.966", "asa_Sigma_prime_sigma": "0.974", "wps": "4784.1", "ups": "0.28", "wpb": "17008.6", "bsz": "50.7", "num_updates": "100", "lr": "2.48e-06", "gnorm": "11.353", "loss_scale": "0.0312", "train_wall": "354", "gb_free": "8", "wall": "0"}
+[2026-02-02 23:25:25,450][train_inner][INFO] - {"epoch": 428, "update": 427.486, "loss": "4.89", "ntokens": "16939.6", "nsentences": "49.89", "ema_decay": "999.01", "target_ppl": "135.689", "pred_ppl": "147.797", "codebook_decay": "0.9", "sample_size": "16939.6", "asa_I_mu": "1", "asa_I_sigma": "1", "asa_Sigma_prime_mu": "0.966", "asa_Sigma_prime_sigma": "0.974", "wps": "5163", "ups": "0.3", "wpb": "16939.6", "bsz": "49.9", "num_updates": "200", "lr": "4.46e-06", "gnorm": "2.632", "loss_scale": "0.0312", "train_wall": "327", "gb_free": "8.2", "wall": "0"}
+[2026-02-02 23:25:45,345][fairseq.trainer][INFO] - NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.015625
+[2026-02-02 23:30:56,900][train_inner][INFO] - {"epoch": 428, "update": 427.673, "loss": "4.766", "ntokens": "16942.9", "nsentences": "51.14", "ema_decay": "999.017", "target_ppl": "134.889", "pred_ppl": "150.375", "codebook_decay": "0.9", "sample_size": "16942.9", "asa_I_mu": "1", "asa_I_sigma": "1", "asa_Sigma_prime_mu": "0.966", "asa_Sigma_prime_sigma": "0.974", "wps": "5111.8", "ups": "0.3", "wpb": "16942.9", "bsz": "51.1", "num_updates": "300", "lr": "6.44e-06", "gnorm": "2.172", "loss_scale": "0.0156", "train_wall": "330", "gb_free": "7.8", "wall": "0"}
+[2026-02-02 23:36:23,526][train_inner][INFO] - {"epoch": 428, "update": 427.859, "loss": "4.642", "ntokens": "16985.2", "nsentences": "49.97", "ema_decay": "999.023", "target_ppl": "133.678", "pred_ppl": "149.727", "codebook_decay": "0.9", "sample_size": "16985.2", "asa_I_mu": "1", "asa_I_sigma": "1", "asa_Sigma_prime_mu": "0.966", "asa_Sigma_prime_sigma": "0.974", "wps": "5200.3", "ups": "0.31", "wpb": "16985.2", "bsz": "50", "num_updates": "400", "lr": "8.42e-06", "gnorm": "1.829", "loss_scale": "0.0156", "train_wall": "325", "gb_free": "8.3", "wall": "0"}
+[2026-02-02 23:40:31,854][fairseq.checkpoint_utils][INFO] - Preparing to save checkpoint for epoch 428 @ 476 updates
+[2026-02-02 23:40:31,855][fairseq.trainer][INFO] - Saving checkpoint to /kaggle/working/fairseq/outputs/2026-02-02/23-13-56/checkpoints/checkpoint_last.pt
+[2026-02-02 23:40:34,649][fairseq.trainer][INFO] - Finished saving checkpoint to /kaggle/working/fairseq/outputs/2026-02-02/23-13-56/checkpoints/checkpoint_last.pt
+[2026-02-02 23:40:34,650][fairseq.checkpoint_utils][INFO] - Saved checkpoint checkpoints/checkpoint_last.pt (epoch 428 @ 476 updates, score None) (writing took 2.7960509399999864 seconds)
+[2026-02-02 23:40:34,652][fairseq_cli.train][INFO] - end of epoch 428 (average epoch stats below)
+[2026-02-02 23:40:34,655][train][INFO] - {"epoch": 428, "train_loss": "2.242", "train_ntokens": "39807.8", "train_nsentences": "124.018", "train_ema_decay": "999.304", "train_target_ppl": "139.531", "train_pred_ppl": "143.484", "train_codebook_decay": "0.929", "train_sample_size": "16977.8", "train_asa_I_mu": "1", "train_asa_I_sigma": "1", "train_asa_Sigma_prime_mu": "0.966", "train_asa_Sigma_prime_sigma": "0.974", "train_wps": "14442.6", "train_ups": "0.36", "train_wpb": "39807.8", "train_bsz": "124", "train_num_updates": "476", "train_lr": "9.9248e-06", "train_gnorm": "3.135", "train_loss_scale": "0.0156", "train_train_wall": "1838", "train_gb_free": "8.1", "train_wall": "0"}
+[2026-02-02 23:40:34,656][fairseq.tasks.fairseq_task][INFO] - can_reuse_epoch_itr = True
+[2026-02-02 23:40:34,694][fairseq.data.iterators][INFO] - grouped total_num_itrs = 539
+[2026-02-02 23:40:34,697][fairseq.trainer][INFO] - begin training epoch 428
+[2026-02-02 23:40:34,697][fairseq_cli.train][INFO] - Start iterating over samples
+[2026-02-02 23:41:54,466][train_inner][INFO] - {"epoch": 428, "update": 427.045, "loss": "4.507", "ntokens": "17042.7", "nsentences": "50.48", "ema_decay": "999.03", "target_ppl": "133.149", "pred_ppl": "148.784", "codebook_decay": "0.9", "sample_size": "17042.7", "asa_I_mu": "1", "asa_I_sigma": "1", "asa_Sigma_prime_mu": "0.966", "asa_Sigma_prime_sigma": "0.974", "wps": "5149.8", "ups": "0.3", "wpb": "17042.7", "bsz": "50.5", "num_updates": "500", "lr": "1.04e-05", "gnorm": "1.391", "loss_scale": "0.0156", "train_wall": "327", "gb_free": "8.2", "wall": "0"}
+[2026-02-02 23:47:20,798][train_inner][INFO] - {"epoch": 428, "update": 427.23, "loss": "4.376", "ntokens": "16964.8", "nsentences": "50.33", "ema_decay": "999.037", "target_ppl": "132.864", "pred_ppl": "148.479", "codebook_decay": "0.9", "sample_size": "16964.8", "asa_I_mu": "1", "asa_I_sigma": "1", "asa_Sigma_prime_mu": "0.966", "asa_Sigma_prime_sigma": "0.974", "wps": "5198.7", "ups": "0.31", "wpb": "16964.8", "bsz": "50.3", "num_updates": "600", "lr": "1.238e-05", "gnorm": "1.965", "loss_scale": "0.0156", "train_wall": "325", "gb_free": "8.2", "wall": "0"}
+[2026-02-02 23:52:48,491][train_inner][INFO] - {"epoch": 428, "update": 427.416, "loss": "4.266", "ntokens": "16981.7", "nsentences": "50.23", "ema_decay": "999.043", "target_ppl": "132.45", "pred_ppl": "147.933", "codebook_decay": "0.9", "sample_size": "16981.7", "asa_I_mu": "1.001", "asa_I_sigma": "1", "asa_Sigma_prime_mu": "0.966", "asa_Sigma_prime_sigma": "0.973", "wps": "5182.2", "ups": "0.31", "wpb": "16981.7", "bsz": "50.2", "num_updates": "700", "lr": "1.436e-05", "gnorm": "1.478", "loss_scale": "0.0156", "train_wall": "327", "gb_free": "7.9", "wall": "0"}
+[2026-02-02 23:58:17,112][train_inner][INFO] - {"epoch": 428, "update": 427.601, "loss": "4.146", "ntokens": "16925.1", "nsentences": "50.96", "ema_decay": "999.05", "target_ppl": "131.881", "pred_ppl": "147.057", "codebook_decay": "0.9", "sample_size": "16925.1", "asa_I_mu": "1.001", "asa_I_sigma": "1", "asa_Sigma_prime_mu": "0.965", "asa_Sigma_prime_sigma": "0.973", "wps": "5150.4", "ups": "0.3", "wpb": "16925.1", "bsz": "51", "num_updates": "800", "lr": "1.634e-05", "gnorm": "1.261", "loss_scale": "0.0156", "train_wall": "328", "gb_free": "8", "wall": "0"}
+[2026-02-02 23:59:32,435][fairseq.trainer][INFO] - NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.0078125
+[2026-02-03 00:03:48,216][train_inner][INFO] - {"epoch": 428, "update": 427.788, "loss": "4.024", "ntokens": "16946.3", "nsentences": "50.56", "ema_decay": "999.057", "target_ppl": "131.805", "pred_ppl": "146.525", "codebook_decay": "0.9", "sample_size": "16946.3", "asa_I_mu": "1.001", "asa_I_sigma": "1.001", "asa_Sigma_prime_mu": "0.965", "asa_Sigma_prime_sigma": "0.972", "wps": "5118.2", "ups": "0.3", "wpb": "16946.3", "bsz": "50.6", "num_updates": "900", "lr": "1.832e-05", "gnorm": "1.625", "loss_scale": "0.0078", "train_wall": "330", "gb_free": "8", "wall": "0"}
+[2026-02-03 00:09:13,939][train_inner][INFO] - {"epoch": 428, "update": 427.974, "loss": "3.898", "ntokens": "16979.8", "nsentences": "49.99", "ema_decay": "999.063", "target_ppl": "131.923", "pred_ppl": "146.114", "codebook_decay": "0.9", "sample_size": "16979.8", "asa_I_mu": "1.001", "asa_I_sigma": "1.001", "asa_Sigma_prime_mu": "0.964", "asa_Sigma_prime_sigma": "0.971", "wps": "5213.1", "ups": "0.31", "wpb": "16979.8", "bsz": "50", "num_updates": "1000", "lr": "2.03e-05", "gnorm": "1.406", "loss_scale": "0.0078", "train_wall": "325", "gb_free": "8", "wall": "0"}
+[2026-02-03 00:09:57,404][fairseq.checkpoint_utils][INFO] - Preparing to save checkpoint for epoch 428 @ 1014 updates
+[2026-02-03 00:09:57,405][fairseq.trainer][INFO] - Saving checkpoint to /kaggle/working/fairseq/outputs/2026-02-02/23-13-56/checkpoints/checkpoint_last.pt
+[2026-02-03 00:10:01,186][fairseq.trainer][INFO] - Finished saving checkpoint to /kaggle/working/fairseq/outputs/2026-02-02/23-13-56/checkpoints/checkpoint_last.pt
+[2026-02-03 00:10:01,187][fairseq.checkpoint_utils][INFO] - Saved checkpoint checkpoints/checkpoint_last.pt (epoch 428 @ 1014 updates, score None) (writing took 3.782398663999629 seconds)
+[2026-02-03 00:10:01,188][fairseq_cli.train][INFO] - end of epoch 428 (average epoch stats below)
+[2026-02-03 00:10:01,190][train][INFO] - {"epoch": 428, "train_loss": "4.147", "train_ntokens": "16951.5", "train_nsentences": "50.3364", "train_ema_decay": "999.05", "train_target_ppl": "132.223", "train_pred_ppl": "147.242", "train_codebook_decay": "0.9", "train_sample_size": "16951.5", "train_asa_I_mu": "1.001", "train_asa_I_sigma": "1", "train_asa_Sigma_prime_mu": "0.965", "train_asa_Sigma_prime_sigma": "0.973", "train_wps": "5162.6", "train_ups": "0.3", "train_wpb": "16951.5", "train_bsz": "50.3", "train_num_updates": "1014", "train_lr": "2.05772e-05", "train_gnorm": "1.539", "train_loss_scale": "0.0078", "train_train_wall": "1757", "train_gb_free": "8.3", "train_wall": "0"}
+[2026-02-03 00:10:01,191][fairseq.tasks.fairseq_task][INFO] - can_reuse_epoch_itr = True
+[2026-02-03 00:10:01,225][fairseq.tasks.fairseq_task][INFO] - creating new batches for epoch 429
+[2026-02-03 00:10:01,249][fairseq.data.iterators][INFO] - grouped total_num_itrs = 539
+[2026-02-03 00:10:01,252][fairseq.trainer][INFO] - begin training epoch 429
+[2026-02-03 00:10:01,252][fairseq_cli.train][INFO] - Start iterating over samples
+[2026-02-03 00:14:42,597][train_inner][INFO] - {"epoch": 429, "update": 428.16, "loss": "3.755", "ntokens": "16890", "nsentences": "48", "ema_decay": "999.07", "target_ppl": "131.975", "pred_ppl": "145.727", "codebook_decay": "0.9", "sample_size": "16890", "asa_I_mu": "1.002", "asa_I_sigma": "1.001", "asa_Sigma_prime_mu": "0.963", "asa_Sigma_prime_sigma": "0.971", "wps": "5139.2", "ups": "0.3", "wpb": "16890", "bsz": "48", "num_updates": "1100", "lr": "2.228e-05", "gnorm": "1.622", "loss_scale": "0.0078", "train_wall": "324", "gb_free": "8.1", "wall": "0"}
+[2026-02-03 00:20:08,394][train_inner][INFO] - {"epoch": 429, "update": 428.345, "loss": "3.629", "ntokens": "16926.7", "nsentences": "51.43", "ema_decay": "999.077", "target_ppl": "131.592", "pred_ppl": "145.048", "codebook_decay": "0.9", "sample_size": "16926.7", "asa_I_mu": "1.002", "asa_I_sigma": "1.002", "asa_Sigma_prime_mu": "0.962", "asa_Sigma_prime_sigma": "0.97", "wps": "5195.6", "ups": "0.31", "wpb": "16926.7", "bsz": "51.4", "num_updates": "1200", "lr": "2.426e-05", "gnorm": "1.542", "loss_scale": "0.0078", "train_wall": "325", "gb_free": "8.1", "wall": "0"}
+[2026-02-03 00:25:36,806][train_inner][INFO] - {"epoch": 429, "update": 428.531, "loss": "3.465", "ntokens": "16967.5", "nsentences": "50.38", "ema_decay": "999.083", "target_ppl": "132.038", "pred_ppl": "144.868", "codebook_decay": "0.9", "sample_size": "16967.5", "asa_I_mu": "1.003", "asa_I_sigma": "1.002", "asa_Sigma_prime_mu": "0.961", "asa_Sigma_prime_sigma": "0.969", "wps": "5166.6", "ups": "0.3", "wpb": "16967.5", "bsz": "50.4", "num_updates": "1300", "lr": "2.624e-05", "gnorm": "1.577", "loss_scale": "0.0078", "train_wall": "327", "gb_free": "8", "wall": "0"}
+[2026-02-03 00:31:04,453][train_inner][INFO] - {"epoch": 429, "update": 428.716, "loss": "3.341", "ntokens": "16964.5", "nsentences": "50.09", "ema_decay": "999.09", "target_ppl": "131.531", "pred_ppl": "144.034", "codebook_decay": "0.9", "sample_size": "16964.5", "asa_I_mu": "1.003", "asa_I_sigma": "1.003", "asa_Sigma_prime_mu": "0.96", "asa_Sigma_prime_sigma": "0.969", "wps": "5177.7", "ups": "0.31", "wpb": "16964.5", "bsz": "50.1", "num_updates": "1400", "lr": "2.822e-05", "gnorm": "1.746", "loss_scale": "0.0078", "train_wall": "327", "gb_free": "8.1", "wall": "0"}
+[2026-02-03 00:36:30,253][train_inner][INFO] - {"epoch": 429, "update": 428.902, "loss": "3.224", "ntokens": "16992.8", "nsentences": "51.68", "ema_decay": "999.097", "target_ppl": "131.978", "pred_ppl": "144.071", "codebook_decay": "0.9", "sample_size": "16992.8", "asa_I_mu": "1.004", "asa_I_sigma": "1.004", "asa_Sigma_prime_mu": "0.959", "asa_Sigma_prime_sigma": "0.968", "wps": "5215.7", "ups": "0.31", "wpb": "16992.8", "bsz": "51.7", "num_updates": "1500", "lr": "3.02e-05", "gnorm": "1.647", "loss_scale": "0.0078", "train_wall": "325", "gb_free": "7.9", "wall": "0"}
+[2026-02-03 00:39:22,389][fairseq.checkpoint_utils][INFO] - Preparing to save checkpoint for epoch 429 @ 1553 updates
+[2026-02-03 00:39:22,390][fairseq.trainer][INFO] - Saving checkpoint to /kaggle/working/fairseq/outputs/2026-02-02/23-13-56/checkpoints/checkpoint_last.pt
+[2026-02-03 00:39:25,941][fairseq.trainer][INFO] - Finished saving checkpoint to /kaggle/working/fairseq/outputs/2026-02-02/23-13-56/checkpoints/checkpoint_last.pt
+[2026-02-03 00:39:25,942][fairseq.checkpoint_utils][INFO] - Saved checkpoint checkpoints/checkpoint_last.pt (epoch 429 @ 1553 updates, score None) (writing took 3.553410245000123 seconds)
+[2026-02-03 00:39:25,943][fairseq_cli.train][INFO] - end of epoch 429 (average epoch stats below)
+[2026-02-03 00:39:25,951][train][INFO] - {"epoch": 429, "train_loss": "3.437", "train_ntokens": "16952.7", "train_nsentences": "50.3228", "train_ema_decay": "999.086", "train_target_ppl": "131.832", "train_pred_ppl": "144.62", "train_codebook_decay": "0.9", "train_sample_size": "16952.7", "train_asa_I_mu": "1.003", "train_asa_I_sigma": "1.003", "train_asa_Sigma_prime_mu": "0.961", "train_asa_Sigma_prime_sigma": "0.969", "train_wps": "5177.8", "train_ups": "0.31", "train_wpb": "16952.7", "train_bsz": "50.3", "train_num_updates": "1553", "train_lr": "3.12494e-05", "train_gnorm": "1.636", "train_loss_scale": "0.0078", "train_train_wall": "1755", "train_gb_free": "8", "train_wall": "0"}
+[2026-02-03 00:39:25,952][fairseq.tasks.fairseq_task][INFO] - can_reuse_epoch_itr = True
+[2026-02-03 00:39:25,984][fairseq.tasks.fairseq_task][INFO] - creating new batches for epoch 430
+[2026-02-03 00:39:26,007][fairseq.data.iterators][INFO] - grouped total_num_itrs = 539
+[2026-02-03 00:39:26,010][fairseq.trainer][INFO] - begin training epoch 430
+[2026-02-03 00:39:26,010][fairseq_cli.train][INFO] - Start iterating over samples

hydra_train2.log ADDED Viewed

The diff for this file is too large to render. See raw diff