kvkk_set3 / train.log
esrasv's picture
Upload folder using huggingface_hub
8d62f58 verified
13:4: not a valid test operator: (
13:4: not a valid test operator: 535.86.10
2026-04-28 05:14:39.684201: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
Traceback (most recent call last):
File "/workspace/finetune/main_chars_lstm.py", line 26, in <module>
import tensorflow as tf
File "/usr/local/lib/python3.8/dist-packages/tensorflow/__init__.py", line 101, in <module>
from tensorflow_core import *
File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/__init__.py", line 28, in <module>
from tensorflow.python import pywrap_tensorflow # pylint: disable=unused-import
File "/usr/local/lib/python3.8/dist-packages/tensorflow/__init__.py", line 50, in __getattr__
module = self._load()
File "/usr/local/lib/python3.8/dist-packages/tensorflow/__init__.py", line 44, in _load
module = _importlib.import_module(self.__name__)
File "/usr/lib/python3.8/importlib/__init__.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/__init__.py", line 63, in <module>
from tensorflow.python.framework.framework_lib import * # pylint: disable=redefined-builtin
File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/framework/framework_lib.py", line 52, in <module>
from tensorflow.python.framework.importer import import_graph_def
File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/framework/importer.py", line 28, in <module>
from tensorflow.python.framework import function
File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/framework/function.py", line 38, in <module>
from tensorflow.python.ops import variable_scope as vs
File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/ops/variable_scope.py", line 40, in <module>
from tensorflow.python.ops import init_ops
File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/ops/init_ops.py", line 45, in <module>
from tensorflow.python.ops import linalg_ops_impl
File "<frozen importlib._bootstrap>", line 991, in _find_and_load
File "<frozen importlib._bootstrap>", line 975, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 671, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 844, in exec_module
File "<frozen importlib._bootstrap_external>", line 939, in get_code
File "<frozen importlib._bootstrap_external>", line 1038, in get_data
KeyboardInterrupt
13:4: not a valid test operator: (
13:4: not a valid test operator: 535.86.10
2026-04-28 05:14:49.627917: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
WARNING:tensorflow:From /workspace/finetune/main_chars_lstm.py:36: The name tf.logging.set_verbosity is deprecated. Please use tf.compat.v1.logging.set_verbosity instead.
[train] params: {"batch_size": 128, "buffer": 2000, "char_lstm_size": 25, "chars": "/workspace/finetune/data_kvkk_set3_v3/vocab.chars.txt", "dim": 50, "dim_chars": 100, "dropout": 0.5, "early_stop_max_steps": 600, "epochs": 20, "learning_rate": 0.001, "log_step_count_steps": 200, "lstm_size": 100, "min_steps": 600000, "num_oov_buckets": 1, "save_checkpoints_secs": 500, "save_summary_steps": 1000, "tags": "/workspace/finetune/data_kvkk_set3_v3/vocab.tags.txt", "trainable_embeddings": true, "vectors": "/workspace/finetune/data_kvkk_set3_v3/vectors.npz", "words": "/workspace/finetune/data_kvkk_set3_v3/vocab.words.txt"}
Using config: {'_model_dir': '/workspace/finetune/results/model', '_tf_random_seed': None, '_save_summary_steps': 1000, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 500, '_session_config': allow_soft_placement: true
graph_options {
rewrite_options {
meta_optimizer_iterations: ONE
}
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 200, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f22cadd3190>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
Not using Distribute Coordinator.
Running training and evaluation locally (non-distributed).
Start train and evaluate loop. The evaluate will happen after every checkpoint. Checkpoint frequency is determined based on RunConfig arguments: save_checkpoints_steps None or save_checkpoints_secs 500.
Calling model_fn.
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
* https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
* https://github.com/tensorflow/addons
* https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.
TensorFlow will not use sklearn by default. This improves performance in some cases. To enable sklearn export the environment variable TF_ALLOW_IOLIBS=1.
TensorFlow will not use Dask by default. This improves performance in some cases. To enable Dask export the environment variable TF_ALLOW_IOLIBS=1.
TensorFlow will not use Pandas by default. This improves performance in some cases. To enable Pandas export the environment variable TF_ALLOW_IOLIBS=1.
From /workspace/finetune/main_chars_lstm.py:104: The name tf.get_variable is deprecated. Please use tf.compat.v1.get_variable instead.
From /workspace/finetune/main_chars_lstm.py:169: The name tf.metrics.accuracy is deprecated. Please use tf.compat.v1.metrics.accuracy instead.
From /py_packages/tf_metrics/__init__.py:152: The name tf.diag_part is deprecated. Please use tf.linalg.tensor_diag_part instead.
From /workspace/finetune/main_chars_lstm.py:175: The name tf.summary.scalar is deprecated. Please use tf.compat.v1.summary.scalar instead.
From /workspace/finetune/main_chars_lstm.py:180: The name tf.train.AdamOptimizer is deprecated. Please use tf.compat.v1.train.AdamOptimizer instead.
From /workspace/finetune/main_chars_lstm.py:181: The name tf.train.get_or_create_global_step is deprecated. Please use tf.compat.v1.train.get_or_create_global_step instead.
Done calling model_fn.
Create CheckpointSaverHook.
Graph was finalized.
2026-04-28 05:14:53.664462: I tensorflow/core/platform/profile_utils/cpu_utils.cc:109] CPU Frequency: 2000000000 Hz
2026-04-28 05:14:53.695769: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x695d390 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2026-04-28 05:14:53.695810: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2026-04-28 05:14:53.700326: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2026-04-28 05:14:53.861749: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x6640640 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2026-04-28 05:14:53.861787: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): NVIDIA H100, Compute Capability 9.0
2026-04-28 05:14:53.862462: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1669] Found device 0 with properties:
name: NVIDIA H100 major: 9 minor: 0 memoryClockRate(GHz): 1.98
pciBusID: 0000:ad:00.0
2026-04-28 05:14:53.862490: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2026-04-28 05:14:54.169364: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2026-04-28 05:14:54.198790: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2026-04-28 05:14:54.206171: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2026-04-28 05:14:54.213877: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.11
2026-04-28 05:14:54.225162: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
2026-04-28 05:14:54.226485: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2026-04-28 05:14:54.226851: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1797] Adding visible gpu devices: 0
2026-04-28 05:14:54.228166: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2026-04-28 05:14:54.233928: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1209] Device interconnect StreamExecutor with strength 1 edge matrix:
2026-04-28 05:14:54.233952: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1215] 0
2026-04-28 05:14:54.233961: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1228] 0: N
2026-04-28 05:14:54.234404: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1354] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 60325 MB memory) -> physical GPU (device: 0, name: NVIDIA H100, pci bus id: 0000:ad:00.0, compute capability: 9.0)
Running local_init_op.
Done running local_init_op.
Saving checkpoints for 0 into /workspace/finetune/results/model/model.ckpt.
2026-04-28 05:14:57.835187: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
loss = 111.06471, step = 0
global_step/sec: 15.9321
loss = 5.9110317, step = 200 (12.553 sec)
global_step/sec: 17.0678
loss = 2.240281, step = 400 (11.721 sec)
13:4: not a valid test operator: (
13:4: not a valid test operator: 535.86.10
2026-04-28 05:15:33.664526: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
WARNING:tensorflow:From /workspace/finetune/main_chars_lstm.py:36: The name tf.logging.set_verbosity is deprecated. Please use tf.compat.v1.logging.set_verbosity instead.
[train] params: {"batch_size": 128, "buffer": 2000, "char_lstm_size": 25, "chars": "/workspace/finetune/data_kvkk_set3_v3/vocab.chars.txt", "dim": 50, "dim_chars": 100, "dropout": 0.5, "early_stop_max_steps": 600, "epochs": 20, "learning_rate": 0.001, "log_step_count_steps": 200, "lstm_size": 100, "min_steps": 600000, "num_oov_buckets": 1, "save_checkpoints_secs": 500, "save_summary_steps": 1000, "tags": "/workspace/finetune/data_kvkk_set3_v3/vocab.tags.txt", "trainable_embeddings": true, "vectors": "/workspace/finetune/data_kvkk_set3_v3/vectors.npz", "words": "/workspace/finetune/data_kvkk_set3_v3/vocab.words.txt"}
Using config: {'_model_dir': '/workspace/finetune/results/model', '_tf_random_seed': None, '_save_summary_steps': 1000, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 500, '_session_config': allow_soft_placement: true
graph_options {
rewrite_options {
meta_optimizer_iterations: ONE
}
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 200, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fe5412e7190>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
Not using Distribute Coordinator.
Running training and evaluation locally (non-distributed).
Start train and evaluate loop. The evaluate will happen after every checkpoint. Checkpoint frequency is determined based on RunConfig arguments: save_checkpoints_steps None or save_checkpoints_secs 500.
Calling model_fn.
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
* https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
* https://github.com/tensorflow/addons
* https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.
TensorFlow will not use sklearn by default. This improves performance in some cases. To enable sklearn export the environment variable TF_ALLOW_IOLIBS=1.
TensorFlow will not use Dask by default. This improves performance in some cases. To enable Dask export the environment variable TF_ALLOW_IOLIBS=1.
TensorFlow will not use Pandas by default. This improves performance in some cases. To enable Pandas export the environment variable TF_ALLOW_IOLIBS=1.
From /workspace/finetune/main_chars_lstm.py:104: The name tf.get_variable is deprecated. Please use tf.compat.v1.get_variable instead.
From /workspace/finetune/main_chars_lstm.py:169: The name tf.metrics.accuracy is deprecated. Please use tf.compat.v1.metrics.accuracy instead.
From /py_packages/tf_metrics/__init__.py:152: The name tf.diag_part is deprecated. Please use tf.linalg.tensor_diag_part instead.
From /workspace/finetune/main_chars_lstm.py:175: The name tf.summary.scalar is deprecated. Please use tf.compat.v1.summary.scalar instead.
From /workspace/finetune/main_chars_lstm.py:180: The name tf.train.AdamOptimizer is deprecated. Please use tf.compat.v1.train.AdamOptimizer instead.
From /workspace/finetune/main_chars_lstm.py:181: The name tf.train.get_or_create_global_step is deprecated. Please use tf.compat.v1.train.get_or_create_global_step instead.
Done calling model_fn.
Create CheckpointSaverHook.
Graph was finalized.
2026-04-28 05:15:37.160482: I tensorflow/core/platform/profile_utils/cpu_utils.cc:109] CPU Frequency: 2000000000 Hz
2026-04-28 05:15:37.198169: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x6a89f10 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2026-04-28 05:15:37.198203: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2026-04-28 05:15:37.203160: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2026-04-28 05:15:37.389778: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x695b9c0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2026-04-28 05:15:37.389819: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): NVIDIA H100, Compute Capability 9.0
2026-04-28 05:15:37.390462: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1669] Found device 0 with properties:
name: NVIDIA H100 major: 9 minor: 0 memoryClockRate(GHz): 1.98
pciBusID: 0000:ad:00.0
2026-04-28 05:15:37.390490: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2026-04-28 05:15:37.703030: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2026-04-28 05:15:37.735401: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2026-04-28 05:15:37.743084: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2026-04-28 05:15:37.750452: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.11
2026-04-28 05:15:37.762361: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
2026-04-28 05:15:37.764124: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2026-04-28 05:15:37.766984: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1797] Adding visible gpu devices: 0
2026-04-28 05:15:37.768618: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2026-04-28 05:15:37.774465: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1209] Device interconnect StreamExecutor with strength 1 edge matrix:
2026-04-28 05:15:37.774485: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1215] 0
2026-04-28 05:15:37.774492: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1228] 0: N
2026-04-28 05:15:37.774878: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1354] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 60325 MB memory) -> physical GPU (device: 0, name: NVIDIA H100, pci bus id: 0000:ad:00.0, compute capability: 9.0)
Restoring parameters from /workspace/finetune/results/model/model.ckpt-0
Running local_init_op.
Done running local_init_op.
Saving checkpoints for 0 into /workspace/finetune/results/model/model.ckpt.
2026-04-28 05:15:40.981003: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
loss = 102.289, step = 0
global_step/sec: 16.2585
loss = 5.986624, step = 200 (12.302 sec)
global_step/sec: 16.5611
loss = 2.1498528, step = 400 (12.076 sec)
global_step/sec: 16.1212
loss = 1.312039, step = 600 (12.406 sec)
global_step/sec: 16.2189
loss = 1.3136858, step = 800 (12.331 sec)
global_step/sec: 16.2345
loss = 1.3853596, step = 1000 (12.320 sec)
global_step/sec: 15.9785
loss = 1.2304411, step = 1200 (12.517 sec)
global_step/sec: 16.0121
loss = 0.7067219, step = 1400 (12.491 sec)
global_step/sec: 16.1801
loss = 0.93841916, step = 1600 (12.361 sec)
global_step/sec: 16.1054
loss = 0.65951395, step = 1800 (12.418 sec)
global_step/sec: 16.1093
loss = 0.8376456, step = 2000 (12.415 sec)
global_step/sec: 16.4475
loss = 0.45452726, step = 2200 (12.160 sec)
global_step/sec: 16.3659
loss = 0.6535889, step = 2400 (12.220 sec)
global_step/sec: 16.8136
loss = 0.54435384, step = 2600 (11.895 sec)
global_step/sec: 16.8206
loss = 0.5056425, step = 2800 (11.891 sec)
global_step/sec: 16.4608
loss = 0.54167736, step = 3000 (12.150 sec)
global_step/sec: 16.5302
loss = 0.9743586, step = 3200 (12.099 sec)
global_step/sec: 16.6723
loss = 0.29906678, step = 3400 (11.996 sec)
global_step/sec: 16.9803
loss = 0.5273884, step = 3600 (11.780 sec)
global_step/sec: 16.9109
loss = 0.51336044, step = 3800 (11.825 sec)
global_step/sec: 17.2923
loss = 0.6441746, step = 4000 (11.566 sec)
global_step/sec: 17.4957
loss = 0.34444237, step = 4200 (11.432 sec)
global_step/sec: 17.3879
loss = 0.3049839, step = 4400 (11.501 sec)
global_step/sec: 17.517
loss = 0.33523333, step = 4600 (11.418 sec)
global_step/sec: 17.1766
loss = 0.5199193, step = 4800 (11.643 sec)
global_step/sec: 17.223
loss = 0.40655118, step = 5000 (11.613 sec)
global_step/sec: 17.67
loss = 0.42372644, step = 5200 (11.319 sec)
global_step/sec: 17.67
loss = 0.08840948, step = 5400 (11.318 sec)
global_step/sec: 18.0636
loss = 0.21405059, step = 5600 (11.072 sec)
global_step/sec: 17.8529
loss = 0.5103779, step = 5800 (11.203 sec)
global_step/sec: 17.9147
loss = 0.16661471, step = 6000 (11.164 sec)
global_step/sec: 16.9307
loss = 0.24831116, step = 6200 (11.813 sec)
global_step/sec: 17.287
loss = 0.07917696, step = 6400 (11.569 sec)
global_step/sec: 17.287
loss = 0.19675535, step = 6600 (11.569 sec)
global_step/sec: 17.2728
loss = 0.45481026, step = 6800 (11.579 sec)
global_step/sec: 17.3187
loss = 0.1687935, step = 7000 (11.548 sec)
global_step/sec: 17.3065
loss = 0.10605991, step = 7200 (11.556 sec)
global_step/sec: 17.2856
loss = 0.21068978, step = 7400 (11.570 sec)
global_step/sec: 17.6764
loss = 0.10583621, step = 7600 (11.314 sec)
global_step/sec: 17.1817
loss = 0.25232738, step = 7800 (11.640 sec)
global_step/sec: 17.3914
loss = 0.5265127, step = 8000 (11.500 sec)
global_step/sec: 17.2989
loss = 0.15738869, step = 8200 (11.561 sec)
global_step/sec: 17.4263
loss = 0.13792074, step = 8400 (11.477 sec)
Saving checkpoints for 8428 into /workspace/finetune/results/model/model.ckpt.
Calling model_fn.
Done calling model_fn.
Starting evaluation at 2026-04-28T05:24:01Z
Graph was finalized.
2026-04-28 05:24:01.172161: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1669] Found device 0 with properties:
name: NVIDIA H100 major: 9 minor: 0 memoryClockRate(GHz): 1.98
pciBusID: 0000:ad:00.0
2026-04-28 05:24:01.172213: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2026-04-28 05:24:01.172243: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2026-04-28 05:24:01.172249: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2026-04-28 05:24:01.172255: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2026-04-28 05:24:01.172260: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.11
2026-04-28 05:24:01.172265: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
2026-04-28 05:24:01.172272: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2026-04-28 05:24:01.172522: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1797] Adding visible gpu devices: 0
2026-04-28 05:24:01.172549: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1209] Device interconnect StreamExecutor with strength 1 edge matrix:
2026-04-28 05:24:01.172553: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1215] 0
2026-04-28 05:24:01.172557: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1228] 0: N
2026-04-28 05:24:01.172822: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1354] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 60325 MB memory) -> physical GPU (device: 0, name: NVIDIA H100, pci bus id: 0000:ad:00.0, compute capability: 9.0)
Restoring parameters from /workspace/finetune/results/model/model.ckpt-8428
Running local_init_op.
Done running local_init_op.
Evaluation [10/100]
Evaluation [20/100]
Evaluation [30/100]
Evaluation [40/100]
Evaluation [50/100]
Evaluation [60/100]
Evaluation [70/100]
Evaluation [80/100]
Evaluation [90/100]
Evaluation [100/100]
Finished evaluation at 2026-04-28-05:24:06
Saving dict for global step 8428: acc = 0.997914, f1 = 0.9942703, global_step = 8428, loss = 0.14689502, precision = 0.9936579, recall = 0.99488366
Saving 'checkpoint_path' summary for global step 8428: /workspace/finetune/results/model/model.ckpt-8428
global_step/sec: 11.5153
loss = 0.13722146, step = 8600 (17.368 sec)
global_step/sec: 17.6419
loss = 0.31847942, step = 8800 (11.336 sec)
global_step/sec: 17.8273
loss = 0.28958094, step = 9000 (11.219 sec)
global_step/sec: 17.5481
loss = 0.30293107, step = 9200 (11.397 sec)
global_step/sec: 17.5695
loss = 0.11271095, step = 9400 (11.383 sec)
global_step/sec: 17.8699
loss = 0.2657137, step = 9600 (11.192 sec)
global_step/sec: 17.8482
loss = 0.06779647, step = 9800 (11.205 sec)
global_step/sec: 17.8093
loss = 0.116514206, step = 10000 (11.230 sec)
global_step/sec: 17.6214
loss = 0.17775589, step = 10200 (11.350 sec)
global_step/sec: 18.1588
loss = 0.11069703, step = 10400 (11.014 sec)
global_step/sec: 18.0034
loss = 0.024612904, step = 10600 (11.109 sec)
global_step/sec: 18.1334
loss = 0.17613989, step = 10800 (11.029 sec)
global_step/sec: 18.0531
loss = 0.13324213, step = 11000 (11.079 sec)
global_step/sec: 18.0712
loss = 0.12831438, step = 11200 (11.067 sec)
global_step/sec: 17.947
loss = 0.2804634, step = 11400 (11.144 sec)
global_step/sec: 18.1069
loss = 0.2336666, step = 11600 (11.046 sec)
global_step/sec: 17.9596
loss = 0.16818011, step = 11800 (11.136 sec)
global_step/sec: 17.9847
loss = 0.062304914, step = 12000 (11.121 sec)
global_step/sec: 17.8009
loss = 0.15490824, step = 12200 (11.235 sec)
global_step/sec: 18.0888
loss = 0.102054656, step = 12400 (11.056 sec)
global_step/sec: 17.9338
loss = 0.04917127, step = 12600 (11.152 sec)
global_step/sec: 17.9571
loss = 0.07043046, step = 12800 (11.138 sec)
global_step/sec: 16.5145
loss = 0.22558558, step = 13000 (12.111 sec)
global_step/sec: 17.9296
loss = 0.0617373, step = 13200 (11.154 sec)
global_step/sec: 18.0051
loss = 0.080938876, step = 13400 (11.108 sec)
global_step/sec: 17.8975
loss = 0.049816668, step = 13600 (11.175 sec)
global_step/sec: 17.6146
loss = 0.091686785, step = 13800 (11.354 sec)
global_step/sec: 17.6869
loss = 0.09760392, step = 14000 (11.308 sec)
global_step/sec: 17.7814
loss = 0.08662301, step = 14200 (11.247 sec)
global_step/sec: 17.8851
loss = 0.12617636, step = 14400 (11.183 sec)
global_step/sec: 17.1209
loss = 0.13539296, step = 14600 (11.682 sec)
global_step/sec: 18.0841
loss = 0.21070999, step = 14800 (11.059 sec)
global_step/sec: 17.9193
loss = 0.09002066, step = 15000 (11.161 sec)
global_step/sec: 18.0027
loss = 0.07087046, step = 15200 (11.109 sec)
global_step/sec: 18.0166
loss = 0.1169194, step = 15400 (11.101 sec)
global_step/sec: 17.7961
loss = 0.094311416, step = 15600 (11.238 sec)
global_step/sec: 18.069
loss = 0.08750421, step = 15800 (11.069 sec)
global_step/sec: 17.8826
loss = 0.013639748, step = 16000 (11.184 sec)
global_step/sec: 17.9857
loss = 0.051932037, step = 16200 (11.120 sec)
global_step/sec: 17.8354
loss = 0.11339444, step = 16400 (11.214 sec)
global_step/sec: 17.8948
loss = 0.07497531, step = 16600 (11.176 sec)
global_step/sec: 17.9611
loss = 0.09614849, step = 16800 (11.135 sec)
global_step/sec: 17.984
loss = 0.044727564, step = 17000 (11.121 sec)
global_step/sec: 17.9836
loss = 0.06893462, step = 17200 (11.121 sec)
Saving checkpoints for 17244 into /workspace/finetune/results/model/model.ckpt.
Calling model_fn.
Done calling model_fn.
Starting evaluation at 2026-04-28T05:32:21Z
Graph was finalized.
2026-04-28 05:32:21.131189: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1669] Found device 0 with properties:
name: NVIDIA H100 major: 9 minor: 0 memoryClockRate(GHz): 1.98
pciBusID: 0000:ad:00.0
2026-04-28 05:32:21.131230: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2026-04-28 05:32:21.131259: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2026-04-28 05:32:21.131265: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2026-04-28 05:32:21.131271: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2026-04-28 05:32:21.131276: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.11
2026-04-28 05:32:21.131281: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
2026-04-28 05:32:21.131288: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2026-04-28 05:32:21.131566: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1797] Adding visible gpu devices: 0
2026-04-28 05:32:21.131590: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1209] Device interconnect StreamExecutor with strength 1 edge matrix:
2026-04-28 05:32:21.131594: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1215] 0
2026-04-28 05:32:21.131598: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1228] 0: N
2026-04-28 05:32:21.131906: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1354] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 60325 MB memory) -> physical GPU (device: 0, name: NVIDIA H100, pci bus id: 0000:ad:00.0, compute capability: 9.0)
Restoring parameters from /workspace/finetune/results/model/model.ckpt-17244
Running local_init_op.
Done running local_init_op.
Evaluation [10/100]
Evaluation [20/100]
Evaluation [30/100]
Evaluation [40/100]
Evaluation [50/100]
Evaluation [60/100]
Evaluation [70/100]
Evaluation [80/100]
Evaluation [90/100]
Evaluation [100/100]
Finished evaluation at 2026-04-28-05:32:25
Saving dict for global step 17244: acc = 0.9983455, f1 = 0.9955927, global_step = 17244, loss = 0.10923356, precision = 0.99532104, recall = 0.99586445
Saving 'checkpoint_path' summary for global step 17244: /workspace/finetune/results/model/model.ckpt-17244
global_step/sec: 12.2121
loss = 0.07070118, step = 17400 (16.377 sec)
global_step/sec: 17.9306
loss = 0.07339883, step = 17600 (11.154 sec)
global_step/sec: 17.8084
loss = 0.12021345, step = 17800 (11.231 sec)
global_step/sec: 17.8753
loss = 0.0967865, step = 18000 (11.189 sec)
global_step/sec: 17.6591
loss = 0.0594576, step = 18200 (11.325 sec)
global_step/sec: 17.9587
loss = 0.06613392, step = 18400 (11.137 sec)
global_step/sec: 17.6629
loss = 0.11538398, step = 18600 (11.323 sec)
global_step/sec: 17.6883
loss = 0.008395612, step = 18800 (11.307 sec)
global_step/sec: 18.0527
loss = 0.08756232, step = 19000 (11.079 sec)
global_step/sec: 17.9936
loss = 0.023801446, step = 19200 (11.115 sec)
global_step/sec: 17.7398
loss = 0.11921042, step = 19400 (11.274 sec)
global_step/sec: 17.6782
loss = 0.048651278, step = 19600 (11.314 sec)
global_step/sec: 17.7342
loss = 0.17171317, step = 19800 (11.278 sec)
global_step/sec: 17.756
loss = 0.072808385, step = 20000 (11.264 sec)
global_step/sec: 17.7842
loss = 0.08458197, step = 20200 (11.246 sec)
global_step/sec: 18.1083
loss = 0.09870511, step = 20400 (11.045 sec)
global_step/sec: 17.8018
loss = 0.014859796, step = 20600 (11.235 sec)
global_step/sec: 17.9517
loss = 0.02439493, step = 20800 (11.141 sec)
global_step/sec: 17.8295
loss = 0.063162625, step = 21000 (11.218 sec)
global_step/sec: 17.8725
loss = 0.026917815, step = 21200 (11.190 sec)
global_step/sec: 17.8992
loss = 0.13421822, step = 21400 (11.174 sec)
global_step/sec: 18.0597
loss = 0.11149919, step = 21600 (11.074 sec)
global_step/sec: 17.3746
loss = 0.019822836, step = 21800 (11.511 sec)
global_step/sec: 17.9356
loss = 0.064364076, step = 22000 (11.151 sec)
global_step/sec: 17.8118
loss = 0.02763486, step = 22200 (11.228 sec)
global_step/sec: 17.6316
loss = 0.09873259, step = 22400 (11.343 sec)
global_step/sec: 17.9541
loss = 0.07205546, step = 22600 (11.140 sec)
global_step/sec: 17.9894
loss = 0.052541614, step = 22800 (11.118 sec)
global_step/sec: 17.7515
loss = 0.07275927, step = 23000 (11.267 sec)
global_step/sec: 17.8268
loss = 0.13450235, step = 23200 (11.219 sec)
global_step/sec: 18.1835
loss = 0.016338944, step = 23400 (10.999 sec)
global_step/sec: 18.0212
loss = 0.07616836, step = 23600 (11.098 sec)
global_step/sec: 17.9681
loss = 0.0625878, step = 23800 (11.131 sec)
global_step/sec: 17.786
loss = 0.04009646, step = 24000 (11.245 sec)
global_step/sec: 17.8586
loss = 0.047451556, step = 24200 (11.199 sec)
global_step/sec: 17.7901
loss = 0.04998851, step = 24400 (11.242 sec)
global_step/sec: 17.985
loss = 0.043762565, step = 24600 (11.120 sec)
global_step/sec: 17.9799
loss = 0.06248969, step = 24800 (11.124 sec)
Saving checkpoints for 25000 into /workspace/finetune/results/model/model.ckpt.
Calling model_fn.
Done calling model_fn.
Starting evaluation at 2026-04-28T05:39:40Z
Graph was finalized.
2026-04-28 05:39:40.866675: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1669] Found device 0 with properties:
name: NVIDIA H100 major: 9 minor: 0 memoryClockRate(GHz): 1.98
pciBusID: 0000:ad:00.0
2026-04-28 05:39:40.866717: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2026-04-28 05:39:40.866747: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2026-04-28 05:39:40.866754: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2026-04-28 05:39:40.866760: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2026-04-28 05:39:40.866765: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.11
2026-04-28 05:39:40.866770: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
2026-04-28 05:39:40.866777: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2026-04-28 05:39:40.867014: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1797] Adding visible gpu devices: 0
2026-04-28 05:39:40.867040: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1209] Device interconnect StreamExecutor with strength 1 edge matrix:
2026-04-28 05:39:40.867044: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1215] 0
2026-04-28 05:39:40.867048: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1228] 0: N
2026-04-28 05:39:40.867311: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1354] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 60325 MB memory) -> physical GPU (device: 0, name: NVIDIA H100, pci bus id: 0000:ad:00.0, compute capability: 9.0)
Restoring parameters from /workspace/finetune/results/model/model.ckpt-25000
Running local_init_op.
Done running local_init_op.
Evaluation [10/100]
Evaluation [20/100]
Evaluation [30/100]
Evaluation [40/100]
Evaluation [50/100]
Evaluation [60/100]
Evaluation [70/100]
Evaluation [80/100]
Evaluation [90/100]
Evaluation [100/100]
Finished evaluation at 2026-04-28-05:39:45
Saving dict for global step 25000: acc = 0.9985107, f1 = 0.99607116, global_step = 25000, loss = 0.11429862, precision = 0.99621725, recall = 0.9959251
Saving 'checkpoint_path' summary for global step 25000: /workspace/finetune/results/model/model.ckpt-25000
Loss for final step: 0.057276487.
Calling model_fn.
Done calling model_fn.
Graph was finalized.
2026-04-28 05:39:45.501816: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1669] Found device 0 with properties:
name: NVIDIA H100 major: 9 minor: 0 memoryClockRate(GHz): 1.98
pciBusID: 0000:ad:00.0
2026-04-28 05:39:45.501861: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2026-04-28 05:39:45.501878: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2026-04-28 05:39:45.501884: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2026-04-28 05:39:45.501890: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2026-04-28 05:39:45.501896: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.11
2026-04-28 05:39:45.501901: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
2026-04-28 05:39:45.501908: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2026-04-28 05:39:45.502144: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1797] Adding visible gpu devices: 0
2026-04-28 05:39:45.502167: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1209] Device interconnect StreamExecutor with strength 1 edge matrix:
2026-04-28 05:39:45.502171: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1215] 0
2026-04-28 05:39:45.502176: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1228] 0: N
2026-04-28 05:39:45.502823: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1354] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 60325 MB memory) -> physical GPU (device: 0, name: NVIDIA H100, pci bus id: 0000:ad:00.0, compute capability: 9.0)
Restoring parameters from /workspace/finetune/results/model/model.ckpt-25000
Running local_init_op.
Done running local_init_op.
[predict] wrote /workspace/finetune/results/score/train.preds.txt
Calling model_fn.
Done calling model_fn.
Graph was finalized.
2026-04-28 05:40:35.402702: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1669] Found device 0 with properties:
name: NVIDIA H100 major: 9 minor: 0 memoryClockRate(GHz): 1.98
pciBusID: 0000:ad:00.0
2026-04-28 05:40:35.402752: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2026-04-28 05:40:35.402780: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2026-04-28 05:40:35.402786: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2026-04-28 05:40:35.402792: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2026-04-28 05:40:35.402797: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.11
2026-04-28 05:40:35.402802: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
2026-04-28 05:40:35.402809: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2026-04-28 05:40:35.403043: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1797] Adding visible gpu devices: 0
2026-04-28 05:40:35.403069: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1209] Device interconnect StreamExecutor with strength 1 edge matrix:
2026-04-28 05:40:35.403073: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1215] 0
2026-04-28 05:40:35.403077: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1228] 0: N
2026-04-28 05:40:35.403350: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1354] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 60325 MB memory) -> physical GPU (device: 0, name: NVIDIA H100, pci bus id: 0000:ad:00.0, compute capability: 9.0)
Restoring parameters from /workspace/finetune/results/model/model.ckpt-25000
Running local_init_op.
Done running local_init_op.
[predict] wrote /workspace/finetune/results/score/testa.preds.txt
Calling model_fn.
Done calling model_fn.
Graph was finalized.
2026-04-28 05:40:42.254590: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1669] Found device 0 with properties:
name: NVIDIA H100 major: 9 minor: 0 memoryClockRate(GHz): 1.98
pciBusID: 0000:ad:00.0
2026-04-28 05:40:42.254635: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2026-04-28 05:40:42.254654: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2026-04-28 05:40:42.254662: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2026-04-28 05:40:42.254668: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2026-04-28 05:40:42.254674: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.11
2026-04-28 05:40:42.254679: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
2026-04-28 05:40:42.254685: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2026-04-28 05:40:42.254917: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1797] Adding visible gpu devices: 0
2026-04-28 05:40:42.254938: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1209] Device interconnect StreamExecutor with strength 1 edge matrix:
2026-04-28 05:40:42.254942: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1215] 0
2026-04-28 05:40:42.254947: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1228] 0: N
2026-04-28 05:40:42.255201: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1354] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 60325 MB memory) -> physical GPU (device: 0, name: NVIDIA H100, pci bus id: 0000:ad:00.0, compute capability: 9.0)
Restoring parameters from /workspace/finetune/results/model/model.ckpt-25000
Running local_init_op.
Done running local_init_op.
[predict] wrote /workspace/finetune/results/score/testb.preds.txt