kvkk_set2 / train.log
esrasv's picture
Upload folder using huggingface_hub
56f3a3d verified
13:4: not a valid test operator: (
13:4: not a valid test operator: 535.86.10
2026-04-28 04:33:01.866542: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
WARNING:tensorflow:From /workspace/finetune/main_chars_lstm.py:36: The name tf.logging.set_verbosity is deprecated. Please use tf.compat.v1.logging.set_verbosity instead.
[train] params: {"batch_size": 128, "buffer": 2000, "char_lstm_size": 25, "chars": "/workspace/finetune/data_kvkk_set2_v3/vocab.chars.txt", "dim": 50, "dim_chars": 100, "dropout": 0.5, "early_stop_max_steps": 600, "epochs": 20, "learning_rate": 0.001, "log_step_count_steps": 200, "lstm_size": 100, "min_steps": 600000, "num_oov_buckets": 1, "save_checkpoints_secs": 500, "save_summary_steps": 1000, "tags": "/workspace/finetune/data_kvkk_set2_v3/vocab.tags.txt", "trainable_embeddings": true, "vectors": "/workspace/finetune/data_kvkk_set2_v3/vectors.npz", "words": "/workspace/finetune/data_kvkk_set2_v3/vocab.words.txt"}
Using config: {'_model_dir': '/workspace/finetune/results/model', '_tf_random_seed': None, '_save_summary_steps': 1000, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 500, '_session_config': allow_soft_placement: true
graph_options {
rewrite_options {
meta_optimizer_iterations: ONE
}
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 200, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fe07b876190>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
Not using Distribute Coordinator.
Running training and evaluation locally (non-distributed).
Start train and evaluate loop. The evaluate will happen after every checkpoint. Checkpoint frequency is determined based on RunConfig arguments: save_checkpoints_steps None or save_checkpoints_secs 500.
Calling model_fn.
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
* https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
* https://github.com/tensorflow/addons
* https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.
TensorFlow will not use sklearn by default. This improves performance in some cases. To enable sklearn export the environment variable TF_ALLOW_IOLIBS=1.
TensorFlow will not use Dask by default. This improves performance in some cases. To enable Dask export the environment variable TF_ALLOW_IOLIBS=1.
TensorFlow will not use Pandas by default. This improves performance in some cases. To enable Pandas export the environment variable TF_ALLOW_IOLIBS=1.
From /workspace/finetune/main_chars_lstm.py:104: The name tf.get_variable is deprecated. Please use tf.compat.v1.get_variable instead.
From /workspace/finetune/main_chars_lstm.py:169: The name tf.metrics.accuracy is deprecated. Please use tf.compat.v1.metrics.accuracy instead.
From /py_packages/tf_metrics/__init__.py:152: The name tf.diag_part is deprecated. Please use tf.linalg.tensor_diag_part instead.
From /workspace/finetune/main_chars_lstm.py:175: The name tf.summary.scalar is deprecated. Please use tf.compat.v1.summary.scalar instead.
From /workspace/finetune/main_chars_lstm.py:180: The name tf.train.AdamOptimizer is deprecated. Please use tf.compat.v1.train.AdamOptimizer instead.
From /workspace/finetune/main_chars_lstm.py:181: The name tf.train.get_or_create_global_step is deprecated. Please use tf.compat.v1.train.get_or_create_global_step instead.
Done calling model_fn.
Create CheckpointSaverHook.
Graph was finalized.
2026-04-28 04:33:05.296490: I tensorflow/core/platform/profile_utils/cpu_utils.cc:109] CPU Frequency: 2000000000 Hz
2026-04-28 04:33:05.328827: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x64d0070 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2026-04-28 04:33:05.328873: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2026-04-28 04:33:05.333954: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2026-04-28 04:33:05.523685: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x660a6c0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2026-04-28 04:33:05.523735: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): NVIDIA H100, Compute Capability 9.0
2026-04-28 04:33:05.524596: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1669] Found device 0 with properties:
name: NVIDIA H100 major: 9 minor: 0 memoryClockRate(GHz): 1.98
pciBusID: 0000:ad:00.0
2026-04-28 04:33:05.524637: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2026-04-28 04:33:05.859509: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2026-04-28 04:33:05.891835: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2026-04-28 04:33:05.899990: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2026-04-28 04:33:05.908493: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.11
2026-04-28 04:33:05.920724: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
2026-04-28 04:33:05.922071: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2026-04-28 04:33:05.922505: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1797] Adding visible gpu devices: 0
2026-04-28 04:33:05.923841: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2026-04-28 04:33:05.929938: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1209] Device interconnect StreamExecutor with strength 1 edge matrix:
2026-04-28 04:33:05.929961: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1215] 0
2026-04-28 04:33:05.929969: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1228] 0: N
2026-04-28 04:33:05.930412: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1354] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 60325 MB memory) -> physical GPU (device: 0, name: NVIDIA H100, pci bus id: 0000:ad:00.0, compute capability: 9.0)
Running local_init_op.
Done running local_init_op.
Saving checkpoints for 0 into /workspace/finetune/results/model/model.ckpt.
2026-04-28 04:33:09.576112: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
loss = 117.38817, step = 0
global_step/sec: 15.9262
loss = 4.773464, step = 200 (12.558 sec)
global_step/sec: 16.4469
loss = 2.8829772, step = 400 (12.160 sec)
global_step/sec: 16.4739
loss = 1.6619699, step = 600 (12.140 sec)
global_step/sec: 17.5305
loss = 1.4286156, step = 800 (11.409 sec)
global_step/sec: 16.591
loss = 1.1506453, step = 1000 (12.055 sec)
global_step/sec: 16.3338
loss = 1.6229303, step = 1200 (12.245 sec)
global_step/sec: 16.319
loss = 1.1030784, step = 1400 (12.256 sec)
global_step/sec: 16.0311
loss = 0.48492754, step = 1600 (12.476 sec)
global_step/sec: 15.9238
loss = 0.7973819, step = 1800 (12.560 sec)
global_step/sec: 15.9882
loss = 0.5444262, step = 2000 (12.510 sec)
global_step/sec: 16.0443
loss = 0.34822357, step = 2200 (12.465 sec)
global_step/sec: 16.0029
loss = 0.46935606, step = 2400 (12.498 sec)
global_step/sec: 15.9425
loss = 0.5678671, step = 2600 (12.548 sec)
global_step/sec: 16.3265
loss = 0.5309495, step = 2800 (12.247 sec)
global_step/sec: 16.3088
loss = 0.7498473, step = 3000 (12.264 sec)
global_step/sec: 15.5706
loss = 0.6110069, step = 3200 (12.844 sec)
global_step/sec: 16.7051
loss = 0.24717766, step = 3400 (11.973 sec)
global_step/sec: 16.8543
loss = 1.3274518, step = 3600 (11.866 sec)
global_step/sec: 16.4027
loss = 0.33537441, step = 3800 (12.193 sec)
global_step/sec: 16.2972
loss = 0.29465103, step = 4000 (12.272 sec)
global_step/sec: 16.6392
loss = 0.2784903, step = 4200 (12.020 sec)
global_step/sec: 16.6742
loss = 0.2184174, step = 4400 (11.994 sec)
global_step/sec: 16.755
loss = 0.44008517, step = 4600 (11.937 sec)
global_step/sec: 17.096
loss = 0.25767207, step = 4800 (11.698 sec)
global_step/sec: 16.899
loss = 0.17028493, step = 5000 (11.835 sec)
global_step/sec: 16.7529
loss = 0.17947096, step = 5200 (11.938 sec)
global_step/sec: 17.0098
loss = 0.28755808, step = 5400 (11.758 sec)
global_step/sec: 17.141
loss = 0.25743997, step = 5600 (11.668 sec)
global_step/sec: 17.1619
loss = 0.44591618, step = 5800 (11.654 sec)
global_step/sec: 17.2099
loss = 0.44130975, step = 6000 (11.621 sec)
global_step/sec: 16.7787
loss = 0.9595022, step = 6200 (11.920 sec)
global_step/sec: 16.5132
loss = 0.17380977, step = 6400 (12.112 sec)
global_step/sec: 17.1766
loss = 0.62790394, step = 6600 (11.644 sec)
global_step/sec: 16.6513
loss = 0.115854025, step = 6800 (12.011 sec)
global_step/sec: 16.8948
loss = 0.36853623, step = 7000 (11.839 sec)
global_step/sec: 16.8772
loss = 0.15148377, step = 7200 (11.850 sec)
global_step/sec: 17.1629
loss = 0.23943508, step = 7400 (11.653 sec)
global_step/sec: 16.7432
loss = 0.13623583, step = 7600 (11.945 sec)
global_step/sec: 17.1363
loss = 0.13282633, step = 7800 (11.671 sec)
global_step/sec: 17.5276
loss = 0.0774439, step = 8000 (11.413 sec)
global_step/sec: 17.2208
loss = 0.08795112, step = 8200 (11.612 sec)
Saving checkpoints for 8284 into /workspace/finetune/results/model/model.ckpt.
Calling model_fn.
Done calling model_fn.
Starting evaluation at 2026-04-28T04:41:29Z
Graph was finalized.
2026-04-28 04:41:29.885621: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1669] Found device 0 with properties:
name: NVIDIA H100 major: 9 minor: 0 memoryClockRate(GHz): 1.98
pciBusID: 0000:ad:00.0
2026-04-28 04:41:29.885671: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2026-04-28 04:41:29.885708: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2026-04-28 04:41:29.885715: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2026-04-28 04:41:29.885723: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2026-04-28 04:41:29.885730: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.11
2026-04-28 04:41:29.885736: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
2026-04-28 04:41:29.885743: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2026-04-28 04:41:29.886053: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1797] Adding visible gpu devices: 0
2026-04-28 04:41:29.886087: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1209] Device interconnect StreamExecutor with strength 1 edge matrix:
2026-04-28 04:41:29.886092: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1215] 0
2026-04-28 04:41:29.886097: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1228] 0: N
2026-04-28 04:41:29.886396: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1354] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 60325 MB memory) -> physical GPU (device: 0, name: NVIDIA H100, pci bus id: 0000:ad:00.0, compute capability: 9.0)
Restoring parameters from /workspace/finetune/results/model/model.ckpt-8284
Running local_init_op.
Done running local_init_op.
Evaluation [10/100]
Evaluation [20/100]
Evaluation [30/100]
Evaluation [40/100]
Evaluation [50/100]
Evaluation [60/100]
Evaluation [70/100]
Evaluation [80/100]
Evaluation [90/100]
Evaluation [100/100]
Finished evaluation at 2026-04-28-04:41:34
Saving dict for global step 8284: acc = 0.99759144, f1 = 0.9933167, global_step = 8284, loss = 0.17885803, precision = 0.99258435, recall = 0.99405015
Saving 'checkpoint_path' summary for global step 8284: /workspace/finetune/results/model/model.ckpt-8284
global_step/sec: 11.4752
loss = 0.24231434, step = 8400 (17.429 sec)
global_step/sec: 17.3567
loss = 0.11804509, step = 8600 (11.523 sec)
global_step/sec: 17.0883
loss = 0.23675942, step = 8800 (11.704 sec)
global_step/sec: 17.4953
loss = 0.14836943, step = 9000 (11.432 sec)
global_step/sec: 17.495
loss = 0.1359005, step = 9200 (11.432 sec)
global_step/sec: 17.082
loss = 0.21677959, step = 9400 (11.709 sec)
global_step/sec: 17.2371
loss = 0.14764589, step = 9600 (11.603 sec)
global_step/sec: 17.2015
loss = 0.09986603, step = 9800 (11.627 sec)
global_step/sec: 17.0483
loss = 0.23195946, step = 10000 (11.732 sec)
global_step/sec: 17.4384
loss = 0.11996138, step = 10200 (11.469 sec)
global_step/sec: 17.4722
loss = 0.077587605, step = 10400 (11.447 sec)
global_step/sec: 17.2922
loss = 0.13140821, step = 10600 (11.566 sec)
global_step/sec: 17.3569
loss = 0.1550746, step = 10800 (11.523 sec)
global_step/sec: 17.3439
loss = 0.16950673, step = 11000 (11.531 sec)
global_step/sec: 17.4338
loss = 0.1085785, step = 11200 (11.472 sec)
global_step/sec: 17.5506
loss = 0.114634454, step = 11400 (11.396 sec)
global_step/sec: 17.7854
loss = 0.09087747, step = 11600 (11.245 sec)
global_step/sec: 17.3965
loss = 0.0742746, step = 11800 (11.497 sec)
global_step/sec: 17.3043
loss = 0.11870754, step = 12000 (11.558 sec)
global_step/sec: 17.6434
loss = 0.15823495, step = 12200 (11.335 sec)
global_step/sec: 17.4929
loss = 0.058683336, step = 12400 (11.433 sec)
global_step/sec: 17.6003
loss = 0.16609168, step = 12600 (11.364 sec)
global_step/sec: 17.6682
loss = 0.2326163, step = 12800 (11.320 sec)
global_step/sec: 17.6474
loss = 0.2841699, step = 13000 (11.333 sec)
global_step/sec: 17.306
loss = 0.17810643, step = 13200 (11.557 sec)
global_step/sec: 17.3409
loss = 0.049505234, step = 13400 (11.533 sec)
global_step/sec: 17.453
loss = 0.04317367, step = 13600 (11.459 sec)
global_step/sec: 17.4567
loss = 0.107658744, step = 13800 (11.457 sec)
global_step/sec: 17.7515
loss = 0.040947437, step = 14000 (11.267 sec)
global_step/sec: 17.5825
loss = 0.07045764, step = 14200 (11.375 sec)
global_step/sec: 17.4036
loss = 0.07488018, step = 14400 (11.492 sec)
global_step/sec: 17.6369
loss = 0.25159824, step = 14600 (11.340 sec)
global_step/sec: 17.7241
loss = 0.08559203, step = 14800 (11.284 sec)
global_step/sec: 17.4513
loss = 0.03275448, step = 15000 (11.461 sec)
global_step/sec: 17.5323
loss = 0.13171089, step = 15200 (11.407 sec)
global_step/sec: 17.6592
loss = 0.047302723, step = 15400 (11.326 sec)
global_step/sec: 17.6405
loss = 0.13909113, step = 15600 (11.338 sec)
global_step/sec: 17.6345
loss = 0.08797115, step = 15800 (11.342 sec)
global_step/sec: 17.8771
loss = 0.08592886, step = 16000 (11.187 sec)
global_step/sec: 17.5025
loss = 0.042692125, step = 16200 (11.427 sec)
global_step/sec: 17.5749
loss = 0.050394714, step = 16400 (11.380 sec)
global_step/sec: 17.6533
loss = 0.050355732, step = 16600 (11.329 sec)
global_step/sec: 17.6026
loss = 0.14212108, step = 16800 (11.362 sec)
Saving checkpoints for 16921 into /workspace/finetune/results/model/model.ckpt.
Calling model_fn.
Done calling model_fn.
Starting evaluation at 2026-04-28T04:49:49Z
Graph was finalized.
2026-04-28 04:49:49.668155: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1669] Found device 0 with properties:
name: NVIDIA H100 major: 9 minor: 0 memoryClockRate(GHz): 1.98
pciBusID: 0000:ad:00.0
2026-04-28 04:49:49.668199: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2026-04-28 04:49:49.668238: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2026-04-28 04:49:49.668246: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2026-04-28 04:49:49.668254: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2026-04-28 04:49:49.668262: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.11
2026-04-28 04:49:49.668268: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
2026-04-28 04:49:49.668274: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2026-04-28 04:49:49.668591: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1797] Adding visible gpu devices: 0
2026-04-28 04:49:49.668623: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1209] Device interconnect StreamExecutor with strength 1 edge matrix:
2026-04-28 04:49:49.668628: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1215] 0
2026-04-28 04:49:49.668633: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1228] 0: N
2026-04-28 04:49:49.668917: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1354] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 60325 MB memory) -> physical GPU (device: 0, name: NVIDIA H100, pci bus id: 0000:ad:00.0, compute capability: 9.0)
Restoring parameters from /workspace/finetune/results/model/model.ckpt-16921
Running local_init_op.
Done running local_init_op.
Evaluation [10/100]
Evaluation [20/100]
Evaluation [30/100]
Evaluation [40/100]
Evaluation [50/100]
Evaluation [60/100]
Evaluation [70/100]
Evaluation [80/100]
Evaluation [90/100]
Evaluation [100/100]
Finished evaluation at 2026-04-28-04:49:54
Saving dict for global step 16921: acc = 0.99804306, f1 = 0.9946128, global_step = 16921, loss = 0.13094354, precision = 0.99445856, recall = 0.9947671
Saving 'checkpoint_path' summary for global step 16921: /workspace/finetune/results/model/model.ckpt-16921
global_step/sec: 12.0425
loss = 0.069523394, step = 17000 (16.608 sec)
global_step/sec: 16.8534
loss = 0.03588748, step = 17200 (11.867 sec)
global_step/sec: 17.389
loss = 0.060875297, step = 17400 (11.502 sec)
global_step/sec: 17.5332
loss = 0.093826056, step = 17600 (11.407 sec)
global_step/sec: 17.5162
loss = 0.04980874, step = 17800 (11.418 sec)
global_step/sec: 17.5958
loss = 0.08436108, step = 18000 (11.367 sec)
global_step/sec: 17.6499
loss = 0.07031548, step = 18200 (11.332 sec)
global_step/sec: 17.4897
loss = 0.08018917, step = 18400 (11.435 sec)
global_step/sec: 17.6163
loss = 0.028933227, step = 18600 (11.353 sec)
global_step/sec: 17.3619
loss = 0.06399143, step = 18800 (11.519 sec)
global_step/sec: 17.5151
loss = 0.104281485, step = 19000 (11.419 sec)
global_step/sec: 17.5888
loss = 0.10930133, step = 19200 (11.371 sec)
global_step/sec: 17.6307
loss = 0.056066155, step = 19400 (11.344 sec)
global_step/sec: 17.7029
loss = 0.19223732, step = 19600 (11.298 sec)
global_step/sec: 17.8645
loss = 0.13925654, step = 19800 (11.195 sec)
global_step/sec: 17.6568
loss = 0.05859512, step = 20000 (11.327 sec)
global_step/sec: 17.7123
loss = 0.06567615, step = 20200 (11.291 sec)
global_step/sec: 17.8937
loss = 0.106875956, step = 20400 (11.177 sec)
global_step/sec: 17.6091
loss = 0.03964001, step = 20600 (11.358 sec)
global_step/sec: 17.4414
loss = 0.09287375, step = 20800 (11.467 sec)
global_step/sec: 17.6155
loss = 0.03273791, step = 21000 (11.354 sec)
global_step/sec: 17.8087
loss = 0.0277282, step = 21200 (11.230 sec)
global_step/sec: 17.8569
loss = 0.07149267, step = 21400 (11.200 sec)
global_step/sec: 17.9393
loss = 0.05456364, step = 21600 (11.149 sec)
global_step/sec: 17.7899
loss = 0.011094809, step = 21800 (11.242 sec)
global_step/sec: 17.5774
loss = 0.04235983, step = 22000 (11.378 sec)
global_step/sec: 17.8274
loss = 0.056289375, step = 22200 (11.218 sec)
global_step/sec: 17.8135
loss = 0.06535977, step = 22400 (11.228 sec)
global_step/sec: 17.5839
loss = 0.11343026, step = 22600 (11.374 sec)
global_step/sec: 17.8426
loss = 0.035084844, step = 22800 (11.209 sec)
global_step/sec: 17.9744
loss = 0.02807188, step = 23000 (11.127 sec)
global_step/sec: 17.6131
loss = 0.12693703, step = 23200 (11.355 sec)
global_step/sec: 17.8415
loss = 0.064061224, step = 23400 (11.210 sec)
global_step/sec: 17.7546
loss = 0.090565205, step = 23600 (11.265 sec)
global_step/sec: 17.7205
loss = 0.0683707, step = 23800 (11.286 sec)
global_step/sec: 17.8176
loss = 0.031541705, step = 24000 (11.225 sec)
global_step/sec: 17.906
loss = 0.04308015, step = 24200 (11.169 sec)
global_step/sec: 17.643
loss = 0.12729162, step = 24400 (11.336 sec)
global_step/sec: 17.5792
loss = 0.05107689, step = 24600 (11.377 sec)
global_step/sec: 17.6869
loss = 0.037356377, step = 24800 (11.308 sec)
Saving checkpoints for 25000 into /workspace/finetune/results/model/model.ckpt.
Calling model_fn.
Done calling model_fn.
Starting evaluation at 2026-04-28T04:57:32Z
Graph was finalized.
2026-04-28 04:57:32.870073: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1669] Found device 0 with properties:
name: NVIDIA H100 major: 9 minor: 0 memoryClockRate(GHz): 1.98
pciBusID: 0000:ad:00.0
2026-04-28 04:57:32.870119: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2026-04-28 04:57:32.870154: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2026-04-28 04:57:32.870162: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2026-04-28 04:57:32.870170: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2026-04-28 04:57:32.870177: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.11
2026-04-28 04:57:32.870184: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
2026-04-28 04:57:32.870191: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2026-04-28 04:57:32.870457: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1797] Adding visible gpu devices: 0
2026-04-28 04:57:32.870492: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1209] Device interconnect StreamExecutor with strength 1 edge matrix:
2026-04-28 04:57:32.870497: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1215] 0
2026-04-28 04:57:32.870502: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1228] 0: N
2026-04-28 04:57:32.870790: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1354] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 60325 MB memory) -> physical GPU (device: 0, name: NVIDIA H100, pci bus id: 0000:ad:00.0, compute capability: 9.0)
Restoring parameters from /workspace/finetune/results/model/model.ckpt-25000
Running local_init_op.
Done running local_init_op.
Evaluation [10/100]
Evaluation [20/100]
Evaluation [30/100]
Evaluation [40/100]
Evaluation [50/100]
Evaluation [60/100]
Evaluation [70/100]
Evaluation [80/100]
Evaluation [90/100]
Evaluation [100/100]
Finished evaluation at 2026-04-28-04:57:37
Saving dict for global step 25000: acc = 0.99808574, f1 = 0.9946638, global_step = 25000, loss = 0.12328317, precision = 0.9939835, recall = 0.995345
Saving 'checkpoint_path' summary for global step 25000: /workspace/finetune/results/model/model.ckpt-25000
Loss for final step: 0.035551548.
Calling model_fn.
Done calling model_fn.
Graph was finalized.
2026-04-28 04:57:37.865806: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1669] Found device 0 with properties:
name: NVIDIA H100 major: 9 minor: 0 memoryClockRate(GHz): 1.98
pciBusID: 0000:ad:00.0
2026-04-28 04:57:37.865856: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2026-04-28 04:57:37.865881: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2026-04-28 04:57:37.865889: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2026-04-28 04:57:37.865897: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2026-04-28 04:57:37.865906: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.11
2026-04-28 04:57:37.865912: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
2026-04-28 04:57:37.865920: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2026-04-28 04:57:37.866178: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1797] Adding visible gpu devices: 0
2026-04-28 04:57:37.866207: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1209] Device interconnect StreamExecutor with strength 1 edge matrix:
2026-04-28 04:57:37.866212: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1215] 0
2026-04-28 04:57:37.866217: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1228] 0: N
2026-04-28 04:57:37.866511: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1354] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 60325 MB memory) -> physical GPU (device: 0, name: NVIDIA H100, pci bus id: 0000:ad:00.0, compute capability: 9.0)
Restoring parameters from /workspace/finetune/results/model/model.ckpt-25000
Running local_init_op.
Done running local_init_op.
[predict] wrote /workspace/finetune/results/score/train.preds.txt
Calling model_fn.
Done calling model_fn.
Graph was finalized.
2026-04-28 04:58:27.768805: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1669] Found device 0 with properties:
name: NVIDIA H100 major: 9 minor: 0 memoryClockRate(GHz): 1.98
pciBusID: 0000:ad:00.0
2026-04-28 04:58:27.768847: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2026-04-28 04:58:27.768867: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2026-04-28 04:58:27.768874: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2026-04-28 04:58:27.768879: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2026-04-28 04:58:27.768884: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.11
2026-04-28 04:58:27.768889: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
2026-04-28 04:58:27.768895: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2026-04-28 04:58:27.769123: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1797] Adding visible gpu devices: 0
2026-04-28 04:58:27.769147: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1209] Device interconnect StreamExecutor with strength 1 edge matrix:
2026-04-28 04:58:27.769151: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1215] 0
2026-04-28 04:58:27.769155: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1228] 0: N
2026-04-28 04:58:27.769420: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1354] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 60325 MB memory) -> physical GPU (device: 0, name: NVIDIA H100, pci bus id: 0000:ad:00.0, compute capability: 9.0)
Restoring parameters from /workspace/finetune/results/model/model.ckpt-25000
Running local_init_op.
Done running local_init_op.
[predict] wrote /workspace/finetune/results/score/testa.preds.txt
Calling model_fn.
Done calling model_fn.
Graph was finalized.
2026-04-28 04:58:34.490921: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1669] Found device 0 with properties:
name: NVIDIA H100 major: 9 minor: 0 memoryClockRate(GHz): 1.98
pciBusID: 0000:ad:00.0
2026-04-28 04:58:34.490967: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2026-04-28 04:58:34.491002: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2026-04-28 04:58:34.491008: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2026-04-28 04:58:34.491014: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2026-04-28 04:58:34.491020: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.11
2026-04-28 04:58:34.491025: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
2026-04-28 04:58:34.491030: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2026-04-28 04:58:34.491264: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1797] Adding visible gpu devices: 0
2026-04-28 04:58:34.491287: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1209] Device interconnect StreamExecutor with strength 1 edge matrix:
2026-04-28 04:58:34.491292: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1215] 0
2026-04-28 04:58:34.491296: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1228] 0: N
2026-04-28 04:58:34.491562: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1354] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 60325 MB memory) -> physical GPU (device: 0, name: NVIDIA H100, pci bus id: 0000:ad:00.0, compute capability: 9.0)
Restoring parameters from /workspace/finetune/results/model/model.ckpt-25000
Running local_init_op.
Done running local_init_op.
[predict] wrote /workspace/finetune/results/score/testb.preds.txt