3TF-14B / eval_lighteval|aime24|0.log

Upload folder using huggingface_hub

c67ee0f 3 months ago

12.7 kB

	INFO 10-26 02:40:28 [__init__.py:235] Automatically detected platform cuda.
	[2025-10-26 02:40:30,553] [[32m INFO[0m]: --- INIT SEEDS --- (pipeline.py:249)[0m
	[2025-10-26 02:40:30,554] [[32m INFO[0m]: --- LOADING TASKS --- (pipeline.py:210)[0m
	[2025-10-26 02:40:30,558] [[33m WARNING[0m]: Careful, the task aime24 is using evaluation data to build the few shot examples. (lighteval_task.py:269)[0m
	[2025-10-26 02:40:34,877] [[32m INFO[0m]: --- LOADING MODEL --- (pipeline.py:177)[0m
	`torch_dtype` is deprecated! Use `dtype` instead!
	[2025-10-26 02:40:41,255] [[32m INFO[0m]: Using max model len 32768 (config.py:1604)[0m
	[2025-10-26 02:40:41,418] [[32m INFO[0m]: Chunked prefill is enabled with max_num_batched_tokens=2048. (config.py:2434)[0m
	INFO 10-26 02:40:45 [__init__.py:235] Automatically detected platform cuda.
	INFO 10-26 02:40:46 [core.py:572] Waiting for init message from front-end.
	INFO 10-26 02:40:46 [core.py:71] Initializing a V1 LLM engine (v0.10.0) with config: model='/mnt/public/wucanhui/outputs/Qwen3-14B-math-reasoning/checkpoint-2562', speculative_config=None, tokenizer='/mnt/public/wucanhui/outputs/Qwen3-14B-math-reasoning/checkpoint-2562', skip_tokenizer_init=False, tokenizer_mode=auto, revision=main, override_neuron_config={}, tokenizer_revision=main, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=1234, served_model_name=/mnt/public/wucanhui/outputs/Qwen3-14B-math-reasoning/checkpoint-2562, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":0,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":[],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":0,"cudagraph_capture_sizes":[],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":0,"local_cache_dir":null}
	INFO 10-26 02:40:50 [parallel_state.py:1102] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
	WARNING 10-26 02:40:50 [topk_topp_sampler.py:59] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
	INFO 10-26 02:40:50 [gpu_model_runner.py:1843] Starting to load model /mnt/public/wucanhui/outputs/Qwen3-14B-math-reasoning/checkpoint-2562...
	INFO 10-26 02:40:50 [gpu_model_runner.py:1875] Loading model from scratch...
	INFO 10-26 02:40:50 [cuda.py:290] Using Flash Attention backend on V1 engine.
	Loading safetensors checkpoint shards: 0% Completed \| 0/6 [00:00<?, ?it/s]
	Loading safetensors checkpoint shards: 17% Completed \| 1/6 [00:36<03:00, 36.20s/it]
	Loading safetensors checkpoint shards: 33% Completed \| 2/6 [01:14<02:30, 37.71s/it]
	Loading safetensors checkpoint shards: 50% Completed \| 3/6 [01:50<01:49, 36.62s/it]
	Loading safetensors checkpoint shards: 67% Completed \| 4/6 [02:24<01:11, 35.85s/it]
	Loading safetensors checkpoint shards: 83% Completed \| 5/6 [03:01<00:35, 35.94s/it]
	Loading safetensors checkpoint shards: 100% Completed \| 6/6 [03:35<00:00, 35.39s/it]
	Loading safetensors checkpoint shards: 100% Completed \| 6/6 [03:35<00:00, 35.90s/it]

	INFO 10-26 02:44:27 [default_loader.py:262] Loading weights took 216.39 seconds
	INFO 10-26 02:44:27 [gpu_model_runner.py:1892] Model loading took 27.5185 GiB and 216.948136 seconds
	INFO 10-26 02:44:28 [gpu_worker.py:255] Available KV cache memory: 97.63 GiB
	INFO 10-26 02:44:28 [kv_cache_utils.py:833] GPU KV cache size: 639,792 tokens
	INFO 10-26 02:44:28 [kv_cache_utils.py:837] Maximum concurrency for 32,768 tokens per request: 19.52x
	INFO 10-26 02:44:28 [core.py:193] init engine (profile, create kv cache, warmup model) took 1.09 seconds
	[2025-10-26 02:44:29,374] [[32m INFO[0m]: [CACHING] Initializing data cache (cache_management.py:105)[0m
	[2025-10-26 02:44:29,379] [[32m INFO[0m]: --- RUNNING MODEL --- (pipeline.py:330)[0m
	[2025-10-26 02:44:29,380] [[32m INFO[0m]: Running SamplingMethod.GENERATIVE requests (pipeline.py:313)[0m
	[2025-10-26 02:44:30,432] [[32m INFO[0m]: Cache: Starting to process 30/30 samples (not found in cache) for tasks lighteval\|aime24\|0 (12234f074d9327ac, GENERATIVE) (cache_management.py:399)[0m
	[2025-10-26 02:44:30,434] [[33m WARNING[0m]: You cannot select the number of dataset splits for a generative evaluation at the moment. Automatically inferring. (data.py:206)[0m
	Splits: 0%\| \| 0/1 [00:00<?, ?it/s][2025-10-26 02:44:30,465] [[33m WARNING[0m]: context_size + max_new_tokens=33247 which is greater than self.max_length=32768. Truncating context to 0 tokens. (vllm_model.py:367)[0m

	Adding requests: 0%\| \| 0/30 [00:00<?, ?it/s][A Adding requests: 100%\|██████████\| 30/30 [00:00<00:00, 2927.76it/s]

	Processed prompts: 0%\| \| 0/480 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s][A
	Processed prompts: 3%\|▎ \| 16/480 [00:40<19:23, 2.51s/it, est. speed input: 87.35 toks/s, output: 487.48 toks/s][A
	Processed prompts: 7%\|▋ \| 32/480 [03:32<55:05, 7.38s/it, est. speed input: 32.27 toks/s, output: 395.23 toks/s][A
	Processed prompts: 10%\|█ \| 48/480 [10:41<1:57:00, 16.25s/it, est. speed input: 16.61 toks/s, output: 363.25 toks/s][A
	Processed prompts: 13%\|█▎ \| 64/480 [12:38<1:28:12, 12.72s/it, est. speed input: 19.64 toks/s, output: 577.37 toks/s][A
	Processed prompts: 17%\|█▋ \| 80/480 [15:08<1:16:48, 11.52s/it, est. speed input: 21.37 toks/s, output: 695.52 toks/s][A
	Processed prompts: 20%\|██ \| 96/480 [16:40<1:01:10, 9.56s/it, est. speed input: 27.06 toks/s, output: 953.26 toks/s][A
	Processed prompts: 23%\|██▎ \| 112/480 [18:07<50:21, 8.21s/it, est. speed input: 28.27 toks/s, output: 924.75 toks/s] [A
	Processed prompts: 27%\|██▋ \| 128/480 [20:02<46:10, 7.87s/it, est. speed input: 28.22 toks/s, output: 891.32 toks/s][A
	Processed prompts: 30%\|███ \| 144/480 [21:12<38:02, 6.79s/it, est. speed input: 29.47 toks/s, output: 898.38 toks/s][A
	Processed prompts: 33%\|███▎ \| 160/480 [23:09<37:01, 6.94s/it, est. speed input: 29.16 toks/s, output: 992.64 toks/s][A
	Processed prompts: 37%\|███▋ \| 176/480 [24:38<33:01, 6.52s/it, est. speed input: 29.93 toks/s, output: 1262.14 toks/s][A
	Processed prompts: 40%\|████ \| 192/480 [24:54<23:16, 4.85s/it, est. speed input: 31.56 toks/s, output: 1341.49 toks/s][A
	Processed prompts: 43%\|████▎ \| 208/480 [27:19<27:44, 6.12s/it, est. speed input: 30.21 toks/s, output: 1259.30 toks/s][A
	Processed prompts: 47%\|████▋ \| 224/480 [27:53<20:55, 4.90s/it, est. speed input: 31.35 toks/s, output: 1332.05 toks/s][A
	Processed prompts: 50%\|█████ \| 240/480 [28:13<15:16, 3.82s/it, est. speed input: 32.66 toks/s, output: 1455.63 toks/s][A
	Processed prompts: 53%\|█████▎ \| 256/480 [28:43<12:02, 3.22s/it, est. speed input: 33.60 toks/s, output: 1478.79 toks/s][A
	Processed prompts: 57%\|█████▋ \| 272/480 [31:29<18:36, 5.37s/it, est. speed input: 31.84 toks/s, output: 1384.43 toks/s][A
	Processed prompts: 60%\|██████ \| 288/480 [34:43<23:40, 7.40s/it, est. speed input: 30.18 toks/s, output: 1332.79 toks/s][A
	Processed prompts: 63%\|██████▎ \| 304/480 [36:23<20:42, 7.06s/it, est. speed input: 29.98 toks/s, output: 1403.25 toks/s][A
	Processed prompts: 67%\|██████▋ \| 320/480 [37:08<15:25, 5.79s/it, est. speed input: 30.75 toks/s, output: 1481.63 toks/s][A
	Processed prompts: 70%\|███████ \| 336/480 [42:42<24:44, 10.31s/it, est. speed input: 27.79 toks/s, output: 1376.64 toks/s][A
	Processed prompts: 73%\|███████▎ \| 352/480 [43:29<17:17, 8.11s/it, est. speed input: 28.10 toks/s, output: 1372.93 toks/s][A
	Processed prompts: 77%\|███████▋ \| 368/480 [43:59<11:37, 6.23s/it, est. speed input: 28.60 toks/s, output: 1378.64 toks/s][A
	Processed prompts: 80%\|████████ \| 384/480 [44:01<07:01, 4.39s/it, est. speed input: 29.53 toks/s, output: 1493.31 toks/s][A
	Processed prompts: 83%\|████████▎ \| 400/480 [44:52<05:23, 4.05s/it, est. speed input: 29.86 toks/s, output: 1538.22 toks/s][A
	Processed prompts: 87%\|████████▋ \| 416/480 [45:42<04:00, 3.76s/it, est. speed input: 30.32 toks/s, output: 1595.22 toks/s][A
	Processed prompts: 90%\|█████████ \| 432/480 [46:09<02:30, 3.14s/it, est. speed input: 30.82 toks/s, output: 1618.27 toks/s][A
	Processed prompts: 93%\|█████████▎\| 448/480 [47:20<01:52, 3.52s/it, est. speed input: 30.89 toks/s, output: 1630.94 toks/s][A
	Processed prompts: 97%\|█████████▋\| 464/480 [47:20<00:39, 2.48s/it, est. speed input: 31.70 toks/s, output: 1684.27 toks/s][A
	Processed prompts: 97%\|█████████▋\| 464/480 [47:38<00:39, 2.48s/it, est. speed input: 31.70 toks/s, output: 1684.27 toks/s][A
	Processed prompts: 100%\|██████████\| 480/480 [48:04<00:00, 2.55s/it, est. speed input: 32.13 toks/s, output: 1767.75 toks/s][A
	Processed prompts: 100%\|██████████\| 480/480 [48:04<00:00, 2.55s/it, est. speed input: 32.13 toks/s, output: 1767.75 toks/s][A Processed prompts: 100%\|██████████\| 480/480 [48:04<00:00, 6.01s/it, est. speed input: 32.13 toks/s, output: 1767.75 toks/s]
	Splits: 100%\|██████████\| 1/1 [48:04<00:00, 2884.22s/it] Splits: 100%\|██████████\| 1/1 [48:04<00:00, 2884.22s/it]
	Creating parquet from Arrow format: 0%\| \| 0/1 [00:00<?, ?ba/s] Creating parquet from Arrow format: 100%\|██████████\| 1/1 [00:00<00:00, 4.29ba/s] Creating parquet from Arrow format: 100%\|██████████\| 1/1 [00:00<00:00, 4.27ba/s]
	[2025-10-26 03:32:40,749] [[32m INFO[0m]: Cached 30 samples of lighteval\|aime24\|0 (12234f074d9327ac, GENERATIVE) at /mnt/public/wucanhui/outputs/Qwen3-14B-math-reasoning/checkpoint-2562/081b5a149587018c/lighteval\|aime24\|0/12234f074d9327ac/GENERATIVE.parquet. (cache_management.py:345)[0m
	Generating train split: 0 examples [00:00, ? examples/s] Generating train split: 30 examples [00:00, 161.20 examples/s] Generating train split: 30 examples [00:00, 159.79 examples/s]
	[rank0]:[W1026 03:32:46.311872257 ProcessGroupNCCL.cpp:1479] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
	[2025-10-26 03:32:47,149] [[32m INFO[0m]: --- POST-PROCESSING MODEL RESPONSES --- (pipeline.py:344)[0m
	[2025-10-26 03:32:47,160] [[32m INFO[0m]: --- COMPUTING METRICS --- (pipeline.py:371)[0m
	[2025-10-26 03:32:47,161] [[33m WARNING[0m]: n undefined in the pass@k. We assume it's the same as the sample's number of predictions. (metrics_sample.py:1302)[0m
	[2025-10-26 03:32:50,281] [[32m INFO[0m]: --- DISPLAYING RESULTS --- (pipeline.py:432)[0m
	[2025-10-26 03:32:50,292] [[32m INFO[0m]: --- SAVING AND PUSHING RESULTS --- (pipeline.py:422)[0m
	[2025-10-26 03:32:50,293] [[32m INFO[0m]: Saving experiment tracker (evaluation_tracker.py:246)[0m
	[2025-10-26 03:32:53,278] [[32m INFO[0m]: Saving results to /mnt/public/wucanhui/lighteval/results/results/mnt/public/wucanhui/outputs/Qwen3-14B-math-reasoning/checkpoint-2562/results_2025-10-26T03-32-50.294155.json (evaluation_tracker.py:310)[0m
	\| Task \|Version\| Metric \|Value \| \|Stderr\|
	\|------------------\|-------\|-------------\|-----:\|---\|-----:\|
	\|all \| \|pass@k_with_k\|0.9000\|± \|0.0557\|
	\| \| \|avg@k_with_k \|0.6417\|± \|0.0664\|
	\|lighteval:aime24:0\| \|pass@k_with_k\|0.9000\|± \|0.0557\|
	\| \| \|avg@k_with_k \|0.6417\|± \|0.0664\|

	INFO 10-26 02:40:28 [__init__.py:235] Automatically detected platform cuda.
	[2025-10-26 02:40:30,553] [[32m INFO[0m]: --- INIT SEEDS --- (pipeline.py:249)[0m
	[2025-10-26 02:40:30,554] [[32m INFO[0m]: --- LOADING TASKS --- (pipeline.py:210)[0m
	[2025-10-26 02:40:30,558] [[33m WARNING[0m]: Careful, the task aime24 is using evaluation data to build the few shot examples. (lighteval_task.py:269)[0m
	[2025-10-26 02:40:34,877] [[32m INFO[0m]: --- LOADING MODEL --- (pipeline.py:177)[0m
	`torch_dtype` is deprecated! Use `dtype` instead!
	[2025-10-26 02:40:41,255] [[32m INFO[0m]: Using max model len 32768 (config.py:1604)[0m
	[2025-10-26 02:40:41,418] [[32m INFO[0m]: Chunked prefill is enabled with max_num_batched_tokens=2048. (config.py:2434)[0m
	INFO 10-26 02:40:45 [__init__.py:235] Automatically detected platform cuda.
	INFO 10-26 02:40:46 [core.py:572] Waiting for init message from front-end.
	INFO 10-26 02:40:46 [core.py:71] Initializing a V1 LLM engine (v0.10.0) with config: model='/mnt/public/wucanhui/outputs/Qwen3-14B-math-reasoning/checkpoint-2562', speculative_config=None, tokenizer='/mnt/public/wucanhui/outputs/Qwen3-14B-math-reasoning/checkpoint-2562', skip_tokenizer_init=False, tokenizer_mode=auto, revision=main, override_neuron_config={}, tokenizer_revision=main, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=1234, served_model_name=/mnt/public/wucanhui/outputs/Qwen3-14B-math-reasoning/checkpoint-2562, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":0,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":[],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":0,"cudagraph_capture_sizes":[],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":0,"local_cache_dir":null}
	INFO 10-26 02:40:50 [parallel_state.py:1102] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
	WARNING 10-26 02:40:50 [topk_topp_sampler.py:59] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
	INFO 10-26 02:40:50 [gpu_model_runner.py:1843] Starting to load model /mnt/public/wucanhui/outputs/Qwen3-14B-math-reasoning/checkpoint-2562...
	INFO 10-26 02:40:50 [gpu_model_runner.py:1875] Loading model from scratch...
	INFO 10-26 02:40:50 [cuda.py:290] Using Flash Attention backend on V1 engine.
	Loading safetensors checkpoint shards: 0% Completed \| 0/6 [00:00<?, ?it/s]
	Loading safetensors checkpoint shards: 17% Completed \| 1/6 [00:36<03:00, 36.20s/it]
	Loading safetensors checkpoint shards: 33% Completed \| 2/6 [01:14<02:30, 37.71s/it]
	Loading safetensors checkpoint shards: 50% Completed \| 3/6 [01:50<01:49, 36.62s/it]
	Loading safetensors checkpoint shards: 67% Completed \| 4/6 [02:24<01:11, 35.85s/it]
	Loading safetensors checkpoint shards: 83% Completed \| 5/6 [03:01<00:35, 35.94s/it]
	Loading safetensors checkpoint shards: 100% Completed \| 6/6 [03:35<00:00, 35.39s/it]
	Loading safetensors checkpoint shards: 100% Completed \| 6/6 [03:35<00:00, 35.90s/it]

	INFO 10-26 02:44:27 [default_loader.py:262] Loading weights took 216.39 seconds
	INFO 10-26 02:44:27 [gpu_model_runner.py:1892] Model loading took 27.5185 GiB and 216.948136 seconds
	INFO 10-26 02:44:28 [gpu_worker.py:255] Available KV cache memory: 97.63 GiB
	INFO 10-26 02:44:28 [kv_cache_utils.py:833] GPU KV cache size: 639,792 tokens
	INFO 10-26 02:44:28 [kv_cache_utils.py:837] Maximum concurrency for 32,768 tokens per request: 19.52x
	INFO 10-26 02:44:28 [core.py:193] init engine (profile, create kv cache, warmup model) took 1.09 seconds
	[2025-10-26 02:44:29,374] [[32m INFO[0m]: [CACHING] Initializing data cache (cache_management.py:105)[0m
	[2025-10-26 02:44:29,379] [[32m INFO[0m]: --- RUNNING MODEL --- (pipeline.py:330)[0m
	[2025-10-26 02:44:29,380] [[32m INFO[0m]: Running SamplingMethod.GENERATIVE requests (pipeline.py:313)[0m
	[2025-10-26 02:44:30,432] [[32m INFO[0m]: Cache: Starting to process 30/30 samples (not found in cache) for tasks lighteval\|aime24\|0 (12234f074d9327ac, GENERATIVE) (cache_management.py:399)[0m
	[2025-10-26 02:44:30,434] [[33m WARNING[0m]: You cannot select the number of dataset splits for a generative evaluation at the moment. Automatically inferring. (data.py:206)[0m
	Splits: 0%\| \| 0/1 [00:00<?, ?it/s][2025-10-26 02:44:30,465] [[33m WARNING[0m]: context_size + max_new_tokens=33247 which is greater than self.max_length=32768. Truncating context to 0 tokens. (vllm_model.py:367)[0m

	Adding requests: 0%\| \| 0/30 [00:00<?, ?it/s][A Adding requests: 100%\|██████████\| 30/30 [00:00<00:00, 2927.76it/s]

	Processed prompts: 0%\| \| 0/480 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s][A
	Processed prompts: 3%\|▎ \| 16/480 [00:40<19:23, 2.51s/it, est. speed input: 87.35 toks/s, output: 487.48 toks/s][A
	Processed prompts: 7%\|▋ \| 32/480 [03:32<55:05, 7.38s/it, est. speed input: 32.27 toks/s, output: 395.23 toks/s][A
	Processed prompts: 10%\|█ \| 48/480 [10:41<1:57:00, 16.25s/it, est. speed input: 16.61 toks/s, output: 363.25 toks/s][A
	Processed prompts: 13%\|█▎ \| 64/480 [12:38<1:28:12, 12.72s/it, est. speed input: 19.64 toks/s, output: 577.37 toks/s][A
	Processed prompts: 17%\|█▋ \| 80/480 [15:08<1:16:48, 11.52s/it, est. speed input: 21.37 toks/s, output: 695.52 toks/s][A
	Processed prompts: 20%\|██ \| 96/480 [16:40<1:01:10, 9.56s/it, est. speed input: 27.06 toks/s, output: 953.26 toks/s][A
	Processed prompts: 23%\|██▎ \| 112/480 [18:07<50:21, 8.21s/it, est. speed input: 28.27 toks/s, output: 924.75 toks/s] [A
	Processed prompts: 27%\|██▋ \| 128/480 [20:02<46:10, 7.87s/it, est. speed input: 28.22 toks/s, output: 891.32 toks/s][A
	Processed prompts: 30%\|███ \| 144/480 [21:12<38:02, 6.79s/it, est. speed input: 29.47 toks/s, output: 898.38 toks/s][A
	Processed prompts: 33%\|███▎ \| 160/480 [23:09<37:01, 6.94s/it, est. speed input: 29.16 toks/s, output: 992.64 toks/s][A
	Processed prompts: 37%\|███▋ \| 176/480 [24:38<33:01, 6.52s/it, est. speed input: 29.93 toks/s, output: 1262.14 toks/s][A
	Processed prompts: 40%\|████ \| 192/480 [24:54<23:16, 4.85s/it, est. speed input: 31.56 toks/s, output: 1341.49 toks/s][A
	Processed prompts: 43%\|████▎ \| 208/480 [27:19<27:44, 6.12s/it, est. speed input: 30.21 toks/s, output: 1259.30 toks/s][A
	Processed prompts: 47%\|████▋ \| 224/480 [27:53<20:55, 4.90s/it, est. speed input: 31.35 toks/s, output: 1332.05 toks/s][A
	Processed prompts: 50%\|█████ \| 240/480 [28:13<15:16, 3.82s/it, est. speed input: 32.66 toks/s, output: 1455.63 toks/s][A
	Processed prompts: 53%\|█████▎ \| 256/480 [28:43<12:02, 3.22s/it, est. speed input: 33.60 toks/s, output: 1478.79 toks/s][A
	Processed prompts: 57%\|█████▋ \| 272/480 [31:29<18:36, 5.37s/it, est. speed input: 31.84 toks/s, output: 1384.43 toks/s][A
	Processed prompts: 60%\|██████ \| 288/480 [34:43<23:40, 7.40s/it, est. speed input: 30.18 toks/s, output: 1332.79 toks/s][A
	Processed prompts: 63%\|██████▎ \| 304/480 [36:23<20:42, 7.06s/it, est. speed input: 29.98 toks/s, output: 1403.25 toks/s][A
	Processed prompts: 67%\|██████▋ \| 320/480 [37:08<15:25, 5.79s/it, est. speed input: 30.75 toks/s, output: 1481.63 toks/s][A
	Processed prompts: 70%\|███████ \| 336/480 [42:42<24:44, 10.31s/it, est. speed input: 27.79 toks/s, output: 1376.64 toks/s][A
	Processed prompts: 73%\|███████▎ \| 352/480 [43:29<17:17, 8.11s/it, est. speed input: 28.10 toks/s, output: 1372.93 toks/s][A
	Processed prompts: 77%\|███████▋ \| 368/480 [43:59<11:37, 6.23s/it, est. speed input: 28.60 toks/s, output: 1378.64 toks/s][A
	Processed prompts: 80%\|████████ \| 384/480 [44:01<07:01, 4.39s/it, est. speed input: 29.53 toks/s, output: 1493.31 toks/s][A
	Processed prompts: 83%\|████████▎ \| 400/480 [44:52<05:23, 4.05s/it, est. speed input: 29.86 toks/s, output: 1538.22 toks/s][A
	Processed prompts: 87%\|████████▋ \| 416/480 [45:42<04:00, 3.76s/it, est. speed input: 30.32 toks/s, output: 1595.22 toks/s][A
	Processed prompts: 90%\|█████████ \| 432/480 [46:09<02:30, 3.14s/it, est. speed input: 30.82 toks/s, output: 1618.27 toks/s][A
	Processed prompts: 93%\|█████████▎\| 448/480 [47:20<01:52, 3.52s/it, est. speed input: 30.89 toks/s, output: 1630.94 toks/s][A
	Processed prompts: 97%\|█████████▋\| 464/480 [47:20<00:39, 2.48s/it, est. speed input: 31.70 toks/s, output: 1684.27 toks/s][A
	Processed prompts: 97%\|█████████▋\| 464/480 [47:38<00:39, 2.48s/it, est. speed input: 31.70 toks/s, output: 1684.27 toks/s][A
	Processed prompts: 100%\|██████████\| 480/480 [48:04<00:00, 2.55s/it, est. speed input: 32.13 toks/s, output: 1767.75 toks/s][A
	Processed prompts: 100%\|██████████\| 480/480 [48:04<00:00, 2.55s/it, est. speed input: 32.13 toks/s, output: 1767.75 toks/s][A Processed prompts: 100%\|██████████\| 480/480 [48:04<00:00, 6.01s/it, est. speed input: 32.13 toks/s, output: 1767.75 toks/s]
	Splits: 100%\|██████████\| 1/1 [48:04<00:00, 2884.22s/it] Splits: 100%\|██████████\| 1/1 [48:04<00:00, 2884.22s/it]
	Creating parquet from Arrow format: 0%\| \| 0/1 [00:00<?, ?ba/s] Creating parquet from Arrow format: 100%\|██████████\| 1/1 [00:00<00:00, 4.29ba/s] Creating parquet from Arrow format: 100%\|██████████\| 1/1 [00:00<00:00, 4.27ba/s]
	[2025-10-26 03:32:40,749] [[32m INFO[0m]: Cached 30 samples of lighteval\|aime24\|0 (12234f074d9327ac, GENERATIVE) at /mnt/public/wucanhui/outputs/Qwen3-14B-math-reasoning/checkpoint-2562/081b5a149587018c/lighteval\|aime24\|0/12234f074d9327ac/GENERATIVE.parquet. (cache_management.py:345)[0m
	Generating train split: 0 examples [00:00, ? examples/s] Generating train split: 30 examples [00:00, 161.20 examples/s] Generating train split: 30 examples [00:00, 159.79 examples/s]
	[rank0]:[W1026 03:32:46.311872257 ProcessGroupNCCL.cpp:1479] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
	[2025-10-26 03:32:47,149] [[32m INFO[0m]: --- POST-PROCESSING MODEL RESPONSES --- (pipeline.py:344)[0m
	[2025-10-26 03:32:47,160] [[32m INFO[0m]: --- COMPUTING METRICS --- (pipeline.py:371)[0m
	[2025-10-26 03:32:47,161] [[33m WARNING[0m]: n undefined in the pass@k. We assume it's the same as the sample's number of predictions. (metrics_sample.py:1302)[0m
	[2025-10-26 03:32:50,281] [[32m INFO[0m]: --- DISPLAYING RESULTS --- (pipeline.py:432)[0m
	[2025-10-26 03:32:50,292] [[32m INFO[0m]: --- SAVING AND PUSHING RESULTS --- (pipeline.py:422)[0m
	[2025-10-26 03:32:50,293] [[32m INFO[0m]: Saving experiment tracker (evaluation_tracker.py:246)[0m
	[2025-10-26 03:32:53,278] [[32m INFO[0m]: Saving results to /mnt/public/wucanhui/lighteval/results/results/mnt/public/wucanhui/outputs/Qwen3-14B-math-reasoning/checkpoint-2562/results_2025-10-26T03-32-50.294155.json (evaluation_tracker.py:310)[0m
	\| Task \|Version\| Metric \|Value \| \|Stderr\|
	\|------------------\|-------\|-------------\|-----:\|---\|-----:\|
	\|all \| \|pass@k_with_k\|0.9000\|± \|0.0557\|
	\| \| \|avg@k_with_k \|0.6417\|± \|0.0664\|
	\|lighteval:aime24:0\| \|pass@k_with_k\|0.9000\|± \|0.0557\|
	\| \| \|avg@k_with_k \|0.6417\|± \|0.0664\|