3TF-14B / eval_lighteval|aime25|0.log
volcanos's picture
Upload folder using huggingface_hub
c67ee0f
INFO 10-26 02:40:28 [__init__.py:235] Automatically detected platform cuda.
[2025-10-26 02:40:30,546] [ INFO]: --- INIT SEEDS --- (pipeline.py:249)
[2025-10-26 02:40:30,547] [ INFO]: --- LOADING TASKS --- (pipeline.py:210)
[2025-10-26 02:40:30,551] [ WARNING]: Careful, the task aime25 is using evaluation data to build the few shot examples. (lighteval_task.py:269)
[2025-10-26 02:40:34,973] [ INFO]: --- LOADING MODEL --- (pipeline.py:177)
`torch_dtype` is deprecated! Use `dtype` instead!
[2025-10-26 02:40:41,401] [ INFO]: Using max model len 32768 (config.py:1604)
[2025-10-26 02:40:41,745] [ INFO]: Chunked prefill is enabled with max_num_batched_tokens=2048. (config.py:2434)
INFO 10-26 02:40:46 [__init__.py:235] Automatically detected platform cuda.
INFO 10-26 02:40:47 [core.py:572] Waiting for init message from front-end.
INFO 10-26 02:40:47 [core.py:71] Initializing a V1 LLM engine (v0.10.0) with config: model='/mnt/public/wucanhui/outputs/Qwen3-14B-math-reasoning/checkpoint-2562', speculative_config=None, tokenizer='/mnt/public/wucanhui/outputs/Qwen3-14B-math-reasoning/checkpoint-2562', skip_tokenizer_init=False, tokenizer_mode=auto, revision=main, override_neuron_config={}, tokenizer_revision=main, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=1234, served_model_name=/mnt/public/wucanhui/outputs/Qwen3-14B-math-reasoning/checkpoint-2562, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":0,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":[],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":0,"cudagraph_capture_sizes":[],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":0,"local_cache_dir":null}
INFO 10-26 02:40:50 [parallel_state.py:1102] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
WARNING 10-26 02:40:50 [topk_topp_sampler.py:59] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
INFO 10-26 02:40:50 [gpu_model_runner.py:1843] Starting to load model /mnt/public/wucanhui/outputs/Qwen3-14B-math-reasoning/checkpoint-2562...
INFO 10-26 02:40:50 [gpu_model_runner.py:1875] Loading model from scratch...
INFO 10-26 02:40:50 [cuda.py:290] Using Flash Attention backend on V1 engine.
Loading safetensors checkpoint shards: 0% Completed | 0/6 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 17% Completed | 1/6 [00:36<03:00, 36.20s/it]
Loading safetensors checkpoint shards: 33% Completed | 2/6 [01:14<02:30, 37.71s/it]
Loading safetensors checkpoint shards: 50% Completed | 3/6 [01:50<01:49, 36.62s/it]
Loading safetensors checkpoint shards: 67% Completed | 4/6 [02:24<01:11, 35.85s/it]
Loading safetensors checkpoint shards: 83% Completed | 5/6 [03:01<00:35, 35.94s/it]
Loading safetensors checkpoint shards: 100% Completed | 6/6 [03:35<00:00, 35.39s/it]
Loading safetensors checkpoint shards: 100% Completed | 6/6 [03:35<00:00, 35.90s/it]
INFO 10-26 02:44:27 [default_loader.py:262] Loading weights took 216.39 seconds
INFO 10-26 02:44:27 [gpu_model_runner.py:1892] Model loading took 27.5185 GiB and 216.927435 seconds
INFO 10-26 02:44:28 [gpu_worker.py:255] Available KV cache memory: 97.63 GiB
INFO 10-26 02:44:29 [kv_cache_utils.py:833] GPU KV cache size: 639,792 tokens
INFO 10-26 02:44:29 [kv_cache_utils.py:837] Maximum concurrency for 32,768 tokens per request: 19.52x
INFO 10-26 02:44:29 [core.py:193] init engine (profile, create kv cache, warmup model) took 1.23 seconds
[2025-10-26 02:44:29,506] [ INFO]: [CACHING] Initializing data cache (cache_management.py:105)
[2025-10-26 02:44:29,511] [ INFO]: --- RUNNING MODEL --- (pipeline.py:330)
[2025-10-26 02:44:29,512] [ INFO]: Running SamplingMethod.GENERATIVE requests (pipeline.py:313)
[2025-10-26 02:44:30,493] [ INFO]: Cache: Starting to process 30/30 samples (not found in cache) for tasks lighteval|aime25|0 (824021a82e1c701e, GENERATIVE) (cache_management.py:399)
[2025-10-26 02:44:30,494] [ WARNING]: You cannot select the number of dataset splits for a generative evaluation at the moment. Automatically inferring. (data.py:206)
Splits: 0%| | 0/1 [00:00<?, ?it/s][2025-10-26 02:44:30,527] [ WARNING]: context_size + max_new_tokens=33622 which is greater than self.max_length=32768. Truncating context to 0 tokens. (vllm_model.py:367)
Adding requests: 0%| | 0/30 [00:00<?, ?it/s] Adding requests: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 30/30 [00:00<00:00, 2861.70it/s]
Processed prompts: 0%| | 0/480 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
Processed prompts: 3%|β–Ž | 16/480 [11:36<5:36:30, 43.51s/it, est. speed input: 7.49 toks/s, output: 290.25 toks/s]
Processed prompts: 7%|β–‹ | 32/480 [12:24<2:26:59, 19.69s/it, est. speed input: 21.99 toks/s, output: 576.39 toks/s]
Processed prompts: 10%|β–ˆ | 48/480 [15:00<1:49:06, 15.15s/it, est. speed input: 24.47 toks/s, output: 754.55 toks/s]
Processed prompts: 13%|β–ˆβ–Ž | 64/480 [15:10<1:05:17, 9.42s/it, est. speed input: 30.84 toks/s, output: 1028.12 toks/s]
Processed prompts: 17%|β–ˆβ–‹ | 80/480 [16:23<51:11, 7.68s/it, est. speed input: 34.15 toks/s, output: 1135.02 toks/s] 
Processed prompts: 20%|β–ˆβ–ˆ | 96/480 [18:00<45:33, 7.12s/it, est. speed input: 35.32 toks/s, output: 1118.93 toks/s]
Processed prompts: 23%|β–ˆβ–ˆβ–Ž | 112/480 [18:03<29:49, 4.86s/it, est. speed input: 47.81 toks/s, output: 1268.53 toks/s]
Processed prompts: 27%|β–ˆβ–ˆβ–‹ | 128/480 [21:04<40:32, 6.91s/it, est. speed input: 43.46 toks/s, output: 1188.71 toks/s]
Processed prompts: 30%|β–ˆβ–ˆβ–ˆ | 144/480 [25:51<57:56, 10.35s/it, est. speed input: 37.40 toks/s, output: 1108.45 toks/s]
Processed prompts: 33%|β–ˆβ–ˆβ–ˆβ–Ž | 160/480 [28:25<53:59, 10.12s/it, est. speed input: 36.69 toks/s, output: 1218.22 toks/s]
Processed prompts: 37%|β–ˆβ–ˆβ–ˆβ–‹ | 176/480 [29:18<40:45, 8.04s/it, est. speed input: 37.06 toks/s, output: 1302.43 toks/s]
Processed prompts: 40%|β–ˆβ–ˆβ–ˆβ–ˆ | 192/480 [29:35<28:24, 5.92s/it, est. speed input: 38.76 toks/s, output: 1358.64 toks/s]
Processed prompts: 43%|β–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 208/480 [34:13<42:32, 9.39s/it, est. speed input: 35.11 toks/s, output: 1304.15 toks/s]
Processed prompts: 47%|β–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 224/480 [37:12<42:24, 9.94s/it, est. speed input: 33.81 toks/s, output: 1334.06 toks/s]
Processed prompts: 50%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 240/480 [37:27<28:53, 7.22s/it, est. speed input: 34.79 toks/s, output: 1343.51 toks/s]
Processed prompts: 53%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 256/480 [37:52<20:38, 5.53s/it, est. speed input: 35.70 toks/s, output: 1343.47 toks/s]
Processed prompts: 57%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 272/480 [38:26<15:34, 4.49s/it, est. speed input: 36.48 toks/s, output: 1442.67 toks/s]
Processed prompts: 60%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 288/480 [38:39<10:50, 3.39s/it, est. speed input: 37.83 toks/s, output: 1575.07 toks/s]
Processed prompts: 63%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 304/480 [39:43<10:30, 3.58s/it, est. speed input: 37.98 toks/s, output: 1608.13 toks/s]
Processed prompts: 67%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 320/480 [41:00<10:31, 3.94s/it, est. speed input: 37.98 toks/s, output: 1594.87 toks/s]
Processed prompts: 70%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 336/480 [45:19<18:17, 7.62s/it, est. speed input: 35.40 toks/s, output: 1490.98 toks/s]
Processed prompts: 73%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 352/480 [47:59<17:47, 8.34s/it, est. speed input: 34.42 toks/s, output: 1501.80 toks/s]
Processed prompts: 77%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 368/480 [48:22<11:41, 6.26s/it, est. speed input: 34.80 toks/s, output: 1502.65 toks/s]
Processed prompts: 80%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 384/480 [49:21<08:46, 5.49s/it, est. speed input: 34.92 toks/s, output: 1545.40 toks/s]
Processed prompts: 83%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 400/480 [49:40<05:35, 4.20s/it, est. speed input: 35.34 toks/s, output: 1555.65 toks/s]
Processed prompts: 87%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 416/480 [50:27<04:05, 3.83s/it, est. speed input: 35.54 toks/s, output: 1601.45 toks/s]
Processed prompts: 90%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 432/480 [52:10<03:41, 4.61s/it, est. speed input: 35.16 toks/s, output: 1633.81 toks/s]
Processed prompts: 93%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž| 448/480 [52:32<01:56, 3.64s/it, est. speed input: 35.76 toks/s, output: 1715.27 toks/s]
Processed prompts: 97%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹| 464/480 [53:17<00:54, 3.38s/it, est. speed input: 36.05 toks/s, output: 1747.92 toks/s]
Processed prompts: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 480/480 [55:29<00:00, 4.85s/it, est. speed input: 35.30 toks/s, output: 1712.28 toks/s]
Processed prompts: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 480/480 [55:29<00:00, 4.85s/it, est. speed input: 35.30 toks/s, output: 1712.28 toks/s] Processed prompts: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 480/480 [55:29<00:00, 6.94s/it, est. speed input: 35.30 toks/s, output: 1712.28 toks/s]
Splits: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [55:29<00:00, 3329.65s/it] Splits: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [55:29<00:00, 3329.65s/it]
Creating parquet from Arrow format: 0%| | 0/1 [00:00<?, ?ba/s] Creating parquet from Arrow format: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 3.70ba/s] Creating parquet from Arrow format: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 3.69ba/s]
[2025-10-26 03:40:07,290] [ INFO]: Cached 30 samples of lighteval|aime25|0 (824021a82e1c701e, GENERATIVE) at /mnt/public/wucanhui/outputs/Qwen3-14B-math-reasoning/checkpoint-2562/081b5a149587018c/lighteval|aime25|0/824021a82e1c701e/GENERATIVE.parquet. (cache_management.py:345)
Generating train split: 0 examples [00:00, ? examples/s] Generating train split: 30 examples [00:00, 143.42 examples/s] Generating train split: 30 examples [00:00, 142.17 examples/s]
[rank0]:[W1026 03:40:13.128387541 ProcessGroupNCCL.cpp:1479] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
[2025-10-26 03:40:14,126] [ INFO]: --- POST-PROCESSING MODEL RESPONSES --- (pipeline.py:344)
[2025-10-26 03:40:14,141] [ INFO]: --- COMPUTING METRICS --- (pipeline.py:371)
[2025-10-26 03:40:14,143] [ WARNING]: n undefined in the pass@k. We assume it's the same as the sample's number of predictions. (metrics_sample.py:1302)
[2025-10-26 03:40:17,146] [ INFO]: --- DISPLAYING RESULTS --- (pipeline.py:432)
[2025-10-26 03:40:17,160] [ INFO]: --- SAVING AND PUSHING RESULTS --- (pipeline.py:422)
[2025-10-26 03:40:17,160] [ INFO]: Saving experiment tracker (evaluation_tracker.py:246)
[2025-10-26 03:40:20,475] [ INFO]: Saving results to /mnt/public/wucanhui/lighteval/results/results/mnt/public/wucanhui/outputs/Qwen3-14B-math-reasoning/checkpoint-2562/results_2025-10-26T03-40-17.161750.json (evaluation_tracker.py:310)
| Task |Version| Metric |Value | |Stderr|
|------------------|-------|-------------|-----:|---|-----:|
|all | |pass@k_with_k|0.8000|Β± |0.0743|
| | |avg@k_with_k |0.5333|Β± |0.0713|
|lighteval:aime25:0| |pass@k_with_k|0.8000|Β± |0.0743|
| | |avg@k_with_k |0.5333|Β± |0.0713|