File size: 12,641 Bytes
e998d4b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
INFO 10-26 08:02:52 [__init__.py:235] Automatically detected platform cuda.
[2025-10-26 08:02:53,782] [    INFO]: --- INIT SEEDS --- (pipeline.py:249)
[2025-10-26 08:02:53,783] [    INFO]: --- LOADING TASKS --- (pipeline.py:210)
[2025-10-26 08:02:53,786] [ WARNING]: Careful, the task aime24 is using evaluation data to build the few shot examples. (lighteval_task.py:269)
[2025-10-26 08:02:58,782] [    INFO]: --- LOADING MODEL --- (pipeline.py:177)
`torch_dtype` is deprecated! Use `dtype` instead!
[2025-10-26 08:03:06,003] [    INFO]: Using max model len 32768 (config.py:1604)
[2025-10-26 08:03:06,707] [    INFO]: Chunked prefill is enabled with max_num_batched_tokens=2048. (config.py:2434)
INFO 10-26 08:03:11 [__init__.py:235] Automatically detected platform cuda.
INFO 10-26 08:03:13 [core.py:572] Waiting for init message from front-end.
INFO 10-26 08:03:13 [core.py:71] Initializing a V1 LLM engine (v0.10.0) with config: model='/mnt/public/wucanhui/outputs/Qwen3-4B-math-reasoning/checkpoint-2562', speculative_config=None, tokenizer='/mnt/public/wucanhui/outputs/Qwen3-4B-math-reasoning/checkpoint-2562', skip_tokenizer_init=False, tokenizer_mode=auto, revision=main, override_neuron_config={}, tokenizer_revision=main, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=1234, served_model_name=/mnt/public/wucanhui/outputs/Qwen3-4B-math-reasoning/checkpoint-2562, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":0,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":[],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":0,"cudagraph_capture_sizes":[],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":0,"local_cache_dir":null}
INFO 10-26 08:03:17 [parallel_state.py:1102] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
WARNING 10-26 08:03:17 [topk_topp_sampler.py:59] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
INFO 10-26 08:03:17 [gpu_model_runner.py:1843] Starting to load model /mnt/public/wucanhui/outputs/Qwen3-4B-math-reasoning/checkpoint-2562...
INFO 10-26 08:03:18 [gpu_model_runner.py:1875] Loading model from scratch...
INFO 10-26 08:03:18 [cuda.py:290] Using Flash Attention backend on V1 engine.

Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]

Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:31<00:31, 31.34s/it]

Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:52<00:00, 25.45s/it]

Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:52<00:00, 26.34s/it]

INFO 10-26 08:04:11 [default_loader.py:262] Loading weights took 53.10 seconds
INFO 10-26 08:04:11 [gpu_model_runner.py:1892] Model loading took 7.5552 GiB and 53.217135 seconds
INFO 10-26 08:04:12 [gpu_worker.py:255] Available KV cache memory: 117.60 GiB
INFO 10-26 08:04:12 [kv_cache_utils.py:833] GPU KV cache size: 856,336 tokens
INFO 10-26 08:04:12 [kv_cache_utils.py:837] Maximum concurrency for 32,768 tokens per request: 26.13x
INFO 10-26 08:04:13 [core.py:193] init engine (profile, create kv cache, warmup model) took 1.35 seconds
[2025-10-26 08:04:13,688] [    INFO]: [CACHING] Initializing data cache (cache_management.py:105)
[2025-10-26 08:04:13,695] [    INFO]: --- RUNNING MODEL --- (pipeline.py:330)
[2025-10-26 08:04:13,696] [    INFO]: Running SamplingMethod.GENERATIVE requests (pipeline.py:313)
[2025-10-26 08:04:14,670] [    INFO]: Cache: Starting to process 30/30 samples (not found in cache) for tasks lighteval|aime24|0 (12234f074d9327ac, GENERATIVE) (cache_management.py:399)
[2025-10-26 08:04:14,671] [ WARNING]: You cannot select the number of dataset splits for a generative evaluation at the moment. Automatically inferring. (data.py:206)

Splits:   0%|          | 0/1 [00:00<?, ?it/s][2025-10-26 08:04:14,701] [ WARNING]: context_size + max_new_tokens=33247 which is greater than self.max_length=32768. Truncating context to 0 tokens. (vllm_model.py:367)


Adding requests:   0%|          | 0/30 [00:00<?, ?it/s]
Adding requests: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 30/30 [00:00<00:00, 2828.89it/s]


Processed prompts:   0%|          | 0/480 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Processed prompts:   3%|β–Ž         | 16/480 [00:19<09:23,  1.21s/it, est. speed input: 173.05 toks/s, output: 842.39 toks/s]

Processed prompts:   7%|β–‹         | 32/480 [02:10<34:21,  4.60s/it, est. speed input: 52.41 toks/s, output: 303.21 toks/s] 

Processed prompts:  10%|β–ˆ         | 48/480 [02:35<23:02,  3.20s/it, est. speed input: 67.72 toks/s, output: 431.83 toks/s]

Processed prompts:  13%|β–ˆβ–Ž        | 64/480 [06:48<56:43,  8.18s/it, est. speed input: 36.14 toks/s, output: 414.29 toks/s]

Processed prompts:  17%|β–ˆβ–‹        | 80/480 [11:58<1:21:30, 12.23s/it, est. speed input: 26.84 toks/s, output: 518.27 toks/s]

Processed prompts:  20%|β–ˆβ–ˆ        | 96/480 [19:04<1:49:32, 17.12s/it, est. speed input: 23.55 toks/s, output: 586.62 toks/s]

Processed prompts:  23%|β–ˆβ–ˆβ–Ž       | 112/480 [19:16<1:12:09, 11.76s/it, est. speed input: 26.59 toks/s, output: 800.47 toks/s]

Processed prompts:  27%|β–ˆβ–ˆβ–‹       | 128/480 [19:16<47:04,  8.03s/it, est. speed input: 29.80 toks/s, output: 1216.72 toks/s] 

Processed prompts:  30%|β–ˆβ–ˆβ–ˆ       | 144/480 [19:19<31:12,  5.57s/it, est. speed input: 32.27 toks/s, output: 1322.60 toks/s]

Processed prompts:  30%|β–ˆβ–ˆβ–ˆ       | 144/480 [19:30<31:12,  5.57s/it, est. speed input: 32.27 toks/s, output: 1322.60 toks/s]

Processed prompts:  33%|β–ˆβ–ˆβ–ˆβ–Ž      | 160/480 [19:52<23:53,  4.48s/it, est. speed input: 33.90 toks/s, output: 1406.40 toks/s]

Processed prompts:  37%|β–ˆβ–ˆβ–ˆβ–‹      | 176/480 [19:55<16:00,  3.16s/it, est. speed input: 36.47 toks/s, output: 1508.94 toks/s]

Processed prompts:  37%|β–ˆβ–ˆβ–ˆβ–‹      | 176/480 [20:12<16:00,  3.16s/it, est. speed input: 36.47 toks/s, output: 1508.94 toks/s]

Processed prompts:  40%|β–ˆβ–ˆβ–ˆβ–ˆ      | 192/480 [20:23<13:10,  2.74s/it, est. speed input: 38.56 toks/s, output: 1605.36 toks/s]

Processed prompts:  43%|β–ˆβ–ˆβ–ˆβ–ˆβ–Ž     | 208/480 [21:56<16:40,  3.68s/it, est. speed input: 37.53 toks/s, output: 1508.98 toks/s]

Processed prompts:  47%|β–ˆβ–ˆβ–ˆβ–ˆβ–‹     | 224/480 [25:14<26:51,  6.30s/it, est. speed input: 34.53 toks/s, output: 1477.46 toks/s]

Processed prompts:  50%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ     | 240/480 [25:18<17:55,  4.48s/it, est. speed input: 35.98 toks/s, output: 1513.08 toks/s]

Processed prompts:  50%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ     | 240/480 [25:34<17:55,  4.48s/it, est. speed input: 35.98 toks/s, output: 1513.08 toks/s]

Processed prompts:  53%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž    | 256/480 [27:19<20:10,  5.40s/it, est. speed input: 34.98 toks/s, output: 1447.90 toks/s]

Processed prompts:  57%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹    | 272/480 [28:37<18:11,  5.25s/it, est. speed input: 35.09 toks/s, output: 1500.56 toks/s]

Processed prompts:  60%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ    | 288/480 [28:40<11:56,  3.73s/it, est. speed input: 36.28 toks/s, output: 1511.14 toks/s]

Processed prompts:  63%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž   | 304/480 [29:22<09:57,  3.39s/it, est. speed input: 36.93 toks/s, output: 1538.79 toks/s]

Processed prompts:  67%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹   | 320/480 [29:29<06:39,  2.50s/it, est. speed input: 38.27 toks/s, output: 1599.71 toks/s]

Processed prompts:  70%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ   | 336/480 [35:40<20:53,  8.71s/it, est. speed input: 32.76 toks/s, output: 1412.11 toks/s]

Processed prompts:  73%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž  | 352/480 [36:02<13:52,  6.51s/it, est. speed input: 33.63 toks/s, output: 1527.18 toks/s]

Processed prompts:  77%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹  | 368/480 [37:35<11:46,  6.31s/it, est. speed input: 33.59 toks/s, output: 1596.24 toks/s]

Processed prompts:  80%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  | 384/480 [37:39<07:11,  4.49s/it, est. speed input: 34.75 toks/s, output: 1658.85 toks/s]

Processed prompts:  83%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 400/480 [38:45<05:50,  4.38s/it, est. speed input: 34.84 toks/s, output: 1784.84 toks/s]

Processed prompts:  87%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 416/480 [39:05<03:40,  3.44s/it, est. speed input: 35.55 toks/s, output: 1890.39 toks/s]

Processed prompts:  90%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 432/480 [40:29<03:10,  3.98s/it, est. speed input: 35.40 toks/s, output: 1928.69 toks/s]

Processed prompts:  93%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž| 448/480 [40:34<01:32,  2.89s/it, est. speed input: 36.23 toks/s, output: 1998.74 toks/s]

Processed prompts:  97%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹| 464/480 [42:23<01:04,  4.05s/it, est. speed input: 35.60 toks/s, output: 1961.46 toks/s]

Processed prompts: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 480/480 [43:03<00:00,  3.60s/it, est. speed input: 35.86 toks/s, output: 1958.48 toks/s]

Processed prompts: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 480/480 [43:03<00:00,  3.60s/it, est. speed input: 35.86 toks/s, output: 1958.48 toks/s]
Processed prompts: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 480/480 [43:04<00:00,  5.38s/it, est. speed input: 35.86 toks/s, output: 1958.48 toks/s]

Splits: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [43:04<00:00, 2584.05s/it]
Splits: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [43:04<00:00, 2584.05s/it]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]
Creating parquet from Arrow format: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00,  4.79ba/s]
Creating parquet from Arrow format: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00,  4.77ba/s]
[2025-10-26 08:47:25,202] [    INFO]: Cached 30 samples of lighteval|aime24|0 (12234f074d9327ac, GENERATIVE) at /mnt/public/wucanhui/outputs/Qwen3-4B-math-reasoning/checkpoint-2562/0619260e1176b049/lighteval|aime24|0/12234f074d9327ac/GENERATIVE.parquet. (cache_management.py:345)

Generating train split: 0 examples [00:00, ? examples/s]
Generating train split: 30 examples [00:00, 190.79 examples/s]
Generating train split: 30 examples [00:00, 188.72 examples/s]
[rank0]:[W1026 08:47:30.553971068 ProcessGroupNCCL.cpp:1479] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
[2025-10-26 08:47:31,519] [    INFO]: --- POST-PROCESSING MODEL RESPONSES --- (pipeline.py:344)
[2025-10-26 08:47:31,531] [    INFO]: --- COMPUTING METRICS --- (pipeline.py:371)
[2025-10-26 08:47:31,532] [ WARNING]: n undefined in the pass@k. We assume it's the same as the sample's number of predictions. (metrics_sample.py:1302)
[2025-10-26 08:47:36,019] [    INFO]: --- DISPLAYING RESULTS --- (pipeline.py:432)
[2025-10-26 08:47:36,036] [    INFO]: --- SAVING AND PUSHING RESULTS --- (pipeline.py:422)
[2025-10-26 08:47:36,037] [    INFO]: Saving experiment tracker (evaluation_tracker.py:246)
[2025-10-26 08:47:38,929] [    INFO]: Saving results to /mnt/public/wucanhui/lighteval/results/results/mnt/public/wucanhui/outputs/Qwen3-4B-math-reasoning/checkpoint-2562/results_2025-10-26T08-47-36.038278.json (evaluation_tracker.py:310)
|       Task       |Version|   Metric    |Value |   |Stderr|
|------------------|-------|-------------|-----:|---|-----:|
|all               |       |pass@k_with_k|0.7333|Β±  |0.0821|
|                  |       |avg@k_with_k |0.3604|Β±  |0.0684|
|lighteval:aime24:0|       |pass@k_with_k|0.7333|Β±  |0.0821|
|                  |       |avg@k_with_k |0.3604|Β±  |0.0684|