| INFO 10-26 08:02:52 [__init__.py:235] Automatically detected platform cuda. | |
| [2025-10-26 08:02:53,803] [[32m INFO[0m]: --- INIT SEEDS --- (pipeline.py:249)[0m | |
| [2025-10-26 08:02:53,804] [[32m INFO[0m]: --- LOADING TASKS --- (pipeline.py:210)[0m | |
| [2025-10-26 08:02:53,807] [[33m WARNING[0m]: Careful, the task aime25 is using evaluation data to build the few shot examples. (lighteval_task.py:269)[0m | |
| [2025-10-26 08:02:59,213] [[32m INFO[0m]: --- LOADING MODEL --- (pipeline.py:177)[0m | |
| `torch_dtype` is deprecated! Use `dtype` instead! | |
| [2025-10-26 08:03:06,080] [[32m INFO[0m]: Using max model len 32768 (config.py:1604)[0m | |
| [2025-10-26 08:03:06,785] [[32m INFO[0m]: Chunked prefill is enabled with max_num_batched_tokens=2048. (config.py:2434)[0m | |
| INFO 10-26 08:03:11 [__init__.py:235] Automatically detected platform cuda. | |
| INFO 10-26 08:03:13 [core.py:572] Waiting for init message from front-end. | |
| INFO 10-26 08:03:13 [core.py:71] Initializing a V1 LLM engine (v0.10.0) with config: model='/mnt/public/wucanhui/outputs/Qwen3-4B-math-reasoning/checkpoint-2562', speculative_config=None, tokenizer='/mnt/public/wucanhui/outputs/Qwen3-4B-math-reasoning/checkpoint-2562', skip_tokenizer_init=False, tokenizer_mode=auto, revision=main, override_neuron_config={}, tokenizer_revision=main, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=1234, served_model_name=/mnt/public/wucanhui/outputs/Qwen3-4B-math-reasoning/checkpoint-2562, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":0,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":[],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":0,"cudagraph_capture_sizes":[],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":0,"local_cache_dir":null} | |
| INFO 10-26 08:03:17 [parallel_state.py:1102] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0 | |
| WARNING 10-26 08:03:17 [topk_topp_sampler.py:59] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer. | |
| INFO 10-26 08:03:17 [gpu_model_runner.py:1843] Starting to load model /mnt/public/wucanhui/outputs/Qwen3-4B-math-reasoning/checkpoint-2562... | |
| INFO 10-26 08:03:18 [gpu_model_runner.py:1875] Loading model from scratch... | |
| INFO 10-26 08:03:18 [cuda.py:290] Using Flash Attention backend on V1 engine. | |
| Loading safetensors checkpoint shards: 0% Completed | 0/2 [00:00<?, ?it/s] | |
| Loading safetensors checkpoint shards: 50% Completed | 1/2 [00:31<00:31, 31.41s/it] | |
| Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:52<00:00, 25.48s/it] | |
| Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:52<00:00, 26.37s/it] | |
| INFO 10-26 08:04:11 [default_loader.py:262] Loading weights took 53.17 seconds | |
| INFO 10-26 08:04:11 [gpu_model_runner.py:1892] Model loading took 7.5552 GiB and 53.295208 seconds | |
| INFO 10-26 08:04:12 [gpu_worker.py:255] Available KV cache memory: 117.60 GiB | |
| INFO 10-26 08:04:12 [kv_cache_utils.py:833] GPU KV cache size: 856,336 tokens | |
| INFO 10-26 08:04:12 [kv_cache_utils.py:837] Maximum concurrency for 32,768 tokens per request: 26.13x | |
| INFO 10-26 08:04:13 [core.py:193] init engine (profile, create kv cache, warmup model) took 1.40 seconds | |
| [2025-10-26 08:04:13,651] [[32m INFO[0m]: [CACHING] Initializing data cache (cache_management.py:105)[0m | |
| [2025-10-26 08:04:13,659] [[32m INFO[0m]: --- RUNNING MODEL --- (pipeline.py:330)[0m | |
| [2025-10-26 08:04:13,659] [[32m INFO[0m]: Running SamplingMethod.GENERATIVE requests (pipeline.py:313)[0m | |
| [2025-10-26 08:04:14,650] [[32m INFO[0m]: Cache: Starting to process 30/30 samples (not found in cache) for tasks lighteval|aime25|0 (824021a82e1c701e, GENERATIVE) (cache_management.py:399)[0m | |
| [2025-10-26 08:04:14,652] [[33m WARNING[0m]: You cannot select the number of dataset splits for a generative evaluation at the moment. Automatically inferring. (data.py:206)[0m | |
| Splits: 0%| | 0/1 [00:00<?, ?it/s][2025-10-26 08:04:14,687] [[33m WARNING[0m]: context_size + max_new_tokens=33622 which is greater than self.max_length=32768. Truncating context to 0 tokens. (vllm_model.py:367)[0m | |
| Adding requests: 0%| | 0/30 [00:00<?, ?it/s][A Adding requests: 100%|ββββββββββ| 30/30 [00:00<00:00, 2662.26it/s] | |
| Processed prompts: 0%| | 0/480 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s][A | |
| Processed prompts: 3%|β | 16/480 [00:31<15:04, 1.95s/it, est. speed input: 100.59 toks/s, output: 600.28 toks/s][A | |
| Processed prompts: 7%|β | 32/480 [02:27<37:51, 5.07s/it, est. speed input: 56.72 toks/s, output: 381.04 toks/s] [A | |
| Processed prompts: 10%|β | 48/480 [05:36<58:44, 8.16s/it, est. speed input: 57.95 toks/s, output: 402.16 toks/s][A | |
| Processed prompts: 13%|ββ | 64/480 [13:22<1:54:00, 16.44s/it, est. speed input: 41.31 toks/s, output: 398.09 toks/s][A | |
| Processed prompts: 17%|ββ | 80/480 [19:16<2:03:12, 18.48s/it, est. speed input: 31.03 toks/s, output: 297.18 toks/s][A | |
| Processed prompts: 20%|ββ | 96/480 [19:34<1:20:29, 12.58s/it, est. speed input: 35.71 toks/s, output: 492.45 toks/s][A | |
| Processed prompts: 23%|βββ | 112/480 [19:35<52:04, 8.49s/it, est. speed input: 40.49 toks/s, output: 722.20 toks/s] [A | |
| Processed prompts: 27%|βββ | 128/480 [19:35<33:59, 5.80s/it, est. speed input: 45.17 toks/s, output: 913.72 toks/s][A | |
| Processed prompts: 30%|βββ | 144/480 [19:38<22:34, 4.03s/it, est. speed input: 48.94 toks/s, output: 1125.63 toks/s][A | |
| Processed prompts: 30%|βββ | 144/480 [19:50<22:34, 4.03s/it, est. speed input: 48.94 toks/s, output: 1125.63 toks/s][A | |
| Processed prompts: 33%|ββββ | 160/480 [19:59<17:04, 3.20s/it, est. speed input: 50.80 toks/s, output: 1227.06 toks/s][A | |
| Processed prompts: 37%|ββββ | 176/480 [20:05<11:49, 2.33s/it, est. speed input: 53.00 toks/s, output: 1238.89 toks/s][A | |
| Processed prompts: 40%|ββββ | 192/480 [20:06<07:50, 1.63s/it, est. speed input: 55.39 toks/s, output: 1290.57 toks/s][A | |
| Processed prompts: 40%|ββββ | 192/480 [20:20<07:50, 1.63s/it, est. speed input: 55.39 toks/s, output: 1290.57 toks/s][A | |
| Processed prompts: 43%|βββββ | 208/480 [20:56<09:27, 2.09s/it, est. speed input: 55.62 toks/s, output: 1443.63 toks/s][A | |
| Processed prompts: 47%|βββββ | 224/480 [21:27<08:44, 2.05s/it, est. speed input: 56.47 toks/s, output: 1568.26 toks/s][A | |
| Processed prompts: 50%|βββββ | 240/480 [21:32<06:03, 1.52s/it, est. speed input: 57.75 toks/s, output: 1577.05 toks/s][A | |
| Processed prompts: 50%|βββββ | 240/480 [21:44<06:03, 1.52s/it, est. speed input: 57.75 toks/s, output: 1577.05 toks/s][A | |
| Processed prompts: 53%|ββββββ | 256/480 [21:49<05:10, 1.39s/it, est. speed input: 58.42 toks/s, output: 1568.61 toks/s][A | |
| Processed prompts: 57%|ββββββ | 272/480 [21:53<03:35, 1.04s/it, est. speed input: 60.40 toks/s, output: 1658.71 toks/s][A | |
| Processed prompts: 57%|ββββββ | 272/480 [22:04<03:35, 1.04s/it, est. speed input: 60.40 toks/s, output: 1658.71 toks/s][A | |
| Processed prompts: 60%|ββββββ | 288/480 [22:10<03:21, 1.05s/it, est. speed input: 63.05 toks/s, output: 1727.32 toks/s][A | |
| Processed prompts: 63%|βββββββ | 304/480 [22:24<02:55, 1.00it/s, est. speed input: 64.07 toks/s, output: 1735.45 toks/s][A | |
| Processed prompts: 67%|βββββββ | 320/480 [23:26<04:59, 1.87s/it, est. speed input: 63.77 toks/s, output: 1844.10 toks/s][A | |
| Processed prompts: 70%|βββββββ | 336/480 [25:13<07:57, 3.32s/it, est. speed input: 61.00 toks/s, output: 1793.67 toks/s][A | |
| Processed prompts: 73%|ββββββββ | 352/480 [25:26<05:26, 2.55s/it, est. speed input: 62.88 toks/s, output: 1887.97 toks/s][A | |
| Processed prompts: 77%|ββββββββ | 368/480 [25:42<03:54, 2.09s/it, est. speed input: 64.04 toks/s, output: 1935.60 toks/s][A | |
| Processed prompts: 80%|ββββββββ | 384/480 [25:51<02:36, 1.63s/it, est. speed input: 65.85 toks/s, output: 2000.84 toks/s][A | |
| Processed prompts: 83%|βββββββββ | 400/480 [26:01<01:46, 1.33s/it, est. speed input: 66.89 toks/s, output: 2071.80 toks/s][A | |
| Processed prompts: 87%|βββββββββ | 416/480 [26:20<01:22, 1.30s/it, est. speed input: 67.95 toks/s, output: 2220.24 toks/s][A | |
| Processed prompts: 90%|βββββββββ | 432/480 [26:35<00:56, 1.17s/it, est. speed input: 68.87 toks/s, output: 2249.41 toks/s][A | |
| Processed prompts: 93%|ββββββββββ| 448/480 [26:54<00:37, 1.18s/it, est. speed input: 69.58 toks/s, output: 2321.91 toks/s][A | |
| Processed prompts: 97%|ββββββββββ| 464/480 [27:55<00:31, 1.98s/it, est. speed input: 68.62 toks/s, output: 2367.29 toks/s][A | |
| Processed prompts: 100%|ββββββββββ| 480/480 [28:41<00:00, 2.25s/it, est. speed input: 68.25 toks/s, output: 2368.48 toks/s][A | |
| Processed prompts: 100%|ββββββββββ| 480/480 [28:41<00:00, 2.25s/it, est. speed input: 68.25 toks/s, output: 2368.48 toks/s][A Processed prompts: 100%|ββββββββββ| 480/480 [28:41<00:00, 3.59s/it, est. speed input: 68.25 toks/s, output: 2368.48 toks/s] | |
| Splits: 100%|ββββββββββ| 1/1 [28:41<00:00, 1721.99s/it] Splits: 100%|ββββββββββ| 1/1 [28:41<00:00, 1721.99s/it] | |
| Creating parquet from Arrow format: 0%| | 0/1 [00:00<?, ?ba/s] Creating parquet from Arrow format: 100%|ββββββββββ| 1/1 [00:00<00:00, 5.70ba/s] Creating parquet from Arrow format: 100%|ββββββββββ| 1/1 [00:00<00:00, 5.67ba/s] | |
| [2025-10-26 08:33:01,888] [[32m INFO[0m]: Cached 30 samples of lighteval|aime25|0 (824021a82e1c701e, GENERATIVE) at /mnt/public/wucanhui/outputs/Qwen3-4B-math-reasoning/checkpoint-2562/0619260e1176b049/lighteval|aime25|0/824021a82e1c701e/GENERATIVE.parquet. (cache_management.py:345)[0m | |
| Generating train split: 0 examples [00:00, ? examples/s] Generating train split: 30 examples [00:00, 234.35 examples/s] Generating train split: 30 examples [00:00, 230.88 examples/s] | |
| [rank0]:[W1026 08:33:06.472513423 ProcessGroupNCCL.cpp:1479] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator()) | |
| [2025-10-26 08:33:07,407] [[32m INFO[0m]: --- POST-PROCESSING MODEL RESPONSES --- (pipeline.py:344)[0m | |
| [2025-10-26 08:33:07,416] [[32m INFO[0m]: --- COMPUTING METRICS --- (pipeline.py:371)[0m | |
| [2025-10-26 08:33:07,417] [[33m WARNING[0m]: n undefined in the pass@k. We assume it's the same as the sample's number of predictions. (metrics_sample.py:1302)[0m | |
| [2025-10-26 08:33:09,279] [[32m INFO[0m]: --- DISPLAYING RESULTS --- (pipeline.py:432)[0m | |
| [2025-10-26 08:33:09,291] [[32m INFO[0m]: --- SAVING AND PUSHING RESULTS --- (pipeline.py:422)[0m | |
| [2025-10-26 08:33:09,292] [[32m INFO[0m]: Saving experiment tracker (evaluation_tracker.py:246)[0m | |
| [2025-10-26 08:33:11,624] [[32m INFO[0m]: Saving results to /mnt/public/wucanhui/lighteval/results/results/mnt/public/wucanhui/outputs/Qwen3-4B-math-reasoning/checkpoint-2562/results_2025-10-26T08-33-09.292915.json (evaluation_tracker.py:310)[0m | |
| | Task |Version| Metric |Value | |Stderr| | |
| |------------------|-------|-------------|-----:|---|-----:| | |
| |all | |pass@k_with_k|0.5333|Β± |0.0926| | |
| | | |avg@k_with_k |0.2750|Β± |0.0672| | |
| |lighteval:aime25:0| |pass@k_with_k|0.5333|Β± |0.0926| | |
| | | |avg@k_with_k |0.2750|Β± |0.0672| | |