| INFO: Started server process [66828] | |
| INFO: Waiting for application startup. | |
| Loading model from /root/model via vLLM ... | |
| INFO 02-25 02:30:55 [model.py:541] Resolved architecture: LlamaForCausalLM | |
| INFO 02-25 02:30:55 [model.py:1561] Using max model len 2048 | |
| INFO 02-25 02:30:55 [scheduler.py:226] Chunked prefill is enabled with max_num_batched_tokens=2048. | |
| INFO 02-25 02:30:55 [vllm.py:624] Asynchronous scheduling is enabled. | |
| WARNING 02-25 02:30:55 [vllm.py:662] Enforce eager set, overriding optimization level to -O0 | |
| INFO 02-25 02:30:55 [vllm.py:762] Cudagraph is disabled under eager mode | |
| (EngineCore_DP0 pid=66868) INFO 02-25 02:30:57 [core.py:96] Initializing a V1 LLM engine (v0.15.1) with config: model='/root/model', speculative_config=None, tokenizer='/root/model', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=/root/model, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['all'], 'splitting_ops': [], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'eliminate_noops': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 0, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': True}, 'local_cache_dir': None, 'static_all_moe_layers': []} | |
| (EngineCore_DP0 pid=66868) INFO 02-25 02:30:59 [parallel_state.py:1212] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://65.109.75.18:42949 backend=nccl | |
| (EngineCore_DP0 pid=66868) INFO 02-25 02:30:59 [parallel_state.py:1423] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A | |
| (EngineCore_DP0 pid=66868) INFO 02-25 02:30:59 [gpu_model_runner.py:4033] Starting to load model /root/model... | |
| (EngineCore_DP0 pid=66868) INFO 02-25 02:31:00 [cuda.py:364] Using FLASH_ATTN attention backend out of potential backends: ('FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION') | |
| (EngineCore_DP0 pid=66868) Loading safetensors checkpoint shards: 0% Completed | 0/5 [00:00<?, ?it/s] | |
| (EngineCore_DP0 pid=66868) Loading safetensors checkpoint shards: 20% Completed | 1/5 [00:00<00:02, 1.55it/s] | |
| (EngineCore_DP0 pid=66868) Loading safetensors checkpoint shards: 40% Completed | 2/5 [00:01<00:02, 1.48it/s] | |
| (EngineCore_DP0 pid=66868) Loading safetensors checkpoint shards: 60% Completed | 3/5 [00:02<00:01, 1.38it/s] | |
| (EngineCore_DP0 pid=66868) Loading safetensors checkpoint shards: 80% Completed | 4/5 [00:02<00:00, 1.74it/s] | |
| (EngineCore_DP0 pid=66868) Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:03<00:00, 1.62it/s] | |
| (EngineCore_DP0 pid=66868) Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:03<00:00, 1.58it/s] | |
| (EngineCore_DP0 pid=66868) | |
| (EngineCore_DP0 pid=66868) INFO 02-25 02:31:04 [default_loader.py:291] Loading weights took 3.34 seconds | |
| (EngineCore_DP0 pid=66868) INFO 02-25 02:31:04 [gpu_model_runner.py:4130] Model loading took 14.96 GiB memory and 4.032548 seconds | |
| (EngineCore_DP0 pid=66868) INFO 02-25 02:31:06 [gpu_worker.py:356] Available KV cache memory: 3.36 GiB | |
| (EngineCore_DP0 pid=66868) INFO 02-25 02:31:06 [kv_cache_utils.py:1307] GPU KV cache size: 27,504 tokens | |
| (EngineCore_DP0 pid=66868) INFO 02-25 02:31:06 [kv_cache_utils.py:1312] Maximum concurrency for 2,048 tokens per request: 13.43x | |
| (EngineCore_DP0 pid=66868) INFO 02-25 02:31:06 [core.py:272] init engine (profile, create kv cache, warmup model) took 1.40 seconds | |
| (EngineCore_DP0 pid=66868) WARNING 02-25 02:31:07 [vllm.py:669] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored. | |
| (EngineCore_DP0 pid=66868) INFO 02-25 02:31:07 [vllm.py:762] Cudagraph is disabled under eager mode | |
| INFO: Application startup complete. | |
| ERROR: [Errno 98] error while attempting to bind on address ('0.0.0.0', 8000): address already in use | |
| INFO: Waiting for application shutdown. | |
| INFO: Application shutdown complete. | |
| vLLM engine ready | |
| Dashboard available at http://0.0.0.0:8000/ | |