Can't use tool (patched with right transformer version)

by clayboby - opened 13 days ago

Here is my start log :
(APIServer pid=1) INFO 04-05 15:37:17 [utils.py:299]
(APIServer pid=1) INFO 04-05 15:37:17 [utils.py:299] █ █ █▄ ▄█
(APIServer pid=1) INFO 04-05 15:37:17 [utils.py:299] ▄▄ ▄█ █ █ █ ▀▄▀ █ version 0.19.1rc1.dev29+g93726b2a1
(APIServer pid=1) INFO 04-05 15:37:17 [utils.py:299] █▄█▀ █ █ █ █ model /models/QuantTrio/Qwopus3.5-27B-v3-AWQ
(APIServer pid=1) INFO 04-05 15:37:17 [utils.py:299] ▀▀ ▀▀▀▀▀ ▀▀▀▀▀ ▀ ▀
(APIServer pid=1) INFO 04-05 15:37:17 [utils.py:299]
(APIServer pid=1) INFO 04-05 15:37:17 [utils.py:233] non-default args: {'model_tag': '/models/QuantTrio/Qwopus3.5-27B-v3-AWQ', 'enable_auto_tool_choice': True, 'tool_call_parser': 'qwen3_coder', 'enable_force_include_usage': True, 'model': '/models/QuantTrio/Qwopus3.5-27B-v3-AWQ', 'max_model_len': 128000, 'served_model_name': ['Qwen3.5-27B'], 'reasoning_parser': 'qwen3', 'gpu_memory_utilization': 0.98, 'kv_cache_dtype': 'fp8_e4m3', 'enable_prefix_caching': True, 'limit_mm_per_prompt': {'image': 1, 'video': 0}, 'max_num_batched_tokens': 4096, 'max_num_seqs': 3}
(APIServer pid=1) Unrecognized keys in rope_parameters for 'rope_type'='default': {'mrope_interleaved', 'mrope_section'}
(APIServer pid=1) Unrecognized keys in rope_parameters for 'rope_type'='default': {'mrope_interleaved', 'mrope_section'}
(APIServer pid=1) Unrecognized keys in rope_parameters for 'rope_type'='default': {'mrope_interleaved', 'mrope_section'}
(APIServer pid=1) INFO 04-05 15:37:20 [model.py:554] Resolved architecture: Qwen3_5ForConditionalGeneration
(APIServer pid=1) INFO 04-05 15:37:20 [model.py:1685] Using max model len 128000
(APIServer pid=1) INFO 04-05 15:37:20 [awq_marlin.py:252] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
(APIServer pid=1) INFO 04-05 15:37:20 [cache.py:253] Using fp8_e4m3 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor
(APIServer pid=1) INFO 04-05 15:37:20 [scheduler.py:238] Chunked prefill is enabled with max_num_batched_tokens=4096.
(APIServer pid=1) WARNING 04-05 15:37:20 [config.py:306] Mamba cache mode is set to 'align' for Qwen3_5ForConditionalGeneration by default when prefix caching is enabled
(APIServer pid=1) INFO 04-05 15:37:20 [config.py:326] Warning: Prefix caching in Mamba cache 'align' mode is currently enabled. Its support for Mamba layers is experimental. Please report any issues you may observe.
(APIServer pid=1) INFO 04-05 15:37:20 [vllm.py:799] Asynchronous scheduling is enabled.
(APIServer pid=1) INFO 04-05 15:37:20 [kernel.py:199] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'])
(APIServer pid=1) Unrecognized keys in rope_parameters for 'rope_type'='default': {'mrope_interleaved', 'mrope_section'}
(APIServer pid=1) The tokenizer you are loading from '/models/QuantTrio/Qwopus3.5-27B-v3-AWQ' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the fix_mistral_regex=True flag when loading this tokenizer to fix this issue.
(APIServer pid=1) Unrecognized keys in rope_parameters for 'rope_type'='default': {'mrope_interleaved', 'mrope_section'}
(APIServer pid=1) Unrecognized keys in rope_parameters for 'rope_type'='default': {'mrope_interleaved', 'mrope_section'}
(APIServer pid=1) Unrecognized keys in rope_parameters for 'rope_type'='default': {'mrope_interleaved', 'mrope_section'}
(APIServer pid=1) Unrecognized keys in rope_parameters for 'rope_type'='default': {'mrope_interleaved', 'mrope_section'}
(APIServer pid=1) The tokenizer you are loading from '/models/QuantTrio/Qwopus3.5-27B-v3-AWQ' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the fix_mistral_regex=True flag when loading this tokenizer to fix this issue.
(APIServer pid=1) Unrecognized keys in rope_parameters for 'rope_type'='default': {'mrope_interleaved', 'mrope_section'}
(APIServer pid=1) The tokenizer you are loading from '/models/QuantTrio/Qwopus3.5-27B-v3-AWQ' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the fix_mistral_regex=True flag when loading this tokenizer to fix this issue.
(EngineCore pid=198) INFO 04-05 15:37:30 [core.py:105] Initializing a V1 LLM engine (v0.19.1rc1.dev29+g93726b2a1) with config: model='/models/QuantTrio/Qwopus3.5-27B-v3-AWQ', speculative_config=None, tokenizer='/models/QuantTrio/Qwopus3.5-27B-v3-AWQ', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=128000, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=awq_marlin, quantization_config=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=fp8_e4m3, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='qwen3', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=Qwen3.5-27B, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'ir_enable_torch_wrap': True, 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_images_per_batch': 0, 'compile_sizes': [], 'compile_ranges_endpoints': [4096], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 4, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}, kernel_config=KernelConfig(ir_op_priority=IrOpPriorityConfig(rms_norm=['native']), enable_flashinfer_autotune=True, moe_backend='auto')
(EngineCore pid=198) The tokenizer you are loading from '/models/QuantTrio/Qwopus3.5-27B-v3-AWQ' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the fix_mistral_regex=True flag when loading this tokenizer to fix this issue.
(EngineCore pid=198) INFO 04-05 15:37:31 [parallel_state.py:1400] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://172.18.0.2:49055 backend=nccl
(EngineCore pid=198) INFO 04-05 15:37:31 [parallel_state.py:1712] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
(EngineCore pid=198) The tokenizer you are loading from '/models/QuantTrio/Qwopus3.5-27B-v3-AWQ' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the fix_mistral_regex=True flag when loading this tokenizer to fix this issue.
(EngineCore pid=198) The tokenizer you are loading from '/models/QuantTrio/Qwopus3.5-27B-v3-AWQ' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the fix_mistral_regex=True flag when loading this tokenizer to fix this issue.
(EngineCore pid=198) INFO 04-05 15:37:35 [gpu_model_runner.py:4735] Starting to load model /models/QuantTrio/Qwopus3.5-27B-v3-AWQ...
(EngineCore pid=198) INFO 04-05 15:37:35 [cuda.py:418] Using backend AttentionBackendEnum.FLASH_ATTN for vit attention
(EngineCore pid=198) INFO 04-05 15:37:35 [mm_encoder_attention.py:230] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention.
(EngineCore pid=198) INFO 04-05 15:37:35 [gdn_linear_attn.py:150] Using Triton/FLA GDN prefill kernel
(EngineCore pid=198) INFO 04-05 15:37:35 [awq_marlin.py:420] Using MarlinLinearKernel for AWQMarlinLinearMethod
(EngineCore pid=198) INFO 04-05 15:37:36 [cuda.py:362] Using FLASHINFER attention backend out of potential backends: ['FLASHINFER', 'TRITON_ATTN'].
Loading safetensors checkpoint shards: 0% Completed | 0/8 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 12% Completed | 1/8 [00:00<00:02, 2.94it/s]
Loading safetensors checkpoint shards: 25% Completed | 2/8 [00:00<00:01, 3.19it/s]
Loading safetensors checkpoint shards: 38% Completed | 3/8 [00:00<00:01, 3.43it/s]
Loading safetensors checkpoint shards: 50% Completed | 4/8 [00:01<00:01, 3.56it/s]
Loading safetensors checkpoint shards: 62% Completed | 5/8 [00:01<00:00, 3.58it/s]
Loading safetensors checkpoint shards: 75% Completed | 6/8 [00:01<00:00, 3.73it/s]
Loading safetensors checkpoint shards: 88% Completed | 7/8 [00:02<00:00, 3.46it/s]
Loading safetensors checkpoint shards: 100% Completed | 8/8 [00:02<00:00, 3.81it/s]
(EngineCore pid=198)
(EngineCore pid=198) INFO 04-05 15:37:38 [default_loader.py:384] Loading weights took 2.11 seconds
(EngineCore pid=198) INFO 04-05 15:37:39 [gpu_model_runner.py:4820] Model loading took 19.78 GiB memory and 3.644680 seconds
(EngineCore pid=198) INFO 04-05 15:37:39 [interface.py:601] Setting attention block size to 1568 tokens to ensure that attention page size is >= mamba page size.
(EngineCore pid=198) INFO 04-05 15:37:39 [interface.py:625] Padding mamba page size by 0.13% to ensure that mamba page size and attention page size are exactly equal.
(EngineCore pid=198) INFO 04-05 15:37:40 [gpu_model_runner.py:5760] Encoder cache will be initialized with a budget of 16384 tokens, and profiled with 1 image items of the maximum feature size.
(EngineCore pid=198) INFO 04-05 15:37:51 [backends.py:1051] Using cache directory: /root/.cache/vllm/torch_compile_cache/d8bc79fadb/rank_0_0/backbone for vLLM's torch.compile
(EngineCore pid=198) INFO 04-05 15:37:51 [backends.py:1111] Dynamo bytecode transform time: 5.71 s
(EngineCore pid=198) INFO 04-05 15:37:52 [backends.py:372] Cache the graph of compile range (1, 4096) for later use
(EngineCore pid=198) INFO 04-05 15:38:10 [backends.py:390] Compiling a graph for compile range (1, 4096) takes 18.73 s
(EngineCore pid=198) INFO 04-05 15:38:12 [decorators.py:655] saved AOT compiled function to /root/.cache/vllm/torch_compile_cache/torch_aot_compile/f69742409a60d9ca229b1df88a38af02968d0ccc0f8125d79307eec06c0e9c3a/rank_0_0/model
(EngineCore pid=198) INFO 04-05 15:38:12 [monitor.py:48] torch.compile took 26.61 s in total
(EngineCore pid=198) INFO 04-05 15:38:52 [monitor.py:76] Initial profiling/warmup run took 40.62 s
(EngineCore pid=198) INFO 04-05 15:38:53 [kv_cache_utils.py:829] Overriding num_gpu_blocks=0 with num_gpu_blocks_override=4
(EngineCore pid=198) INFO 04-05 15:38:53 [gpu_model_runner.py:5883] Profiling CUDA graph memory: PIECEWISE=3 (largest=4), FULL=2 (largest=2)
(EngineCore pid=198) INFO 04-05 15:38:54 [gpu_model_runner.py:5962] Estimated CUDA graph memory: 0.43 GiB total
(EngineCore pid=198) INFO 04-05 15:38:55 [gpu_worker.py:436] Available KV cache memory: 5.51 GiB
(EngineCore pid=198) INFO 04-05 15:38:55 [gpu_worker.py:470] In v0.19, CUDA graph memory profiling will be enabled by default (VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1), which more accurately accounts for CUDA graph memory during KV cache allocation. To try it now, set VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1 and increase --gpu-memory-utilization from 0.9800 to 0.9937 to maintain the same effective KV cache size.
(EngineCore pid=198) INFO 04-05 15:38:55 [kv_cache_utils.py:1319] GPU KV cache size: 43,904 tokens
(EngineCore pid=198) INFO 04-05 15:38:55 [kv_cache_utils.py:1324] Maximum concurrency for 128,000 tokens per request: 1.31x
(EngineCore pid=198) 2026-04-05 15:38:55,256 - INFO - autotuner.py:446 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
(EngineCore pid=198) 2026-04-05 15:38:55,533 - INFO - autotuner.py:455 - flashinfer.jit: [Autotuner]: Autotuning process ends
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|██████████| 3/3 [00:00<00:00, 42.69it/s]
Capturing CUDA graphs (decode, FULL): 100%|██████████| 2/2 [00:00<00:00, 11.54it/s]
(EngineCore pid=198) INFO 04-05 15:38:56 [gpu_model_runner.py:6053] Graph capturing finished in 1 secs, took 0.46 GiB
(EngineCore pid=198) INFO 04-05 15:38:56 [gpu_worker.py:597] CUDA graph pool memory: 0.46 GiB (actual), 0.43 GiB (estimated), difference: 0.03 GiB (6.4%).
(EngineCore pid=198) INFO 04-05 15:38:56 [core.py:283] init engine (profile, create kv cache, warmup model) took 76.82 seconds
(EngineCore pid=198) INFO 04-05 15:38:57 [vllm.py:799] Asynchronous scheduling is enabled.
(EngineCore pid=198) INFO 04-05 15:38:57 [kernel.py:199] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'])
(APIServer pid=1) INFO 04-05 15:38:57 [api_server.py:604] Supported tasks: ['generate']
(APIServer pid=1) INFO 04-05 15:38:57 [parser_manager.py:202] "auto" tool choice has been enabled.
(APIServer pid=1) Unrecognized keys in rope_parameters for 'rope_type'='default': {'mrope_interleaved', 'mrope_section'}
(APIServer pid=1) Unrecognized keys in rope_parameters for 'rope_type'='default': {'mrope_interleaved', 'mrope_section'}
(APIServer pid=1) Unrecognized keys in rope_parameters for 'rope_type'='default': {'mrope_interleaved', 'mrope_section'}
(APIServer pid=1) Unrecognized keys in rope_parameters for 'rope_type'='default': {'mrope_interleaved', 'mrope_section'}
(APIServer pid=1) Unrecognized keys in rope_parameters for 'rope_type'='default': {'mrope_interleaved', 'mrope_section'}
(APIServer pid=1) Unrecognized keys in rope_parameters for 'rope_type'='default': {'mrope_interleaved', 'mrope_section'}
(APIServer pid=1) Unrecognized keys in rope_parameters for 'rope_type'='default': {'mrope_interleaved', 'mrope_section'}
(APIServer pid=1) Unrecognized keys in rope_parameters for 'rope_type'='default': {'mrope_interleaved', 'mrope_section'}
(APIServer pid=1) Unrecognized keys in rope_parameters for 'rope_type'='default': {'mrope_interleaved', 'mrope_section'}
(APIServer pid=1) Unrecognized keys in rope_parameters for 'rope_type'='default': {'mrope_interleaved', 'mrope_section'}
(APIServer pid=1) Unrecognized keys in rope_parameters for 'rope_type'='default': {'mrope_interleaved', 'mrope_section'}
(APIServer pid=1) Unrecognized keys in rope_parameters for 'rope_type'='default': {'mrope_interleaved', 'mrope_section'}
(APIServer pid=1) Unrecognized keys in rope_parameters for 'rope_type'='default': {'mrope_interleaved', 'mrope_section'}
(APIServer pid=1) The tokenizer you are loading from '/models/QuantTrio/Qwopus3.5-27B-v3-AWQ' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the fix_mistral_regex=True flag when loading this tokenizer to fix this issue.
(APIServer pid=1) INFO 04-05 15:38:58 [hf.py:314] Detected the chat template content format to be 'string'. You can set --chat-template-content-format to override this.
(APIServer pid=1) Unrecognized keys in rope_parameters for 'rope_type'='default': {'mrope_interleaved', 'mrope_section'}
(APIServer pid=1) The tokenizer you are loading from '/models/QuantTrio/Qwopus3.5-27B-v3-AWQ' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the fix_mistral_regex=True flag when loading this tokenizer to fix this issue.
(APIServer pid=1) INFO 04-05 15:39:02 [base.py:245] Multi-modal warmup completed in 3.681s
(APIServer pid=1) Unrecognized keys in rope_parameters for 'rope_type'='default': {'mrope_interleaved', 'mrope_section'}
(APIServer pid=1) Unrecognized keys in rope_parameters for 'rope_type'='default': {'mrope_interleaved', 'mrope_section'}
(APIServer pid=1) Unrecognized keys in rope_parameters for 'rope_type'='default': {'mrope_interleaved', 'mrope_section'}
(APIServer pid=1) Unrecognized keys in rope_parameters for 'rope_type'='default': {'mrope_interleaved', 'mrope_section'}
(APIServer pid=1) Unrecognized keys in rope_parameters for 'rope_type'='default': {'mrope_interleaved', 'mrope_section'}
(APIServer pid=1) Unrecognized keys in rope_parameters for 'rope_type'='default': {'mrope_interleaved', 'mrope_section'}
(APIServer pid=1) INFO 04-05 15:39:02 [api_server.py:608] Starting vLLM server on http://0.0.0.0:8000

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment