Beam Search is too slow than Sampling search, code and test result is provided.

by iflamed - opened Jul 24, 2025

Jul 24, 2025

The code, vllm==0.8.0 transformers==4.51.3 flashinfer-python==0.2.2

from vllm import LLM, SamplingParams
from vllm.sampling_params import BeamSearchParams
from vllm.inputs import TextPrompt
import time

model_path = "ByteDance-Seed/Seed-X-Instruct-7B"

model = LLM(model=model_path,
            max_num_seqs=512,
            tensor_parallel_size=1,
            enable_prefix_caching=True, 
            gpu_memory_utilization=0.95,
            task="generate")

prompts = [
    # without CoT
    { "prompt": "Translate the following English sentence into Chinese:\nMay the force be with you <zh>"},
    # with CoT
    { "prompt": "Translate the following English sentence into Chinese and explain it in detail:\nMay the force be with you <zh>"} 
]

messages = [
    # without CoT
    "Translate the following English sentence into Chinese:\nMay the force be with you <zh>",
    # with CoT
    "Translate the following English sentence into Chinese and explain it in detail:\nMay the force be with you <zh>",
]


# Beam Search (We recommend using beam search decoding)
decoding_params = BeamSearchParams(beam_width=4, 
                                   max_tokens=512)

# Start at
start_time = time.time()

results = model.beam_search(prompts, decoding_params)

# End at
end_time = time.time()

# Total run time
elapsed_time = end_time - start_time
print(f"BeamSearchParams program run time: {elapsed_time:.4f} seconds")

for output in results:
    generated_text = output.sequences[0].text
    print(f"Generated text: {generated_text!r}")


# Sampling
decoding_sample_params = SamplingParams(temperature=0,
                                 max_tokens=512,
                                 skip_special_tokens=True)
# Start at
start_time = time.time()

results = model.generate(messages, decoding_sample_params)

# End at
end_time = time.time()

# Total run time
elapsed_time = end_time - start_time
print(f"SamplingParams program run time: {elapsed_time:.4f} seconds")

responses = [res.outputs[0].text.strip() for res in results]
print(responses)

Results:

INFO 07-24 20:38:44 [__init__.py:256] Automatically detected platform cuda.
INFO 07-24 20:38:52 [config.py:1693] Chunked prefill is enabled with max_num_batched_tokens=8192.
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.
INFO 07-24 20:38:56 [core.py:53] Initializing a V1 LLM engine (v0.8.0) with config: model='ByteDance-Seed/Seed-X-Instruct-7B', speculative_config=None, tokenizer='ByteDance-Seed/Seed-X-Instruct-7B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=ByteDance-Seed/Seed-X-Instruct-7B, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"level":3,"custom_ops":["none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":512}
2025-07-24 20:38:56,450 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
WARNING 07-24 20:38:56 [utils.py:2282] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7fa6d5329730>
INFO 07-24 20:38:57 [parallel_state.py:967] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 07-24 20:38:57 [cuda.py:215] Using Flash Attention backend on V1 engine.
INFO 07-24 20:38:57 [gpu_model_runner.py:1128] Starting to load model ByteDance-Seed/Seed-X-Instruct-7B...
INFO 07-24 20:38:57 [topk_topp_sampler.py:53] Using FlashInfer for top-p & top-k sampling.
INFO 07-24 20:38:58 [weight_utils.py:257] Using model weights format ['*.safetensors']
INFO 07-24 20:38:58 [weight_utils.py:307] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:02<00:00,  2.35s/it]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:02<00:00,  2.35s/it]

INFO 07-24 20:39:01 [loader.py:429] Loading weights took 2.75 seconds
INFO 07-24 20:39:01 [gpu_model_runner.py:1140] Model loading took 14.0045 GB and 4.048324 seconds
INFO 07-24 20:39:08 [backends.py:409] Using cache directory: /root/.cache/vllm/torch_compile_cache/a61cc1c890/rank_0_0 for vLLM's torch.compile
INFO 07-24 20:39:08 [backends.py:419] Dynamo bytecode transform time: 6.91 s
INFO 07-24 20:39:08 [backends.py:115] Directly load the compiled graph for shape None from the cache
INFO 07-24 20:39:14 [monitor.py:33] torch.compile takes 6.91 s in total
2025-07-24 20:39:14,915 - INFO - flashinfer.jit: Loading JIT ops: sampling
/root/miniconda3/envs/seedx/lib/python3.12/site-packages/torch/utils/cpp_extension.py:2059: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
  warnings.warn(
/root/miniconda3/envs/seedx/lib/python3.12/site-packages/torch/utils/cpp_extension.py:2059: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
  warnings.warn(
2025-07-24 20:39:15,080 - INFO - flashinfer.jit: Finished loading JIT ops: sampling
INFO 07-24 20:39:15 [kv_cache_utils.py:537] GPU KV cache size: 56,400 tokens
INFO 07-24 20:39:15 [kv_cache_utils.py:540] Maximum concurrency for 32,768 tokens per request: 1.72x
INFO 07-24 20:39:29 [gpu_model_runner.py:1436] Graph capturing finished in 14 secs, took 0.51 GiB
INFO 07-24 20:39:29 [core.py:138] init engine (profile, create kv cache, warmup model) took 28.04 seconds
BeamSearchParams program run time: 22.1455 seconds
Generated text: '<s> Translate the following English sentence into Chinese:\nMay the force be with you <zh><s> 愿原力与你同在</s>'
Generated text: '<s> Translate the following English sentence into Chinese and explain it in detail:\nMay the force be with you <zh><s> 愿原力与你同在</s>'
Processed prompts: 100%|████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 10.96it/s, est. speed input: 235.87 toks/s, output: 109.69 toks/s]
SamplingParams program run time: 0.1910 seconds
['愿原力与你同在', '愿原力与你同在']

BeamSearch take 22.1455 seconds and Sampling Search take 0.1910 seconds.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment