Beam Search is too slow than Sampling search, code and test result is provided.
#5
by
iflamed
- opened
The code, vllm==0.8.0 transformers==4.51.3 flashinfer-python==0.2.2
from vllm import LLM, SamplingParams
from vllm.sampling_params import BeamSearchParams
from vllm.inputs import TextPrompt
import time
model_path = "ByteDance-Seed/Seed-X-Instruct-7B"
model = LLM(model=model_path,
max_num_seqs=512,
tensor_parallel_size=1,
enable_prefix_caching=True,
gpu_memory_utilization=0.95,
task="generate")
prompts = [
# without CoT
{ "prompt": "Translate the following English sentence into Chinese:\nMay the force be with you <zh>"},
# with CoT
{ "prompt": "Translate the following English sentence into Chinese and explain it in detail:\nMay the force be with you <zh>"}
]
messages = [
# without CoT
"Translate the following English sentence into Chinese:\nMay the force be with you <zh>",
# with CoT
"Translate the following English sentence into Chinese and explain it in detail:\nMay the force be with you <zh>",
]
# Beam Search (We recommend using beam search decoding)
decoding_params = BeamSearchParams(beam_width=4,
max_tokens=512)
# Start at
start_time = time.time()
results = model.beam_search(prompts, decoding_params)
# End at
end_time = time.time()
# Total run time
elapsed_time = end_time - start_time
print(f"BeamSearchParams program run time: {elapsed_time:.4f} seconds")
for output in results:
generated_text = output.sequences[0].text
print(f"Generated text: {generated_text!r}")
# Sampling
decoding_sample_params = SamplingParams(temperature=0,
max_tokens=512,
skip_special_tokens=True)
# Start at
start_time = time.time()
results = model.generate(messages, decoding_sample_params)
# End at
end_time = time.time()
# Total run time
elapsed_time = end_time - start_time
print(f"SamplingParams program run time: {elapsed_time:.4f} seconds")
responses = [res.outputs[0].text.strip() for res in results]
print(responses)
Results:
INFO 07-24 20:38:44 [__init__.py:256] Automatically detected platform cuda.
INFO 07-24 20:38:52 [config.py:1693] Chunked prefill is enabled with max_num_batched_tokens=8192.
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.
INFO 07-24 20:38:56 [core.py:53] Initializing a V1 LLM engine (v0.8.0) with config: model='ByteDance-Seed/Seed-X-Instruct-7B', speculative_config=None, tokenizer='ByteDance-Seed/Seed-X-Instruct-7B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=ByteDance-Seed/Seed-X-Instruct-7B, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"level":3,"custom_ops":["none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":512}
2025-07-24 20:38:56,450 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
WARNING 07-24 20:38:56 [utils.py:2282] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7fa6d5329730>
INFO 07-24 20:38:57 [parallel_state.py:967] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 07-24 20:38:57 [cuda.py:215] Using Flash Attention backend on V1 engine.
INFO 07-24 20:38:57 [gpu_model_runner.py:1128] Starting to load model ByteDance-Seed/Seed-X-Instruct-7B...
INFO 07-24 20:38:57 [topk_topp_sampler.py:53] Using FlashInfer for top-p & top-k sampling.
INFO 07-24 20:38:58 [weight_utils.py:257] Using model weights format ['*.safetensors']
INFO 07-24 20:38:58 [weight_utils.py:307] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:02<00:00, 2.35s/it]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:02<00:00, 2.35s/it]
INFO 07-24 20:39:01 [loader.py:429] Loading weights took 2.75 seconds
INFO 07-24 20:39:01 [gpu_model_runner.py:1140] Model loading took 14.0045 GB and 4.048324 seconds
INFO 07-24 20:39:08 [backends.py:409] Using cache directory: /root/.cache/vllm/torch_compile_cache/a61cc1c890/rank_0_0 for vLLM's torch.compile
INFO 07-24 20:39:08 [backends.py:419] Dynamo bytecode transform time: 6.91 s
INFO 07-24 20:39:08 [backends.py:115] Directly load the compiled graph for shape None from the cache
INFO 07-24 20:39:14 [monitor.py:33] torch.compile takes 6.91 s in total
2025-07-24 20:39:14,915 - INFO - flashinfer.jit: Loading JIT ops: sampling
/root/miniconda3/envs/seedx/lib/python3.12/site-packages/torch/utils/cpp_extension.py:2059: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
warnings.warn(
/root/miniconda3/envs/seedx/lib/python3.12/site-packages/torch/utils/cpp_extension.py:2059: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
warnings.warn(
2025-07-24 20:39:15,080 - INFO - flashinfer.jit: Finished loading JIT ops: sampling
INFO 07-24 20:39:15 [kv_cache_utils.py:537] GPU KV cache size: 56,400 tokens
INFO 07-24 20:39:15 [kv_cache_utils.py:540] Maximum concurrency for 32,768 tokens per request: 1.72x
INFO 07-24 20:39:29 [gpu_model_runner.py:1436] Graph capturing finished in 14 secs, took 0.51 GiB
INFO 07-24 20:39:29 [core.py:138] init engine (profile, create kv cache, warmup model) took 28.04 seconds
BeamSearchParams program run time: 22.1455 seconds
Generated text: '<s> Translate the following English sentence into Chinese:\nMay the force be with you <zh><s> ζΏεεδΈδ½ εε¨</s>'
Generated text: '<s> Translate the following English sentence into Chinese and explain it in detail:\nMay the force be with you <zh><s> ζΏεεδΈδ½ εε¨</s>'
Processed prompts: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 2/2 [00:00<00:00, 10.96it/s, est. speed input: 235.87 toks/s, output: 109.69 toks/s]
SamplingParams program run time: 0.1910 seconds
['ζΏεεδΈδ½ εε¨', 'ζΏεεδΈδ½ εε¨']
BeamSearch take 22.1455 seconds and Sampling Search take 0.1910 seconds.