File size: 28,918 Bytes
981b783
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
INFO 09-18 14:31:14 [__init__.py:241] Automatically detected platform cuda.
(APIServer pid=3508930) INFO 09-18 14:31:16 [api_server.py:1805] vLLM API server version 0.10.1.1
(APIServer pid=3508930) INFO 09-18 14:31:16 [utils.py:326] non-default args: {'model_tag': '/data/wyt/codes/DocDPO/sft/checkpoints_multilang/ted_base_balanced_en_zhdefr_320/dpo/merged/checkpoint-1800', 'host': '0.0.0.0', 'port': 8011, 'model': '/data/wyt/codes/DocDPO/sft/checkpoints_multilang/ted_base_balanced_en_zhdefr_320/dpo/merged/checkpoint-1800', 'served_model_name': ['qwen'], 'enable_prefix_caching': True}
(APIServer pid=3508930) INFO 09-18 14:31:22 [__init__.py:711] Resolved architecture: Qwen2ForCausalLM
(APIServer pid=3508930) `torch_dtype` is deprecated! Use `dtype` instead!
(APIServer pid=3508930) INFO 09-18 14:31:22 [__init__.py:1750] Using max model len 32768
(APIServer pid=3508930) INFO 09-18 14:31:23 [scheduler.py:222] Chunked prefill is enabled with max_num_batched_tokens=8192.
INFO 09-18 14:31:27 [__init__.py:241] Automatically detected platform cuda.
(EngineCore_0 pid=3509752) INFO 09-18 14:31:29 [core.py:636] Waiting for init message from front-end.
(EngineCore_0 pid=3509752) INFO 09-18 14:31:29 [core.py:74] Initializing a V1 LLM engine (v0.10.1.1) with config: model='/data/wyt/codes/DocDPO/sft/checkpoints_multilang/ted_base_balanced_en_zhdefr_320/dpo/merged/checkpoint-1800', speculative_config=None, tokenizer='/data/wyt/codes/DocDPO/sft/checkpoints_multilang/ted_base_balanced_en_zhdefr_320/dpo/merged/checkpoint-1800', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=qwen, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output","vllm.mamba_mixer2"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"cudagraph_mode":1,"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"pass_config":{},"max_capture_size":512,"local_cache_dir":null}
(EngineCore_0 pid=3509752) INFO 09-18 14:31:30 [parallel_state.py:1134] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
(EngineCore_0 pid=3509752) WARNING 09-18 14:31:30 [topk_topp_sampler.py:61] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
(EngineCore_0 pid=3509752) INFO 09-18 14:31:30 [gpu_model_runner.py:1953] Starting to load model /data/wyt/codes/DocDPO/sft/checkpoints_multilang/ted_base_balanced_en_zhdefr_320/dpo/merged/checkpoint-1800...
(EngineCore_0 pid=3509752) INFO 09-18 14:31:30 [gpu_model_runner.py:1985] Loading model from scratch...
(EngineCore_0 pid=3509752) INFO 09-18 14:31:30 [cuda.py:328] Using Flash Attention backend on V1 engine.
(EngineCore_0 pid=3509752) 
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
(EngineCore_0 pid=3509752) 
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:01<00:03,  1.29s/it]
(EngineCore_0 pid=3509748) 
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:03<00:03,  1.70s/it]
(EngineCore_0 pid=3509744) 
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:04<00:01,  1.39s/it]
(EngineCore_0 pid=3509748) 
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:06<00:00,  1.53s/it]
(EngineCore_0 pid=3509748) 
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:06<00:00,  1.52s/it]
(EngineCore_0 pid=3509748) 
(EngineCore_0 pid=3509744) INFO 09-18 14:31:37 [default_loader.py:262] Loading weights took 6.39 seconds
(EngineCore_0 pid=3509744) INFO 09-18 14:31:37 [gpu_model_runner.py:2007] Model loading took 14.2488 GiB and 6.587665 seconds
(EngineCore_0 pid=3509744) INFO 09-18 14:31:44 [backends.py:548] Using cache directory: /data/wyt/.cache/vllm/torch_compile_cache/1fe949e292/rank_0_0/backbone for vLLM's torch.compile
(EngineCore_0 pid=3509744) INFO 09-18 14:31:44 [backends.py:559] Dynamo bytecode transform time: 6.39 s
(EngineCore_0 pid=3509744) INFO 09-18 14:31:49 [backends.py:161] Directly load the compiled graph(s) for dynamic shape from the cache, took 4.697 s
(EngineCore_0 pid=3509744) INFO 09-18 14:31:52 [monitor.py:34] torch.compile takes 6.39 s in total
(EngineCore_0 pid=3509744) INFO 09-18 14:31:53 [gpu_worker.py:276] Available KV cache memory: 51.38 GiB
(EngineCore_0 pid=3509744) INFO 09-18 14:31:53 [kv_cache_utils.py:849] GPU KV cache size: 962,112 tokens
(EngineCore_0 pid=3509744) INFO 09-18 14:31:53 [kv_cache_utils.py:853] Maximum concurrency for 32,768 tokens per request: 29.36x
(EngineCore_0 pid=3509744) 
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):   0%|          | 0/67 [00:00<?, ?it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):   3%|β–Ž         | 2/67 [00:00<00:03, 18.81it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):   6%|β–Œ         | 4/67 [00:00<00:03, 19.08it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):   9%|β–‰         | 6/67 [00:00<00:03, 19.08it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  12%|β–ˆβ–        | 8/67 [00:00<00:03, 18.31it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  16%|β–ˆβ–‹        | 11/67 [00:00<00:02, 18.95it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  19%|β–ˆβ–‰        | 13/67 [00:00<00:02, 18.88it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  22%|β–ˆβ–ˆβ–       | 15/67 [00:00<00:02, 18.86it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  27%|β–ˆβ–ˆβ–‹       | 18/67 [00:00<00:02, 19.62it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  31%|β–ˆβ–ˆβ–ˆβ–      | 21/67 [00:01<00:02, 20.42it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  36%|β–ˆβ–ˆβ–ˆβ–Œ      | 24/67 [00:01<00:02, 21.04it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  40%|β–ˆβ–ˆβ–ˆβ–ˆ      | 27/67 [00:01<00:01, 21.36it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  45%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 30/67 [00:01<00:01, 21.03it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  49%|β–ˆβ–ˆβ–ˆβ–ˆβ–‰     | 33/67 [00:01<00:01, 21.63it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  54%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž    | 36/67 [00:01<00:01, 22.82it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  58%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š    | 39/67 [00:01<00:01, 23.81it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  63%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž   | 42/67 [00:01<00:01, 23.68it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  67%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹   | 45/67 [00:02<00:00, 24.29it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  72%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 48/67 [00:02<00:00, 24.79it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  76%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ  | 51/67 [00:02<00:00, 25.57it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  81%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  | 54/67 [00:02<00:00, 26.28it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  85%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 57/67 [00:02<00:00, 26.36it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  90%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 60/67 [00:02<00:00, 26.96it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  96%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ| 64/67 [00:02<00:00, 28.16it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 67/67 [00:02<00:00, 27.23it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 67/67 [00:02<00:00, 23.13it/s]
(EngineCore_0 pid=3509744) INFO 09-18 14:31:56 [gpu_model_runner.py:2708] Graph capturing finished in 3 secs, took 1.56 GiB
(EngineCore_0 pid=3509744) INFO 09-18 14:31:56 [core.py:214] init engine (profile, create kv cache, warmup model) took 19.28 seconds
(APIServer pid=3508927) INFO 09-18 14:31:57 [loggers.py:142] Engine 000: vllm cache_config_info with initialization after num_gpu_blocks is: 60132
(APIServer pid=3508927) INFO 09-18 14:31:57 [api_server.py:1611] Supported_tasks: ['generate']
(APIServer pid=3508927) WARNING 09-18 14:31:57 [__init__.py:1625] Default sampling parameters have been overridden by the model's Hugging Face generation config recommended from the model creator. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
(APIServer pid=3508927) INFO 09-18 14:31:57 [serving_responses.py:120] Using default chat sampling params from model: {'repetition_penalty': 1.05, 'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}
(APIServer pid=3508927) INFO 09-18 14:31:57 [serving_chat.py:134] Using default chat sampling params from model: {'repetition_penalty': 1.05, 'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}
(APIServer pid=3508927) INFO 09-18 14:31:57 [serving_completion.py:77] Using default completion sampling params from model: {'repetition_penalty': 1.05, 'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}
(APIServer pid=3508927) INFO 09-18 14:31:57 [api_server.py:1880] Starting vLLM API server 0 on http://0.0.0.0:8012
(APIServer pid=3508927) INFO 09-18 14:31:57 [launcher.py:36] Available routes are:
(APIServer pid=3508927) INFO 09-18 14:31:57 [launcher.py:44] Route: /openapi.json, Methods: HEAD, GET
(APIServer pid=3508927) INFO 09-18 14:31:57 [launcher.py:44] Route: /docs, Methods: HEAD, GET
(APIServer pid=3508927) INFO 09-18 14:31:57 [launcher.py:44] Route: /docs/oauth2-redirect, Methods: HEAD, GET
(APIServer pid=3508927) INFO 09-18 14:31:57 [launcher.py:44] Route: /redoc, Methods: HEAD, GET
(APIServer pid=3508927) INFO 09-18 14:31:57 [launcher.py:44] Route: /health, Methods: GET
(APIServer pid=3508927) INFO 09-18 14:31:57 [launcher.py:44] Route: /load, Methods: GET
(APIServer pid=3508927) INFO 09-18 14:31:57 [launcher.py:44] Route: /ping, Methods: POST
(APIServer pid=3508927) INFO 09-18 14:31:57 [launcher.py:44] Route: /ping, Methods: GET
(APIServer pid=3508927) INFO 09-18 14:31:57 [launcher.py:44] Route: /tokenize, Methods: POST
(APIServer pid=3508927) INFO 09-18 14:31:57 [launcher.py:44] Route: /detokenize, Methods: POST
(APIServer pid=3508927) INFO 09-18 14:31:57 [launcher.py:44] Route: /v1/models, Methods: GET
(APIServer pid=3508927) INFO 09-18 14:31:57 [launcher.py:44] Route: /version, Methods: GET
(APIServer pid=3508927) INFO 09-18 14:31:57 [launcher.py:44] Route: /v1/responses, Methods: POST
(APIServer pid=3508927) INFO 09-18 14:31:57 [launcher.py:44] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=3508927) INFO 09-18 14:31:57 [launcher.py:44] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=3508927) INFO 09-18 14:31:57 [launcher.py:44] Route: /v1/chat/completions, Methods: POST
(APIServer pid=3508927) INFO 09-18 14:31:57 [launcher.py:44] Route: /v1/completions, Methods: POST
(APIServer pid=3508927) INFO 09-18 14:31:57 [launcher.py:44] Route: /v1/embeddings, Methods: POST
(APIServer pid=3508927) INFO 09-18 14:31:57 [launcher.py:44] Route: /pooling, Methods: POST
(APIServer pid=3508927) INFO 09-18 14:31:57 [launcher.py:44] Route: /classify, Methods: POST
(APIServer pid=3508927) INFO 09-18 14:31:57 [launcher.py:44] Route: /score, Methods: POST
(APIServer pid=3508927) INFO 09-18 14:31:57 [launcher.py:44] Route: /v1/score, Methods: POST
(APIServer pid=3508927) INFO 09-18 14:31:57 [launcher.py:44] Route: /v1/audio/transcriptions, Methods: POST
(APIServer pid=3508927) INFO 09-18 14:31:57 [launcher.py:44] Route: /v1/audio/translations, Methods: POST
(APIServer pid=3508927) INFO 09-18 14:31:57 [launcher.py:44] Route: /rerank, Methods: POST
(APIServer pid=3508927) INFO 09-18 14:31:57 [launcher.py:44] Route: /v1/rerank, Methods: POST
(APIServer pid=3508927) INFO 09-18 14:31:57 [launcher.py:44] Route: /v2/rerank, Methods: POST
(APIServer pid=3508927) INFO 09-18 14:31:57 [launcher.py:44] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=3508927) INFO 09-18 14:31:57 [launcher.py:44] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=3508927) INFO 09-18 14:31:57 [launcher.py:44] Route: /invocations, Met(APIServer pid=3508933) INFO 09-18 14:32:00 [chat_utils.py:470] Detected the chat template content format to (APIServer pid=3508930) INFO 09-18 14:32:00 [chat_utils.py:470] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to ov(APIServer pid=3508933) INFO 09-18 14:32:07 [loggers.py:123] Engine 000: Avg prompt throughput: 47.2 tokens/s(APIServer pid=3508930) INFO 09-18 14:32:07 [loggers.py:123] Engine 000: Avg prompt throughput: 47.2 tokens/s, Avg gene(APIServer pid=3508933) INFO:     127.0.0.1:56172 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=3508933) INFO:     127.0.0.1:56172 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=3508933) INFO:     127.0.0.1:56172 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=3508933) INFO:     127.0.0.1:56172 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=3508933) INFO:     127.0.0.1:56172 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=3508933) INFO:     127.0.0.1:56172 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=3508933) INFO:     127.0.0.1:56172 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=3508933) INFO:     127.0.0.1:56172 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=3508933) INFO:     127.0.0.1:56172 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=3508933) INFO:     127.0.0.1:56172 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=3508933) INFO:     127.0.0.1:56172 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=3508933) INFO:     127.0.0.1:56172 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=3508933) INFO:     127.0.0.1:56172 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=3508933) INFO 09-18 14:32:17 [loggers.py:123] Engine 000: Avg prompt throughput: 608.0 tokens/(APIServer pid=3508930) INFO 09-18 14:32:17 [loggers.py:123] Engine 000: Avg prompt throughput: 608.0 tokens/s, Avg gener(APIServer pid=3508933) INFO:     127.0.0.1:56172 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=3508933) INFO:     127.0.0.1:56172 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=3508933) INFO:     127.0.0.1:56172 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=3508933) INFO:     127.0.0.1:56172 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=3508933) INFO:     127.0.0.1:56172 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=3508933) INFO 09-18 14:32:27 [loggers.py:123] Engine 000: Avg prompt throughput: 195.2 tokens/(APIServer pid=3508930) INFO 09-18 14:32:27 [loggers.py:123] Engine 000: Avg prompt throughput: 195.2 tokens/s, Avg gener(APIServer pid=3508933) INFO:     127.0.0.1:56172 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=3508933) INFO:     127.0.0.1:39414 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=3508933) INFO:     127.0.0.1:56172 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=3508933) INFO:     127.0.0.1:39414 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=3508933) INFO 09-18 14:32:37 [loggers.py:123] Engine 000: Avg prompt throughput: 325.8 tokens/(APIServer pid=3508930) INFO 09-18 14:32:37 [loggers.py:123] Engine 000: Avg prompt throughput: 325.8 tokens/s, Avg gener(APIServer pid=3508933) INFO:     127.0.0.1:56172 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=3508933) INFO:     127.0.0.1:56172 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=3508933) INFO:     127.0.0.1:39414 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=3508933) INFO:     127.0.0.1:56172 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=3508933) INFO:     127.0.0.1:56172 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=3508933) INFO:     127.0.0.1:56172 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=3508933) INFO:     127.0.0.1:56172 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=3508933) INFO:     127.0.0.1:56172 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=3508933) INFO:     127.0.0.1:56172 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=3508933) INFO:     127.0.0.1:56172 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=3508933) INFO:     127.0.0.1:56172 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=3508933) INFO 09-18 14:32:47 [loggers.py:123] Engine 000: Avg prompt throughput: 694.0 tokens/(APIServer pid=3508930) INFO 09-18 14:32:47 [loggers.py:123] Engine 000: Avg prompt throughput: 694.1 tokens/s, Avg gener(APIServer pid=3508933) INFO:     127.0.0.1:56172 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=3508933) INFO:     127.0.0.1:56172 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=3508933) INFO:     127.0.0.1:56172 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=3508933) INFO:     127.0.0.1:56172 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=3508933) INFO:     127.0.0.1:56172 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=3508933) INFO:     127.0.0.1:56172 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=3508933) INFO:     127.0.0.1:56172 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=3508933) INFO:     127.0.0.1:56172 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=3508933) INFO:     127.0.0.1:48072 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APISe(APIServer pid=3508930) INFO:     127.0.0.1:41594 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=3508930) INFO 09-18 14:32:57 [loggers.py:123] Engine 000: Avg prompt throughput: 373.7 tokens/s, Avg gener(APIServer pid=3508933) INFO:     127.0.0.1:56172 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=3508933) INFO:     127.0.0.1:56172 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=3508933) INFO 09-18 14:33:07 [loggers.py:123] Engine 000: Avg prompt throughput: 249.6 tokens/(APIServer pid=3508930) INFO 09-18 14:33:07 [loggers.py:123] Engine 000: Avg prompt throughput: 209.7 tokens/s, Avg gener(APIServer pid=3508933) INFO:     127.0.0.1:45372 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=3508933) INFO:     127.0.0.1:56172 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=3508933) INFO:     127.0.0.1:56172 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=3508933) INFO:     127.0.0.1:45372 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=3508933) INFO 09-18 14:33:17 [loggers.py:123] Engine 000: Avg prompt throughput: 356.6 tokens/(APIServer pid=3508930) INFO 09-18 14:33:17 [loggers.py:123] Engine 000: Avg prompt throughput: 356.6 tokens/s, Avg generation throughput: 84.5 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit(APIServer pid=3508933) INFO:     127.0.0.1:56172 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=3508933) INFO:     127.0.0.1:56172 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=3508933) INFO:     127.0.0.1:56172 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=3508933) INFO:     127.0.0.1:56172 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=3508933) INFO:     127.0.0.1:56172 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APISe(APIServer pid=3508930) INFO:     127.0.0.1:32922 - "POST /v1/chat/completions HTTP(APIServer pid=3508933) INFO:     127.0.0.1:56172 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APISe(APIServer pid=3508930) INFO:     127.0.0.1:32922 - "POST /v1/chat/completions HTTP(APISe(APIServer pid=3508930) INFO:     127.0.0.1:32922 - "POST /v1/chat/completions HTTP(APISe(APIServer pid=3508930) INFO:     127.0.0.1:32922 - "POST /v1/chat/completions HTTP(APIServer pid=3508933) INFO:     127.0.0.1:56172 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APISe(APIServer pid=3508930) INFO:     127.0.0.1:32922 - "POST /v1/chat/completions HTTP(APIServer pid=3508933) INFO:     127.0.0.1:56172 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APISe(APIServer pid=3508930) INFO:     127.0.0.1:32922 - "POST /v1/chat/completions HTTP(APIServer pid=3508933) INFO:     127.0.0.1:56172 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APISe(APIServer pid=3508930) INFO:     127.0.0.1:32922 - "POST /v1/chat/completions HTTP(APIServer pid=3508933) INFO:     127.0.0.1:56172 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APISe(APIServer pid=3508930) INFO:     127.0.0.1:32922 - "POST /v1/chat/completions HTTP(APIServer pid=3508933) INFO:     127.0.0.1:56172 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APISe(APIServer pid=3508930) INFO:     127.0.0.1:32922 - "POST /v1/chat/completions HTTP(APIServer pid=3508933) INFO 09-18 14:33:27 [loggers.py:123] Engine 000: Avg prompt throughput: 786.5 tokens/(APIServer pid=3508930) INFO 09-18 14:33:27 [loggers.py:123] Engine 000: Avg prompt throughput: 786.5 tokens/s, Avg gener(APIServer pid=3508933) INFO:     127.0.0.1:56172 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APISe(APIServer pid=3508930) INFO:     127.0.0.1:32922 - "POST /v1/chat/completions HTTP(APIServer pid=3508933) INFO:     127.0.0.1:56172 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=3508933) INFO:     127.0.0.1:56172 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=3508933) INFO:     127.0.0.1:54502 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=3508933) INFO 09-18 14:33:37 [loggers.py:123] Engine 000: Avg prompt throughput: 253.4 tokens/(APIServer pid=3508930) INFO 09-18 14:33:37 [loggers.py:123] Engine 000: Avg prompt throughput: 253.4 tokens/s, Avg gener(APIServer pid=3508933) INFO:     127.0.0.1:56172 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=3508933) INFO:     127.0.0.1:54502 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=3508933) INFO:     127.0.0.1:56172 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=3508933) INFO:     127.0.0.1:56172 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=3508933) INFO:     127.0.0.1:56172 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=3508933) INFO:     127.0.0.1:56172 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APISe(APIServer pid=3508930) INFO 09-18 14:33:47 [loggers.py:123] Engine 000: Avg prompt throughput: 376.9 tokens/s, Avg generation throughput: 81.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit(APIServer pid=3508933) INFO:     127.0.0.1:36442 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=3508933) INFO:     127.0.0.1:56172 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=3508933) INFO:     127.0.0.1:56172 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APISe(APIServer pid=3508930) INFO:     127.0.0.1:32922 - "POST /v1/chat/completions HTTP(APISe(APIServer pid=3508930) INFO:     127.0.0.1:32922 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=3508930) INFO:     127.0.0.1:32922 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=3508930) INFO:     127.0.0.1:32922 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=3508930) INFO:     127.0.0.1:32922 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=3508930) INFO:     127.0.0.1:32922 - "POST /v1/chat/completions HTTP(APIServer pid=3508933) INFO:     127.0.0.1:56172 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APISe(APIServer pid=3508930) INFO:     127.0.0.1:32922 - "POST /v1/chat/completions HTTP(APIServer pid=3508933) INFO:     127.0.0.1:56172 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APISe(APIServer pid=3508930) INFO:     127.0.0.1:32922 - "POST /v1/chat/completions HTTP(APIServer pid=3508933) INFO 09-18 14:33:57 [loggers.py:123] Engine 000: Avg prompt throughput: 848.9 tokens/(APIServer pid=3508930) INFO 09-18 14:33:57 [loggers.py:123] Engine 000: Avg prompt throughput: 849.0 tokens/s, Avg gener(APIServer pid=3508933) INFO:     127.0.0.1:56172 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APISe(APIServer pid=3508930) INFO:     127.0.0.1:32922 - "POST /v1/chat/completions HTTP(APIServer pid=3508933) INFO:     127.0.0.1:56172 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=3508933) INFO:     127.0.0.1:56172 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=3508933) INFO:     127.0.0.1:35254 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=3508933) INFO 09-18 14:34:07 [loggers.py:123] Engine 000: Avg prompt throughput: 269.5 tokens/(APIServer pid=3508930) INFO 09-18 14:34:07 [loggers.py:123] Engine 000: Avg prompt throughput: 269.5 tokens/s, Avg gener(APIServer pid=3508933) INFO:     127.0.0.1:56172 - "POST /v1/chat/completions HTTP/1.1" 200 OK
rver pid=3508(APIServer pid=3508930) INFO:     127.0.0.1:32922 - "POST /v1/chat/completions HTTP/1.1" 200 OK
tion throughput: 84.2 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 57.6%
(APIServer pid=3508927) INFO:     127.0.0.1:58182 - "POST /v1/chat/completions HTTP/1.1" 200 OK