Optimize GPU KV cache memory usage

#25

by oleslav - opened 15 days ago

I am running Voxtral via vLLM on NVIDIA GeForce RTX 4090 (memory ~24GB).
With default configuration I hit 100% memory usage really quickly (with only one audio stream):

docker run --gpus all -p 8000:8000 vllm-voxtral:latest mistralai/Voxtral-Mini-4B-Realtime-2602 --compilation_config '{"cudagraph_mode": "PIECEWISE"}' --gpu-memory-utilization 0.9

But I found out that with that params I might decrease the memory usage. One stream could run indefinitely long (more than an hour without using more than 34.6% of memory after 10 minutes).

--max-model-len 100
--max-num-seqs 90000

I am curious how I can further optimize the memory usage to run more streams & for longer.
I didn't find any related information how or why those params improve memory usage.

This param also has no affect & I didn't find a param that could decrease a size of a sliding window.

--no-disable-sliding-window

I will be grateful for any info that might help.

juliendenize

Mistral AI_ org 10 days ago

Hey I strongly suggest to take a look at the vLLM documentation for this:
https://docs.vllm.ai/en/stable/configuration/conserving_memory/
https://docs.vllm.ai/en/stable/configuration/optimization/?h=max_num_seqs#preemption

Basically the way to reduce memory usage is to reduce the number of tokens vLLM might receive:

max-model-len is the maximul context length for your model can have so depending on your task you might expect low or high context length and you can adjust it
max-num-seqs is the number of requests vLLM can process in the same time. If you plan on serving only for yourself one request at a time, you can put it to 1 or a few more but 90000 is definetly not what you want

oleslav

7 days ago

Hi @juliendenize , thank you for response!
I actually made a mistake when describing my problem. I meant the following instead:

--max-model-len 90000
--max-num-seqs 100

My goal is to optimize memory consumption while maximizing the number of concurrent audio streams. This setup is not for personal use.
I set --max-model-len to 90000 because, according to the documentation, one token corresponds to an 80 ms chunk. My plan was to live-transcribe 2-hour meetings:

--max-model-len >= 7200 / 0.08 = 90000

However, it does not behave as I expected.

I split the streams into 4-minute chunks, which helped increase the number of consecutive streams from 1 to 10, with around 97% GPU KV cache memory usage. However, this is still not sufficient for my use case. Now I am trying to reduce the chunk size even further (e.g., to 10 seconds), but this introduces multiple issues related to WebSocket handling and increased load on the vLLM server.

In any case, thank you very much for the links and the explanation of the parameters — I found them very helpful!
If I achieve better results (more streams with lower memory consumption), I will share them here.

oleslav

1 day ago

My best what I can achieve on NVIDIA GeForce RTX 4090 vLLM (0.16.1rc1.dev173+g8fa68a8ce):

58 streams on 10 seconds audio chunks (max-model-len=200)
34 streams on 20 seconds audio chunks (max-model-len=400)
13 streams on 1 minute audio chunks (max-model-len=800)
9 streams on 5 minutes audio chunks (max-model-len=4000)

max-num-seqs doesn't affect the memory usage at all. It is only limit how many requests can be run by model simultaneously.

This means that you can maximum feed n streams to model within n seconds/minutes & then you need to input_audio_buffer.commit final=True, because otherwise model will be locked into vLLM memory limit.
It is worth mentioning that with small audio chunks (10/20 seconds) model start to missing the context of the audio & it can transcribe the whole audio in different language in my case (German was fully replace by Hindi).

@patrickvonplaten Please read that ->
Other thing that is worth mentioning is wasting KV cache memory according to the vLLM logs. This might be fixed in future if it is really an issue or it might be just misleading logging.

...
(EngineCore_DP0 pid=111) INFO 03-04 15:03:01 [gpu_worker.py:423] Available KV cache memory: 11.79 GiB
(EngineCore_DP0 pid=111) WARNING 03-04 15:03:01 [kv_cache_utils.py:1054] Add 6 padding layers, may waste at most 23.08% KV cache memory
...

TL;DR: to run Voxtral Realtime on many streams for long time you need something better than a single NVIDIA GeForce RTX 4090.

If you manage to get a better results on same or almost the same hardware. I will appreciate if you tag me & share your idea 😄

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment