Setting for Throughput Experiments
Hi there, thank you very much for your fantastic work. In the paper, I noticed there is a brief description of the throughput experiment:
"We use an input sequence length of 65536 and ask the models to generate 1024 output tokens. We use an initial Megatron-LM implementation for Nemotron-H inference and vLLM v0.7.312 for baselines. In these experiments, we try to maximize per-GPU inference throughput by
using as large a batch size as possible, and we run all experiments on NVIDIA H100 GPUs."
When I used the NeMo Docker(nemo:25.04.nemotron-h) to test the inference throughput, I couldn't get a similar result(200tps for Nemotron-H-8B). May I request a more detailed description of the experimental setup? like the batch size, warmup, and other acceleration methods used.