vLLM software comes with built-in benchmarks. I will compare LLM inference performance on different GPUs. And I will compare the performance difference between tensor parallelism and data parallelism. I want to find q deployment method that gives better performance.
I will benchmark two larger models on multiple GPUs:
- Qwen/Qwen3-Coder-30B-A3B-Instruct
- openai/gpt-oss-120b
And smaller models on a single Nvidia RTX 4090, A100, H100, H200:
- google/gemma-3-4b-it
- meta-llama/Llama-3.1-8B-Instruct
- mistralai/Mistral-7B-Instruct-v0.3
- Qwen/Qwen2.5-Coder-7B-Instruct
Tensor Parallel vs Data Parallel
There are different deployment methods for LLM inference, on multiple GPUs or multiple servers:
- Tensor Parallelism
- Data Parallelism
- Pipeline Parallelism
- ...
I will inference LLM on one server, and I want to compare only Tensor Parallelism vs Data Parallelism.
Also benchmark mix of methods:
- Tensor Parallelism + Data Parallelism
Main vLLM server for benchmarks
- OS: Rocky Linux 9.6
- Hardware: Dell PowerEdge XE9680
- Processors: 2 x Intel Xeon Platinum 8470
- RAM: 2TB
- GPUs: 8 x GPUs H100 80GB
- Server IP address: 10.1.x.x
TORCH_CUDA_ARCH_LIST
TORCH_CUDA_ARCH_LIST is an environment variable that should be set according to Nvidia GPU card before starting vLLM server.
More information here: https://en.wikipedia.org/wiki/CUDA
In my benchmarks, I will use:
- TORCH_CUDA_ARCH_LIST=8.0 # for A100
- TORCH_CUDA_ARCH_LIST=8.9 # for RTX 4090
- TORCH_CUDA_ARCH_LIST=9.0 # for H200, H100
Install vLLM server
I installed Python 3.12 because of flashinfer-python issues with default Python 3.9 on Rocky Linux 9.6.
dnf -y install python3.12 python3.12-devel
Create a dedicated user to run vLLM:
useradd vllm
Change user and create a Python virtual environment with vLLM.
su - vllm
python3.12 -m venv venv-vllm
source venv-vllm/bin/activate
Find CUDA version:
nvidia-smi | grep CUDA
| NVIDIA-SMI 575.57.08 Driver Version: 575.57.08 CUDA Version: 12.9 |
I have installed CUDA 12.9. CUDA version is needed for next command link ....cu129.
pip install vllm --extra-index-url https://download.pytorch.org/whl/cu129
pip install flashinfer-python
vllm --version
0.10.1.1
NOTE 1: flashinfer-python is not a critical component, but it improves performance.
NOTE 2: Gemma, Llama and Mistral models require accepting a license and access token from Huggingface.
Install Huggingface CLI
pip install -U "huggingface_hub[cli]"
Log in with a token (create on https://huggingface.co )
hf auth login
Example: manually start vLLM on two GPUs with tensor-parallel 2:
export TORCH_CUDA_ARCH_LIST=9.0
vllm serve Qwen/Qwen3-Coder-30B-A3B-Instruct -tp=2 --port 8001 --max-model-len 65536 --enable-auto-tool-choice --tool-call-parser qwen3_coder
I also want to start inferencing automatically on boot.
Create systemctld service file /etc/systemd/system/vllm-8001.service :
[Unit]
Description=vLLM Service 8001
After=network-online.target
[Service]
WorkingDirectory=/home/vllm/
ExecStart=/bin/sh -c 'cd /home/vllm && source venv-vllm/bin/activate && vllm serve Qwen3-Coder-30B-A3B-Instruct -tp=2 -dp=1 --port 8001 --max-model-len 65536 --enable-auto-tool-choice --tool-call-parser qwen3_coder'
User=vllm
Group=vllm
Restart=always
RestartSec=3
Environment="CUDA_VISIBLE_DEVICES=0,1"
Environment="VLLM_CONFIGURE_LOGGING=0"
Environment="TORCH_CUDA_ARCH_LIST=9.0"
[Install]
WantedBy=multi-user.target
Benchmark client
Benchmark can be started from the same server locally or from a remote server.
On a remote server operating system Rocky Linux 9.6.
Install vLLM in Python virtual environment venv-vllm as it was done for vLLM server.
Benchmarking
I will restart the vLLM server with different combinations. I will modify only these 2 options
- -tp=... (or --tensor-parallel-size=...)
- -dp=... (or --data-parallel-size=...)
Benchmark command will be the same 10000 prompts and 1000 concurrent connections.
- option for remote server --host 10.1.8.1
- option for local server --host localhost
Example benchmark command with output:
source venv-vllm/bin/activate
vllm bench serve --endpoint-type vllm --host 10.1.8.1 --port 8001 --model Qwen/Qwen3-Coder-30B-A3B-Instruct --dataset-name random --random-input-len 128 --random-output-len 128 --num-prompts 10000 --max-concurrency 1000 --temperature 0.7
. . .
. . .
============ Serving Benchmark Result ============
Successful requests: 10000
Maximum request concurrency: 1000
Benchmark duration (s): 86.26
Total input tokens: 1276722
Total generated tokens: 1265234
Request throughput (req/s): 115.93
Output token throughput (tok/s): 14667.45
Total Token throughput (tok/s): 29468.07
---------------Time to First Token----------------
Mean TTFT (ms): 1126.10
Median TTFT (ms): 1258.56
P99 TTFT (ms): 1970.82
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 55.98
Median TPOT (ms): 54.04
P99 TPOT (ms): 74.32
---------------Inter-token Latency----------------
Mean ITL (ms): 55.85
Median ITL (ms): 48.65
P99 ITL (ms): 160.58
==================================================
Commands to start smaller LLM inference:
TORCH_CUDA_ARCH_LIST=8.9 vllm serve Qwen/Qwen2.5-Coder-7B-Instruct -tp=1 --port 8001 --max-model-len 8192
TORCH_CUDA_ARCH_LIST=8.9 vllm serve meta-llama/Llama-3.1-8B-Instruct -tp=1 --port 8001 --max-model-len 8192
TORCH_CUDA_ARCH_LIST=8.9 vllm serve mistralai/Mistral-7B-Instruct-v0.3 -tp=1 --port 8001 --max-model-len 8192
TORCH_CUDA_ARCH_LIST=8.9 vllm serve google/gemma-3-4b-it -tp=1 --port 8001 --max-model-len 8192
Benchmark commands:
vllm bench serve --endpoint-type vllm --host 127.0.0.1 --port 8001 --model Qwen/Qwen2.5-Coder-7B-Instruct --dataset-name random --random-input-len 128 --random-output-len 128 --num-prompts 10000 --max-concurrency 1000 --temperature 0.7
vllm bench serve --endpoint-type vllm --host 127.0.0.1 --port 8001 --model meta-llama/Llama-3.1-8B-Instruct --dataset-name random --random-input-len 128 --random-output-len 128 --num-prompts 10000 --max-concurrency 1000 --temperature 0.7
vllm bench serve --endpoint-type vllm --host 127.0.0.1 --port 8001 --model mistralai/Mistral-7B-Instruct-v0.3 --dataset-name random --random-input-len 128 --random-output-len 128 --num-prompts 10000 --max-concurrency 1000 --temperature 0.7
vllm bench serve --endpoint-type vllm --host 127.0.0.1 --port 8001 --model google/gemma-3-4b-it --dataset-name random --random-input-len 128 --random-output-len 128 --num-prompts 10000 --max-concurrency 1000 --temperature 0.7
Benchmark results - smaller models
NOTE: I restarted each benchmark 3 times because of warm-up. The first benchmark is always the slowest after server restart.
Nvidia RTX 4090 24GB
gemma... | Llama... | Mistral... | Qwen... | |
---|---|---|---|---|
Throughput (tok/s) | 9845 | 6739 | 7428 | 7746 |
Time to First Token (ms) | 18086 | 27982 | 22992 | 20940 |
Inter-token Latency (ms) | 52 | 64 | 72 | 74 |
Nvidia A100 80GB
gemma... | Llama... | Mistral... | Qwen... | |
---|---|---|---|---|
Throughput (tok/s) | 12877 | 12716 | 12459 | 13027 |
Time to First Token (ms) | 8850 | 9561 | 8894 | 8143 |
Inter-token Latency (ms) | 37 | 37 | 39 | 40 |
Nvidia H100 80GB
gemma... | Llama... | Mistral... | Qwen... | |
---|---|---|---|---|
Throughput (tok/s) | 22172 | 29371 | 31611 | 35643 |
Time to First Token (ms) | 714 | 1478 | 1032 | 671 |
Inter-token Latency (ms) | 84 | 54 | 53 | 52 |
Nvidia H200 141GB
gemma... | Llama... | Mistral... | Qwen... | |
---|---|---|---|---|
Throughput (tok/s) | 25008 | 32188 | 34569 | 36991 |
Time to First Token (ms) | 933 | 995 | 817 | 596 |
Inter-token Latency (ms) | 69 | 51 | 50 | 51 |
Benchmark results - larger models
Results with Qwen3-Coder-30B-A3B-Instruct (11 combinations) on server with 8 x H100 GPUs
GPUs=1 tp=1 |
GPUs=2 tp=2 |
GPUs=4 tp=4 |
GPUs=8 tp=8 |
|
---|---|---|---|---|
Throughput (tok/s) | 17349 | 26266 | 25804 | 29226 |
Time to First Token (ms) | 3815 | 803 | 631 | 775 |
Inter-token Latency (ms) | 83 | 68 | 71 | 60 |
GPUs=2 dp=1 |
GPUs=2 dp=2 |
GPUs=4 dp=4 |
GPUs=8 dp=8 |
|
---|---|---|---|---|
Throughput (tok/s) | 17419 | 25890 | 32364 | ERR |
Time to First Token (ms) | 3838 | 1110 | 954 | ERR |
Inter-token Latency (ms) | 82 | 65 | 51 | ERR |
GPUs=4 tp=2 dp=2 |
GPUs=8 tp=4 dp=2 |
GPUs=8 tp=2 dp=4 |
|
---|---|---|---|
Throughput (tok/s) | 29468 | 37681 | 35429 |
Time to First Token (ms) | 1126 | 974 | 1055 |
Inter-token Latency (ms) | 55 | 42 | 44 |
NOTE: vLLM server failed to start with tp=1 and dp=8. RuntimeError: CUDA out of memory occurred. I tried to decrease --max-model-len to 8192 but that did not help.
Results with Qwen3-Coder-30B-A3B-Instruct (11 combinations) on server with 8 x H200 GPUs
GPUs=1 tp=1 |
GPUs=2 tp=2 |
GPUs=4 tp=4 |
GPUs=8 tp=8 |
|
---|---|---|---|---|
Throughput (tok/s) | 29218 | 29318 | 28316 | 29916 |
Time to First Token (ms) | 655 | 539 | 471 | 467 |
Inter-token Latency (ms) | 62 | 63 | 65 | 61 |
GPUs=2 dp=1 |
GPUs=2 dp=2 |
GPUs=4 dp=4 |
GPUs=8 dp=8 |
|
---|---|---|---|---|
Throughput (tok/s) | 28945 | 26541 | 34865 | 27411 |
Time to First Token (ms) | 614 | 664 | 640 | 692 |
Inter-token Latency (ms) | 63 | 69 | 51 | 66 |
GPUs=4 tp=2 dp=2 |
GPUs=8 tp=4 dp=2 |
GPUs=8 tp=2 dp=4 |
|
---|---|---|---|
Throughput (tok/s) | 34635 | 32665 | 38830 |
Time to First Token (ms) | 474 | 459 | 581 |
Inter-token Latency (ms) | 52 | 56 | 45 |
**Results with openai/gpt-oss-120b (4 combinations) on server with 8 x H100 GPUs **
GPUs=2 tp=1 dp=1 |
GPUs=2 tp=2 dp=1 |
GPUs=4 tp=4 dp=1 |
GPUs=8 tp=8 dp=1 |
|
---|---|---|---|---|
Throughput (tok/s) | 10356 | 19513 | 24574 | 29917 |
Time to First Token (ms) | 13923 | 555 | 1134 | 1187 |
Inter-token Latency (ms) | 74 | 106 | 76 | 60 |
NOTE: openai/gpt-oss-120b does not support data parallelism in vLLM (yet?)
Conclusion:
- Nvidia H200 is the fastest in my benchmarks. However, Qwen3-Coder-30B does not scale performance significantly on multiple H200.
- In both cases, with Qwen3-Coder-30B-A3B-Instruct and gpt-oss-120b, it makes sense to start vLLM with tensor parallel 2. Less VRAM is used per GPU. More VRAM is used for Key-Value cache (KV cache).
- Run benchmarks for every LLM on different GPUs. They perform differently.