LLM inferencing benchmark with vLLM benchmark script: Tensor Parallel vs Data Parallel

vLLM software comes with built-in benchmarks. I will compare LLM inference performance on different GPUs. And I will compare the performance difference between tensor parallelism and data parallelism. I want to find q deployment method that gives better performance.

I will benchmark two larger models on multiple GPUs:

Qwen/Qwen3-Coder-30B-A3B-Instruct
openai/gpt-oss-120b

And smaller models on a single Nvidia RTX 4090, A100, H100, H200:

google/gemma-3-4b-it
meta-llama/Llama-3.1-8B-Instruct
mistralai/Mistral-7B-Instruct-v0.3
Qwen/Qwen2.5-Coder-7B-Instruct

Video version of this article on YouTube

Tensor Parallel vs Data Parallel

There are different deployment methods for LLM inference, on multiple GPUs or multiple servers:

Tensor Parallelism
Data Parallelism
Pipeline Parallelism
...

I will inference LLM on one server, and I want to compare only Tensor Parallelism vs Data Parallelism.

Also benchmark mix of methods:

Tensor Parallelism + Data Parallelism

Main vLLM server for benchmarks

OS: Rocky Linux 9.6
Hardware: Dell PowerEdge XE9680
Processors: 2 x Intel Xeon Platinum 8470
RAM: 2TB
GPUs: 8 x GPUs H100 80GB
Server IP address: 10.1.x.x

TORCH_CUDA_ARCH_LIST

TORCH_CUDA_ARCH_LIST is an environment variable that should be set according to Nvidia GPU card before starting vLLM server.

More information here: https://en.wikipedia.org/wiki/CUDA

In my benchmarks, I will use:

TORCH_CUDA_ARCH_LIST=8.0 # for A100
TORCH_CUDA_ARCH_LIST=8.9 # for RTX 4090
TORCH_CUDA_ARCH_LIST=9.0 # for H200, H100

Install vLLM server

I installed Python 3.12 because of flashinfer-python issues with default Python 3.9 on Rocky Linux 9.6.

dnf -y install python3.12 python3.12-devel

Create a dedicated user to run vLLM:

useradd vllm

Change user and create a Python virtual environment with vLLM.

su - vllm
python3.12 -m venv venv-vllm
source venv-vllm/bin/activate

Find CUDA version:

nvidia-smi | grep CUDA

| NVIDIA-SMI 575.57.08              Driver Version: 575.57.08      CUDA Version: 12.9     |

I have installed CUDA 12.9. CUDA version is needed for next command link ....cu129.

pip install vllm --extra-index-url https://download.pytorch.org/whl/cu129
pip install flashinfer-python
vllm --version
0.10.1.1

NOTE 1: flashinfer-python is not a critical component, but it improves performance.

NOTE 2: Gemma, Llama and Mistral models require accepting a license and access token from Huggingface.

Install Huggingface CLI

pip install -U "huggingface_hub[cli]"

hf auth login

Example: manually start vLLM on two GPUs with tensor-parallel 2:

export TORCH_CUDA_ARCH_LIST=9.0
vllm serve Qwen/Qwen3-Coder-30B-A3B-Instruct -tp=2 --port 8001 --max-model-len 65536 --enable-auto-tool-choice --tool-call-parser qwen3_coder

I also want to start inferencing automatically on boot.

Create systemctld service file /etc/systemd/system/vllm-8001.service :

[Unit]
Description=vLLM Service 8001
After=network-online.target

[Service]
WorkingDirectory=/home/vllm/
ExecStart=/bin/sh -c 'cd /home/vllm && source venv-vllm/bin/activate && vllm serve Qwen3-Coder-30B-A3B-Instruct -tp=2 -dp=1 --port 8001 --max-model-len 65536 --enable-auto-tool-choice --tool-call-parser qwen3_coder'
User=vllm
Group=vllm
Restart=always
RestartSec=3
Environment="CUDA_VISIBLE_DEVICES=0,1"
Environment="VLLM_CONFIGURE_LOGGING=0"
Environment="TORCH_CUDA_ARCH_LIST=9.0"

[Install]
WantedBy=multi-user.target

Benchmark client

Benchmark can be started from the same server locally or from a remote server.

On a remote server operating system Rocky Linux 9.6.

Install vLLM in Python virtual environment venv-vllm as it was done for vLLM server.

Benchmarking

I will restart the vLLM server with different combinations. I will modify only these 2 options

-tp=... (or --tensor-parallel-size=...)
-dp=... (or --data-parallel-size=...)

Benchmark command will be the same 10000 prompts and 1000 concurrent connections.

option for remote server --host 10.1.8.1
option for local server --host localhost

Example benchmark command with output:

source venv-vllm/bin/activate

vllm bench serve --endpoint-type vllm --host 10.1.8.1 --port 8001 --model Qwen/Qwen3-Coder-30B-A3B-Instruct --dataset-name random --random-input-len 128 --random-output-len 128 --num-prompts 10000 --max-concurrency 1000 --temperature 0.7

. . .
. . .
============ Serving Benchmark Result ============
Successful requests:                     10000    
Maximum request concurrency:             1000     
Benchmark duration (s):                  86.26    
Total input tokens:                      1276722  
Total generated tokens:                  1265234  
Request throughput (req/s):              115.93   
Output token throughput (tok/s):         14667.45 
Total Token throughput (tok/s):          29468.07 
---------------Time to First Token----------------
Mean TTFT (ms):                          1126.10  
Median TTFT (ms):                        1258.56  
P99 TTFT (ms):                           1970.82  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          55.98    
Median TPOT (ms):                        54.04    
P99 TPOT (ms):                           74.32    
---------------Inter-token Latency----------------
Mean ITL (ms):                           55.85    
Median ITL (ms):                         48.65    
P99 ITL (ms):                            160.58   
==================================================

Commands to start smaller LLM inference:

TORCH_CUDA_ARCH_LIST=8.9 vllm serve Qwen/Qwen2.5-Coder-7B-Instruct     -tp=1 --port 8001 --max-model-len 8192
TORCH_CUDA_ARCH_LIST=8.9 vllm serve meta-llama/Llama-3.1-8B-Instruct   -tp=1 --port 8001 --max-model-len 8192
TORCH_CUDA_ARCH_LIST=8.9 vllm serve mistralai/Mistral-7B-Instruct-v0.3 -tp=1 --port 8001 --max-model-len 8192 
TORCH_CUDA_ARCH_LIST=8.9 vllm serve google/gemma-3-4b-it               -tp=1 --port 8001 --max-model-len 8192

Benchmark commands:

vllm bench serve --endpoint-type vllm --host 127.0.0.1 --port 8001 --model Qwen/Qwen2.5-Coder-7B-Instruct --dataset-name random --random-input-len 128 --random-output-len 128 --num-prompts 10000 --max-concurrency 1000 --temperature 0.7
vllm bench serve --endpoint-type vllm --host 127.0.0.1 --port 8001 --model meta-llama/Llama-3.1-8B-Instruct --dataset-name random --random-input-len 128 --random-output-len 128 --num-prompts 10000 --max-concurrency 1000 --temperature 0.7
vllm bench serve --endpoint-type vllm --host 127.0.0.1 --port 8001 --model mistralai/Mistral-7B-Instruct-v0.3 --dataset-name random --random-input-len 128 --random-output-len 128 --num-prompts 10000 --max-concurrency 1000 --temperature 0.7
vllm bench serve --endpoint-type vllm --host 127.0.0.1 --port 8001 --model google/gemma-3-4b-it --dataset-name random --random-input-len 128 --random-output-len 128 --num-prompts 10000 --max-concurrency 1000 --temperature 0.7

Benchmark results - smaller models

NOTE: I restarted each benchmark 3 times because of warm-up. The first benchmark is always the slowest after server restart.

Nvidia RTX 4090 24GB

	gemma...	Llama...	Mistral...	Qwen...
Throughput (tok/s)	9845	6739	7428	7746
Time to First Token (ms)	18086	27982	22992	20940
Inter-token Latency (ms)	52	64	72	74

Nvidia A100 80GB

	gemma...	Llama...	Mistral...	Qwen...
Throughput (tok/s)	12877	12716	12459	13027
Time to First Token (ms)	8850	9561	8894	8143
Inter-token Latency (ms)	37	37	39	40

Nvidia H100 80GB

	gemma...	Llama...	Mistral...	Qwen...
Throughput (tok/s)	22172	29371	31611	35643
Time to First Token (ms)	714	1478	1032	671
Inter-token Latency (ms)	84	54	53	52

Nvidia H200 141GB

	gemma...	Llama...	Mistral...	Qwen...
Throughput (tok/s)	25008	32188	34569	36991
Time to First Token (ms)	933	995	817	596
Inter-token Latency (ms)	69	51	50	51

Benchmark results - larger models

Results with Qwen3-Coder-30B-A3B-Instruct (11 combinations) on server with 8 x H100 GPUs

	GPUs=1 tp=1	GPUs=2 tp=2	GPUs=4 tp=4	GPUs=8 tp=8
Throughput (tok/s)	17349	26266	25804	29226
Time to First Token (ms)	3815	803	631	775
Inter-token Latency (ms)	83	68	71	60

	GPUs=2 dp=1	GPUs=2 dp=2	GPUs=4 dp=4	GPUs=8 dp=8
Throughput (tok/s)	17419	25890	32364	ERR
Time to First Token (ms)	3838	1110	954	ERR
Inter-token Latency (ms)	82	65	51	ERR

	GPUs=4 tp=2 dp=2	GPUs=8 tp=4 dp=2	GPUs=8 tp=2 dp=4
Throughput (tok/s)	29468	37681	35429
Time to First Token (ms)	1126	974	1055
Inter-token Latency (ms)	55	42	44

NOTE: vLLM server failed to start with tp=1 and dp=8. RuntimeError: CUDA out of memory occurred. I tried to decrease --max-model-len to 8192 but that did not help.

Results with Qwen3-Coder-30B-A3B-Instruct (11 combinations) on server with 8 x H200 GPUs

	GPUs=1 tp=1	GPUs=2 tp=2	GPUs=4 tp=4	GPUs=8 tp=8
Throughput (tok/s)	29218	29318	28316	29916
Time to First Token (ms)	655	539	471	467
Inter-token Latency (ms)	62	63	65	61

	GPUs=2 dp=1	GPUs=2 dp=2	GPUs=4 dp=4	GPUs=8 dp=8
Throughput (tok/s)	28945	26541	34865	27411
Time to First Token (ms)	614	664	640	692
Inter-token Latency (ms)	63	69	51	66

	GPUs=4 tp=2 dp=2	GPUs=8 tp=4 dp=2	GPUs=8 tp=2 dp=4
Throughput (tok/s)	34635	32665	38830
Time to First Token (ms)	474	459	581
Inter-token Latency (ms)	52	56	45

Results with openai/gpt-oss-120b (4 combinations) on server with 8 x H100 GPUs

	GPUs=2 tp=1	GPUs=2 tp=2	GPUs=4 tp=4	GPUs=8 tp=8
Throughput (tok/s)	10356	19513	24574	29917
Time to First Token (ms)	13923	555	1134	1187
Inter-token Latency (ms)	74	106	76	60

NOTE: openai/gpt-oss-120b does not support data parallelism in vLLM (yet?)

Conclusion:

Nvidia H200 is the fastest in my benchmarks. However, Qwen3-Coder-30B does not scale performance significantly on multiple H200.
In both cases, with Qwen3-Coder-30B-A3B-Instruct and gpt-oss-120b, it makes sense to start vLLM with tensor parallel 2. Less VRAM is used per GPU. More VRAM is used for Key-Value cache (KV cache).
Run benchmarks for every LLM on different GPUs. They perform differently.