LLM inferencing benchmark with vLLM benchmark script: Tensor Parallel vs Data Parallel

Posted on Sat 16 August 2025 by Pavlo Khmel

vLLM software comes with built-in benchmarks. I will compare LLM inference performance on different GPUs. And I will compare the performance difference between tensor parallelism and data parallelism. I want to find q deployment method that gives better performance.

I will benchmark two larger models on multiple GPUs:

  • Qwen/Qwen3-Coder-30B-A3B-Instruct
  • openai/gpt-oss-120b

And smaller models on a single Nvidia RTX 4090, A100, H100, H200:

  • google/gemma-3-4b-it
  • meta-llama/Llama-3.1-8B-Instruct
  • mistralai/Mistral-7B-Instruct-v0.3
  • Qwen/Qwen2.5-Coder-7B-Instruct

Tensor Parallel vs Data Parallel

There are different deployment methods for LLM inference, on multiple GPUs or multiple servers:

  • Tensor Parallelism
  • Data Parallelism
  • Pipeline Parallelism
  • ...

I will inference LLM on one server, and I want to compare only Tensor Parallelism vs Data Parallelism.

Also benchmark mix of methods:

  • Tensor Parallelism + Data Parallelism

Main vLLM server for benchmarks

  • OS: Rocky Linux 9.6
  • Hardware: Dell PowerEdge XE9680
  • Processors: 2 x Intel Xeon Platinum 8470
  • RAM: 2TB
  • GPUs: 8 x GPUs H100 80GB
  • Server IP address: 10.1.x.x

TORCH_CUDA_ARCH_LIST

TORCH_CUDA_ARCH_LIST is an environment variable that should be set according to Nvidia GPU card before starting vLLM server.

More information here: https://en.wikipedia.org/wiki/CUDA

In my benchmarks, I will use:

  • TORCH_CUDA_ARCH_LIST=8.0 # for A100
  • TORCH_CUDA_ARCH_LIST=8.9 # for RTX 4090
  • TORCH_CUDA_ARCH_LIST=9.0 # for H200, H100

Install vLLM server

I installed Python 3.12 because of flashinfer-python issues with default Python 3.9 on Rocky Linux 9.6.

dnf -y install python3.12 python3.12-devel

Create a dedicated user to run vLLM:

useradd vllm

Change user and create a Python virtual environment with vLLM.

su - vllm
python3.12 -m venv venv-vllm
source venv-vllm/bin/activate

Find CUDA version:

nvidia-smi | grep CUDA

| NVIDIA-SMI 575.57.08              Driver Version: 575.57.08      CUDA Version: 12.9     |

I have installed CUDA 12.9. CUDA version is needed for next command link ....cu129.

pip install vllm --extra-index-url https://download.pytorch.org/whl/cu129
pip install flashinfer-python
vllm --version
0.10.1.1

NOTE 1: flashinfer-python is not a critical component, but it improves performance.

NOTE 2: Gemma, Llama and Mistral models require accepting a license and access token from Huggingface.

Install Huggingface CLI

pip install -U "huggingface_hub[cli]"

Log in with a token (create on https://huggingface.co )

hf auth login

Example: manually start vLLM on two GPUs with tensor-parallel 2:

export TORCH_CUDA_ARCH_LIST=9.0
vllm serve Qwen/Qwen3-Coder-30B-A3B-Instruct -tp=2 --port 8001 --max-model-len 65536 --enable-auto-tool-choice --tool-call-parser qwen3_coder

I also want to start inferencing automatically on boot.

Create systemctld service file /etc/systemd/system/vllm-8001.service :

[Unit]
Description=vLLM Service 8001
After=network-online.target

[Service]
WorkingDirectory=/home/vllm/
ExecStart=/bin/sh -c 'cd /home/vllm && source venv-vllm/bin/activate && vllm serve Qwen3-Coder-30B-A3B-Instruct -tp=2 -dp=1 --port 8001 --max-model-len 65536 --enable-auto-tool-choice --tool-call-parser qwen3_coder'
User=vllm
Group=vllm
Restart=always
RestartSec=3
Environment="CUDA_VISIBLE_DEVICES=0,1"
Environment="VLLM_CONFIGURE_LOGGING=0"
Environment="TORCH_CUDA_ARCH_LIST=9.0"

[Install]
WantedBy=multi-user.target

Benchmark client

Benchmark can be started from the same server locally or from a remote server.

On a remote server operating system Rocky Linux 9.6.

Install vLLM in Python virtual environment venv-vllm as it was done for vLLM server.

Benchmarking

I will restart the vLLM server with different combinations. I will modify only these 2 options

  • -tp=... (or --tensor-parallel-size=...)
  • -dp=... (or --data-parallel-size=...)

Benchmark command will be the same 10000 prompts and 1000 concurrent connections.

  • option for remote server --host 10.1.8.1
  • option for local server --host localhost

Example benchmark command with output:

source venv-vllm/bin/activate

vllm bench serve --endpoint-type vllm --host 10.1.8.1 --port 8001 --model Qwen/Qwen3-Coder-30B-A3B-Instruct --dataset-name random --random-input-len 128 --random-output-len 128 --num-prompts 10000 --max-concurrency 1000 --temperature 0.7

. . .
. . .
============ Serving Benchmark Result ============
Successful requests:                     10000    
Maximum request concurrency:             1000     
Benchmark duration (s):                  86.26    
Total input tokens:                      1276722  
Total generated tokens:                  1265234  
Request throughput (req/s):              115.93   
Output token throughput (tok/s):         14667.45 
Total Token throughput (tok/s):          29468.07 
---------------Time to First Token----------------
Mean TTFT (ms):                          1126.10  
Median TTFT (ms):                        1258.56  
P99 TTFT (ms):                           1970.82  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          55.98    
Median TPOT (ms):                        54.04    
P99 TPOT (ms):                           74.32    
---------------Inter-token Latency----------------
Mean ITL (ms):                           55.85    
Median ITL (ms):                         48.65    
P99 ITL (ms):                            160.58   
==================================================

Commands to start smaller LLM inference:

TORCH_CUDA_ARCH_LIST=8.9 vllm serve Qwen/Qwen2.5-Coder-7B-Instruct     -tp=1 --port 8001 --max-model-len 8192
TORCH_CUDA_ARCH_LIST=8.9 vllm serve meta-llama/Llama-3.1-8B-Instruct   -tp=1 --port 8001 --max-model-len 8192
TORCH_CUDA_ARCH_LIST=8.9 vllm serve mistralai/Mistral-7B-Instruct-v0.3 -tp=1 --port 8001 --max-model-len 8192 
TORCH_CUDA_ARCH_LIST=8.9 vllm serve google/gemma-3-4b-it               -tp=1 --port 8001 --max-model-len 8192

Benchmark commands:

vllm bench serve --endpoint-type vllm --host 127.0.0.1 --port 8001 --model Qwen/Qwen2.5-Coder-7B-Instruct --dataset-name random --random-input-len 128 --random-output-len 128 --num-prompts 10000 --max-concurrency 1000 --temperature 0.7
vllm bench serve --endpoint-type vllm --host 127.0.0.1 --port 8001 --model meta-llama/Llama-3.1-8B-Instruct --dataset-name random --random-input-len 128 --random-output-len 128 --num-prompts 10000 --max-concurrency 1000 --temperature 0.7
vllm bench serve --endpoint-type vllm --host 127.0.0.1 --port 8001 --model mistralai/Mistral-7B-Instruct-v0.3 --dataset-name random --random-input-len 128 --random-output-len 128 --num-prompts 10000 --max-concurrency 1000 --temperature 0.7
vllm bench serve --endpoint-type vllm --host 127.0.0.1 --port 8001 --model google/gemma-3-4b-it --dataset-name random --random-input-len 128 --random-output-len 128 --num-prompts 10000 --max-concurrency 1000 --temperature 0.7

Benchmark results - smaller models

NOTE: I restarted each benchmark 3 times because of warm-up. The first benchmark is always the slowest after server restart.

Nvidia RTX 4090 24GB

gemma... Llama... Mistral... Qwen...
Throughput (tok/s) 9845 6739 7428 7746
Time to First Token (ms) 18086 27982 22992 20940
Inter-token Latency (ms) 52 64 72 74

Nvidia A100 80GB

gemma... Llama... Mistral... Qwen...
Throughput (tok/s) 12877 12716 12459 13027
Time to First Token (ms) 8850 9561 8894 8143
Inter-token Latency (ms) 37 37 39 40

Nvidia H100 80GB

gemma... Llama... Mistral... Qwen...
Throughput (tok/s) 22172 29371 31611 35643
Time to First Token (ms) 714 1478 1032 671
Inter-token Latency (ms) 84 54 53 52

Nvidia H200 141GB

gemma... Llama... Mistral... Qwen...
Throughput (tok/s) 25008 32188 34569 36991
Time to First Token (ms) 933 995 817 596
Inter-token Latency (ms) 69 51 50 51

Benchmark results - larger models

Results with Qwen3-Coder-30B-A3B-Instruct (11 combinations) on server with 8 x H100 GPUs

GPUs=1
tp=1
GPUs=2
tp=2
GPUs=4
tp=4
GPUs=8
tp=8
Throughput (tok/s) 17349 26266 25804 29226
Time to First Token (ms) 3815 803 631 775
Inter-token Latency (ms) 83 68 71 60
GPUs=2
dp=1
GPUs=2
dp=2
GPUs=4
dp=4
GPUs=8
dp=8
Throughput (tok/s) 17419 25890 32364 ERR
Time to First Token (ms) 3838 1110 954 ERR
Inter-token Latency (ms) 82 65 51 ERR
GPUs=4
tp=2
dp=2
GPUs=8
tp=4
dp=2
GPUs=8
tp=2
dp=4
Throughput (tok/s) 29468 37681 35429
Time to First Token (ms) 1126 974 1055
Inter-token Latency (ms) 55 42 44

NOTE: vLLM server failed to start with tp=1 and dp=8. RuntimeError: CUDA out of memory occurred. I tried to decrease --max-model-len to 8192 but that did not help.

Results with Qwen3-Coder-30B-A3B-Instruct (11 combinations) on server with 8 x H200 GPUs

GPUs=1
tp=1
GPUs=2
tp=2
GPUs=4
tp=4
GPUs=8
tp=8
Throughput (tok/s) 29218 29318 28316 29916
Time to First Token (ms) 655 539 471 467
Inter-token Latency (ms) 62 63 65 61
GPUs=2
dp=1
GPUs=2
dp=2
GPUs=4
dp=4
GPUs=8
dp=8
Throughput (tok/s) 28945 26541 34865 27411
Time to First Token (ms) 614 664 640 692
Inter-token Latency (ms) 63 69 51 66
GPUs=4
tp=2
dp=2
GPUs=8
tp=4
dp=2
GPUs=8
tp=2
dp=4
Throughput (tok/s) 34635 32665 38830
Time to First Token (ms) 474 459 581
Inter-token Latency (ms) 52 56 45

**Results with openai/gpt-oss-120b (4 combinations) on server with 8 x H100 GPUs **

GPUs=2
tp=1
dp=1
GPUs=2
tp=2
dp=1
GPUs=4
tp=4
dp=1
GPUs=8
tp=8
dp=1
Throughput (tok/s) 10356 19513 24574 29917
Time to First Token (ms) 13923 555 1134 1187
Inter-token Latency (ms) 74 106 76 60

NOTE: openai/gpt-oss-120b does not support data parallelism in vLLM (yet?)

Conclusion:

  • Nvidia H200 is the fastest in my benchmarks. However, Qwen3-Coder-30B does not scale performance significantly on multiple H200.
  • In both cases, with Qwen3-Coder-30B-A3B-Instruct and gpt-oss-120b, it makes sense to start vLLM with tensor parallel 2. Less VRAM is used per GPU. More VRAM is used for Key-Value cache (KV cache).
  • Run benchmarks for every LLM on different GPUs. They perform differently.