LLM throughput benchmark on 13 GPUs and 10 models

Posted on Sun 07 June 2026 by Pavlo Khmel

I benchmarked models with llama.cpp (only one parallel request). Goal was to see how GPUs are performing with the same models. And as a side quest I wanted to find what models are comfortable to use on hardware with low VRAM size.

What is comfortable?

I think tok/s per second perfectly aligned with FPS (Frames Per Second) in gaming:

  • below 30 tok/s is too slow. You can read faster.
  • 60 tok/s is OK for small things. But for competitive action shooter games is not great.
  • 120 tok/s and more is an elite level. You forget what reality is.

This is a result table that shows tokens per second (tok/s):

MODEL ID Size
(GB)
M1
8core
16GB
M5
10core
32GB
P100
PCIe
16GB
GTX
1080
8GB
V100
PCIe
32GB
A100
PCIe
40GB
A100
SXM
80GB
RTX
4080
16GB
RTX
4090
24GB
H200
PCIe
141GB
H100
SXM
80GB
H200
SXM
141GB
B300
SXM
288GB
LFM2.5-1.2B 0.7 67 158 167 194 388 485 524 614 741 737 764 796 801
Ministral-3-3B 1.8 21 50 65 65 151 184 211 210 271 308 305 312 319
Qwen3.5-4B 2.7 16 37 49 48 113 145 154 164 206 226 234 241 251
gemma-4-E4B-it 5.0 17 37 44 45 96 119 125 145 177 183 190 195 199
Qwen3.5-9B 5.7 9 22 32 28 85 113 123 105 137 184 190 197 213
gpt-oss-20b 11.6 21 48 65 x 138 180 189 210 266 276 286 292 327
Qwen3.6-27B 16.8 x 7 x x 31 41 46 x 47 71 74 76 82
gemma-4-26B-A4B-it 16.9 x 32 x x 84 108 116 x 149 178 181 188 185
gemma-4-31B-it 18.3 x 6 x x 28 37 43 x 35 65 68 71 75
Qwen3.6-35B-A3B 22.1 x 34 x x 92 120 128 x 180 189 194 204 210

LLM models are from hugginface and to make comparison fair:

  • models are from unsloth because they provide most of the quantization variants.
  • models are Q4_K_M as most popular recommendation because of trade-off accuracy vs speed.
  • models selected without optimizations and preferences for hardware so no MTP, MLX, NVFP4.

Benchmark

Small Python script (less than 100 lines). It sends requests and shows tok/s. Download: https://github.com/pavlokhmel/llm_throughput_benchmark

Prompt in the benchmark command written to force LLM model to write a long output. In average I was getting between 4000 - 7000 tokens.

$ python3 llm_throughput_benchmark.py \
--api-url http://127.0.0.1:8000/v1/chat/completions \
--model unsloth/LFM2.5-1.2B-Instruct-GGUF:Q4_K_M \
--prompt "Provide a comprehensive, highly verbose technical guide on LLMs covering history, transformer architecture, tokenization, pre-training, reinforcement learning from human feedback, limitations, and future trends. Expand each section with extreme detail." \
--requests 1 \
--parallel 1

Output example on Mac mini with M1 processor:

MODEL                                     TOK/S  TIME(S)  IN_TOK  OUT_TOK  REQUESTS  PARALLEL  SUCCESS
unsloth/LFM2.5-1.2B-Instruct-GGUF:Q4_K_M  65     74       55      4898     1         1         1   

These 2 options --parsable and --no-header will make oneline output with tab character as separator. Output example:

unsloth/LFM2.5-1.2B-Instruct-GGUF:Q4_K_M 65 74 55 4898 1 1 1

NOTE: to start gpt-oss-20b on M1 with 16GB RAM I needed to use this command:

sudo sysctl iogpu.wired_limit_mb=14336

Because around 30% of unified memory is reserved on the macOS.

LLM inferencing

I selected llama.cpp as it can be started everywhere and it has useful command line options. I skipped vLLM, SGlang. They are faster is some cases but less support for different hardware and operation systems. And I also skipped LM Studio and Ollama as they use llama.cpp under the hood.

llama.cpp can be installed as pre-build software. Or it can be build from source code. This is example how to build on Linux with CUDA:

useradd llm
su - llm
git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp
export PATH=${PATH}:/usr/local/cuda/bin
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/usr/local/cuda/lib64
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j 8
export CUDA_VISIBLE_DEVICES=0
cd ..

This is example how to build on macOS:

brew install cmake
git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build --config Release -j
cd ..

Options:

  • --host 0.0.0.0 # listen on all IP addreses
  • --port 8000 # listen on port 8000
  • --no-mmap # forces to load the entire LLM model weights directly into physical memory
  • -ngl all # store all layers to store in VRAM
  • -fit off # do not adjust memory. Out of Memory (OOM) if model does not fit in to the VRAM.
  • -c 32768 # context size 32k tokens is more than enough for this task.
  • -hf hugginface_model_name:quantization

Commands to start inferencing:

./llama.cpp/build/bin/llama-server --host 0.0.0.0 --port 8000 --no-mmap -ngl all -fit off -c 32768 -hf unsloth/LFM2.5-1.2B-Instruct-GGUF:Q4_K_M
./llama.cpp/build/bin/llama-server --host 0.0.0.0 --port 8000 --no-mmap -ngl all -fit off -c 32768 -hf unsloth/Ministral-3-3B-Instruct-2512-GGUF:Q4_K_M
./llama.cpp/build/bin/llama-server --host 0.0.0.0 --port 8000 --no-mmap -ngl all -fit off -c 32768 -hf unsloth/Qwen3.5-4B-GGUF:Q4_K_M
./llama.cpp/build/bin/llama-server --host 0.0.0.0 --port 8000 --no-mmap -ngl all -fit off -c 32768 -hf unsloth/gemma-4-E4B-it-GGUF:Q4_K_M
./llama.cpp/build/bin/llama-server --host 0.0.0.0 --port 8000 --no-mmap -ngl all -fit off -c 32768 -hf unsloth/Qwen3.5-9B-GGUF:Q4_K_M
./llama.cpp/build/bin/llama-server --host 0.0.0.0 --port 8000 --no-mmap -ngl all -fit off -c 32768 -hf unsloth/gpt-oss-20b-GGUF:Q4_K_M
./llama.cpp/build/bin/llama-server --host 0.0.0.0 --port 8000 --no-mmap -ngl all -fit off -c 32768 -hf unsloth/Qwen3.6-27B-GGUF:Q4_K_M
./llama.cpp/build/bin/llama-server --host 0.0.0.0 --port 8000 --no-mmap -ngl all -fit off -c 32768 -hf unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_M
./llama.cpp/build/bin/llama-server --host 0.0.0.0 --port 8000 --no-mmap -ngl all -fit off -c 32768 -hf unsloth/gemma-4-31B-it-GGUF:Q4_K_M
./llama.cpp/build/bin/llama-server --host 0.0.0.0 --port 8000 --no-mmap -ngl all -fit off -c 32768 -hf unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_M

Benchmarks command example:

python3 llm_throughput_benchmark.py --api-url http://10.1.0.5:8000/v1/chat/completions --model unsloth/LFM2.5-1.2B-Instruct-GGUF:Q4_K_M --prompt "Provide a comprehensive, highly verbose technical guide on LLMs covering history, transformer architecture, tokenization, pre-training, reinforcement learning from human feedback, limitations, and future trends. Expand each section with extreme detail." --requests 1 --parallel 1 --parsable --no-header

Example output on RTX4090

# RTX 4090
unsloth/LFM2.5-1.2B-Instruct-GGUF:Q4_K_M 741 5 55 4141 1 1 1
unsloth/Ministral-3-3B-Instruct-2512-GGUF:Q4_K_M 271 14 580 4073 1 1 1
unsloth/Qwen3.5-4B-GGUF:Q4_K_M 206 36 53 7430 1 1 1
unsloth/gemma-4-E4B-it-GGUF:Q4_K_M 177 23 59 4188 1 1 1
unsloth/Qwen3.5-9B-GGUF:Q4_K_M 137 44 53 6034 1 1 1
unsloth/gpt-oss-20b-GGUF:Q4_K_M 266 20 110 5561 1 1 1
unsloth/Qwen3.6-27B-GGUF:Q4_K_M 47 166 53 7902 1 1 1
unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_M 149 24 59 3613 1 1 1
unsloth/gemma-4-31B-it-GGUF:Q4_K_M 35 93 59 3261 1 1 1
unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_M 180 22 53 4043 1 1 1

Operation system

Benchmark on:

  • macOS Thahoe 26.5.1
  • Rocky Linux 9.7 with CUDA 12.9.2
  • Rocky Linux 10.1 with CUDA 13.3.0

Hardware

Hardware Description
M1 Macmini9,1 - M1 - 8 (4 Performance and 4 Efficiency) - 16GB RAM
M5 MacBook Pro - Mac17,2 - 10 (4 Super and 6 Efficiency) - 32GB RAM
P100 NVIDIA Tesla P100 PCIe 16 GB
GTX1080 NVIDIA GeForce GTX 1080 8GB
V100 PCIe NVIDIA Tesla V100 PCIe 32 GB
A100 PCIe NVIDIA A100 PCIe 40 GB
A100 SXM NVIDIA A100 SXM 80 GB
RTX4080 NVIDIA GeForce RTX 4080 16GB
RTX4090 NVIDIA GeForce RTX 4090 24GB
H200 PCIe NVIDIA H200 NVL
H100 SMX NVIDIA H100 SXM 80GB
H200 SMX NVIDIA H200 SXM 141GB
B300 SMX NVIDIA B300 SXM 288GB