LLM throughput benchmark on 13 GPUs and 10 models

I benchmarked models with llama.cpp (only one parallel request). Goal was to see how GPUs are performing with the same models. And as a side quest I wanted to find what models are comfortable to use on hardware with low VRAM size.

What is comfortable?

I think tok/s per second perfectly aligned with FPS (Frames Per Second) in gaming:

below 30 tok/s is too slow. You can read faster.
60 tok/s is OK for small things. But for competitive action shooter games is not great.
120 tok/s and more is an elite level. You forget what reality is.

This is a result table that shows tokens per second (tok/s):

MODEL ID	Size (GB)	M1 8core 16GB	M5 10core 32GB	P100 PCIe 16GB	GTX 1080 8GB	V100 PCIe 32GB	A100 PCIe 40GB	A100 SXM 80GB	RTX 4080 16GB	RTX 4090 24GB	H200 PCIe 141GB	H100 SXM 80GB	H200 SXM 141GB	B300 SXM 288GB
LFM2.5-1.2B	0.7	67	158	167	194	388	485	524	614	741	737	764	796	801
Ministral-3-3B	1.8	21	50	65	65	151	184	211	210	271	308	305	312	319
Qwen3.5-4B	2.7	16	37	49	48	113	145	154	164	206	226	234	241	251
gemma-4-E4B-it	5.0	17	37	44	45	96	119	125	145	177	183	190	195	199
Qwen3.5-9B	5.7	9	22	32	28	85	113	123	105	137	184	190	197	213
gpt-oss-20b	11.6	21	48	65	x	138	180	189	210	266	276	286	292	327
Qwen3.6-27B	16.8	x	7	x	x	31	41	46	x	47	71	74	76	82
gemma-4-26B-A4B-it	16.9	x	32	x	x	84	108	116	x	149	178	181	188	185
gemma-4-31B-it	18.3	x	6	x	x	28	37	43	x	35	65	68	71	75
Qwen3.6-35B-A3B	22.1	x	34	x	x	92	120	128	x	180	189	194	204	210

LLM models are from hugginface and to make comparison fair:

models are from unsloth because they provide most of the quantization variants.
models are Q4_K_M as most popular recommendation because of trade-off accuracy vs speed.
models selected without optimizations and preferences for hardware so no MTP, MLX, NVFP4.

Benchmark

Small Python script (less than 100 lines). It sends requests and shows tok/s. Download: https://github.com/pavlokhmel/llm_throughput_benchmark

Prompt in the benchmark command written to force LLM model to write a long output. In average I was getting between 4000 - 7000 tokens.

$ python3 llm_throughput_benchmark.py \
--api-url http://127.0.0.1:8000/v1/chat/completions \
--model unsloth/LFM2.5-1.2B-Instruct-GGUF:Q4_K_M \
--prompt "Provide a comprehensive, highly verbose technical guide on LLMs covering history, transformer architecture, tokenization, pre-training, reinforcement learning from human feedback, limitations, and future trends. Expand each section with extreme detail." \
--requests 1 \
--parallel 1

Output example on Mac mini with M1 processor:

MODEL                                     TOK/S  TIME(S)  IN_TOK  OUT_TOK  REQUESTS  PARALLEL  SUCCESS
unsloth/LFM2.5-1.2B-Instruct-GGUF:Q4_K_M  65     74       55      4898     1         1         1

These 2 options --parsable and --no-header will make oneline output with tab character as separator. Output example:

unsloth/LFM2.5-1.2B-Instruct-GGUF:Q4_K_M 65 74 55 4898 1 1 1

NOTE: to start gpt-oss-20b on M1 with 16GB RAM I needed to use this command:

sudo sysctl iogpu.wired_limit_mb=14336

Because around 30% of unified memory is reserved on the macOS.

LLM inferencing

I selected llama.cpp as it can be started everywhere and it has useful command line options. I skipped vLLM, SGlang. They are faster is some cases but less support for different hardware and operation systems. And I also skipped LM Studio and Ollama as they use llama.cpp under the hood.

llama.cpp can be installed as pre-build software. Or it can be build from source code. This is example how to build on Linux with CUDA:

useradd llm
su - llm
git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp
export PATH=${PATH}:/usr/local/cuda/bin
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/usr/local/cuda/lib64
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j 8
cd ..

This is example how to build on macOS:

git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build --config Release -j
cd ..

Options:

--host 0.0.0.0 # listen on all IP addresses
--port 8000 # listen on port 8000
--no-mmap # forces to load the entire LLM model weights directly into physical memory
-ngl all # store all layers to store in VRAM
-fit off # do not adjust memory. Out of Memory (OOM) if model does not fit in to the VRAM.
-c 32768 # context size 32k tokens is more than enough for this task.
-hf hugginface_model_name:quantization

Commands to start inferencing:

./llama.cpp/build/bin/llama-server --host 0.0.0.0 --port 8000 --no-mmap -ngl all -fit off -c 32768 -hf unsloth/LFM2.5-1.2B-Instruct-GGUF:Q4_K_M
./llama.cpp/build/bin/llama-server --host 0.0.0.0 --port 8000 --no-mmap -ngl all -fit off -c 32768 -hf unsloth/Ministral-3-3B-Instruct-2512-GGUF:Q4_K_M
./llama.cpp/build/bin/llama-server --host 0.0.0.0 --port 8000 --no-mmap -ngl all -fit off -c 32768 -hf unsloth/Qwen3.5-4B-GGUF:Q4_K_M
./llama.cpp/build/bin/llama-server --host 0.0.0.0 --port 8000 --no-mmap -ngl all -fit off -c 32768 -hf unsloth/gemma-4-E4B-it-GGUF:Q4_K_M
./llama.cpp/build/bin/llama-server --host 0.0.0.0 --port 8000 --no-mmap -ngl all -fit off -c 32768 -hf unsloth/Qwen3.5-9B-GGUF:Q4_K_M
./llama.cpp/build/bin/llama-server --host 0.0.0.0 --port 8000 --no-mmap -ngl all -fit off -c 32768 -hf unsloth/gpt-oss-20b-GGUF:Q4_K_M
./llama.cpp/build/bin/llama-server --host 0.0.0.0 --port 8000 --no-mmap -ngl all -fit off -c 32768 -hf unsloth/Qwen3.6-27B-GGUF:Q4_K_M
./llama.cpp/build/bin/llama-server --host 0.0.0.0 --port 8000 --no-mmap -ngl all -fit off -c 32768 -hf unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_M
./llama.cpp/build/bin/llama-server --host 0.0.0.0 --port 8000 --no-mmap -ngl all -fit off -c 32768 -hf unsloth/gemma-4-31B-it-GGUF:Q4_K_M
./llama.cpp/build/bin/llama-server --host 0.0.0.0 --port 8000 --no-mmap -ngl all -fit off -c 32768 -hf unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_M

Benchmarks command example:

python3 llm_throughput_benchmark.py --api-url http://10.1.0.5:8000/v1/chat/completions --model unsloth/LFM2.5-1.2B-Instruct-GGUF:Q4_K_M --prompt "Provide a comprehensive, highly verbose technical guide on LLMs covering history, transformer architecture, tokenization, pre-training, reinforcement learning from human feedback, limitations, and future trends. Expand each section with extreme detail." --requests 1 --parallel 1 --parsable --no-header

Example output on RTX4090

# RTX 4090
unsloth/LFM2.5-1.2B-Instruct-GGUF:Q4_K_M 741 5 55 4141 1 1 1
unsloth/Ministral-3-3B-Instruct-2512-GGUF:Q4_K_M 271 14 580 4073 1 1 1
unsloth/Qwen3.5-4B-GGUF:Q4_K_M 206 36 53 7430 1 1 1
unsloth/gemma-4-E4B-it-GGUF:Q4_K_M 177 23 59 4188 1 1 1
unsloth/Qwen3.5-9B-GGUF:Q4_K_M 137 44 53 6034 1 1 1
unsloth/gpt-oss-20b-GGUF:Q4_K_M 266 20 110 5561 1 1 1
unsloth/Qwen3.6-27B-GGUF:Q4_K_M 47 166 53 7902 1 1 1
unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_M 149 24 59 3613 1 1 1
unsloth/gemma-4-31B-it-GGUF:Q4_K_M 35 93 59 3261 1 1 1
unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_M 180 22 53 4043 1 1 1

Operating systems

Benchmark on:

macOS Thahoe 26.5.1
Rocky Linux 9.7 with CUDA 12.9.2
Rocky Linux 10.1 with CUDA 13.3.0

Hardware

Hardware	Description
M1	Macmini9,1 - M1 - 8 cores (4 Performance and 4 Efficiency) - 16GB RAM
M5	MacBook Pro - Mac17,2 - 10 cores (4 Super and 6 Efficiency) - 32GB RAM
P100	NVIDIA Tesla P100 PCIe 16 GB
GTX1080	NVIDIA GeForce GTX 1080 8GB
V100 PCIe	NVIDIA Tesla V100 PCIe 32 GB
A100 PCIe	NVIDIA A100 PCIe 40 GB
A100 SXM	NVIDIA A100 SXM 80 GB
RTX4080	NVIDIA GeForce RTX 4080 16GB
RTX4090	NVIDIA GeForce RTX 4090 24GB
H200 PCIe	NVIDIA H200 NVL
H100 SMX	NVIDIA H100 SXM 80GB
H200 SMX	NVIDIA H200 SXM 141GB
B300 SMX	NVIDIA B300 SXM 288GB