I benchmarked models with llama.cpp (only one parallel request). Goal was to see how GPUs are performing with the same models. And as a side quest I wanted to find what models are comfortable to use on hardware with low VRAM size.
What is comfortable?
I think tok/s per second perfectly aligned with FPS (Frames Per Second) in gaming:
- below 30 tok/s is too slow. You can read faster.
- 60 tok/s is OK for small things. But for competitive action shooter games is not great.
- 120 tok/s and more is an elite level. You forget what reality is.
This is a result table that shows tokens per second (tok/s):
| MODEL ID | Size (GB) |
M1 8core 16GB |
M5 10core 32GB |
P100 PCIe 16GB |
GTX 1080 8GB |
V100 PCIe 32GB |
A100 PCIe 40GB |
A100 SXM 80GB |
RTX 4080 16GB |
RTX 4090 24GB |
H200 PCIe 141GB |
H100 SXM 80GB |
H200 SXM 141GB |
B300 SXM 288GB |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| LFM2.5-1.2B | 0.7 | 67 | 158 | 167 | 194 | 388 | 485 | 524 | 614 | 741 | 737 | 764 | 796 | 801 |
| Ministral-3-3B | 1.8 | 21 | 50 | 65 | 65 | 151 | 184 | 211 | 210 | 271 | 308 | 305 | 312 | 319 |
| Qwen3.5-4B | 2.7 | 16 | 37 | 49 | 48 | 113 | 145 | 154 | 164 | 206 | 226 | 234 | 241 | 251 |
| gemma-4-E4B-it | 5.0 | 17 | 37 | 44 | 45 | 96 | 119 | 125 | 145 | 177 | 183 | 190 | 195 | 199 |
| Qwen3.5-9B | 5.7 | 9 | 22 | 32 | 28 | 85 | 113 | 123 | 105 | 137 | 184 | 190 | 197 | 213 |
| gpt-oss-20b | 11.6 | 21 | 48 | 65 | x | 138 | 180 | 189 | 210 | 266 | 276 | 286 | 292 | 327 |
| Qwen3.6-27B | 16.8 | x | 7 | x | x | 31 | 41 | 46 | x | 47 | 71 | 74 | 76 | 82 |
| gemma-4-26B-A4B-it | 16.9 | x | 32 | x | x | 84 | 108 | 116 | x | 149 | 178 | 181 | 188 | 185 |
| gemma-4-31B-it | 18.3 | x | 6 | x | x | 28 | 37 | 43 | x | 35 | 65 | 68 | 71 | 75 |
| Qwen3.6-35B-A3B | 22.1 | x | 34 | x | x | 92 | 120 | 128 | x | 180 | 189 | 194 | 204 | 210 |
LLM models are from hugginface and to make comparison fair:
- models are from unsloth because they provide most of the quantization variants.
- models are Q4_K_M as most popular recommendation because of trade-off accuracy vs speed.
- models selected without optimizations and preferences for hardware so no MTP, MLX, NVFP4.
Benchmark
Small Python script (less than 100 lines). It sends requests and shows tok/s. Download: https://github.com/pavlokhmel/llm_throughput_benchmark
Prompt in the benchmark command written to force LLM model to write a long output. In average I was getting between 4000 - 7000 tokens.
$ python3 llm_throughput_benchmark.py \
--api-url http://127.0.0.1:8000/v1/chat/completions \
--model unsloth/LFM2.5-1.2B-Instruct-GGUF:Q4_K_M \
--prompt "Provide a comprehensive, highly verbose technical guide on LLMs covering history, transformer architecture, tokenization, pre-training, reinforcement learning from human feedback, limitations, and future trends. Expand each section with extreme detail." \
--requests 1 \
--parallel 1
Output example on Mac mini with M1 processor:
MODEL TOK/S TIME(S) IN_TOK OUT_TOK REQUESTS PARALLEL SUCCESS
unsloth/LFM2.5-1.2B-Instruct-GGUF:Q4_K_M 65 74 55 4898 1 1 1
These 2 options --parsable and --no-header will make oneline output with tab character as separator. Output example:
unsloth/LFM2.5-1.2B-Instruct-GGUF:Q4_K_M 65 74 55 4898 1 1 1
NOTE: to start gpt-oss-20b on M1 with 16GB RAM I needed to use this command:
sudo sysctl iogpu.wired_limit_mb=14336
Because around 30% of unified memory is reserved on the macOS.
LLM inferencing
I selected llama.cpp as it can be started everywhere and it has useful command line options. I skipped vLLM, SGlang. They are faster is some cases but less support for different hardware and operation systems. And I also skipped LM Studio and Ollama as they use llama.cpp under the hood.
llama.cpp can be installed as pre-build software. Or it can be build from source code. This is example how to build on Linux with CUDA:
useradd llm
su - llm
git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp
export PATH=${PATH}:/usr/local/cuda/bin
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/usr/local/cuda/lib64
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j 8
export CUDA_VISIBLE_DEVICES=0
cd ..
This is example how to build on macOS:
brew install cmake
git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build --config Release -j
cd ..
Options:
- --host 0.0.0.0 # listen on all IP addreses
- --port 8000 # listen on port 8000
- --no-mmap # forces to load the entire LLM model weights directly into physical memory
- -ngl all # store all layers to store in VRAM
- -fit off # do not adjust memory. Out of Memory (OOM) if model does not fit in to the VRAM.
- -c 32768 # context size 32k tokens is more than enough for this task.
- -hf hugginface_model_name:quantization
Commands to start inferencing:
./llama.cpp/build/bin/llama-server --host 0.0.0.0 --port 8000 --no-mmap -ngl all -fit off -c 32768 -hf unsloth/LFM2.5-1.2B-Instruct-GGUF:Q4_K_M
./llama.cpp/build/bin/llama-server --host 0.0.0.0 --port 8000 --no-mmap -ngl all -fit off -c 32768 -hf unsloth/Ministral-3-3B-Instruct-2512-GGUF:Q4_K_M
./llama.cpp/build/bin/llama-server --host 0.0.0.0 --port 8000 --no-mmap -ngl all -fit off -c 32768 -hf unsloth/Qwen3.5-4B-GGUF:Q4_K_M
./llama.cpp/build/bin/llama-server --host 0.0.0.0 --port 8000 --no-mmap -ngl all -fit off -c 32768 -hf unsloth/gemma-4-E4B-it-GGUF:Q4_K_M
./llama.cpp/build/bin/llama-server --host 0.0.0.0 --port 8000 --no-mmap -ngl all -fit off -c 32768 -hf unsloth/Qwen3.5-9B-GGUF:Q4_K_M
./llama.cpp/build/bin/llama-server --host 0.0.0.0 --port 8000 --no-mmap -ngl all -fit off -c 32768 -hf unsloth/gpt-oss-20b-GGUF:Q4_K_M
./llama.cpp/build/bin/llama-server --host 0.0.0.0 --port 8000 --no-mmap -ngl all -fit off -c 32768 -hf unsloth/Qwen3.6-27B-GGUF:Q4_K_M
./llama.cpp/build/bin/llama-server --host 0.0.0.0 --port 8000 --no-mmap -ngl all -fit off -c 32768 -hf unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_M
./llama.cpp/build/bin/llama-server --host 0.0.0.0 --port 8000 --no-mmap -ngl all -fit off -c 32768 -hf unsloth/gemma-4-31B-it-GGUF:Q4_K_M
./llama.cpp/build/bin/llama-server --host 0.0.0.0 --port 8000 --no-mmap -ngl all -fit off -c 32768 -hf unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_M
Benchmarks command example:
python3 llm_throughput_benchmark.py --api-url http://10.1.0.5:8000/v1/chat/completions --model unsloth/LFM2.5-1.2B-Instruct-GGUF:Q4_K_M --prompt "Provide a comprehensive, highly verbose technical guide on LLMs covering history, transformer architecture, tokenization, pre-training, reinforcement learning from human feedback, limitations, and future trends. Expand each section with extreme detail." --requests 1 --parallel 1 --parsable --no-header
Example output on RTX4090
# RTX 4090
unsloth/LFM2.5-1.2B-Instruct-GGUF:Q4_K_M 741 5 55 4141 1 1 1
unsloth/Ministral-3-3B-Instruct-2512-GGUF:Q4_K_M 271 14 580 4073 1 1 1
unsloth/Qwen3.5-4B-GGUF:Q4_K_M 206 36 53 7430 1 1 1
unsloth/gemma-4-E4B-it-GGUF:Q4_K_M 177 23 59 4188 1 1 1
unsloth/Qwen3.5-9B-GGUF:Q4_K_M 137 44 53 6034 1 1 1
unsloth/gpt-oss-20b-GGUF:Q4_K_M 266 20 110 5561 1 1 1
unsloth/Qwen3.6-27B-GGUF:Q4_K_M 47 166 53 7902 1 1 1
unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_M 149 24 59 3613 1 1 1
unsloth/gemma-4-31B-it-GGUF:Q4_K_M 35 93 59 3261 1 1 1
unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_M 180 22 53 4043 1 1 1
Operation system
Benchmark on:
- macOS Thahoe 26.5.1
- Rocky Linux 9.7 with CUDA 12.9.2
- Rocky Linux 10.1 with CUDA 13.3.0
Hardware
| Hardware | Description |
|---|---|
| M1 | Macmini9,1 - M1 - 8 (4 Performance and 4 Efficiency) - 16GB RAM |
| M5 | MacBook Pro - Mac17,2 - 10 (4 Super and 6 Efficiency) - 32GB RAM |
| P100 | NVIDIA Tesla P100 PCIe 16 GB |
| GTX1080 | NVIDIA GeForce GTX 1080 8GB |
| V100 PCIe | NVIDIA Tesla V100 PCIe 32 GB |
| A100 PCIe | NVIDIA A100 PCIe 40 GB |
| A100 SXM | NVIDIA A100 SXM 80 GB |
| RTX4080 | NVIDIA GeForce RTX 4080 16GB |
| RTX4090 | NVIDIA GeForce RTX 4090 24GB |
| H200 PCIe | NVIDIA H200 NVL |
| H100 SMX | NVIDIA H100 SXM 80GB |
| H200 SMX | NVIDIA H200 SXM 141GB |
| B300 SMX | NVIDIA B300 SXM 288GB |