I asked different flavours of Gemma 4 to create a LLM benchmark in Python 100 times.
And these are results:
| success | quantized | software | hardware | |
|---|---|---|---|---|
| google/gemma-4-31B-it | 81/100 | vLLM | Nvidia H100 | |
| RedHatAI/gemma-4-31B-it-FP8-block | 92/100 | FP8 | vLLM | Nvidia H100 |
| nvidia/Gemma-4-31B-IT-NVFP4 | 81/100 | NVFP4 | vLLM | Nvidia H100 |
| gemma-4-31B-it-Q4_K_M.gguf | 77/100 | Q4_K_M | LM Studio | Nvidia RTX4090 |
- https://huggingface.co/google/gemma-4-31B-it
- https://huggingface.co/RedHatAI/gemma-4-31B-it-FP8-block
- https://huggingface.co/nvidia/Gemma-4-31B-IT-NVFP4
- https://huggingface.co/lmstudio-community/gemma-4-31B-it-GGUF
Gemma 4 inferencing parameters:
- reasoning disabled
- temperature=1.0
- top_p=0.95
- top_k=64
Now I run the same prompt only 10 times (because it takes forever) on Qwen3.6, Kimi-K2.6, GLM-4.7, Devstral-Small-2.
Result:
| Model ID | success |
|---|---|
| RedHatAI/gemma-4-31B-it-FP8-block | 9/10 |
| moonshotai/Kimi-K2.6 | 7/10 |
| Qwen/Qwen3.6-27B-FP8 | 4/10 |
| zai-org/GLM-4.7-FP8 | 3/10 |
| mistralai/Devstral-Small-2-24B-Instruct-2512 | 0/10 |
Devstral-Small-2 have not implemented all methods but at least it created scripts with valid Python code.
Maybe my prompt is bad? I asked Kimi-K2.6 to created better prompt:
"Improve this prompt so even small LLM models will be able to create benchmark script. New prompt start with NEW_PROMPT_START and it endrs with NEW_PROMPT_END
I iterated several times and I got new prompt that works with Devstral:
for i in {0..9}; do echo -n "TEST: ${i} "; bash llm_one_shot_improved_prompt.sh ${i} https://192.168.0.4/v1/chat/completions mistralai/Devstral-Small-2-24B-Instruct-2512 Qwen/Qwen3.6-27B-FP8; done
TEST: 0 Qwen/Qwen3.6-27B-FP8,57.39,152.65,160.40,234.96,6.53
TEST: 1 Qwen/Qwen3.6-27B-FP8,56.79,137.40,158.60,253.51,5.19
TEST: 2 Qwen/Qwen3.6-27B-FP8,57.92,144.04,148.19,167.43,6.15
TEST: 3 Traceback (most recent call last): . . . ValueError: too many values to unpack (expected 3)
TEST: 5 Qwen/Qwen3.6-27B-FP8,57.72,150.36,151.53,158.17,6.22
TEST: 6 Qwen/Qwen3.6-27B-FP8,56.59,166.46,156.3863787291997,200.48,4.63
TEST: 7 Qwen/Qwen3.6-27B-FP8,57.50,143.13,158.13,162.10,5.17
TEST: 8 Qwen/Qwen3.6-27B-FP8,56.73,174.12,158.51,189.34,3.95
TEST: 9 Qwen/Qwen3.6-27B-FP8,57.94,157.60,160.89,176.14,7.27
9 of of 10 that is a great improvement.
Details about test:
In my prompt I ask to implement 3 methods to benchmark LLM throughput tokens per second tok/s. I do not want to use correct method: Use tokenizer.json from the model, count tokens, and divide by time.
- method 1: Use streaming and count lines (assume one line is a token. This is not a very valid way)
- method 2: Count all characters and divide by 4 (average characters per token)
- method 3: Run without streaming and if outputs contains "token count" then divide by time. (This is better method)
I run benchmark against Qwen/Qwen3.6-27B-FP8 and I benchmarked it with vLLM built in benchmark, so I know what throughput to expect. vLLM benchmark command:
$ vllm bench serve --host 192.167.1.80 --port 8000 --model Qwen/Qwen3.6-27B-FP8 --dataset-name random --random-input-len 1000 --random-output-len 1000 --num-prompts 10 --max-concurrency 1
============ Serving Benchmark Result ============
. . .
Output token throughput (tok/s): 152.66
. . .
==================================================
So about 151 tok/s.
This is script to create benchmark and test it:
#!/bin/bash
### How to run
# export LLM_API_KEY=sk-v...API...KEY...Q
### one run
# bash llm_one_prompt.sh TEST https://192.168.0.5/v1/chat/completions google/gemma-4-31B-it Qwen/Qwen3.6-27B-FP8
### mulriple
# for i in {0..99}; do echo -n "TEST: ${i} "; bash llm_one_prompt.sh ${i} https://192.168.0.5/v1/chat/completions google/gemma-4-31B-it Qwen/Qwen3.6-27B-FP8; done
file_num=$1
api_url=$2
model=$3
benchmark_model=$4
curl -sS ${api_url} -H "Authorization: Bearer ${LLM_API_KEY}" -H "Content-Type: application/json" -d '{
"model": "'${model}'",
"messages": [
{ "role": "system", "content": "You are senior Python developer." },
{"role": "user", "content": "Create a Python 3 script llm_benchmark.py to measure LLM throughput via OpenAI compatible API. Use only Python 3 Standard Library. Script options: --api-url, --api-key, --model, --prompt, --requests, --concurrency, --one-line, --verbose. Option --api-url is a full URL of the chat completions endpoint. Option --verbose shows the prompt and full output. Script outputs: token throughput (tok/s) and Time to First Token TTFT (ms), total run time. Script will not use tokenizer. Use streaming. Implement 3 methods to count tokens. Method A: In streaming count output tokens. Assume one line is one token. Method B: Divide total amount of output characters by 4. Then divide by time. Method C: Do not use streaming. And run benchmark second time and if API returns token count, then use it. Display results of 3 methods. Option --one-line prints result in only one line: model name, output tok/s method A, output tok/s method B, output tok/s method C, TTFT method A, total time method A. Count content and reasoning_content tokens together."}
]
}' | awk -F '```python' '{print $2}' | awk -F '```' '{print $1}' | sed 's/\\n/\n/g' | sed 's/\\"/"/g' > llm_benchmark_v${file_num}.py
python3 llm_benchmark_v${file_num}.py --api-url ${api_url} --api-key ${LLM_API_KEY} --model ${benchmark_model} --prompt "What is LLM? Write at least 50 words." --requests 1 --concurrency 1 --one-line
Example output:
% for i in {0..99}; do echo -n "TEST: ${i} "; bash llm_one_shot_prompt.sh ${i} http://localhost:1234/v1/chat/completions google/gemma-4-31B-it Qwen/Qwen3.6-27B-FP8; done
TEST: 0 Qwen/Qwen3.6-27B-FP8, 58.50, 147.08, 160.52, 79.26, 4.77
TEST: 1 Qwen/Qwen3.6-27B-FP8, 59.12, 154.28, 166.70, 62.59, 5.29
TEST: 2 Qwen/Qwen3.6-27B-FP8, 58.76, 178.33, 159.84, 63.28, 3.91
TEST: 3 Qwen/Qwen3.6-27B-FP8, 58.59, 147.04, 160.96, 78.49, 5.22
. . .
TEST: 68 Qwen/Qwen3.6-27B-FP8, 56.66, 144.87, 97.47, 0.28, 6.37
TEST: 69 File "./llm_benchmark_v69.py", line 135
total_wall_time = sum(r[3] for r in results) / len(results) if args.concurrency == 1 else \\
^
SyntaxError: unexpected character after line continuation character
TEST: 70 Qwen/Qwen3.6-27B-FP8, 59.86, 158.70, 265632.36, 0.06, 5.45
TEST: 71 Qwen/Qwen3.6-27B-FP8, 59.48, 150.68, 147.24, 158.70, 6.48
Output is: model name, output tok/s method A, output tok/s method B, output tok/s method C, TTFT method A, total time method A.
Inferencing commands:
vllm serve google/gemma-4-31B-it --served-model-name google/gemma-4-31B-it --tensor-parallel-size 1 --enable-auto-tool-choice --tool-call-parser gemma4 --reasoning-parser gemma4 --api-key sk-IDUN-NTNU-LLM-API-KEY --port 8016 --gpu-memory-utilization 0.97 --no-enable-prefix-caching --max-model-len 54768
vllm serve RedHatAI/gemma-4-31B-it-FP8-block --served-model-name google/gemma-4-31B-it --tensor-parallel-size 1 --enable-auto-tool-choice --tool-call-parser gemma4 --reasoning-parser gemma4 --api-key sk-IDUN-NTNU-LLM-API-KEY --port 8016 --gpu-memory-utilization 0.90 --no-enable-prefix-caching --kv-cache-dtype fp8 --speculative-config '{"model": "gg-hf-am/gemma-4-31B-it-assistant", "num_speculative_tokens": 4}'
vllm serve nvidia/Gemma-4-31B-IT-NVFP4 --served-model-name google/gemma-4-31B-it --tensor-parallel-size 1 --enable-auto-tool-choice --tool-call-parser gemma4 --reasoning-parser gemma4 --api-key sk-IDUN-NTNU-LLM-API-KEY --port 8016 --gpu-memory-utilization 0.97 --no-enable-prefix-caching --quantization modelopt
Conclusion:
- Gemma 4 was the best for small python tool. RedHatAI/gemma-4-31B-it-FP8-block flavour of Gemma-4-31b-it showed best results. 91% chance to create successful Python script that does what was requested.
- Devstral-Small-2 was improved from 0 to 90% success rate just by improving my prompt by larger model.
- Qwen3.6-27B-FP8 got good public reviews but was not great in this tests with small Python tool.