I asked LLM Gemma 4 the same prompt 100 times and compared with Qwen3.6, Kimi-K2.6, GLM-4.7, Devstral-Small-2

I asked different flavours of Gemma 4 to create a LLM benchmark in Python 100 times.

And these are results:

	success	quantized	software	hardware
google/gemma-4-31B-it	81/100		vLLM	Nvidia H100
RedHatAI/gemma-4-31B-it-FP8-block	92/100	FP8	vLLM	Nvidia H100
nvidia/Gemma-4-31B-IT-NVFP4	81/100	NVFP4	vLLM	Nvidia H100
gemma-4-31B-it-Q4_K_M.gguf	77/100	Q4_K_M	LM Studio	Nvidia RTX4090

https://huggingface.co/google/gemma-4-31B-it
https://huggingface.co/RedHatAI/gemma-4-31B-it-FP8-block
https://huggingface.co/nvidia/Gemma-4-31B-IT-NVFP4
https://huggingface.co/lmstudio-community/gemma-4-31B-it-GGUF

Gemma 4 inferencing parameters:

reasoning disabled
temperature=1.0
top_p=0.95
top_k=64

Now I run the same prompt only 10 times (because it takes forever) on Qwen3.6, Kimi-K2.6, GLM-4.7, Devstral-Small-2.

Result:

Model ID	success
RedHatAI/gemma-4-31B-it-FP8-block	9/10
moonshotai/Kimi-K2.6	7/10
Qwen/Qwen3.6-27B-FP8	4/10
zai-org/GLM-4.7-FP8	3/10
mistralai/Devstral-Small-2-24B-Instruct-2512	0/10

Devstral-Small-2 have not implemented all methods but at least it created scripts with valid Python code.

Maybe my prompt is bad? I asked Kimi-K2.6 to created better prompt:

"Improve this prompt so even small LLM models will be able to create benchmark script. New prompt start with NEW_PROMPT_START and it endrs with NEW_PROMPT_END . . . "

I iterated several times and I got new prompt that works with Devstral:

for i in {0..9}; do echo -n "TEST: ${i} "; bash llm_one_shot_improved_prompt.sh ${i} https://192.168.0.4/v1/chat/completions mistralai/Devstral-Small-2-24B-Instruct-2512 Qwen/Qwen3.6-27B-FP8; done
TEST: 0 Qwen/Qwen3.6-27B-FP8,57.39,152.65,160.40,234.96,6.53
TEST: 1 Qwen/Qwen3.6-27B-FP8,56.79,137.40,158.60,253.51,5.19
TEST: 2 Qwen/Qwen3.6-27B-FP8,57.92,144.04,148.19,167.43,6.15
TEST: 3 Traceback (most recent call last): . . . ValueError: too many values to unpack (expected 3)
TEST: 5 Qwen/Qwen3.6-27B-FP8,57.72,150.36,151.53,158.17,6.22
TEST: 6 Qwen/Qwen3.6-27B-FP8,56.59,166.46,156.3863787291997,200.48,4.63
TEST: 7 Qwen/Qwen3.6-27B-FP8,57.50,143.13,158.13,162.10,5.17
TEST: 8 Qwen/Qwen3.6-27B-FP8,56.73,174.12,158.51,189.34,3.95
TEST: 9 Qwen/Qwen3.6-27B-FP8,57.94,157.60,160.89,176.14,7.27

9 of of 10 that is a great improvement.

Details about test:

In my prompt I ask to implement 3 methods to benchmark LLM throughput tokens per second tok/s. I do not want to use correct method: Use tokenizer.json from the model, count tokens, and divide by time.

method 1: Use streaming and count lines (assume one line is a token. This is not a very valid way)
method 2: Count all characters and divide by 4 (average characters per token)
method 3: Run without streaming and if outputs contains "token count" then divide by time. (This is better method)

I run benchmark against Qwen/Qwen3.6-27B-FP8 and I benchmarked it with vLLM built in benchmark, so I know what throughput to expect. vLLM benchmark command:

$ vllm bench serve --host 192.167.1.80 --port 8000 --model Qwen/Qwen3.6-27B-FP8 --dataset-name random --random-input-len 1000 --random-output-len 1000 --num-prompts 10 --max-concurrency 1

============ Serving Benchmark Result ============
. . .
Output token throughput (tok/s):         152.66     
. . .  
==================================================

So about 151 tok/s.

This is script to create benchmark and test it:

#!/bin/bash

### How to run
# export LLM_API_KEY=sk-v...API...KEY...Q
### one run
# bash llm_one_prompt.sh TEST https://192.168.0.5/v1/chat/completions google/gemma-4-31B-it Qwen/Qwen3.6-27B-FP8
### mulriple
# for i in {0..99}; do echo -n "TEST: ${i} "; bash llm_one_prompt.sh ${i} https://192.168.0.5/v1/chat/completions google/gemma-4-31B-it Qwen/Qwen3.6-27B-FP8; done

file_num=$1
api_url=$2
model=$3
benchmark_model=$4

curl -sS ${api_url} -H "Authorization: Bearer ${LLM_API_KEY}" -H "Content-Type: application/json" -d '{
    "model": "'${model}'",
    "messages": [
    { "role": "system", "content": "You are senior Python developer." },
    {"role": "user", "content": "Create a Python 3 script llm_benchmark.py to measure LLM throughput via OpenAI compatible API. Use only Python 3 Standard Library. Script options: --api-url, --api-key, --model, --prompt, --requests, --concurrency, --one-line, --verbose. Option --api-url is a full URL of the chat completions endpoint. Option --verbose shows the prompt and full output. Script outputs: token throughput (tok/s) and Time to First Token TTFT (ms), total run time. Script will not use tokenizer. Use streaming. Implement 3 methods to count tokens. Method A: In streaming count output tokens. Assume one line is one token. Method B: Divide total amount of output characters by 4. Then divide by time. Method C: Do not use streaming. And run benchmark second time and if API returns token count, then use it. Display results of 3 methods. Option --one-line prints result in only one line: model name, output tok/s method A, output tok/s method B, output tok/s method C, TTFT method A, total time method A. Count content and reasoning_content tokens together."}
    ]
  }' | awk -F '```python' '{print $2}' | awk -F '```' '{print $1}' | sed 's/\\n/\n/g' | sed 's/\\"/"/g' > llm_benchmark_v${file_num}.py

python3 llm_benchmark_v${file_num}.py --api-url ${api_url} --api-key ${LLM_API_KEY} --model ${benchmark_model} --prompt "What is LLM? Write at least 50 words." --requests 1 --concurrency 1 --one-line

Example output:

% for i in {0..99}; do echo -n "TEST: ${i} "; bash llm_one_shot_prompt.sh ${i} http://localhost:1234/v1/chat/completions google/gemma-4-31B-it Qwen/Qwen3.6-27B-FP8; done
TEST: 0 Qwen/Qwen3.6-27B-FP8, 58.50, 147.08, 160.52, 79.26, 4.77
TEST: 1 Qwen/Qwen3.6-27B-FP8, 59.12, 154.28, 166.70, 62.59, 5.29
TEST: 2 Qwen/Qwen3.6-27B-FP8, 58.76, 178.33, 159.84, 63.28, 3.91
TEST: 3 Qwen/Qwen3.6-27B-FP8, 58.59, 147.04, 160.96, 78.49, 5.22
. . .
TEST: 68 Qwen/Qwen3.6-27B-FP8, 56.66, 144.87, 97.47, 0.28, 6.37
TEST: 69   File "./llm_benchmark_v69.py", line 135
    total_wall_time = sum(r[3] for r in results) / len(results) if args.concurrency == 1 else \\
                                                                                               ^
SyntaxError: unexpected character after line continuation character
TEST: 70 Qwen/Qwen3.6-27B-FP8, 59.86, 158.70, 265632.36, 0.06, 5.45
TEST: 71 Qwen/Qwen3.6-27B-FP8, 59.48, 150.68, 147.24, 158.70, 6.48

Output is: model name, output tok/s method A, output tok/s method B, output tok/s method C, TTFT method A, total time method A.

Inferencing commands:

vllm serve google/gemma-4-31B-it --served-model-name google/gemma-4-31B-it --tensor-parallel-size 1 --enable-auto-tool-choice --tool-call-parser gemma4 --reasoning-parser gemma4 --api-key sk-IDUN-NTNU-LLM-API-KEY --port 8016 --gpu-memory-utilization 0.97 --no-enable-prefix-caching --max-model-len 54768

vllm serve RedHatAI/gemma-4-31B-it-FP8-block --served-model-name google/gemma-4-31B-it --tensor-parallel-size 1 --enable-auto-tool-choice --tool-call-parser gemma4 --reasoning-parser gemma4 --api-key sk-IDUN-NTNU-LLM-API-KEY --port 8016 --gpu-memory-utilization 0.90 --no-enable-prefix-caching --kv-cache-dtype fp8 --speculative-config '{"model": "gg-hf-am/gemma-4-31B-it-assistant", "num_speculative_tokens": 4}'

vllm serve nvidia/Gemma-4-31B-IT-NVFP4 --served-model-name google/gemma-4-31B-it --tensor-parallel-size 1 --enable-auto-tool-choice --tool-call-parser gemma4 --reasoning-parser gemma4 --api-key sk-IDUN-NTNU-LLM-API-KEY --port 8016 --gpu-memory-utilization 0.97 --no-enable-prefix-caching --quantization modelopt

Conclusion:

Gemma 4 was the best for small python tool. RedHatAI/gemma-4-31B-it-FP8-block flavour of Gemma-4-31b-it showed best results. 91% chance to create successful Python script that does what was requested.
Devstral-Small-2 was improved from 0 to 90% success rate just by improving my prompt by larger model.
Qwen3.6-27B-FP8 got good public reviews but was not great in this tests with small Python tool.