Benchmarks with Dell Precision 7520 and Nvidia RTX 2060 with 12GB vram and Gemma-4-26B-A4B Q5

Posted

In my tests I already figured out that —n-cpu-moe 18 is a good setting to have 2 parallel endpoints with a context size of about 54k. But still the prompt processing is slow. So, I wanted to try to increase the performance a bit.

Check the best value for -b and -ub.

adrian@bigdelli:~$ CUDA_VISIBLE_DEVICES=0 ~/llama.cpp/build/bin/llama-bench -hf bartowski/google_gemma-4-26B-A4B-it-GGUF:Q5_K_S —n-cpu-moe 18 -fa 1 -n 128 -b 512,1024,2048 -ub 128,256,512,1024 ggml_cuda_init: found 1 CUDA devices (Total VRAM: 11833 MiB): Device 0: NVIDIA GeForce RTX 2060, compute capability 7.5, VMM: yes, VRAM: 11833 MiB
model size params backend ngl n_cpu_moe n_batch n_ubatch fa test t/s
——————————————— ————-: ————-: ————— —: ————-: ———: ———-: -: ———————: —————————-:
gemma4 ?B Q5_K – Small 16.87 GiB 25.23 B CUDA 99 18 512 128 1 pp512 69.90 ± 4.47
gemma4 ?B Q5_K – Small 16.87 GiB 25.23 B CUDA 99 18 512 128 1 tg128 24.14 ± 0.15
gemma4 ?B Q5_K – Small 16.87 GiB 25.23 B CUDA 99 18 512 256 1 pp512 119.58 ± 2.77
gemma4 ?B Q5_K – Small 16.87 GiB 25.23 B CUDA 99 18 512 256 1 tg128 24.17 ± 0.21
gemma4 ?B Q5_K – Small 16.87 GiB 25.23 B CUDA 99 18 512 512 1 pp512 208.32 ± 5.94
gemma4 ?B Q5_K – Small 16.87 GiB 25.23 B CUDA 99 18 512 512 1 tg128 24.24 ± 0.25
gemma4 ?B Q5_K – Small 16.87 GiB 25.23 B CUDA 99 18 512 1024 1 pp512 204.08 ± 6.59
gemma4 ?B Q5_K – Small 16.87 GiB 25.23 B CUDA 99 18 512 1024 1 tg128 24.31 ± 0.30
gemma4 ?B Q5_K – Small 16.87 GiB 25.23 B CUDA 99 18 1024 128 1 pp512 70.91 ± 2.03
gemma4 ?B Q5_K – Small 16.87 GiB 25.23 B CUDA 99 18 1024 128 1 tg128 24.23 ± 0.28
gemma4 ?B Q5_K – Small 16.87 GiB 25.23 B CUDA 99 18 1024 256 1 pp512 118.23 ± 4.00
gemma4 ?B Q5_K – Small 16.87 GiB 25.23 B CUDA 99 18 1024 256 1 tg128 24.34 ± 0.24
gemma4 ?B Q5_K – Small 16.87 GiB 25.23 B CUDA 99 18 1024 512 1 pp512 203.31 ± 3.21
gemma4 ?B Q5_K – Small 16.87 GiB 25.23 B CUDA 99 18 1024 512 1 tg128 24.21 ± 0.15
gemma4 ?B Q5_K – Small 16.87 GiB 25.23 B CUDA 99 18 1024 1024 1 pp512 205.94 ± 9.56
gemma4 ?B Q5_K – Small 16.87 GiB 25.23 B CUDA 99 18 1024 1024 1 tg128 24.24 ± 0.12
gemma4 ?B Q5_K – Small 16.87 GiB 25.23 B CUDA 99 18 2048 128 1 pp512 71.78 ± 2.65
gemma4 ?B Q5_K – Small 16.87 GiB 25.23 B CUDA 99 18 2048 128 1 tg128 24.23 ± 0.26
gemma4 ?B Q5_K – Small 16.87 GiB 25.23 B CUDA 99 18 2048 256 1 pp512 124.82 ± 4.47
gemma4 ?B Q5_K – Small 16.87 GiB 25.23 B CUDA 99 18 2048 256 1 tg128 24.22 ± 0.20
gemma4 ?B Q5_K – Small 16.87 GiB 25.23 B CUDA 99 18 2048 512 1 pp512 205.93 ± 6.00
gemma4 ?B Q5_K – Small 16.87 GiB 25.23 B CUDA 99 18 2048 512 1 tg128 24.27 ± 0.41
gemma4 ?B Q5_K – Small 16.87 GiB 25.23 B CUDA 99 18 2048 1024 1 pp512 204.90 ± 4.04
gemma4 ?B Q5_K – Small 16.87 GiB 25.23 B CUDA 99 18 2048 1024 1 tg128 24.31 ± 0.27

Recheck the speeds with different amounts of MOE layers on the CPU.

adrian@bigdelli:~$ CUDA_VISIBLE_DEVICES=0 ~/llama.cpp/build/bin/llama-bench -hf bartowski/google_gemma-4-26B-A4B-it-GGUF:Q5_K_S —n-cpu-moe 14,15,16,17,18,19,20,21 -fa 1,0 -n 128 -b 1024 -ub 1024 ggml_cuda_init: found 1 CUDA devices (Total VRAM: 11833 MiB): Device 0: NVIDIA GeForce RTX 2060, compute capability 7.5, VMM: yes, VRAM: 11833 MiB
model size params backend ngl n_cpu_moe n_batch n_ubatch fa test t/s
——————————————— ————-: ————-: ————— —: ————-: ———: ———-: -: ———————: —————————-:
gemma4 ?B Q5_K – Small 16.87 GiB 25.23 B CUDA 99 14 1024 1024 1 pp512 223.06 ± 9.02
gemma4 ?B Q5_K – Small 16.87 GiB 25.23 B CUDA 99 14 1024 1024 1 tg128 29.18 ± 0.41
gemma4 ?B Q5_K – Small 16.87 GiB 25.23 B CUDA 99 14 1024 1024 0 pp512 222.99 ± 4.69
gemma4 ?B Q5_K – Small 16.87 GiB 25.23 B CUDA 99 14 1024 1024 0 tg128 28.79 ± 0.29
gemma4 ?B Q5_K – Small 16.87 GiB 25.23 B CUDA 99 15 1024 1024 1 pp512 224.38 ± 6.31
gemma4 ?B Q5_K – Small 16.87 GiB 25.23 B CUDA 99 15 1024 1024 1 tg128 27.55 ± 0.27
gemma4 ?B Q5_K – Small 16.87 GiB 25.23 B CUDA 99 15 1024 1024 0 pp512 219.00 ± 5.54
gemma4 ?B Q5_K – Small 16.87 GiB 25.23 B CUDA 99 15 1024 1024 0 tg128 27.57 ± 0.38
gemma4 ?B Q5_K – Small 16.87 GiB 25.23 B CUDA 99 16 1024 1024 1 pp512 212.68 ± 3.87
gemma4 ?B Q5_K – Small 16.87 GiB 25.23 B CUDA 99 16 1024 1024 1 tg128 26.23 ± 0.24
gemma4 ?B Q5_K – Small 16.87 GiB 25.23 B CUDA 99 16 1024 1024 0 pp512 209.31 ± 7.52
gemma4 ?B Q5_K – Small 16.87 GiB 25.23 B CUDA 99 16 1024 1024 0 tg128 26.15 ± 0.27
gemma4 ?B Q5_K – Small 16.87 GiB 25.23 B CUDA 99 17 1024 1024 1 pp512 208.96 ± 3.50
gemma4 ?B Q5_K – Small 16.87 GiB 25.23 B CUDA 99 17 1024 1024 1 tg128 25.07 ± 0.25
gemma4 ?B Q5_K – Small 16.87 GiB 25.23 B CUDA 99 17 1024 1024 0 pp512 209.88 ± 9.27
gemma4 ?B Q5_K – Small 16.87 GiB 25.23 B CUDA 99 17 1024 1024 0 tg128 24.79 ± 0.33
gemma4 ?B Q5_K – Small 16.87 GiB 25.23 B CUDA 99 18 1024 1024 1 pp512 207.52 ± 6.04
gemma4 ?B Q5_K – Small 16.87 GiB 25.23 B CUDA 99 18 1024 1024 1 tg128 24.18 ± 0.20
gemma4 ?B Q5_K – Small 16.87 GiB 25.23 B CUDA 99 18 1024 1024 0 pp512 209.03 ± 9.32
gemma4 ?B Q5_K – Small 16.87 GiB 25.23 B CUDA 99 18 1024 1024 0 tg128 23.76 ± 0.22
gemma4 ?B Q5_K – Small 16.87 GiB 25.23 B CUDA 99 19 1024 1024 1 pp512 199.83 ± 5.87
gemma4 ?B Q5_K – Small 16.87 GiB 25.23 B CUDA 99 19 1024 1024 1 tg128 23.45 ± 0.24
gemma4 ?B Q5_K – Small 16.87 GiB 25.23 B CUDA 99 19 1024 1024 0 pp512 197.79 ± 3.33
gemma4 ?B Q5_K – Small 16.87 GiB 25.23 B CUDA 99 19 1024 1024 0 tg128 22.94 ± 0.20
gemma4 ?B Q5_K – Small 16.87 GiB 25.23 B CUDA 99 20 1024 1024 1 pp512 190.75 ± 3.30
gemma4 ?B Q5_K – Small 16.87 GiB 25.23 B CUDA 99 20 1024 1024 1 tg128 22.70 ± 0.22
gemma4 ?B Q5_K – Small 16.87 GiB 25.23 B CUDA 99 20 1024 1024 0 pp512 194.97 ± 6.65
gemma4 ?B Q5_K – Small 16.87 GiB 25.23 B CUDA 99 20 1024 1024 0 tg128 22.22 ± 0.15
gemma4 ?B Q5_K – Small 16.87 GiB 25.23 B CUDA 99 21 1024 1024 1 pp512 188.35 ± 3.29
gemma4 ?B Q5_K – Small 16.87 GiB 25.23 B CUDA 99 21 1024 1024 1 tg128 21.98 ± 0.14
gemma4 ?B Q5_K – Small 16.87 GiB 25.23 B CUDA 99 21 1024 1024 0 pp512 189.21 ± 2.23
gemma4 ?B Q5_K – Small 16.87 GiB 25.23 B CUDA 99 21 1024 1024 0 tg128 21.71 ± 0.17

Conclusion

After these tests and some real life tests I now stick with the following settings:

llama-server -hf bartowski/google_gemma-4-26B-A4B-it-GGUF:Q5_K_S --device CUDA0 \
  --jinja --no-mmproj-offload --reasoning off \
  -fa 1 -b 1024 -ub 1024 -ctk q8_0 -ctv q8_0 --no-mmap \
  --host 0.0.0.0 --port 8002 -ngl 99 --n-cpu-moe 18 --parallel 2 --ctx-size 108000

With each additional layer, I lose context, and because I am using hermes-agent, I should stick with those 2 connections since the agent sometimes uses sub-agents for different tasks. However, I still see some ReadTimeouts when it comes to context compaction. But this is another problem.

Author