In my tests I already figured out that —n-cpu-moe 18 is a good setting to have 2 parallel endpoints with a context size of about 54k. But still the prompt processing is slow. So, I wanted to try to increase the performance a bit.
Check the best value for -b and -ub.
adrian@bigdelli:~$ CUDA_VISIBLE_DEVICES=0 ~/llama.cpp/build/bin/llama-bench -hf bartowski/google_gemma-4-26B-A4B-it-GGUF:Q5_K_S —n-cpu-moe 18 -fa 1 -n 128 -b 512,1024,2048 -ub 128,256,512,1024 ggml_cuda_init: found 1 CUDA devices (Total VRAM: 11833 MiB): Device 0: NVIDIA GeForce RTX 2060, compute capability 7.5, VMM: yes, VRAM: 11833 MiB| model | size | params | backend | ngl | n_cpu_moe | n_batch | n_ubatch | fa | test | t/s |
| ——————————————— | ————-: | ————-: | ————— | —: | ————-: | ———: | ———-: | -: | ———————: | —————————-: |
| gemma4 ?B Q5_K – Small | 16.87 GiB | 25.23 B | CUDA | 99 | 18 | 512 | 128 | 1 | pp512 | 69.90 ± 4.47 |
| gemma4 ?B Q5_K – Small | 16.87 GiB | 25.23 B | CUDA | 99 | 18 | 512 | 128 | 1 | tg128 | 24.14 ± 0.15 |
| gemma4 ?B Q5_K – Small | 16.87 GiB | 25.23 B | CUDA | 99 | 18 | 512 | 256 | 1 | pp512 | 119.58 ± 2.77 |
| gemma4 ?B Q5_K – Small | 16.87 GiB | 25.23 B | CUDA | 99 | 18 | 512 | 256 | 1 | tg128 | 24.17 ± 0.21 |
| gemma4 ?B Q5_K – Small | 16.87 GiB | 25.23 B | CUDA | 99 | 18 | 512 | 512 | 1 | pp512 | 208.32 ± 5.94 |
| gemma4 ?B Q5_K – Small | 16.87 GiB | 25.23 B | CUDA | 99 | 18 | 512 | 512 | 1 | tg128 | 24.24 ± 0.25 |
| gemma4 ?B Q5_K – Small | 16.87 GiB | 25.23 B | CUDA | 99 | 18 | 512 | 1024 | 1 | pp512 | 204.08 ± 6.59 |
| gemma4 ?B Q5_K – Small | 16.87 GiB | 25.23 B | CUDA | 99 | 18 | 512 | 1024 | 1 | tg128 | 24.31 ± 0.30 |
| gemma4 ?B Q5_K – Small | 16.87 GiB | 25.23 B | CUDA | 99 | 18 | 1024 | 128 | 1 | pp512 | 70.91 ± 2.03 |
| gemma4 ?B Q5_K – Small | 16.87 GiB | 25.23 B | CUDA | 99 | 18 | 1024 | 128 | 1 | tg128 | 24.23 ± 0.28 |
| gemma4 ?B Q5_K – Small | 16.87 GiB | 25.23 B | CUDA | 99 | 18 | 1024 | 256 | 1 | pp512 | 118.23 ± 4.00 |
| gemma4 ?B Q5_K – Small | 16.87 GiB | 25.23 B | CUDA | 99 | 18 | 1024 | 256 | 1 | tg128 | 24.34 ± 0.24 |
| gemma4 ?B Q5_K – Small | 16.87 GiB | 25.23 B | CUDA | 99 | 18 | 1024 | 512 | 1 | pp512 | 203.31 ± 3.21 |
| gemma4 ?B Q5_K – Small | 16.87 GiB | 25.23 B | CUDA | 99 | 18 | 1024 | 512 | 1 | tg128 | 24.21 ± 0.15 |
| gemma4 ?B Q5_K – Small | 16.87 GiB | 25.23 B | CUDA | 99 | 18 | 1024 | 1024 | 1 | pp512 | 205.94 ± 9.56 |
| gemma4 ?B Q5_K – Small | 16.87 GiB | 25.23 B | CUDA | 99 | 18 | 1024 | 1024 | 1 | tg128 | 24.24 ± 0.12 |
| gemma4 ?B Q5_K – Small | 16.87 GiB | 25.23 B | CUDA | 99 | 18 | 2048 | 128 | 1 | pp512 | 71.78 ± 2.65 |
| gemma4 ?B Q5_K – Small | 16.87 GiB | 25.23 B | CUDA | 99 | 18 | 2048 | 128 | 1 | tg128 | 24.23 ± 0.26 |
| gemma4 ?B Q5_K – Small | 16.87 GiB | 25.23 B | CUDA | 99 | 18 | 2048 | 256 | 1 | pp512 | 124.82 ± 4.47 |
| gemma4 ?B Q5_K – Small | 16.87 GiB | 25.23 B | CUDA | 99 | 18 | 2048 | 256 | 1 | tg128 | 24.22 ± 0.20 |
| gemma4 ?B Q5_K – Small | 16.87 GiB | 25.23 B | CUDA | 99 | 18 | 2048 | 512 | 1 | pp512 | 205.93 ± 6.00 |
| gemma4 ?B Q5_K – Small | 16.87 GiB | 25.23 B | CUDA | 99 | 18 | 2048 | 512 | 1 | tg128 | 24.27 ± 0.41 |
| gemma4 ?B Q5_K – Small | 16.87 GiB | 25.23 B | CUDA | 99 | 18 | 2048 | 1024 | 1 | pp512 | 204.90 ± 4.04 |
| gemma4 ?B Q5_K – Small | 16.87 GiB | 25.23 B | CUDA | 99 | 18 | 2048 | 1024 | 1 | tg128 | 24.31 ± 0.27 |
Recheck the speeds with different amounts of MOE layers on the CPU.
adrian@bigdelli:~$ CUDA_VISIBLE_DEVICES=0 ~/llama.cpp/build/bin/llama-bench -hf bartowski/google_gemma-4-26B-A4B-it-GGUF:Q5_K_S —n-cpu-moe 14,15,16,17,18,19,20,21 -fa 1,0 -n 128 -b 1024 -ub 1024 ggml_cuda_init: found 1 CUDA devices (Total VRAM: 11833 MiB): Device 0: NVIDIA GeForce RTX 2060, compute capability 7.5, VMM: yes, VRAM: 11833 MiB| model | size | params | backend | ngl | n_cpu_moe | n_batch | n_ubatch | fa | test | t/s |
| ——————————————— | ————-: | ————-: | ————— | —: | ————-: | ———: | ———-: | -: | ———————: | —————————-: |
| gemma4 ?B Q5_K – Small | 16.87 GiB | 25.23 B | CUDA | 99 | 14 | 1024 | 1024 | 1 | pp512 | 223.06 ± 9.02 |
| gemma4 ?B Q5_K – Small | 16.87 GiB | 25.23 B | CUDA | 99 | 14 | 1024 | 1024 | 1 | tg128 | 29.18 ± 0.41 |
| gemma4 ?B Q5_K – Small | 16.87 GiB | 25.23 B | CUDA | 99 | 14 | 1024 | 1024 | 0 | pp512 | 222.99 ± 4.69 |
| gemma4 ?B Q5_K – Small | 16.87 GiB | 25.23 B | CUDA | 99 | 14 | 1024 | 1024 | 0 | tg128 | 28.79 ± 0.29 |
| gemma4 ?B Q5_K – Small | 16.87 GiB | 25.23 B | CUDA | 99 | 15 | 1024 | 1024 | 1 | pp512 | 224.38 ± 6.31 |
| gemma4 ?B Q5_K – Small | 16.87 GiB | 25.23 B | CUDA | 99 | 15 | 1024 | 1024 | 1 | tg128 | 27.55 ± 0.27 |
| gemma4 ?B Q5_K – Small | 16.87 GiB | 25.23 B | CUDA | 99 | 15 | 1024 | 1024 | 0 | pp512 | 219.00 ± 5.54 |
| gemma4 ?B Q5_K – Small | 16.87 GiB | 25.23 B | CUDA | 99 | 15 | 1024 | 1024 | 0 | tg128 | 27.57 ± 0.38 |
| gemma4 ?B Q5_K – Small | 16.87 GiB | 25.23 B | CUDA | 99 | 16 | 1024 | 1024 | 1 | pp512 | 212.68 ± 3.87 |
| gemma4 ?B Q5_K – Small | 16.87 GiB | 25.23 B | CUDA | 99 | 16 | 1024 | 1024 | 1 | tg128 | 26.23 ± 0.24 |
| gemma4 ?B Q5_K – Small | 16.87 GiB | 25.23 B | CUDA | 99 | 16 | 1024 | 1024 | 0 | pp512 | 209.31 ± 7.52 |
| gemma4 ?B Q5_K – Small | 16.87 GiB | 25.23 B | CUDA | 99 | 16 | 1024 | 1024 | 0 | tg128 | 26.15 ± 0.27 |
| gemma4 ?B Q5_K – Small | 16.87 GiB | 25.23 B | CUDA | 99 | 17 | 1024 | 1024 | 1 | pp512 | 208.96 ± 3.50 |
| gemma4 ?B Q5_K – Small | 16.87 GiB | 25.23 B | CUDA | 99 | 17 | 1024 | 1024 | 1 | tg128 | 25.07 ± 0.25 |
| gemma4 ?B Q5_K – Small | 16.87 GiB | 25.23 B | CUDA | 99 | 17 | 1024 | 1024 | 0 | pp512 | 209.88 ± 9.27 |
| gemma4 ?B Q5_K – Small | 16.87 GiB | 25.23 B | CUDA | 99 | 17 | 1024 | 1024 | 0 | tg128 | 24.79 ± 0.33 |
| gemma4 ?B Q5_K – Small | 16.87 GiB | 25.23 B | CUDA | 99 | 18 | 1024 | 1024 | 1 | pp512 | 207.52 ± 6.04 |
| gemma4 ?B Q5_K – Small | 16.87 GiB | 25.23 B | CUDA | 99 | 18 | 1024 | 1024 | 1 | tg128 | 24.18 ± 0.20 |
| gemma4 ?B Q5_K – Small | 16.87 GiB | 25.23 B | CUDA | 99 | 18 | 1024 | 1024 | 0 | pp512 | 209.03 ± 9.32 |
| gemma4 ?B Q5_K – Small | 16.87 GiB | 25.23 B | CUDA | 99 | 18 | 1024 | 1024 | 0 | tg128 | 23.76 ± 0.22 |
| gemma4 ?B Q5_K – Small | 16.87 GiB | 25.23 B | CUDA | 99 | 19 | 1024 | 1024 | 1 | pp512 | 199.83 ± 5.87 |
| gemma4 ?B Q5_K – Small | 16.87 GiB | 25.23 B | CUDA | 99 | 19 | 1024 | 1024 | 1 | tg128 | 23.45 ± 0.24 |
| gemma4 ?B Q5_K – Small | 16.87 GiB | 25.23 B | CUDA | 99 | 19 | 1024 | 1024 | 0 | pp512 | 197.79 ± 3.33 |
| gemma4 ?B Q5_K – Small | 16.87 GiB | 25.23 B | CUDA | 99 | 19 | 1024 | 1024 | 0 | tg128 | 22.94 ± 0.20 |
| gemma4 ?B Q5_K – Small | 16.87 GiB | 25.23 B | CUDA | 99 | 20 | 1024 | 1024 | 1 | pp512 | 190.75 ± 3.30 |
| gemma4 ?B Q5_K – Small | 16.87 GiB | 25.23 B | CUDA | 99 | 20 | 1024 | 1024 | 1 | tg128 | 22.70 ± 0.22 |
| gemma4 ?B Q5_K – Small | 16.87 GiB | 25.23 B | CUDA | 99 | 20 | 1024 | 1024 | 0 | pp512 | 194.97 ± 6.65 |
| gemma4 ?B Q5_K – Small | 16.87 GiB | 25.23 B | CUDA | 99 | 20 | 1024 | 1024 | 0 | tg128 | 22.22 ± 0.15 |
| gemma4 ?B Q5_K – Small | 16.87 GiB | 25.23 B | CUDA | 99 | 21 | 1024 | 1024 | 1 | pp512 | 188.35 ± 3.29 |
| gemma4 ?B Q5_K – Small | 16.87 GiB | 25.23 B | CUDA | 99 | 21 | 1024 | 1024 | 1 | tg128 | 21.98 ± 0.14 |
| gemma4 ?B Q5_K – Small | 16.87 GiB | 25.23 B | CUDA | 99 | 21 | 1024 | 1024 | 0 | pp512 | 189.21 ± 2.23 |
| gemma4 ?B Q5_K – Small | 16.87 GiB | 25.23 B | CUDA | 99 | 21 | 1024 | 1024 | 0 | tg128 | 21.71 ± 0.17 |
Conclusion
After these tests and some real life tests I now stick with the following settings:
llama-server -hf bartowski/google_gemma-4-26B-A4B-it-GGUF:Q5_K_S --device CUDA0 \
--jinja --no-mmproj-offload --reasoning off \
-fa 1 -b 1024 -ub 1024 -ctk q8_0 -ctv q8_0 --no-mmap \
--host 0.0.0.0 --port 8002 -ngl 99 --n-cpu-moe 18 --parallel 2 --ctx-size 108000
With each additional layer, I lose context, and because I am using hermes-agent, I should stick with those 2 connections since the agent sometimes uses sub-agents for different tasks. However, I still see some ReadTimeouts when it comes to context compaction. But this is another problem.