Such models are great, and the MOE architecture makes it good if you don’t have enough VRAM to pack it onto your GPU entirely.
But there are still some tweaks possible. So, I played with the configuration a bit.
Not everything is possible with llama-bench. Especially settings like:
—no-mmap -> will keep all CPU experts in RAM instead of loading it from the hard drive if necessary. This may give you a small performance boost but will eat your RAM
—mlock -> ensures that the kernel is not allowed to move the things into swap. In my opinion, if you will use —no-mmap then you should have enough RAM; otherwise, let llama.cpp do the memory stuff.
—no-mmproj-offload -> move mmproj file, used for image reading, to RAM instead of VRAM. I do this constantly because I rarely use pictures and need the memory for the context.
My final setup:
llama-server -m ~/models/cache/bartowski_Qwen_Qwen3.5-35B-A3B-GGUF_Qwen_Qwen3.5-35B-A3B-Q5_K_S.gguf —device CUDA0 —jinja —no-mmproj-offload —reasoning off —parallel 2 —n-cpu-moe 28 -sm layer -ngl 99 -fa 1 -ub 512 -b 512 -ctk q8_0 -ctv q8_0 —no-mmap —mmproj ~/models/cache/bartowski_Qwen_Qwen3.5-35B-A3B-GGUF_mmproj-Qwen_Qwen3.5-35B-A3B-f16.gguf —host 0.0.0.0 —port 8002
It gives me a shared context of roughly 65k over 2 connections. The speed cuts usually in half when both are used, but this is still faster than having only one. My hermes-agent is now running a bit faster and more reliably. I get up to 28 tks tg per second. This surprised me, as with other model providers, I rarely came over 22, and now I am working with Q5 quantization, which works much better than Q4.
The speed decreases with bigger context, which is expectable, but still works well.
During implementation of a website. About 45632 token in context:
prompt eval time = 81623.01 ms / 6535 tokens ( 12.49 ms per token, 80.06 tokens per second)
eval time = 9194.38 ms / 99 tokens ( 92.87 ms per token, 10.77 tokens per second)
total time = 90817.39 ms / 6634 tokens
Single access via web with 16 tokens in context:
prompt eval time = 410.25 ms / 16 tokens ( 25.64 ms per token, 39.00 tokens per second)
eval time = 2506.55 ms / 64 tokens ( 39.16 ms per token, 25.53 tokens per second)
total time = 2916.80 ms / 80 tokens