Qwen3-Coder-Next-IQ4_XS

Posted

Find the sweet spot in my Precision 7520 with eGPU RTX 2060 12GB VRAM via M.2 to OcuLink connection.

I used llama.cpp and played around with the -ncmoe flag to get my sweet spot of tokens per second. For speed measurement, I this time just used the web page provided by the server and asked for 20 prime numbers each time. The goal was to get the best speed but still a huge context window.

test cmoe tk\s ctx
1 50 16.5 262144
2 40 19.1 147456
3 30-35 failed failed
4 36 oom 4096
5 37 20.2 4096
6 38 19.3 40448
7 39 19.1 93952

Test 1
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
llama_memory_breakdown_print: | – CUDA0 (RTX 2060) | 11833 = 5843 + ( 5757 = 1358 + 3339 + 1060) + 231 |
llama_memory_breakdown_print: | – Host | 40969 = 40449 + 0 + 520 |

Test 2
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
llama_memory_breakdown_print: | – CUDA0 (RTX 2060) | 11833 = 1191 + (10417 = 7886 + 1911 + 620) + 223 |
llama_memory_breakdown_print: | – Host | 34031 = 33735 + 0 + 296

Test 3 – 4 failed

Test 5
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
llama_memory_breakdown_print: | – CUDA0 (RTX 2060) | 11833 = 801 + (10802 = 10334 + 126 + 342) + 228 |
llama_memory_breakdown_print: | – Host | 31229 = 31213 + 0 + 16 |

Test 6
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
llama_memory_breakdown_print: | – CUDA0 (RTX 2060) | 11833 = 1095 + (10508 = 9518 + 579 + 411) + 229 |
llama_memory_breakdown_print: | – Host | 32140 = 32053 + 0 + 87 |

Test 7
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
llama_memory_breakdown_print: | – CUDA0 (RTX 2060) | 11833 = 1147 + (10463 = 8702 + 1245 + 515) + 222 |
llama_memory_breakdown_print: | – Host | 33085 = 32894 + 0 + 191 |

Author
Categories Try to fit to VRAM