Qwen3-Coder-Next-IQ4_XS

Posted Mar 31, 06:46 PM

Find the sweet spot in my Precision 7520 with eGPU RTX 2060 12GB VRAM via M.2 to OcuLink connection.

I used llama.cpp and played around with the -ncmoe flag to get my sweet spot of tokens per second. For speed measurement, I this time just used the web page provided by the server and asked for 20 prime numbers each time. The goal was to get the best speed but still a huge context window.

test	cmoe	tk\s	ctx
1	50	16.5	262144
2	40	19.1	147456
3	30-35	failed	failed
4	36	oom	4096
5	37	20.2	4096
6	38	19.3	40448
7	39	19.1	93952

Test 3 – 4 failed

llama-bench

llama-bench -m ~/models/cache/unsloth_Qwen3-Coder-Next-GGUF_Qwen3-Coder-Next-IQ4_XS.gguf —device CUDA0 -ncmoe 40 ggml_cuda_init: found 2 CUDA devices (Total VRAM: 15868 MiB): Device 0: NVIDIA GeForce RTX 2060, compute capability 7.5, VMM: yes, VRAM: 11833 MiB Device 1: Quadro M1200, compute capability 5.0, VMM: yes, VRAM: 4035 MiB

model	size	params	backend	ngl	dev	test	t/s
———————————————	————-:	————-:	—————	—:	——————	———————:	—————————-:
qwen3next 80B.A3B IQ4_XS – 4.25 bpw	39.74 GiB	79.67 B	CUDA	99	CUDA0	pp512	74.88 ± 0.82
qwen3next 80B.A3B IQ4_XS – 4.25 bpw	39.74 GiB	79.67 B	CUDA	99	CUDA0	tg128	20.02 ± 0.18

llama-bench -m ~/models/cache/unsloth_Qwen3-Coder-Next-GGUF_Qwen3-Coder-Next-IQ4_XS.gguf —device CUDA0 -ncmoe 39 ggml_cuda_init: found 2 CUDA devices (Total VRAM: 15868 MiB): Device 0: NVIDIA GeForce RTX 2060, compute capability 7.5, VMM: yes, VRAM: 11833 MiB Device 1: Quadro M1200, compute capability 5.0, VMM: yes, VRAM: 4035 MiB

model	size	params	backend	ngl	dev	test	t/s
———————————————	————-:	————-:	—————	—:	——————	———————:	—————————-:
qwen3next 80B.A3B IQ4_XS – 4.25 bpw	39.74 GiB	79.67 B	CUDA	99	CUDA0	pp512	76.09 ± 0.88
qwen3next 80B.A3B IQ4_XS – 4.25 bpw	39.74 GiB	79.67 B	CUDA	99	CUDA0	tg128	20.32 ± 0.17

Author Adrian Höhne
Categories AI, Try to fit to VRAM

Comi's Kaese

An open reminder for my thougths

Qwen3-Coder-Next-IQ4_XS