Benchmarks with Dell Precision 7520 and Nvidia Quadro M1200 with 4GB vram

Posted Mar 10, 06:02 PM

After ChatGPT mentioned my blog as I asked for some performance comparison, I realized that this seems to be a unique setup to use local LLMs. Therefore, I decided to run llama-bench with the models I currently have on this machine and post it here.

The machine’s specs:
Intel® Core™ i7-6920HQ CPU at 2.90GHz
48 GB RAM at 2400 MHz
Nvidia Quadro M1200 dGPU with 4GB vRAM

Models that fit into the vRAM

llama-bench -ngl 99 -fa 1 -m <model>

Device 0: Quadro M1200, compute capability 5.0, VMM: yes

model	size	params	backend	ngl	fa	test	t/s
granite-4.0-micro-Q6_K.gguf
granite 3B Q6_K	2.60 GiB	3.40 B	CUDA	99	1	pp512	143.47 ± 0.60
granite 3B Q6_K	2.60 GiB	3.40 B	CUDA	99	1	tg128	8.03 ± 0.06
LFM2-2.6B-Exp-Q6_K.gguf
lfm2 2.6B Q6_K	2.07 GiB	2.70 B	CUDA	99	1	pp512	195.71 ± 0.25
lfm2 2.6B Q6_K	2.07 GiB	2.70 B	CUDA	99	1	tg128	10.54 ± 0.08
LFM2.5-1.2B-Thinking-Q6_K.gguf
lfm2 1.2B Q6_K	915.96 MiB	1.17 B	CUDA	99	1	pp512	461.74 ± 2.52
lfm2 1.2B Q6_K	915.96 MiB	1.17 B	CUDA	99	1	tg128	22.96 ± 0.17
Qwen2.5-3B-Instruct-Q4_K_M.gguf
qwen2 3B Q4_K – Medium	1.79 GiB	3.09 B	CUDA	99	1	pp512	156.46 ± 0.64
qwen2 3B Q4_K – Medium	1.79 GiB	3.09 B	CUDA	99	1	tg128	11.51 ± 0.07
Qwen2.5-3B-Instruct-Q6_K_L.gguf
qwen2 3B Q6_K	2.43 GiB	3.09 B	CUDA	99	1	pp512	165.89 ± 0.47
qwen2 3B Q6_K	2.43 GiB	3.09 B	CUDA	99	1	tg128	8.96 ± 0.02
Qwen2.5-3B-Instruct-Q8_0.gguf
qwen2 3B Q8_0	3.05 GiB	3.09 B	CUDA	99	1	pp512	158.52 ± 0.37
qwen2 3B Q8_0	3.05 GiB	3.09 B	CUDA	99	1	tg128	11.94 ± 0.02
Qwen3-4B-Instruct-2507-Q6_K.gguf
qwen3 4B Q6_K	3.07 GiB	4.02 B	CUDA	99	1	pp512	121.10 ± 0.35
qwen3 4B Q6_K	3.07 GiB	4.02 B	CUDA	99	1	tg128	7.11 ± 0.02
qwen2 1.5B Q8_0	1.53 GiB	1.54 B	CUDA	99	1	pp512	332.50 ± 1.28
qwen2 1.5B Q8_0	1.53 GiB	1.54 B	CUDA	99	1	tg128	21.12 ± 0.00

Models that don’t fit into the vRAM

model	size	params	backend	ngl	fa	test	t/s
glm-4.7-flash-claude-4.5-opus.q6_k.gguf
deepseek2 30B.A3B Q6_K	22.92 GiB	29.94 B	CUDA	6	1	pp512	51.14 ± 0.10
deepseek2 30B.A3B Q6_K	22.92 GiB	29.94 B	CUDA	6	1	tg128	7.36 ± 0.04
qwen2.5-32b-instruct-q8_0.gguf ! Do not try, this takes ages !
qwen2 32B Q8_0	32.42 GiB	32.76 B	CUDA	5	1	pp512	14.21 ± 0.02
qwen2 32B Q8_0	32.42 GiB	32.76 B	CUDA	5	1	tg128	0.74 ± 0.00
qwen2.5-14b-Q8_0.gguf
qwen2 14B Q8_0	14.62 GiB	14.77 B	CUDA	10	1	pp512	34.02 ± 0.16
qwen2 14B Q8_0	14.62 GiB	14.77 B	CUDA	10	1	tg128	1.97 ± 0.00

Example via the llama-server provided website

Usually, I have 2 models loaded. One lives completely on the GPU, and one lives completely on the CPU. My current loaded models are
- CPU: qwen/Qwen2.5-7B-Instr-Q4_K_M.gguf
- GPU: granite/granite-4.0-micro-Q6_K.gguf

Question: Please explain “bit shifting” to me as if I were 5 years old.
The answer metrics for the:

LFM2.5-1.2B-Thinking-Q6_K.gguf	1,275 tokens	59s	21.46 t/s
LFM2-2.6B-Exp-Q6_K.gguf	318 tokens	30s	10.33 t/s
granite-4.0-micro-Q6_K.gguf	258 tokens	33s	7.73 t/s
Only loaded on CPU
Qwen2.5-7B-Instr-Q4_K_M.gguf	283 tokens	47s	5.94 t/s

Author Adrian Höhne
Categories Linux, AI

Comi's Kaese

An open reminder for my thougths

Benchmarks with Dell Precision 7520 and Nvidia Quadro M1200 with 4GB vram

Models that fit into the vRAM

Models that don’t fit into the vRAM

Example via the llama-server provided website