Benchmarks with Dell Precision 7520 and Nvidia Quadro M1200 with 4GB vram

Posted

After ChatGPT mentioned my blog as I asked for some performance comparison, I realized that this seems to be a unique setup to use local LLMs. Therefore, I decided to run llama-bench with the models I currently have on this machine and post it here.

The machine’s specs:
Intel® Core™ i7-6920HQ CPU at 2.90GHz
48 GB RAM at 2400 MHz
Nvidia Quadro M1200 dGPU with 4GB vRAM

Models that fit into the vRAM

llama-bench -ngl 99 -fa 1 -m <model>
Device 0: Quadro M1200, compute capability 5.0, VMM: yes
model size params backend ngl fa test t/s
granite-4.0-micro-Q6_K.gguf
granite 3B Q6_K 2.60 GiB 3.40 B CUDA 99 1 pp512 143.47 ± 0.60
granite 3B Q6_K 2.60 GiB 3.40 B CUDA 99 1 tg128 8.03 ± 0.06
LFM2-2.6B-Exp-Q6_K.gguf
lfm2 2.6B Q6_K 2.07 GiB 2.70 B CUDA 99 1 pp512 195.71 ± 0.25
lfm2 2.6B Q6_K 2.07 GiB 2.70 B CUDA 99 1 tg128 10.54 ± 0.08
LFM2.5-1.2B-Thinking-Q6_K.gguf
lfm2 1.2B Q6_K 915.96 MiB 1.17 B CUDA 99 1 pp512 461.74 ± 2.52
lfm2 1.2B Q6_K 915.96 MiB 1.17 B CUDA 99 1 tg128 22.96 ± 0.17
Qwen2.5-3B-Instruct-Q4_K_M.gguf
qwen2 3B Q4_K – Medium 1.79 GiB 3.09 B CUDA 99 1 pp512 156.46 ± 0.64
qwen2 3B Q4_K – Medium 1.79 GiB 3.09 B CUDA 99 1 tg128 11.51 ± 0.07
Qwen2.5-3B-Instruct-Q6_K_L.gguf
qwen2 3B Q6_K 2.43 GiB 3.09 B CUDA 99 1 pp512 165.89 ± 0.47
qwen2 3B Q6_K 2.43 GiB 3.09 B CUDA 99 1 tg128 8.96 ± 0.02
Qwen2.5-3B-Instruct-Q8_0.gguf
qwen2 3B Q8_0 3.05 GiB 3.09 B CUDA 99 1 pp512 158.52 ± 0.37
qwen2 3B Q8_0 3.05 GiB 3.09 B CUDA 99 1 tg128 11.94 ± 0.02
Qwen3-4B-Instruct-2507-Q6_K.gguf
qwen3 4B Q6_K 3.07 GiB 4.02 B CUDA 99 1 pp512 121.10 ± 0.35
qwen3 4B Q6_K 3.07 GiB 4.02 B CUDA 99 1 tg128 7.11 ± 0.02
qwen2 1.5B Q8_0 1.53 GiB 1.54 B CUDA 99 1 pp512 332.50 ± 1.28
qwen2 1.5B Q8_0 1.53 GiB 1.54 B CUDA 99 1 tg128 21.12 ± 0.00

Models that don’t fit into the vRAM

model size params backend ngl fa test t/s
glm-4.7-flash-claude-4.5-opus.q6_k.gguf
deepseek2 30B.A3B Q6_K 22.92 GiB 29.94 B CUDA 6 1 pp512 51.14 ± 0.10
deepseek2 30B.A3B Q6_K 22.92 GiB 29.94 B CUDA 6 1 tg128 7.36 ± 0.04
qwen2.5-32b-instruct-q8_0.gguf
! Do not try, this takes ages !
qwen2 32B Q8_0 32.42 GiB 32.76 B CUDA 5 1 pp512 14.21 ± 0.02
qwen2 32B Q8_0 32.42 GiB 32.76 B CUDA 5 1 tg128 0.74 ± 0.00
qwen2.5-14b-Q8_0.gguf
qwen2 14B Q8_0 14.62 GiB 14.77 B CUDA 10 1 pp512 34.02 ± 0.16
qwen2 14B Q8_0 14.62 GiB 14.77 B CUDA 10 1 tg128 1.97 ± 0.00

Example via the llama-server provided website

Usually, I have 2 models loaded. One lives completely on the GPU, and one lives completely on the CPU. My current loaded models are
- CPU: qwen/Qwen2.5-7B-Instr-Q4_K_M.gguf
- GPU: granite/granite-4.0-micro-Q6_K.gguf

Question: Please explain “bit shifting” to me as if I were 5 years old.
The answer metrics for the:

Fully loaded on GPU
LFM2.5-1.2B-Thinking-Q6_K.gguf 1,275 tokens 59s 21.46 t/s
LFM2-2.6B-Exp-Q6_K.gguf 318 tokens 30s 10.33 t/s
granite-4.0-micro-Q6_K.gguf 258 tokens 33s 7.73 t/s
Only loaded on CPU
Qwen2.5-7B-Instr-Q4_K_M.gguf 283 tokens 47s 5.94 t/s

Author
Categories Linux, AI