Benchmarks with Dell Precision 7520 and Nvidia GTX 970 with 4GB vram

Posted Mar 14, 12:14 PM

The machine’s specs:
Intel® Core™ i7-6920HQ CPU at 2.90GHz
48 GB RAM at 2400 MHz
Nvidia GTX 970 eGPU with 4GB vRAM
eGPU -> Oculink -> M.2 Adapter.

Prepare

adrian@bigdelli:~$ llama-bench --list-devices
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA GeForce GTX 970, compute capability 5.2, VMM: yes
  Device 1: Quadro M1200, compute capability 5.0, VMM: yes
Available devices:
  CUDA0: NVIDIA GeForce GTX 970 (4030 MiB, 3966 MiB free)
  CUDA1: Quadro M1200 (4035 MiB, 4001 MiB free)

Models that fit into the vRAM

llama-bench -ngl 99 -fa 1 -m models/granite/granite-4.0-micro-Q6_K.gguf --device CUDA0

NVIDIA GeForce GTX 970, compute capability 5.2, VMM: yes

model	size	params	backend	ngl	fa	test	t/s
granite-4.0-micro-Q6_K.gguf
granite 3B Q6_K	2.60 GiB	3.40 B	CUDA	99	1	pp512	389.80 ± 1.11
granite 3B Q6_K	2.60 GiB	3.40 B	CUDA	99	1	tg128	24.97 ± 0.04
LFM2-2.6B-Exp-Q6_K.gguf
lfm2 2.6B Q6_K	2.07 GiB	2.70 B	CUDA	99	1	pp512	546.39 ± 1.52
lfm2 2.6B Q6_K	2.07 GiB	2.70 B	CUDA	99	1	tg128	32.72 ± 0.15
LFM2.5-1.2B-Thinking-Q6_K.gguf
lfm2 1.2B Q6_K	915.96 MiB	1.17 B	CUDA	99	1	pp512	1281.22 ± 10.11
lfm2 1.2B Q6_K	915.96 MiB	1.17 B	CUDA	99	1	tg128	69.75 ± 0.35
Qwen2.5-3B-Instruct-Q4_K_M.gguf
qwen2 3B Q4_K – Medium	1.79 GiB	3.09 B	CUDA	99	1	pp512	441.62 ± 1.16
qwen2 3B Q4_K – Medium	1.79 GiB	3.09 B	CUDA	99	1	tg128	33.44 ± 0.02
Qwen2.5-3B-Instruct-Q6_K_L.gguf
qwen2 3B Q6_K	2.43 GiB	3.09 B	CUDA	99	1	pp512	459.30 ± 1.93
qwen2 3B Q6_K	2.43 GiB	3.09 B	CUDA	99	1	tg128	27.27 ± 0.05
Qwen2.5-3B-Instruct-Q8_0.gguf
qwen2 3B Q8_0	3.05 GiB	3.09 B	CUDA	99	1	pp512	143.61 ± 0.21
qwen2 3B Q8_0	3.05 GiB	3.09 B	CUDA	99	1	tg128	34.15 ± 0.01
Qwen3-4B-Instruct-2507-Q6_K.gguf
qwen3 4B Q6_K	3.07 GiB	4.02 B	CUDA	99	1	pp512	100.84 ± 0.31
qwen3 4B Q6_K	3.07 GiB	4.02 B	CUDA	99	1	tg128	21.97 ± 0.02
qwen2.5-1.5b-q8_0.gguf
qwen2 1.5B Q8_0	1.53 GiB	1.54 B	CUDA	99	1	pp512	961.78 ± 5.83
qwen2 1.5B Q8_0	1.53 GiB	1.54 B	CUDA	99	1	tg128	57.89 ± 0.03

Models that don’t fit into the vRAM

model	size	params	backend	ngl	fa	dev	test	t/s
deepseek2 30B.A3B Q4_K – Medium	16.88 GiB	29.94 B	CUDA	6	1	CUDA0	pp512	65.14 ± 0.26
deepseek2 30B.A3B Q4_K – Medium	16.88 GiB	29.94 B	CUDA	6	1	CUDA0	tg128	11.85 ± 0.02

Example via the llama-server provided website

Question: Please explain “bit shifting” to me as if I were 5 years old.
The answer metrics for the:

Fully loaded on GPU
LFM2.5-1.2B-Thinking-Q6_K.gguf	1,487 tokens	22s	65.85 t/s
LFM2-2.6B-Exp-Q6_K.gguf	1,174 tokens	37s	31.51 t/s
granite-4.0-micro-Q6_K.gguf	215 tokens	8.8s	24.34 t/s

Author Adrian Höhne
Categories AI, Benchmark Model