Benchmarks with Dell Precision 7520 and Nvidia GTX 970 with 4GB vram

Posted

The machine’s specs:
Intel® Core™ i7-6920HQ CPU at 2.90GHz
48 GB RAM at 2400 MHz
Nvidia GTX 970 eGPU with 4GB vRAM
eGPU -> Oculink -> M.2 Adapter.

Prepare

adrian@bigdelli:~$ llama-bench --list-devices
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA GeForce GTX 970, compute capability 5.2, VMM: yes
  Device 1: Quadro M1200, compute capability 5.0, VMM: yes
Available devices:
  CUDA0: NVIDIA GeForce GTX 970 (4030 MiB, 3966 MiB free)
  CUDA1: Quadro M1200 (4035 MiB, 4001 MiB free)

Models that fit into the vRAM

llama-bench -ngl 99 -fa 1 -m models/granite/granite-4.0-micro-Q6_K.gguf --device CUDA0

NVIDIA GeForce GTX 970, compute capability 5.2, VMM: yes

model size params backend ngl fa test t/s
granite-4.0-micro-Q6_K.gguf
granite 3B Q6_K 2.60 GiB 3.40 B CUDA 99 1 pp512 389.80 ± 1.11
granite 3B Q6_K 2.60 GiB 3.40 B CUDA 99 1 tg128 24.97 ± 0.04
LFM2-2.6B-Exp-Q6_K.gguf
lfm2 2.6B Q6_K 2.07 GiB 2.70 B CUDA 99 1 pp512 546.39 ± 1.52
lfm2 2.6B Q6_K 2.07 GiB 2.70 B CUDA 99 1 tg128 32.72 ± 0.15
LFM2.5-1.2B-Thinking-Q6_K.gguf
lfm2 1.2B Q6_K 915.96 MiB 1.17 B CUDA 99 1 pp512 1281.22 ± 10.11
lfm2 1.2B Q6_K 915.96 MiB 1.17 B CUDA 99 1 tg128 69.75 ± 0.35
Qwen2.5-3B-Instruct-Q4_K_M.gguf
qwen2 3B Q4_K – Medium 1.79 GiB 3.09 B CUDA 99 1 pp512 441.62 ± 1.16
qwen2 3B Q4_K – Medium 1.79 GiB 3.09 B CUDA 99 1 tg128 33.44 ± 0.02
Qwen2.5-3B-Instruct-Q6_K_L.gguf
qwen2 3B Q6_K 2.43 GiB 3.09 B CUDA 99 1 pp512 459.30 ± 1.93
qwen2 3B Q6_K 2.43 GiB 3.09 B CUDA 99 1 tg128 27.27 ± 0.05
Qwen2.5-3B-Instruct-Q8_0.gguf
qwen2 3B Q8_0 3.05 GiB 3.09 B CUDA 99 1 pp512 143.61 ± 0.21
qwen2 3B Q8_0 3.05 GiB 3.09 B CUDA 99 1 tg128 34.15 ± 0.01
Qwen3-4B-Instruct-2507-Q6_K.gguf
qwen3 4B Q6_K 3.07 GiB 4.02 B CUDA 99 1 pp512 100.84 ± 0.31
qwen3 4B Q6_K 3.07 GiB 4.02 B CUDA 99 1 tg128 21.97 ± 0.02
qwen2.5-1.5b-q8_0.gguf
qwen2 1.5B Q8_0 1.53 GiB 1.54 B CUDA 99 1 pp512 961.78 ± 5.83
qwen2 1.5B Q8_0 1.53 GiB 1.54 B CUDA 99 1 tg128 57.89 ± 0.03

Models that don’t fit into the vRAM

model size params backend ngl fa dev test t/s
deepseek2 30B.A3B Q4_K – Medium 16.88 GiB 29.94 B CUDA 6 1 CUDA0 pp512 65.14 ± 0.26
deepseek2 30B.A3B Q4_K – Medium 16.88 GiB 29.94 B CUDA 6 1 CUDA0 tg128 11.85 ± 0.02

Example via the llama-server provided website

Question: Please explain “bit shifting” to me as if I were 5 years old.
The answer metrics for the:

Fully loaded on GPU
LFM2.5-1.2B-Thinking-Q6_K.gguf 1,487 tokens 22s 65.85 t/s
LFM2-2.6B-Exp-Q6_K.gguf 1,174 tokens 37s 31.51 t/s
granite-4.0-micro-Q6_K.gguf 215 tokens 8.8s 24.34 t/s

Author
Categories Linux, AI