The machine’s specs:
Intel® Core™ i7-6920HQ CPU at 2.90GHz
48 GB RAM at 2400 MHz
Nvidia GTX 970 eGPU with 4GB vRAM
eGPU -> Oculink -> M.2 Adapter.
Prepare
adrian@bigdelli:~$ llama-bench --list-devices
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA GeForce GTX 970, compute capability 5.2, VMM: yes
Device 1: Quadro M1200, compute capability 5.0, VMM: yes
Available devices:
CUDA0: NVIDIA GeForce GTX 970 (4030 MiB, 3966 MiB free)
CUDA1: Quadro M1200 (4035 MiB, 4001 MiB free)
Models that fit into the vRAM
llama-bench -ngl 99 -fa 1 -m models/granite/granite-4.0-micro-Q6_K.gguf --device CUDA0
NVIDIA GeForce GTX 970, compute capability 5.2, VMM: yes
| model | size | params | backend | ngl | fa | test | t/s |
|---|---|---|---|---|---|---|---|
| granite-4.0-micro-Q6_K.gguf | |||||||
| granite 3B Q6_K | 2.60 GiB | 3.40 B | CUDA | 99 | 1 | pp512 | 389.80 ± 1.11 |
| granite 3B Q6_K | 2.60 GiB | 3.40 B | CUDA | 99 | 1 | tg128 | 24.97 ± 0.04 |
| LFM2-2.6B-Exp-Q6_K.gguf | |||||||
| lfm2 2.6B Q6_K | 2.07 GiB | 2.70 B | CUDA | 99 | 1 | pp512 | 546.39 ± 1.52 |
| lfm2 2.6B Q6_K | 2.07 GiB | 2.70 B | CUDA | 99 | 1 | tg128 | 32.72 ± 0.15 |
| LFM2.5-1.2B-Thinking-Q6_K.gguf | |||||||
| lfm2 1.2B Q6_K | 915.96 MiB | 1.17 B | CUDA | 99 | 1 | pp512 | 1281.22 ± 10.11 |
| lfm2 1.2B Q6_K | 915.96 MiB | 1.17 B | CUDA | 99 | 1 | tg128 | 69.75 ± 0.35 |
| Qwen2.5-3B-Instruct-Q4_K_M.gguf | |||||||
| qwen2 3B Q4_K – Medium | 1.79 GiB | 3.09 B | CUDA | 99 | 1 | pp512 | 441.62 ± 1.16 |
| qwen2 3B Q4_K – Medium | 1.79 GiB | 3.09 B | CUDA | 99 | 1 | tg128 | 33.44 ± 0.02 |
| Qwen2.5-3B-Instruct-Q6_K_L.gguf | |||||||
| qwen2 3B Q6_K | 2.43 GiB | 3.09 B | CUDA | 99 | 1 | pp512 | 459.30 ± 1.93 |
| qwen2 3B Q6_K | 2.43 GiB | 3.09 B | CUDA | 99 | 1 | tg128 | 27.27 ± 0.05 |
| Qwen2.5-3B-Instruct-Q8_0.gguf | |||||||
| qwen2 3B Q8_0 | 3.05 GiB | 3.09 B | CUDA | 99 | 1 | pp512 | 143.61 ± 0.21 |
| qwen2 3B Q8_0 | 3.05 GiB | 3.09 B | CUDA | 99 | 1 | tg128 | 34.15 ± 0.01 |
| Qwen3-4B-Instruct-2507-Q6_K.gguf | |||||||
| qwen3 4B Q6_K | 3.07 GiB | 4.02 B | CUDA | 99 | 1 | pp512 | 100.84 ± 0.31 |
| qwen3 4B Q6_K | 3.07 GiB | 4.02 B | CUDA | 99 | 1 | tg128 | 21.97 ± 0.02 |
| qwen2.5-1.5b-q8_0.gguf | |||||||
| qwen2 1.5B Q8_0 | 1.53 GiB | 1.54 B | CUDA | 99 | 1 | pp512 | 961.78 ± 5.83 |
| qwen2 1.5B Q8_0 | 1.53 GiB | 1.54 B | CUDA | 99 | 1 | tg128 | 57.89 ± 0.03 |
Models that don’t fit into the vRAM
| model | size | params | backend | ngl | fa | dev | test | t/s |
| deepseek2 30B.A3B Q4_K – Medium | 16.88 GiB | 29.94 B | CUDA | 6 | 1 | CUDA0 | pp512 | 65.14 ± 0.26 |
| deepseek2 30B.A3B Q4_K – Medium | 16.88 GiB | 29.94 B | CUDA | 6 | 1 | CUDA0 | tg128 | 11.85 ± 0.02 |
Example via the llama-server provided website
Question: Please explain “bit shifting” to me as if I were 5 years old.
The answer metrics for the:
| Fully loaded on GPU | |||
| LFM2.5-1.2B-Thinking-Q6_K.gguf | 1,487 tokens | 22s | 65.85 t/s |
| LFM2-2.6B-Exp-Q6_K.gguf | 1,174 tokens | 37s | 31.51 t/s |
| granite-4.0-micro-Q6_K.gguf | 215 tokens | 8.8s | 24.34 t/s |