After ChatGPT mentioned my blog as I asked for some performance comparison, I realized that this seems to be a unique setup to use local LLMs. Therefore, I decided to run llama-bench with the models I currently have on this machine and post it here.
The machine’s specs:
Intel® Core™ i7-6920HQ CPU at 2.90GHz
48 GB RAM at 2400 MHz
Nvidia Quadro M1200 dGPU with 4GB vRAM
Models that fit into the vRAM
llama-bench -ngl 99 -fa 1 -m <model>
Device 0: Quadro M1200, compute capability 5.0, VMM: yes
| model | size | params | backend | ngl | fa | test | t/s |
|---|---|---|---|---|---|---|---|
| granite-4.0-micro-Q6_K.gguf | |||||||
| granite 3B Q6_K | 2.60 GiB | 3.40 B | CUDA | 99 | 1 | pp512 | 143.47 ± 0.60 |
| granite 3B Q6_K | 2.60 GiB | 3.40 B | CUDA | 99 | 1 | tg128 | 8.03 ± 0.06 |
| LFM2-2.6B-Exp-Q6_K.gguf | |||||||
| lfm2 2.6B Q6_K | 2.07 GiB | 2.70 B | CUDA | 99 | 1 | pp512 | 195.71 ± 0.25 |
| lfm2 2.6B Q6_K | 2.07 GiB | 2.70 B | CUDA | 99 | 1 | tg128 | 10.54 ± 0.08 |
| LFM2.5-1.2B-Thinking-Q6_K.gguf | |||||||
| lfm2 1.2B Q6_K | 915.96 MiB | 1.17 B | CUDA | 99 | 1 | pp512 | 461.74 ± 2.52 |
| lfm2 1.2B Q6_K | 915.96 MiB | 1.17 B | CUDA | 99 | 1 | tg128 | 22.96 ± 0.17 |
| Qwen2.5-3B-Instruct-Q4_K_M.gguf | |||||||
| qwen2 3B Q4_K – Medium | 1.79 GiB | 3.09 B | CUDA | 99 | 1 | pp512 | 156.46 ± 0.64 |
| qwen2 3B Q4_K – Medium | 1.79 GiB | 3.09 B | CUDA | 99 | 1 | tg128 | 11.51 ± 0.07 |
| Qwen2.5-3B-Instruct-Q6_K_L.gguf | |||||||
| qwen2 3B Q6_K | 2.43 GiB | 3.09 B | CUDA | 99 | 1 | pp512 | 165.89 ± 0.47 |
| qwen2 3B Q6_K | 2.43 GiB | 3.09 B | CUDA | 99 | 1 | tg128 | 8.96 ± 0.02 |
| Qwen2.5-3B-Instruct-Q8_0.gguf | |||||||
| qwen2 3B Q8_0 | 3.05 GiB | 3.09 B | CUDA | 99 | 1 | pp512 | 158.52 ± 0.37 |
| qwen2 3B Q8_0 | 3.05 GiB | 3.09 B | CUDA | 99 | 1 | tg128 | 11.94 ± 0.02 |
| Qwen3-4B-Instruct-2507-Q6_K.gguf | |||||||
| qwen3 4B Q6_K | 3.07 GiB | 4.02 B | CUDA | 99 | 1 | pp512 | 121.10 ± 0.35 |
| qwen3 4B Q6_K | 3.07 GiB | 4.02 B | CUDA | 99 | 1 | tg128 | 7.11 ± 0.02 |
| qwen2 1.5B Q8_0 | 1.53 GiB | 1.54 B | CUDA | 99 | 1 | pp512 | 332.50 ± 1.28 |
| qwen2 1.5B Q8_0 | 1.53 GiB | 1.54 B | CUDA | 99 | 1 | tg128 | 21.12 ± 0.00 |
Models that don’t fit into the vRAM
| model | size | params | backend | ngl | fa | test | t/s |
|---|---|---|---|---|---|---|---|
| glm-4.7-flash-claude-4.5-opus.q6_k.gguf | |||||||
| deepseek2 30B.A3B Q6_K | 22.92 GiB | 29.94 B | CUDA | 6 | 1 | pp512 | 51.14 ± 0.10 |
| deepseek2 30B.A3B Q6_K | 22.92 GiB | 29.94 B | CUDA | 6 | 1 | tg128 | 7.36 ± 0.04 |
| qwen2.5-32b-instruct-q8_0.gguf ! Do not try, this takes ages ! |
|||||||
| qwen2 32B Q8_0 | 32.42 GiB | 32.76 B | CUDA | 5 | 1 | pp512 | 14.21 ± 0.02 |
| qwen2 32B Q8_0 | 32.42 GiB | 32.76 B | CUDA | 5 | 1 | tg128 | 0.74 ± 0.00 |
| qwen2.5-14b-Q8_0.gguf | |||||||
| qwen2 14B Q8_0 | 14.62 GiB | 14.77 B | CUDA | 10 | 1 | pp512 | 34.02 ± 0.16 |
| qwen2 14B Q8_0 | 14.62 GiB | 14.77 B | CUDA | 10 | 1 | tg128 | 1.97 ± 0.00 |
Example via the llama-server provided website
Usually, I have 2 models loaded. One lives completely on the GPU, and one lives completely on the CPU. My current loaded models are
- CPU: qwen/Qwen2.5-7B-Instr-Q4_K_M.gguf
- GPU: granite/granite-4.0-micro-Q6_K.gguf
Question: Please explain “bit shifting” to me as if I were 5 years old.
The answer metrics for the:
| Fully loaded on GPU | |||
| LFM2.5-1.2B-Thinking-Q6_K.gguf | 1,275 tokens | 59s | 21.46 t/s |
| LFM2-2.6B-Exp-Q6_K.gguf | 318 tokens | 30s | 10.33 t/s |
| granite-4.0-micro-Q6_K.gguf | 258 tokens | 33s | 7.73 t/s |
| Only loaded on CPU | |||
| Qwen2.5-7B-Instr-Q4_K_M.gguf | 283 tokens | 47s | 5.94 t/s |