Posted

Once upon a time I had the idea to use an old Dell Precision to learn something about AI. It worked, but I wanted to have more. More speed, more context, since it was still not enough. Therefore, I thought it was a good idea to use an old GTX 970 to speed it up. And I finally got this to work. The results were good, but the VRAM was still too small for bigger context.
I decided to buy an RTX 2060 card with 12 GB VRAM via eBay Kleinanzeigen for a small amount of money. And the drama began. Nothing worked, nothing. But I did not want to give up, so I started my journey to fix it.

During the journey, the card was sometimes visible in the PCIe tree (lspci -tv) and sometimes even with nvidia-smi. But after using nvidia-smi once, it fell off the bus.

The connection

The EXP GDC OCuLink came with the dock, an OCuLink cable, and an M.2 adapter. This combination did not work well, even with my GTX 970. I think it was because the connector of the adapter was very fragile and the signaling was not as stable as I wished. However, at the time I tried, I didn’t know. So, I bought another dock with another cable and another adapter and started to replace it one by one. Yet nothing seemed to work; the other equipment also did not work well, and still the card was not visible or had fallen off the bus. Then I bought another adapter and tinkered a bit with the system, so I was not sure if the new adapter was the reason or my tinkering, which at least brought the RTX to visibility. But at least with the new adapter (GINTOOYUN PCI-E4.0 M.2 NVME zu Oculink SFF-8611/8612), it was easier to handle the cable, even though it makes the cable longer, which is not the best if you have connection issues, but the card was visible in the lspci tree.

The performance

It is usual that the faster your communication is, the more it is affected by the environment. At the time I was thinking that my old GTX card did not support PCIE 3.0, since my old PC only supported 2.0. However, I was wrong, but that didn’t matter, because it brought me on the right track. I did want to use it for AI, so a fast connection had no relevance, but stability did. But later more.

The driver

During the research I found a GIT issue comment. He mentioned something interesting. The card falls off the bus only when CUDA is used. This was super important. I checked my configuration and tried to reproduce it. At the first attempts I had no success, but then I could see it. I forgot to disable the Llama service, which already worked and was assigned to the GTX; therefore, it also started directly on the RTX. This was the reason why the card fell off the bus some time after the startup of the system. Now, I was able to reproduce it. As long as I did not touch nvidia-smi or any other CUDA call, the card stayed still, and dmesg did not show this message. But when I called nvidia-smi, the card fell off. So, I started to blacklist the drivers at system start and to load them one by one. And it worked, sometimes, but it was more stable than before. Even nvidia-smi started working. But CUDA still not. When I loaded nvidia_uvm and called nvidia-smi then the connection died again.

The power management trap

From my experience in IT, I know that power management is not your friend, and often when it comes to connection issues, it is your worst enemy. If it works, fine, but if you want to disable it, good luck. Back to the story.
One of the first solutions was to disable ASPM. Yes, power management. So, I did. I wrote pcie_aspm=off into my grub file. I already did it before, because the GTX also didn’t react that well in this setup. And since it helped with the GTX, at least to get rid of the AER errors, I was expecting it to work. Guess what? It did only part of the job. During hassling around with the drivers and the performance and the testing and, and, and I saw it in the lspci tree (3d:00.0 is the address of the RTX), wtf. Can you see it?

root@bigdelli:/home/adrian# lspci -vv -s 3d:00.0 | grep -iE ‘LnkCap|LnkCtl|LnkSta|LnkCtl2|LnkSta2’ LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <512ns, L1 <4us LnkCtl: ASPM L1 Enabled; RCB 64 bytes, Disabled- CommClk+ LnkSta: Speed 2.5GT/s (downgraded), Width x4 (downgraded) LnkCap2: Supported Link Speeds: 2.5-8GT/s, Crosslink- Retimer- 2Retimers- DRS- LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis- LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+ EqualizationPhase1+ LnkCtl3: LnkEquIntrruptEn- PerformEqu-

Wait, I will show you.

LnkCtl:	ASPM L1 Enabled; RCB 64 bytes, Disabled- CommClk+

It was still on.

After some research, I was unable to get it disabled. But I remembered from previous problems that
setpci
might be helpful in this topic. So, even though ChatGPT has not really been helpful so far (as it is good if the solution is already in the training data but bad if it comes to unresolved problems), I wanted to give it a try, and it came back with a correct command:
setpci -s 3d:00.0 CAP_EXP+10.w=0140

CAP_EXP are the PCIe capabilities
+10 -> move to the ASPM area
0140 -> set bits 6 and 8 enabled but removes bit 1: 0000 0001 0100 0000

It basically disables ASPM. 0140 is 0×0140.

My initial state was 0×0142, so 0000 0001 0100 0010. That means ASPM L1 enabled, Common Clock enabled, and Clock Power Management enabled.
I got it with this command:

sudo setpci -s 3d:00.0 CAP_EXP+10.w

I put that into a systemd service and restarted everything. And it worked!

The happy end

I did much more than I described; it was a back and forth the whole time. And on some points I was thinking that the card was broken, but it worked in my gaming PC. However, I am still removing some other steps that I tried to get the root issue visible.
It turned out that the GTX also supported PCIe 3, so even though this guided me in a good direction, I removed the manual speed downgrade, and it is still stable so far.

I am super happy that it now works very stably, and the speed is impressive. Depending on the model, the range is between 20 tk/s for bigger models and over 200 tk/s for smaller models.

My current grub line is:

GRUB_CMDLINE_LINUX="consoleblank=30 pcie_aspm=off pci=realloc pcie_port_pm=off systemd.unit=multi-user.target"

Currently, I disabled the following option:

adrian@bigdelli:~$ cat /etc/modprobe.d/nvidia-graphics-drivers-kms.conf
# Nvidia modesetting support. Set to 0 or comment to disable kernel modesetting
# and framebuffer console support. This must be disabled in case of Mosaic or SLI.
#Das war default
#options nvidia-drm modeset=1

The service that disables the ASPM (Please, remember that the PCIe address (3d:00.0) is most likely different from mine):

cat /etc/systemd/system/egpu-link-aspm-off.service
[Unit]
Description=Disable ASPM on eGPU root port and device
DefaultDependencies=no
After=local-fs.target
Before=multi-user.target
[Service]
Type=oneshot
ExecStart=/bin/sh -c '/usr/bin/setpci -s 3d:00.0 CAP_EXP+10.w=0140; sleep 1'
RemainAfterExit=yes
[Install]
WantedBy=multi-user.target

Author

Posted

The machine’s specs:
Intel® Core™ i7-6920HQ CPU at 2.90GHz
48 GB RAM at 2400 MHz
Nvidia GTX 970 eGPU with 4GB vRAM
eGPU -> Oculink -> M.2 Adapter.

Prepare

adrian@bigdelli:~$ llama-bench --list-devices
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA GeForce GTX 970, compute capability 5.2, VMM: yes
  Device 1: Quadro M1200, compute capability 5.0, VMM: yes
Available devices:
  CUDA0: NVIDIA GeForce GTX 970 (4030 MiB, 3966 MiB free)
  CUDA1: Quadro M1200 (4035 MiB, 4001 MiB free)

Models that fit into the vRAM

llama-bench -ngl 99 -fa 1 -m models/granite/granite-4.0-micro-Q6_K.gguf --device CUDA0

NVIDIA GeForce GTX 970, compute capability 5.2, VMM: yes

model size params backend ngl fa test t/s
granite-4.0-micro-Q6_K.gguf
granite 3B Q6_K 2.60 GiB 3.40 B CUDA 99 1 pp512 389.80 ± 1.11
granite 3B Q6_K 2.60 GiB 3.40 B CUDA 99 1 tg128 24.97 ± 0.04
LFM2-2.6B-Exp-Q6_K.gguf
lfm2 2.6B Q6_K 2.07 GiB 2.70 B CUDA 99 1 pp512 546.39 ± 1.52
lfm2 2.6B Q6_K 2.07 GiB 2.70 B CUDA 99 1 tg128 32.72 ± 0.15
LFM2.5-1.2B-Thinking-Q6_K.gguf
lfm2 1.2B Q6_K 915.96 MiB 1.17 B CUDA 99 1 pp512 1281.22 ± 10.11
lfm2 1.2B Q6_K 915.96 MiB 1.17 B CUDA 99 1 tg128 69.75 ± 0.35
Qwen2.5-3B-Instruct-Q4_K_M.gguf
qwen2 3B Q4_K – Medium 1.79 GiB 3.09 B CUDA 99 1 pp512 441.62 ± 1.16
qwen2 3B Q4_K – Medium 1.79 GiB 3.09 B CUDA 99 1 tg128 33.44 ± 0.02
Qwen2.5-3B-Instruct-Q6_K_L.gguf
qwen2 3B Q6_K 2.43 GiB 3.09 B CUDA 99 1 pp512 459.30 ± 1.93
qwen2 3B Q6_K 2.43 GiB 3.09 B CUDA 99 1 tg128 27.27 ± 0.05
Qwen2.5-3B-Instruct-Q8_0.gguf
qwen2 3B Q8_0 3.05 GiB 3.09 B CUDA 99 1 pp512 143.61 ± 0.21
qwen2 3B Q8_0 3.05 GiB 3.09 B CUDA 99 1 tg128 34.15 ± 0.01
Qwen3-4B-Instruct-2507-Q6_K.gguf
qwen3 4B Q6_K 3.07 GiB 4.02 B CUDA 99 1 pp512 100.84 ± 0.31
qwen3 4B Q6_K 3.07 GiB 4.02 B CUDA 99 1 tg128 21.97 ± 0.02
qwen2.5-1.5b-q8_0.gguf
qwen2 1.5B Q8_0 1.53 GiB 1.54 B CUDA 99 1 pp512 961.78 ± 5.83
qwen2 1.5B Q8_0 1.53 GiB 1.54 B CUDA 99 1 tg128 57.89 ± 0.03

Models that don’t fit into the vRAM

model size params backend ngl fa dev test t/s
deepseek2 30B.A3B Q4_K – Medium 16.88 GiB 29.94 B CUDA 6 1 CUDA0 pp512 65.14 ± 0.26
deepseek2 30B.A3B Q4_K – Medium 16.88 GiB 29.94 B CUDA 6 1 CUDA0 tg128 11.85 ± 0.02

Example via the llama-server provided website

Question: Please explain “bit shifting” to me as if I were 5 years old.
The answer metrics for the:

Fully loaded on GPU
LFM2.5-1.2B-Thinking-Q6_K.gguf 1,487 tokens 22s 65.85 t/s
LFM2-2.6B-Exp-Q6_K.gguf 1,174 tokens 37s 31.51 t/s
granite-4.0-micro-Q6_K.gguf 215 tokens 8.8s 24.34 t/s

Author
Categories Linux, AI

Posted

The Dell Precision 7520 has performed well so far with the Quadro M1200 and smaller models. But I wanted a little more. 4GB is a bit on the low side, and while the speed is fine for casual use, it’s not enough for anything more demanding.

Years ago, I built a PC, and since I enjoyed playing Elite Dangerous back then, I treated myself to a GTX 970. Yes, I know the GTX 970 also only has 4GB of VRAM, but it has a higher bandwidth, which benefits performance.

So I wanted to see how this card performs and whether it’s worth investing in a card with more RAM.

Since eGPU-to-Thunderbolt adapters are insanely expensive, I decided to go the DIY route with an eGPU-to-Oculink-to-M.2 adapter.

So I ordered the “EXP GDC OCuLink High Speed GPU Dock PCIe 4.0 ×4 Mini PC Notebook Laptop to External Graphics Card Adapter M.2 Mkey to OCuLink.”

First off, it’s not plug-and-play! But it was worth it.

How to get it running

After installing it, it did not run.

lspci -tv

Did not show an additional tree with the GTX 970, so the handshake was not successful with the M.2 adapter.

After reading some articles about the problem, I decided to bridge some pins so the ATX power supply can start by itself. The reason is that the order of starting the devices does matter. The adapter can manage the on/off procedure together with the Dell, but still it does not work. The EXP adapter has a switch, but unfortunately you have to open it to use it, which I did. I turned it on, but then the setup was not usable at all. I don’t know if it was because I moved the setup around or not. Furthermore, I measured the pins, and the switch seemed to do the right thing, but to prevent having more sources of problems, I disabled it and bridged it again the old way.

Starting the EXP with the GTX 970 did the trick, and it worked. Until it doesn’t. After reinstalling all the cables and the M.2. Adapter, it worked again. During analysis I found a lot of AER messages and communication errors between the computer and the graphics card. Some disappeared after I disabled ASPM, which stands for Active-State Power Management. To disable it, I added:

GRUB_CMDLINE_LINUX="consoleblank=30 pcie_aspm=off pci=realloc"

to my GRUB configuration. In my case,

/etc/default/grub

then

sudo update-grub

and restart.

This helped to get rid of the AER messages, but still the communication errors are existing, and the setup is brittle as hell. It means if it doesn’t run, check the cables. From my experience with the setup, the cables are very error-prone. I, meanwhile, ordered a new cable, which will hopefully be better.

How lspci should look like:

-[0000:00]-+-00.0  Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor Host Bridge/DRAM Registers
           +-01.0-[01]--+-00.0  NVIDIA Corporation GM107GLM [Quadro M1200 Mobile]
           |            \-00.1  NVIDIA Corporation GM107 High Definition Audio Controller [GeForce 940MX]
           +-02.0  Intel Corporation HD Graphics 530
           +-04.0  Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor Thermal Subsystem
           +-14.0  Intel Corporation 100 Series/C230 Series Chipset Family USB 3.0 xHCI Controller
           +-14.2  Intel Corporation 100 Series/C230 Series Chipset Family Thermal Subsystem
           +-15.0  Intel Corporation 100 Series/C230 Series Chipset Family Serial IO I2C Controller #0
           +-15.1  Intel Corporation 100 Series/C230 Series Chipset Family Serial IO I2C Controller #1
           +-16.0  Intel Corporation 100 Series/C230 Series Chipset Family MEI Controller #1
           +-17.0  Intel Corporation Q170/Q150/B150/H170/H110/Z170/CM236 Chipset SATA Controller [AHCI Mode]
           +-1c.0-[02]----00.0  Intel Corporation Wireless 8265 / 8275
           +-1c.2-[03]----00.0  Realtek Semiconductor Co., Ltd. RTS525A PCI Express Card Reader
           +-1c.4-[04-3c]--
           +-1d.0-[3d]--+-00.0  NVIDIA Corporation GM204 [GeForce GTX 970]
           |            \-00.1  NVIDIA Corporation GM204 High Definition Audio Controller
           +-1f.0  Intel Corporation CM238 Chipset LPC/eSPI Controller
           +-1f.2  Intel Corporation 100 Series/C230 Series Chipset Family Power Management Controller
           +-1f.4  Intel Corporation 100 Series/C230 Series Chipset Family SMBus
           \-1f.6  Intel Corporation Ethernet Connection (5) I219-LM

Running M1200 and GTX in parallel

nvidia-smi
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.126.09             Driver Version: 580.126.09     CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Quadro M1200                   On  |   00000000:01:00.0 Off |                  N/A |
| N/A   35C    P8            N/A  /  200W |    2399MiB /   4096MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce GTX 970         On  |   00000000:3D:00.0 Off |                  N/A |
|  0%   37C    P8             12W /  163W |    1319MiB /   4096MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A             997      C   ...ma.cpp/build/bin/llama-server       2393MiB |
|    1   N/A  N/A             996      C   ...ma.cpp/build/bin/llama-server       1314MiB |
+-----------------------------------------------------------------------------------------+

Pictures of the setup

position of m2 adapter



exp connected

Note

Don’t be surprised if the fans aren’t running. They are dependent on the temperature, and surprisingly, AI doesn’t produce heat on this card.

For that setup I use independent llama-server instances. To assign them to a specific card, the service must get an environment variable:

Environment="CUDA_VISIBLE_DEVICES=1" 

It doesn’t match the ID visible in nvidia-smi so you have to try it out. This example routes to the M1200 even though the GTX 970 has the ID 1.

Is it worth it?

Totally, yes. I have overall an increase of about factor 3. It’s not super fast but way more than I expected. I use the factor here because the speed hardly depends on the settings, context windows, and the size of the model. However, old GTX cards are very cheap, and even though this is not a hassle-free setup, you can get some good results with about €100 investment. And now I try to get the cables more robust so that I can try another card with more VRAM.

Author

Posted

After ChatGPT mentioned my blog as I asked for some performance comparison, I realized that this seems to be a unique setup to use local LLMs. Therefore, I decided to run llama-bench with the models I currently have on this machine and post it here.

The machine’s specs:
Intel® Core™ i7-6920HQ CPU at 2.90GHz
48 GB RAM at 2400 MHz
Nvidia Quadro M1200 dGPU with 4GB vRAM

Models that fit into the vRAM

llama-bench -ngl 99 -fa 1 -m <model>
Device 0: Quadro M1200, compute capability 5.0, VMM: yes
model size params backend ngl fa test t/s
granite-4.0-micro-Q6_K.gguf
granite 3B Q6_K 2.60 GiB 3.40 B CUDA 99 1 pp512 143.47 ± 0.60
granite 3B Q6_K 2.60 GiB 3.40 B CUDA 99 1 tg128 8.03 ± 0.06
LFM2-2.6B-Exp-Q6_K.gguf
lfm2 2.6B Q6_K 2.07 GiB 2.70 B CUDA 99 1 pp512 195.71 ± 0.25
lfm2 2.6B Q6_K 2.07 GiB 2.70 B CUDA 99 1 tg128 10.54 ± 0.08
LFM2.5-1.2B-Thinking-Q6_K.gguf
lfm2 1.2B Q6_K 915.96 MiB 1.17 B CUDA 99 1 pp512 461.74 ± 2.52
lfm2 1.2B Q6_K 915.96 MiB 1.17 B CUDA 99 1 tg128 22.96 ± 0.17
Qwen2.5-3B-Instruct-Q4_K_M.gguf
qwen2 3B Q4_K – Medium 1.79 GiB 3.09 B CUDA 99 1 pp512 156.46 ± 0.64
qwen2 3B Q4_K – Medium 1.79 GiB 3.09 B CUDA 99 1 tg128 11.51 ± 0.07
Qwen2.5-3B-Instruct-Q6_K_L.gguf
qwen2 3B Q6_K 2.43 GiB 3.09 B CUDA 99 1 pp512 165.89 ± 0.47
qwen2 3B Q6_K 2.43 GiB 3.09 B CUDA 99 1 tg128 8.96 ± 0.02
Qwen2.5-3B-Instruct-Q8_0.gguf
qwen2 3B Q8_0 3.05 GiB 3.09 B CUDA 99 1 pp512 158.52 ± 0.37
qwen2 3B Q8_0 3.05 GiB 3.09 B CUDA 99 1 tg128 11.94 ± 0.02
Qwen3-4B-Instruct-2507-Q6_K.gguf
qwen3 4B Q6_K 3.07 GiB 4.02 B CUDA 99 1 pp512 121.10 ± 0.35
qwen3 4B Q6_K 3.07 GiB 4.02 B CUDA 99 1 tg128 7.11 ± 0.02
qwen2 1.5B Q8_0 1.53 GiB 1.54 B CUDA 99 1 pp512 332.50 ± 1.28
qwen2 1.5B Q8_0 1.53 GiB 1.54 B CUDA 99 1 tg128 21.12 ± 0.00

Models that don’t fit into the vRAM

model size params backend ngl fa test t/s
glm-4.7-flash-claude-4.5-opus.q6_k.gguf
deepseek2 30B.A3B Q6_K 22.92 GiB 29.94 B CUDA 6 1 pp512 51.14 ± 0.10
deepseek2 30B.A3B Q6_K 22.92 GiB 29.94 B CUDA 6 1 tg128 7.36 ± 0.04
qwen2.5-32b-instruct-q8_0.gguf
! Do not try, this takes ages !
qwen2 32B Q8_0 32.42 GiB 32.76 B CUDA 5 1 pp512 14.21 ± 0.02
qwen2 32B Q8_0 32.42 GiB 32.76 B CUDA 5 1 tg128 0.74 ± 0.00
qwen2.5-14b-Q8_0.gguf
qwen2 14B Q8_0 14.62 GiB 14.77 B CUDA 10 1 pp512 34.02 ± 0.16
qwen2 14B Q8_0 14.62 GiB 14.77 B CUDA 10 1 tg128 1.97 ± 0.00

Example via the llama-server provided website

Usually, I have 2 models loaded. One lives completely on the GPU, and one lives completely on the CPU. My current loaded models are
- CPU: qwen/Qwen2.5-7B-Instr-Q4_K_M.gguf
- GPU: granite/granite-4.0-micro-Q6_K.gguf

Question: Please explain “bit shifting” to me as if I were 5 years old.
The answer metrics for the:

LFM2.5-1.2B-Thinking-Q6_K.gguf 1,275 tokens 59s 21.46 t/s
LFM2-2.6B-Exp-Q6_K.gguf 318 tokens 30s 10.33 t/s
granite-4.0-micro-Q6_K.gguf 258 tokens 33s 7.73 t/s
Only loaded on CPU
Qwen2.5-7B-Instr-Q4_K_M.gguf 283 tokens 47s 5.94 t/s

Author
Categories Linux, AI

Posted

Goal:

LLM

  • Works with GPU
  • Runs locally and uses the GPU.
  • (optional) can be used for vibe coding with IntelliJ.
  • is accessible via web browser.
  • has a WhatsApp relay. (Another phone number recommended)

Prepare

  • Disable all radio devices on the Dell in the BIOS.
  • Attach a cable network
    I tried first the installation also without a network attached, but this just made my life harder.
  • Download Ubuntu 22.04 Server, even though a newer version is available.
    The error I ran into was :fail curtin command block-meta dev/pve/data not an existing file of block device…
    I tried several things. From partitioning by hand over vgremove, pvremove, wipefs and some other solutions unsuccessfully.
    The older installer is just not so picky as the one from 24.04.

Installation

Install Ubuntu LTS 22.04 Server:

  • Choose Ubuntu Server with the HWE Kernel at boot
  • Check “Ubuntu Server (minimized)”
  • Check “Search for third-party drivers”
  • NO LVM, I needed to disable it.
    • I also edited the automatically made partitions by reducing the size of the main partition in favor of having a swap partition of 16G.
  • Check “Install OpenSSH Server”

Install GPU drivers

  • update repository
    sudo apt update
  • check drivers
    sudo ubuntu-drivers devices

vendor : NVIDIA Corporation
model : GM107GLM [Quadro M1200 Mobile]
driver : nvidia-driver-535-server – distro non-free
driver : nvidia-driver-470 – distro non-free
driver : nvidia-driver-450-server – distro non-free
driver : nvidia-driver-535 – distro non-free
driver : nvidia-driver-580 – distro non-free recommended
driver : nvidia-driver-580-server – distro non-free
driver : nvidia-driver-470-server – distro non-free
driver : nvidia-driver-390 – distro non-free
driver : nvidia-driver-418-server – distro non-free
driver : nvidia-driver-545 – distro non-free
driver : nvidia-driver-570 – distro non-free
driver : nvidia-driver-570-server – distro non-free
driver : xserver-xorg-video-nouveau – distro free builtin

Install the recommended one:

  • remove all old nvidia packages.
    sudo apt purge 'nvidia*'
  • get sources to build the dkms
    sudo apt install build-essential linux-headers-$(uname -r)
  • get new gcc to prevent later compilation problems with nvcc and llama.cpp
    sudo apt install gcc-12 g++-12
  • install the driver
    sudo apt install nvidia-driver-580

reboot.

Then check the installation:
nvidia-smi -> It should show your card.

Sun Feb  1 09:05:38 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.126.09             Driver Version: 580.126.09     CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Quadro M1200                   Off |   00000000:01:00.0 Off |                  N/A |
| N/A   46C    P8            N/A  /  200W |       2MiB /   4096MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |

Install llama.cpp

Prepare

  • Get necessary software
    sudo apt install git
    sudo apt install cmake
    sudo apt install dialog
    sudo apt install open-ssl
  • Install a newer version of the CUDA toolkit (we need version 12, but the repo has version 11)
    wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-ubuntu2204.pin
    sudo mv cuda-ubuntu2204.pin /etc/apt/preferences.d/cuda-repository-pin-600
    sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/3bf863cc.pub
    sudo add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/ /"
    sudo apt update
    sudo apt install cuda-toolkit-12-5 This will take a while; grab a coffee or clean your room.
  • Assign CUDA to the paths (put those commands eventually also into .bashrc):
export CUDA_HOME=/usr/local/cuda-12.5
export PATH=$CUDA_HOME/bin:$PATH
export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH
  • Test the installed version

which nvcc

/usr/local/cuda-12.5/bin/nvcc

nvcc --version

nvcc: NVIDIA ® Cuda compiler driver
Copyright © 2005-2024 NVIDIA Corporation
Built on Thu_Jun__6_02:18:23_PDT_2024
Cuda compilation tools, release 12.5, V12.5.82
Build cuda_12.5.r12.5/compiler.34385749_0

  • Clone the repository
    git clone https://github.com/ggerganov/llama.cpp
  • Check the following sites for the actual CUDA installation method:
    https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md
  • Build a llama
    cd llama.cpp
    cmake -B build -DGGML_CUDA=ON -DCMAKE_C_COMPILER=gcc-12 -DCMAKE_CXX_COMPILER=g++-12
    cmake --build build --config Release -j 8 j8 is the amount of parallel jobs that will be used. It speeds the build a lot.

Go cooking or make something meaningful. This will take a long time!

Install Model

  • Switch to the home folder
    cd
  • Create models folder
    mkdir -p models
  • Download the 7b model
    curl -L -o openhermes-2.5-mistral-7b.Q4_K_M.gguf \ https://huggingface.co/TheBloke/OpenHermes-2.5-Mistral-7B-GGUF/resolve/main/openhermes-2.5-mistral-7b.Q4_K_M.gguf

Start LLM

My Dell Precision 7520 has:
  • 32 GB RAM dual channel
  • 16 GB RAM single channel

Depending on the context size, the communication will be faster or slower.
--ctx-size 4096

  • On 4 GB VRAM (Quadro M1200), this was a good starting point.
    --n-gpu-layers 32
~/llama.cpp/build/bin/llama-server   -m ~/models/openhermes-2.5-mistral-7b.Q4_K_M.gguf \
   --n-gpu-layers -1 \
   --host 0.0.0.0
   --port 8080

All the settings hardly depend on the model that is used. I tried different models with more or less success. However, my approach is to have 2 services, one for a smaller model and one for a bigger model. 3B models run well on the GPU, which means that most of the layers are in VRAM.
For the bigger ones, I set --n-gpu-layers 0 to prevent them from running on the GPU. They then live in my RAM and are processed in the CPU.

Despite the one used for this article, I currently have the following models running:
  • cpu_model.gguf -> glm-4.7-flash-claude-4.5-opus.q4_k_m.gguf
  • gpu_model.gguf -> OwlLM2-e2b.Q8_0.gguf

I recommend putting the server into a service and using systemctl to start it.

Llama.cpp brings its own web UI with it. It is available on the specified port.

I get speeds of about 6-12 tokens per second. On both.

I also tried https://github.com/openclaw/openclaw for the WhatsApp relay, but this doesn’t work well. OpenClaw needs a context size of at least 16k, which is not manageable by this system, or at least not for the GPU model. However, depending on the models, I had to wait at least 3 minutes for just a “Hello.” As an alternative, I now use https://github.com/HKUDS/nanobot which does not need such a big context size.

At the end I achieved all open points. The part with IntelliJ is still ongoing.

Author
Categories Linux, AI