GPU has fallen the bus and how I fixed it

Posted

Once upon a time I had the idea to use an old Dell Precision to learn something about AI. It worked, but I wanted to have more. More speed, more context, since it was still not enough. Therefore, I thought it was a good idea to use an old GTX 970 to speed it up. And I finally got this to work. The results were good, but the VRAM was still too small for bigger context.
I decided to buy an RTX 2060 card with 12 GB VRAM via eBay Kleinanzeigen for a small amount of money. And the drama began. Nothing worked, nothing. But I did not want to give up, so I started my journey to fix it.

During the journey, the card was sometimes visible in the PCIe tree (lspci -tv) and sometimes even with nvidia-smi. But after using nvidia-smi once, it fell off the bus.

The connection

The EXP GDC OCuLink came with the dock, an OCuLink cable, and an M.2 adapter. This combination did not work well, even with my GTX 970. I think it was because the connector of the adapter was very fragile and the signaling was not as stable as I wished. However, at the time I tried, I didn’t know. So, I bought another dock with another cable and another adapter and started to replace it one by one. Yet nothing seemed to work; the other equipment also did not work well, and still the card was not visible or had fallen off the bus. Then I bought another adapter and tinkered a bit with the system, so I was not sure if the new adapter was the reason or my tinkering, which at least brought the RTX to visibility. But at least with the new adapter (GINTOOYUN PCI-E4.0 M.2 NVME zu Oculink SFF-8611/8612), it was easier to handle the cable, even though it makes the cable longer, which is not the best if you have connection issues, but the card was visible in the lspci tree.

The performance

It is usual that the faster your communication is, the more it is affected by the environment. At the time I was thinking that my old GTX card did not support PCIE 3.0, since my old PC only supported 2.0. However, I was wrong, but that didn’t matter, because it brought me on the right track. I did want to use it for AI, so a fast connection had no relevance, but stability did. But later more.

The driver

During the research I found a GIT issue comment. He mentioned something interesting. The card falls off the bus only when CUDA is used. This was super important. I checked my configuration and tried to reproduce it. At the first attempts I had no success, but then I could see it. I forgot to disable the Llama service, which already worked and was assigned to the GTX; therefore, it also started directly on the RTX. This was the reason why the card fell off the bus some time after the startup of the system. Now, I was able to reproduce it. As long as I did not touch nvidia-smi or any other CUDA call, the card stayed still, and dmesg did not show this message. But when I called nvidia-smi, the card fell off. So, I started to blacklist the drivers at system start and to load them one by one. And it worked, sometimes, but it was more stable than before. Even nvidia-smi started working. But CUDA still not. When I loaded nvidia_uvm and called nvidia-smi then the connection died again.

The power management trap

From my experience in IT, I know that power management is not your friend, and often when it comes to connection issues, it is your worst enemy. If it works, fine, but if you want to disable it, good luck. Back to the story.
One of the first solutions was to disable ASPM. Yes, power management. So, I did. I wrote pcie_aspm=off into my grub file. I already did it before, because the GTX also didn’t react that well in this setup. And since it helped with the GTX, at least to get rid of the AER errors, I was expecting it to work. Guess what? It did only part of the job. During hassling around with the drivers and the performance and the testing and, and, and I saw it in the lspci tree (3d:00.0 is the address of the RTX), wtf. Can you see it?

root@bigdelli:/home/adrian# lspci -vv -s 3d:00.0 | grep -iE ‘LnkCap|LnkCtl|LnkSta|LnkCtl2|LnkSta2’ LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <512ns, L1 <4us LnkCtl: ASPM L1 Enabled; RCB 64 bytes, Disabled- CommClk+ LnkSta: Speed 2.5GT/s (downgraded), Width x4 (downgraded) LnkCap2: Supported Link Speeds: 2.5-8GT/s, Crosslink- Retimer- 2Retimers- DRS- LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis- LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+ EqualizationPhase1+ LnkCtl3: LnkEquIntrruptEn- PerformEqu-

Wait, I will show you.

LnkCtl:	ASPM L1 Enabled; RCB 64 bytes, Disabled- CommClk+

It was still on.

After some research, I was unable to get it disabled. But I remembered from previous problems that
setpci
might be helpful in this topic. So, even though ChatGPT has not really been helpful so far (as it is good if the solution is already in the training data but bad if it comes to unresolved problems), I wanted to give it a try, and it came back with a correct command:
setpci -s 3d:00.0 CAP_EXP+10.w=0140

CAP_EXP are the PCIe capabilities
+10 -> move to the ASPM area
0140 -> set bits 6 and 8 enabled but removes bit 1: 0000 0001 0100 0000

It basically disables ASPM. 0140 is 0×0140.

My initial state was 0×0142, so 0000 0001 0100 0010. That means ASPM L1 enabled, Common Clock enabled, and Clock Power Management enabled.
I got it with this command:

sudo setpci -s 3d:00.0 CAP_EXP+10.w

I put that into a systemd service and restarted everything. And it worked!

The happy end

I did much more than I described; it was a back and forth the whole time. And on some points I was thinking that the card was broken, but it worked in my gaming PC. However, I am still removing some other steps that I tried to get the root issue visible.
It turned out that the GTX also supported PCIe 3, so even though this guided me in a good direction, I removed the manual speed downgrade, and it is still stable so far.

I am super happy that it now works very stably, and the speed is impressive. Depending on the model, the range is between 20 tk/s for bigger models and over 200 tk/s for smaller models.

My current grub line is:

GRUB_CMDLINE_LINUX="consoleblank=30 pcie_aspm=off pci=realloc pcie_port_pm=off systemd.unit=multi-user.target"

Currently, I disabled the following option:

adrian@bigdelli:~$ cat /etc/modprobe.d/nvidia-graphics-drivers-kms.conf
# Nvidia modesetting support. Set to 0 or comment to disable kernel modesetting
# and framebuffer console support. This must be disabled in case of Mosaic or SLI.
#Das war default
#options nvidia-drm modeset=1

The service that disables the ASPM (Please, remember that the PCIe address (3d:00.0) is most likely different from mine):

cat /etc/systemd/system/egpu-link-aspm-off.service
[Unit]
Description=Disable ASPM on eGPU root port and device
DefaultDependencies=no
After=local-fs.target
Before=multi-user.target
[Service]
Type=oneshot
ExecStart=/bin/sh -c '/usr/bin/setpci -s 3d:00.0 CAP_EXP+10.w=0140; sleep 1'
RemainAfterExit=yes
[Install]
WantedBy=multi-user.target

Author