Usually, I drink a lot. I won’t say how much — my doctor might be reading this.
But if you want to buy me a drink, I’d love that.

Posted
Usually, I drink a lot. I won’t say how much — my doctor might be reading this.
But if you want to buy me a drink, I’d love that.

Author Adrian Höhne
Posted
After ChatGPT mentioned my blog as I asked for some performance comparison, I realized that this seems to be a unique setup to use local LLMs. Therefore, I decided to run llama-bench with the models I currently have on this machine and post it here.
The machine’s specs:
Intel® Core™ i7-6920HQ CPU at 2.90GHz
48 GB RAM at 2400 MHz
Nvidia Quadro M1200 dGPU with 4GB vRAM
llama-bench -ngl 99 -fa 1 -m <model>
Device 0: Quadro M1200, compute capability 5.0, VMM: yes
| model | size | params | backend | ngl | fa | test | t/s |
|---|---|---|---|---|---|---|---|
| granite-4.0-micro-Q6_K.gguf | |||||||
| granite 3B Q6_K | 2.60 GiB | 3.40 B | CUDA | 99 | 1 | pp512 | 143.47 ± 0.60 |
| granite 3B Q6_K | 2.60 GiB | 3.40 B | CUDA | 99 | 1 | tg128 | 8.03 ± 0.06 |
| LFM2-2.6B-Exp-Q6_K.gguf | |||||||
| lfm2 2.6B Q6_K | 2.07 GiB | 2.70 B | CUDA | 99 | 1 | pp512 | 195.71 ± 0.25 |
| lfm2 2.6B Q6_K | 2.07 GiB | 2.70 B | CUDA | 99 | 1 | tg128 | 10.54 ± 0.08 |
| LFM2.5-1.2B-Thinking-Q6_K.gguf | |||||||
| lfm2 1.2B Q6_K | 915.96 MiB | 1.17 B | CUDA | 99 | 1 | pp512 | 461.74 ± 2.52 |
| lfm2 1.2B Q6_K | 915.96 MiB | 1.17 B | CUDA | 99 | 1 | tg128 | 22.96 ± 0.17 |
| Qwen2.5-3B-Instruct-Q4_K_M.gguf | |||||||
| qwen2 3B Q4_K – Medium | 1.79 GiB | 3.09 B | CUDA | 99 | 1 | pp512 | 156.46 ± 0.64 |
| qwen2 3B Q4_K – Medium | 1.79 GiB | 3.09 B | CUDA | 99 | 1 | tg128 | 11.51 ± 0.07 |
| Qwen2.5-3B-Instruct-Q6_K_L.gguf | |||||||
| qwen2 3B Q6_K | 2.43 GiB | 3.09 B | CUDA | 99 | 1 | pp512 | 165.89 ± 0.47 |
| qwen2 3B Q6_K | 2.43 GiB | 3.09 B | CUDA | 99 | 1 | tg128 | 8.96 ± 0.02 |
| Qwen2.5-3B-Instruct-Q8_0.gguf | |||||||
| qwen2 3B Q8_0 | 3.05 GiB | 3.09 B | CUDA | 99 | 1 | pp512 | 158.52 ± 0.37 |
| qwen2 3B Q8_0 | 3.05 GiB | 3.09 B | CUDA | 99 | 1 | tg128 | 11.94 ± 0.02 |
| Qwen3-4B-Instruct-2507-Q6_K.gguf | |||||||
| qwen3 4B Q6_K | 3.07 GiB | 4.02 B | CUDA | 99 | 1 | pp512 | 121.10 ± 0.35 |
| qwen3 4B Q6_K | 3.07 GiB | 4.02 B | CUDA | 99 | 1 | tg128 | 7.11 ± 0.02 |
| qwen2 1.5B Q8_0 | 1.53 GiB | 1.54 B | CUDA | 99 | 1 | pp512 | 332.50 ± 1.28 |
| qwen2 1.5B Q8_0 | 1.53 GiB | 1.54 B | CUDA | 99 | 1 | tg128 | 21.12 ± 0.00 |
| model | size | params | backend | ngl | fa | test | t/s |
|---|---|---|---|---|---|---|---|
| glm-4.7-flash-claude-4.5-opus.q6_k.gguf | |||||||
| deepseek2 30B.A3B Q6_K | 22.92 GiB | 29.94 B | CUDA | 6 | 1 | pp512 | 51.14 ± 0.10 |
| deepseek2 30B.A3B Q6_K | 22.92 GiB | 29.94 B | CUDA | 6 | 1 | tg128 | 7.36 ± 0.04 |
| qwen2.5-32b-instruct-q8_0.gguf ! Do not try, this takes ages ! |
|||||||
| qwen2 32B Q8_0 | 32.42 GiB | 32.76 B | CUDA | 5 | 1 | pp512 | 14.21 ± 0.02 |
| qwen2 32B Q8_0 | 32.42 GiB | 32.76 B | CUDA | 5 | 1 | tg128 | 0.74 ± 0.00 |
| qwen2.5-14b-Q8_0.gguf | |||||||
| qwen2 14B Q8_0 | 14.62 GiB | 14.77 B | CUDA | 10 | 1 | pp512 | 34.02 ± 0.16 |
| qwen2 14B Q8_0 | 14.62 GiB | 14.77 B | CUDA | 10 | 1 | tg128 | 1.97 ± 0.00 |
Usually, I have 2 models loaded. One lives completely on the GPU, and one lives completely on the CPU. My current loaded models are
- CPU: qwen/Qwen2.5-7B-Instr-Q4_K_M.gguf
- GPU: granite/granite-4.0-micro-Q6_K.gguf
Question: Please explain “bit shifting” to me as if I were 5 years old.
The answer metrics for the:
| Fully loaded on GPU | |||
| LFM2.5-1.2B-Thinking-Q6_K.gguf | 1,275 tokens | 59s | 21.46 t/s |
| LFM2-2.6B-Exp-Q6_K.gguf | 318 tokens | 30s | 10.33 t/s |
| granite-4.0-micro-Q6_K.gguf | 258 tokens | 33s | 7.73 t/s |
| Only loaded on CPU | |||
| Qwen2.5-7B-Instr-Q4_K_M.gguf | 283 tokens | 47s | 5.94 t/s |
Author
Adrian Höhne
Categories
Linux, AI
Posted
sudo apt updatesudo ubuntu-drivers devicesvendor : NVIDIA Corporation
model : GM107GLM [Quadro M1200 Mobile]
driver : nvidia-driver-535-server – distro non-free
driver : nvidia-driver-470 – distro non-free
driver : nvidia-driver-450-server – distro non-free
driver : nvidia-driver-535 – distro non-free
driver : nvidia-driver-580 – distro non-free recommended
driver : nvidia-driver-580-server – distro non-free
driver : nvidia-driver-470-server – distro non-free
driver : nvidia-driver-390 – distro non-free
driver : nvidia-driver-418-server – distro non-free
driver : nvidia-driver-545 – distro non-free
driver : nvidia-driver-570 – distro non-free
driver : nvidia-driver-570-server – distro non-free
driver : xserver-xorg-video-nouveau – distro free builtin
sudo apt purge 'nvidia*'sudo apt install build-essential linux-headers-$(uname -r)sudo apt install gcc-12 g++-12sudo apt install nvidia-driver-580reboot.
Then check the installation:
nvidia-smi -> It should show your card.
Sun Feb 1 09:05:38 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.126.09 Driver Version: 580.126.09 CUDA Version: 13.0 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 Quadro M1200 Off | 00000000:01:00.0 Off | N/A |
| N/A 46C P8 N/A / 200W | 2MiB / 4096MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
sudo apt install gitsudo apt install cmakesudo apt install dialogsudo apt install open-sslwget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-ubuntu2204.pinsudo mv cuda-ubuntu2204.pin /etc/apt/preferences.d/cuda-repository-pin-600sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/3bf863cc.pubsudo add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/ /"sudo apt updatesudo apt install cuda-toolkit-12-5 This will take a while; grab a coffee or clean your room.export CUDA_HOME=/usr/local/cuda-12.5
export PATH=$CUDA_HOME/bin:$PATH
export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH
which nvcc
/usr/local/cuda-12.5/bin/nvcc
nvcc --version
nvcc: NVIDIA ® Cuda compiler driver
Copyright © 2005-2024 NVIDIA Corporation
Built on Thu_Jun__6_02:18:23_PDT_2024
Cuda compilation tools, release 12.5, V12.5.82
Build cuda_12.5.r12.5/compiler.34385749_0
git clone https://github.com/ggerganov/llama.cpphttps://github.com/ggml-org/llama.cpp/blob/master/docs/build.mdcd llama.cppcmake -B build -DGGML_CUDA=ON -DCMAKE_C_COMPILER=gcc-12 -DCMAKE_CXX_COMPILER=g++-12cmake --build build --config Release -j 8 j8 is the amount of parallel jobs that will be used. It speeds the build a lot.Go cooking or make something meaningful. This will take a long time!
cdmkdir -p modelscurl -L -o openhermes-2.5-mistral-7b.Q4_K_M.gguf \
https://huggingface.co/TheBloke/OpenHermes-2.5-Mistral-7B-GGUF/resolve/main/openhermes-2.5-mistral-7b.Q4_K_M.ggufDepending on the context size, the communication will be faster or slower.
--ctx-size 4096
--n-gpu-layers 32~/llama.cpp/build/bin/llama-server -m ~/models/openhermes-2.5-mistral-7b.Q4_K_M.gguf \
--n-gpu-layers -1 \
--host 0.0.0.0
--port 8080
All the settings hardly depend on the model that is used. I tried different models with more or less success. However, my approach is to have 2 services, one for a smaller model and one for a bigger model. 3B models run well on the GPU, which means that most of the layers are in VRAM.
For the bigger ones, I set --n-gpu-layers 0 to prevent them from running on the GPU. They then live in my RAM and are processed in the CPU.
I recommend putting the server into a service and using systemctl to start it.
Llama.cpp brings its own web UI with it. It is available on the specified port.
I get speeds of about 6-12 tokens per second. On both.
I also tried https://github.com/openclaw/openclaw for the WhatsApp relay, but this doesn’t work well. OpenClaw needs a context size of at least 16k, which is not manageable by this system, or at least not for the GPU model. However, depending on the models, I had to wait at least 3 minutes for just a “Hello.” As an alternative, I now use https://github.com/HKUDS/nanobot which does not need such a big context size.
At the end I achieved all open points. The part with IntelliJ is still ongoing.
Author
Adrian Höhne
Categories
Linux, AI
Posted
Foremost, it has some strange issues. But if you want to give it a try, go ahead.
xRdp will create its own X11 instance, and as far as I know, brings its own X11 server. However, this comes with a caveat. Every program I tried could only be started either on the host or on the client, but never on both at the same time.
Although cosmic and xwayland ignore the xorg configuration files, the xserver from xrdp may use them. So, I will try it and extend this article with my findings.
Host:
Remote System:
Update System (optional):
sudo apt update
sudo apt list --upgradable
sudo apt upgrade
Install xRDP
sudo apt install xrdp
Configure xRDP
echo "cosmic-session" > ~/.xsession
systemctl enable xrdp
systemctl start xrdp
Author Adrian Höhne
Posted
Use this formula for pixel movement, they told me. It will work, they told me. Yep, no.
After all, it makes totally sense why this can not work, but later more.
fun rotateCoordinate(point: Coordinate, center: Coordinate, angleDegrees: Double): Coordinate {
val angleRadians = toRadians(angleDegrees)
val dx = point.x – center.x
val dy = point.y – center.y
val rotatedX = dx * cos(angleRadians) – dy * sin(angleRadians)
val rotatedY = dx * sin(angleRadians) + dy * cos(angleRadians)
return Coordinate(center.x + rotatedX, center.y + rotatedY)
}
This is probably the wrong formula for the task, but for now, I just want to write down my notes.
However, after it just looked horrible in my game, I started to make it more easily visible. What we developers do in such cases—yes, we write some more code. And this is the result.

On the left side, I took the pixel coordinate of the previously rotated point and rotated it by angle 1.0.
On the right side, I took the start coordinate and rotated it by the full range.
So, whereas the rotation on the left side is, e.g., from degree 53 to 54, on the right side it is from 0 to 54.
So, what is happening here? With every rotation, the rotated point must be aligned to the coordinate system; therefore, it loses precision. It is now located somewhere else than it should be and will land somewhere else due to the next calculation, which also loses some precision. But keeping the starting point and calculating the full angle solves the issue.
But still—it could be that I am using the wrong formula, so I will continue researching it.
Author
Adrian Höhne
Categories
Programming