proxmox
Homelab AI Server Multi GPU Benchmarks – Dual 4090s + 1070ti
Homelab AI Server Multi GPU Benchmarks – Dual 4090s + 1070ti added in (CRAZY Results!)
#Homelab #Server #Multi #GPU #Benchmarks #Dual #4090s #1070ti
“Digital Spaceport”
We are benchmarking the killer Proxmox based, LXC and Docker Homelab running OpenwebUI and Ollama. We are continuing our Ai GPU Llama 3.1 testing at several parameter levels and mixed GPU Generation and today we are looking at dual 4090s and tossing in a real wildcard extra GPU, an older 1070ti….
source
To see the full content, share this page by clicking one of the buttons below |
Thank you for videos, I have always wondered if it is possible to combine the RTX 4090 and RTX 3090 to run a 70B LLM, and it turns out it is possible!
Tried it with the llama3.1:8b-instruct-fp16, back to dual 3090s again — here are the results:
Q1 – 46.35 tokens/s, Q2 – 45.06, Q3 – 44.19, Q4 – 42.12, AVG – 44.43.
Still via the Windows ollama CLI.
Just as a head's up — for Q3 — the eval duration was 19.915007 seconds, for a total duration of 19.9740072 seconds.
In other words, with my setup, with a 6700K and dual 3090s only running at PCIe 3.0 x16, it takes less time to write the 4000 word story about a cat with a pirate theme.
Update:
So I am re-running your tests with the llama3.1:70b model, but now with FOUR 3090s, albeit, they're only connected to the same Asus Z170-E motherboard via PCIe 3.0 x1 adapter cards that goes to x16 "riser cards" (that were leftover from my GPU mining days).
As such, the GPUs are starved for bandwidth, and therefore; currently, the results end up being quite poor, despite having four cards now, to share the load:
None of the four 3090s exceel 30% GPU utilisation. They all use about 11 GB out of the 24 GB of VRAM each (for a total of ~44 GB used).
Because they are starved for bandwidth, none of the GPUs also > ~80 W of power usage (so way less than the two 3090s I had earlier).
Results are: 5.54 tokens/s, 4.47, 4.26, 6.25, avg: 5.13
Same everything else other than I added a EVGA XC3 3090, and a MSI Ventus 3X 3090, and since the Z170-E motherboard only has three PCIe 3.0 x16 (physical) slots, but the 2nd GPU blocks the 3rd slot, so I can't really use it — that is why I ended up trying with the x1 adapters.
Here are my results:
Hardware:
CPU: Core i7 6700K
Motherboard: Asus Z170-E
RAM: 4x 16 GB DDR4-2400 (I think it's Kingston Fury — not really sure)
SSD: Intel 670p 1 TB NVMe 3.0 x4 SSD
PSU: HP 660185-001 900W 1U PSUs
Parallel Miners ZSX Breakout board
GPUs: 2x Gigabyte Vision RTX 3090 OC 24 GB (both at PCIe 3.0 x8 speeds, instead of the PCIe 3.0 x16 that the motherboard can provide to these otherwise PCIe 4.0 x16 GPUs)
(and technically I have a HGST 3 TB SATA 6 Gbps 7200 rpm HDD in there as well, but that's for my games and this was/is my gaming rig (at least for a little while, although I don't really use it much as a gaming rig anymore))
Software:
OS: Win10 22H2
Ollama: is running with their Windows executable. Had issues with Ubuntu 22.04 LTS via WSL not being able to use both GPUs for this, for some strange reason, but the Windows ("native") version was able to, so I had to run this via the Windows command prompt rather than though open-webui. Go figures. (But codestral:22b is still able to run from open-webui, so I am not really sure what's the issue with the llama3.1:70b model.)
Q1 – warmup
prompt eval rate: 25.24 tokens/s, eval rate: 17.39 tokens/s
Q2 – write me a program that generated fractals
prompt eval rate: 442.09 tokens/s, eval rate: 16.99 tokens/s
Q3 – Tell me a story about a cat in 4 thousand words, not charactesr or spaces. Works. Theme it like a pirate.
prompt eval rate: 1352.27 tokens/s, eval rate: 16.74 token/s
Q4 – write me a sentence and tell me how many words are in that sentence after you write it out.
prompt eval rate: 13186.53 tokens/s, eval rate: 16.57 tokens/s
Average tokens/s: 16.9225
>>> /show info
Model
parameters 70.6B
quantization Q4_0
arch llama
context length 131072
embedding length 8192
Parameters
stop "<|start_header_id|>"
stop "<|end_header_id|>"
stop "<|eot_id|>"
License
LLAMA 3.1 COMMUNITY LICENSE AGREEMENT
Llama 3.1 Version Release Date: July 23, 2024
Power: ~285-300 W per GPU. It depends on whether I let it take a breather between the prompts or whether I am submitted the prompts in rapid succession. If I give it a bit of a breather, than the GPUs can cool down a little bit before it has to respond to the next prompt, which lowers power consumption a little bit, because it doesn't have to ramp the fans up to 100% (which takes from the 370 W max power that my 3090s are permitted to draw.)
VRAM usage: Also ~90%. (~22100 MB out of 24576 MB)
It IS interesting to see that the 3090s, at least for this kind of LLM, isn't really all that much slower vs. dual 4090s, despite the cost and power differences. I thought that the 4090s would have outperformed the 3090s by a wider margin.
Apparently I'm wrong.
Can you please test mistral large model 5 or 6q? I understand it won't fit completely inside vram but still, I'm curious how partly offloading will affect performance. Having such an expensive cards it is a shame to use them with small models or low quantization, so everything below 70b 6q is waste of money in my opinion. Thank you.
For the crazier scenario, are you going to get a couple of L40S or higher?
Strix cards are big but if you have a chance, get the aorus XDDD
Maybe if you mount the cpu radiator and fans on the other side even can drill custom holes so the radiator is high enough to where you can still get gpu in and tighten the screws for the gpus. I would just use thumb screws for the gpus so you dont need to use a screwdriver and can keep the radiator as low as possible.
Edit: so i just watched the build vid and see how non of what I said would work lol. What I would do is the top brace that the radiator is leaning against I would drill some holes in it and mount it between the fans and the radiator just enough so your not blocking any air. Then I would flip it all over and mount it to the other side so the top where you mounted the metal brace is now the bottom and you can then screw that bracket where ever you want above the gpu mounting holes and then move the brace blocking the long gpus to the top of the frame replacing the one mounted to the radiator thats now on the other side. Maybe it might work or its just my spectrum brain running wild hahah
Great video. I’m using a 4090 but still can decide if I go fp16 or int8. The Ada generation have been optimized for fp16 but I’m not seeing any performance gain.
Right on. I'm surrprsied the 1070 did as well as it did! I'll need to check out your bench marks. Most users can only afford a single card but its clear a 3090RTX/24GB card you can do a great job with the 8b models.. I wonder how far they can go now.
Do you offer consulting services?
Is there a plan to benchmark AMD 7600xt ? it has 16gb vram and it seems a great value to run LLM
I think it's hilarious people worry how many watts a paragraph of helpful text costs.
Meanwhile in gaming, bashing a monster took 350W
Couldn't we just use a crypto rig? This has all been done before.
We switched from bitcoin and Etherium over to AI
It looked like the 8b models fit and process on only a single 4090 card in your tests.
Excellent videos. Love the ollama related content.