proxmox

Homelab AI Server Multi GPU Benchmarks – Dual 4090s + 1070ti

Homelab AI Server Multi GPU Benchmarks – Dual 4090s + 1070ti added in (CRAZY Results!)

#Homelab #Server #Multi #GPU #Benchmarks #Dual #4090s #1070ti

“Digital Spaceport”

We are benchmarking the killer Proxmox based, LXC and Docker Homelab running OpenwebUI and Ollama. We are continuing our Ai GPU Llama 3.1 testing at several parameter levels and mixed GPU Generation and today we are looking at dual 4090s and tossing in a real wildcard extra GPU, an older 1070ti….

source

 

To see the full content, share this page by clicking one of the buttons below

Related Articles

16 Comments

  1. Tried it with the llama3.1:8b-instruct-fp16, back to dual 3090s again — here are the results:

    Q1 – 46.35 tokens/s, Q2 – 45.06, Q3 – 44.19, Q4 – 42.12, AVG – 44.43.

    Still via the Windows ollama CLI.

    Just as a head's up — for Q3 — the eval duration was 19.915007 seconds, for a total duration of 19.9740072 seconds.

    In other words, with my setup, with a 6700K and dual 3090s only running at PCIe 3.0 x16, it takes less time to write the 4000 word story about a cat with a pirate theme.

  2. Update:
    So I am re-running your tests with the llama3.1:70b model, but now with FOUR 3090s, albeit, they're only connected to the same Asus Z170-E motherboard via PCIe 3.0 x1 adapter cards that goes to x16 "riser cards" (that were leftover from my GPU mining days).

    As such, the GPUs are starved for bandwidth, and therefore; currently, the results end up being quite poor, despite having four cards now, to share the load:

    None of the four 3090s exceel 30% GPU utilisation. They all use about 11 GB out of the 24 GB of VRAM each (for a total of ~44 GB used).

    Because they are starved for bandwidth, none of the GPUs also > ~80 W of power usage (so way less than the two 3090s I had earlier).

    Results are: 5.54 tokens/s, 4.47, 4.26, 6.25, avg: 5.13

    Same everything else other than I added a EVGA XC3 3090, and a MSI Ventus 3X 3090, and since the Z170-E motherboard only has three PCIe 3.0 x16 (physical) slots, but the 2nd GPU blocks the 3rd slot, so I can't really use it — that is why I ended up trying with the x1 adapters.

  3. Here are my results:

    Hardware:
    CPU: Core i7 6700K
    Motherboard: Asus Z170-E
    RAM: 4x 16 GB DDR4-2400 (I think it's Kingston Fury — not really sure)
    SSD: Intel 670p 1 TB NVMe 3.0 x4 SSD
    PSU: HP 660185-001 900W 1U PSUs
    Parallel Miners ZSX Breakout board
    GPUs: 2x Gigabyte Vision RTX 3090 OC 24 GB (both at PCIe 3.0 x8 speeds, instead of the PCIe 3.0 x16 that the motherboard can provide to these otherwise PCIe 4.0 x16 GPUs)
    (and technically I have a HGST 3 TB SATA 6 Gbps 7200 rpm HDD in there as well, but that's for my games and this was/is my gaming rig (at least for a little while, although I don't really use it much as a gaming rig anymore))

    Software:
    OS: Win10 22H2
    Ollama: is running with their Windows executable. Had issues with Ubuntu 22.04 LTS via WSL not being able to use both GPUs for this, for some strange reason, but the Windows ("native") version was able to, so I had to run this via the Windows command prompt rather than though open-webui. Go figures. (But codestral:22b is still able to run from open-webui, so I am not really sure what's the issue with the llama3.1:70b model.)

    Q1 – warmup
    prompt eval rate: 25.24 tokens/s, eval rate: 17.39 tokens/s
    Q2 – write me a program that generated fractals
    prompt eval rate: 442.09 tokens/s, eval rate: 16.99 tokens/s
    Q3 – Tell me a story about a cat in 4 thousand words, not charactesr or spaces. Works. Theme it like a pirate.
    prompt eval rate: 1352.27 tokens/s, eval rate: 16.74 token/s
    Q4 – write me a sentence and tell me how many words are in that sentence after you write it out.
    prompt eval rate: 13186.53 tokens/s, eval rate: 16.57 tokens/s

    Average tokens/s: 16.9225

    >>> /show info
    Model
    parameters 70.6B
    quantization Q4_0
    arch llama
    context length 131072
    embedding length 8192

    Parameters
    stop "<|start_header_id|>"
    stop "<|end_header_id|>"
    stop "<|eot_id|>"

    License
    LLAMA 3.1 COMMUNITY LICENSE AGREEMENT
    Llama 3.1 Version Release Date: July 23, 2024

    Power: ~285-300 W per GPU. It depends on whether I let it take a breather between the prompts or whether I am submitted the prompts in rapid succession. If I give it a bit of a breather, than the GPUs can cool down a little bit before it has to respond to the next prompt, which lowers power consumption a little bit, because it doesn't have to ramp the fans up to 100% (which takes from the 370 W max power that my 3090s are permitted to draw.)

    VRAM usage: Also ~90%. (~22100 MB out of 24576 MB)

    It IS interesting to see that the 3090s, at least for this kind of LLM, isn't really all that much slower vs. dual 4090s, despite the cost and power differences. I thought that the 4090s would have outperformed the 3090s by a wider margin.

    Apparently I'm wrong.

  4. Can you please test mistral large model 5 or 6q? I understand it won't fit completely inside vram but still, I'm curious how partly offloading will affect performance. Having such an expensive cards it is a shame to use them with small models or low quantization, so everything below 70b 6q is waste of money in my opinion. Thank you.

  5. Maybe if you mount the cpu radiator and fans on the other side even can drill custom holes so the radiator is high enough to where you can still get gpu in and tighten the screws for the gpus. I would just use thumb screws for the gpus so you dont need to use a screwdriver and can keep the radiator as low as possible.
    Edit: so i just watched the build vid and see how non of what I said would work lol. What I would do is the top brace that the radiator is leaning against I would drill some holes in it and mount it between the fans and the radiator just enough so your not blocking any air. Then I would flip it all over and mount it to the other side so the top where you mounted the metal brace is now the bottom and you can then screw that bracket where ever you want above the gpu mounting holes and then move the brace blocking the long gpus to the top of the frame replacing the one mounted to the radiator thats now on the other side. Maybe it might work or its just my spectrum brain running wild hahah

  6. Right on. I'm surrprsied the 1070 did as well as it did! I'll need to check out your bench marks. Most users can only afford a single card but its clear a 3090RTX/24GB card you can do a great job with the 8b models.. I wonder how far they can go now.

Leave a Reply