Hyper-v

LLaMA 3 “Hyper Speed” is INSANE! (Best Version Yet)

LLaMA 3 “Hyper Speed” is INSANE! (Best Version Yet)

#LLaMA #Hyper #Speed #INSANE #Version

“Matthew Berman”

What happens when you power LLaMA with the fastest inference speeds on the market? Let’s test it and find out!

Try Llama 3 on TuneStudio – The ultimate playground for LLMs:
Referral Code – BERMAN (First month free)

Join My Newsletter for Regular AI Updates…

source

 

To see the full content, share this page by clicking one of the buttons below

Related Articles

25 Comments

  1. Hey Matthew, not sure you saw my latest video but I have corrections to your two math problems that REALLY help out the LLMs. The errors are a big reason they're having so many issues getting the right answer. Check out my latest video to see the corrections–and thanks for everything you do. I love your videos! (I'd put a link to the video here but YouTube will hide my comment if I do, so just go to my latest video instead. Sorry about that extra friction)

  2. The marble thing is probably just the result of reflection. Models often get stuff wrong bc an earlier more-or-less-random token pushes it to the wrong path. Models cannot selfcorrect during inference, but can on a seconn iteration. So it probably spotted the incorrect reasoning of the first iteration and never generated early tokens that pushed it down the wrong path again.

  3. For the microwave marble problem, would it be helpful if you were explicit in stating that the cup has no lid? Is it possible it doesn't quite understand that the cup is open?

  4. I tried dolphin llama3 and found it not really 'uncensored'. I asked it to tell me some dirty jokes and it wouldn't and kept insisting that it was here to be informative not an entertainer

  5. So, Matthew, I love your videos, thanks so much. But, I think you need a small reality check WRT Groq. The Groq machine you're using needs 576 Groq chips. In addition, in the current implementation, those 576 (expensive, power-hungry) chips need 144 CPUs (Xeons, expensive and power hungry, 4 Groq boards / Xeon) (see Semianalysis article). Those systems use — I don't know if 'need' is appropriate, but — 144TB of DRAM total (1TB per Xeon), adding to the expense and power. Let's just be generous and say each Groq chip/board costs $1000 and that those 144 servers cost $5000 (more like $10000, but we're being generous). That system — $576,000 + $720,000 — is more than $1 million, probably pushing $2 million if you or I had to purchase the components — but let's round down to $1 million — and that system needs HUGE amounts of power (4x175W+Xeon = 1kW x 144 = 144KW at least). For this cost and power? You get, maybe, 10x tokens/second compared to a single Nvidia Grace+Hopper, system, that costs $40,000 and dissipates maybe 2KW (look on Nvidia's site). Yes, 800 tokens/sec is great and there will be applications for that extreme inference speed, but it's completely out-of-reality for most use cases. It just is. Yes, you could multiplex this $1 million system across, say 100 users. But, you can multiplex the Grace+Hopper too. And, we're not even talking about Blackwell here…. No, I'm not an Nvidia employee nor am I an Nvidia fanboi; I think even Nvidia G+H — let alone Blackwell — isn't the right solution for inference: it's over-powered for inference. Take a look at what Cerebras is doing: they have partnered with Qualcomm for their inference solution. Unless Groq can demonstrate very good performance for training, I don't think they have a good value proposition in the market. They might sell some systems to the government where staying ahead of bad guys is important regardless of cost. On the other hand, however, if Groq implements their chips in 3nm, the comparison changes in their favor somewhat. If they get 4x density improvement, then it's 576/4 = 144 chips instead of 576. That's still not a good value proposition (144 Groq chips vs 1 Nvidia chip?). They lose to Nvidia and other rational inference platforms. Have you run models on a MacBook M3 Max with 64GB or more memory? I can get > 30 tokens/sec on an M3 Max with 64GB of memory running 4-bit quantized Mixtral-8x7b-instruct. As you know, Mixtral is an excellent model and this 30 tok/sec inference speed is very, very good. For LLaMa-3-70b Q4, I can get 8.5 tokens/sec today. (Xeon+A100 gives me 26 tokens/second) All this is with llama.cpp. That's quite usable. And, the M4 with 128GB of memory will run the LLaMa-3-405b model at Q4; I can't predict how well, but it will run it at probably a few tokens/second. I suspect fine-tuned 70B or even MoE models will get better and better until they are good enough for edge purposes (you and me). Smartphones, at least iPhones (since they are cut-down MacBooks, essentially), can run LLaMa-8b and models like Phi-2 and stablelm-2-1.6b right now. Android phones won't be left behind. Groq might find its niche, but it's a niche as of now. I know it's fun to watch responses virtually leap from your screen, but we must be reasonable: you cannot afford to use Groq if they actually charge you what it costs. 🙂 Unless I'm wrong. If I am, I very much invite someone to enlighten me! Thanks.

  6. Could the multi inference output options serve you a random version of any one of its answers? This does not however explain how, when it explains the physics of the actions of the marble, it's inconsistent. Very bizarre…

  7. A new logic/reasoning question for you test that is very hard for LLMs:

    Solve this puzzle:
    Puzzle: There are three piles of matches on a table – Pile A with 7 matches, Pile B with 11 matches, and Pile C with 6 matches. The goal is to rearrange the matches so that each pile contains exactly 8 matches.
    Rules:
    1. You can only add to a pile the exact number of matches it already contains.
    2. All added matches must come from one other single pile.
    3. You have only three moves to achieve the goal.

Leave a Reply