Hyper-v

LLaMA 3 “Hyper Speed” is INSANE! (Best Version Yet)

25 Less than a minute

LLaMA 3 “Hyper Speed” is INSANE! (Best Version Yet)

#LLaMA #Hyper #Speed #INSANE #Version

“Matthew Berman”

What happens when you power LLaMA with the fastest inference speeds on the market? Let’s test it and find out!

Try Llama 3 on TuneStudio – The ultimate playground for LLMs:
Referral Code – BERMAN (First month free)

Join My Newsletter for Regular AI Updates…

source

To see the full content, share this page by clicking one of the buttons below

25 Less than a minute

25 Comments

@matthew_berman says:
April 24, 2024 at 10:35 am

Reply Yes/No on this comment to vote on the next video:

How to build Agents with LLaMA 3 powered by Groq.
Reply
@ponnic says:
April 24, 2024 at 10:35 am

Wow, I got 906.19 T/s on groq this morning! (I am in Europe)
Reply
@aamrani says:
April 24, 2024 at 10:35 am

Temperature must be set to zero. Not sure you can set that on LLaMA models.
Reply
@afterthesmash says:
April 24, 2024 at 10:35 am

7:20 You can't get this right without the additional assumption that the table is too large to fit inside the microwave oven.
Reply
@Gizmo9707 says:
April 24, 2024 at 10:35 am

c = -18 is a solution you obtain setting f(-4) = 0. Uf you set f(12) = 0 you can get another solution c = -324.666666666…
Reply
@petrz5474 says:
April 24, 2024 at 10:35 am

Nope. The marble problem is a fail
Reply
@andyjm2k says:
April 24, 2024 at 10:35 am

Did you modify the temperature setting? It defaults to 1 which can increase your variance
Reply
@davidbangsdemocracy5455 says:
April 24, 2024 at 10:35 am

But it only speaks English. Maybe a model can focus on test questions better when it is trained on certain data while ignoring the languages of 6 billion people.
Reply
@zinexe says:
April 24, 2024 at 10:35 am

perhaps the temperature settings are different in the online/groq version,
for math it's probably best to have very low temp, maybe even 0
Reply
@rodrigoffdsilva says:
April 24, 2024 at 10:35 am

My nephew only responds to a question after I ask him twice also 😂
Reply
@rodrigoffdsilva says:
April 24, 2024 at 10:35 am

I think you're right of being lenient with our future masters 😂
Reply
@DrKnowitallKnows says:
April 24, 2024 at 10:35 am

Hey Matthew, not sure you saw my latest video but I have corrections to your two math problems that REALLY help out the LLMs. The errors are a big reason they're having so many issues getting the right answer. Check out my latest video to see the corrections–and thanks for everything you do. I love your videos! (I'd put a link to the video here but YouTube will hide my comment if I do, so just go to my latest video instead. Sorry about that extra friction)
Reply
@existenceisillusion6528 says:
April 24, 2024 at 10:35 am

4:49 using '2a-2' implies a = 7/6, via substitution. However, it can not be incorrect to say (2a-1)/4 = y, because the implication is that all of mathematics is inconsistent.
Reply
@Scarage21 says:
April 24, 2024 at 10:35 am

The marble thing is probably just the result of reflection. Models often get stuff wrong bc an earlier more-or-less-random token pushes it to the wrong path. Models cannot selfcorrect during inference, but can on a seconn iteration. So it probably spotted the incorrect reasoning of the first iteration and never generated early tokens that pushed it down the wrong path again.
Reply
@theflightwrightsprogrammin4410 says:
April 24, 2024 at 10:35 am

4:50 how is it 2a-2? the answer it gave is spot on. Probably there is some error while pasting the question from SAT but the answer it gave is right
Reply
@luckyskosana3118 says:
April 24, 2024 at 10:35 am

Llama3 + CrewAI/auto-gen please!!!
Reply
@tharzzan says:
April 24, 2024 at 10:35 am

Yes please, llama 3 agents
Reply
@djglxxii says:
April 24, 2024 at 10:35 am

For the microwave marble problem, would it be helpful if you were explicit in stating that the cup has no lid? Is it possible it doesn't quite understand that the cup is open?
Reply
@rgsouza2004 says:
April 24, 2024 at 10:35 am

Could you show us Llama3 fine tuning??!
Reply
@BadLarryz says:
April 24, 2024 at 10:35 am

wow it can write my essays and make a snake game! you guys are all idiots
Reply
@brendtevenden6389 says:
April 24, 2024 at 10:35 am

I tried dolphin llama3 and found it not really 'uncensored'. I asked it to tell me some dirty jokes and it wouldn't and kept insisting that it was here to be informative not an entertainer
Reply
@briancase6180 says:
April 24, 2024 at 10:35 am

So, Matthew, I love your videos, thanks so much. But, I think you need a small reality check WRT Groq. The Groq machine you're using needs 576 Groq chips. In addition, in the current implementation, those 576 (expensive, power-hungry) chips need 144 CPUs (Xeons, expensive and power hungry, 4 Groq boards / Xeon) (see Semianalysis article). Those systems use — I don't know if 'need' is appropriate, but — 144TB of DRAM total (1TB per Xeon), adding to the expense and power. Let's just be generous and say each Groq chip/board costs $1000 and that those 144 servers cost $5000 (more like $10000, but we're being generous). That system — $576,000 + $720,000 — is more than $1 million, probably pushing $2 million if you or I had to purchase the components — but let's round down to $1 million — and that system needs HUGE amounts of power (4x175W+Xeon = 1kW x 144 = 144KW at least). For this cost and power? You get, maybe, 10x tokens/second compared to a single Nvidia Grace+Hopper, system, that costs $40,000 and dissipates maybe 2KW (look on Nvidia's site). Yes, 800 tokens/sec is great and there will be applications for that extreme inference speed, but it's completely out-of-reality for most use cases. It just is. Yes, you could multiplex this $1 million system across, say 100 users. But, you can multiplex the Grace+Hopper too. And, we're not even talking about Blackwell here…. No, I'm not an Nvidia employee nor am I an Nvidia fanboi; I think even Nvidia G+H — let alone Blackwell — isn't the right solution for inference: it's over-powered for inference. Take a look at what Cerebras is doing: they have partnered with Qualcomm for their inference solution. Unless Groq can demonstrate very good performance for training, I don't think they have a good value proposition in the market. They might sell some systems to the government where staying ahead of bad guys is important regardless of cost. On the other hand, however, if Groq implements their chips in 3nm, the comparison changes in their favor somewhat. If they get 4x density improvement, then it's 576/4 = 144 chips instead of 576. That's still not a good value proposition (144 Groq chips vs 1 Nvidia chip?). They lose to Nvidia and other rational inference platforms. Have you run models on a MacBook M3 Max with 64GB or more memory? I can get > 30 tokens/sec on an M3 Max with 64GB of memory running 4-bit quantized Mixtral-8x7b-instruct. As you know, Mixtral is an excellent model and this 30 tok/sec inference speed is very, very good. For LLaMa-3-70b Q4, I can get 8.5 tokens/sec today. (Xeon+A100 gives me 26 tokens/second) All this is with llama.cpp. That's quite usable. And, the M4 with 128GB of memory will run the LLaMa-3-405b model at Q4; I can't predict how well, but it will run it at probably a few tokens/second. I suspect fine-tuned 70B or even MoE models will get better and better until they are good enough for edge purposes (you and me). Smartphones, at least iPhones (since they are cut-down MacBooks, essentially), can run LLaMa-8b and models like Phi-2 and stablelm-2-1.6b right now. Android phones won't be left behind. Groq might find its niche, but it's a niche as of now. I know it's fun to watch responses virtually leap from your screen, but we must be reasonable: you cannot afford to use Groq if they actually charge you what it costs. 🙂 Unless I'm wrong. If I am, I very much invite someone to enlighten me! Thanks.
Reply
@jimbo2112 says:
April 24, 2024 at 10:35 am

Could the multi inference output options serve you a random version of any one of its answers? This does not however explain how, when it explains the physics of the actions of the marble, it's inconsistent. Very bizarre…
Reply
@TheColonelJJ says:
April 24, 2024 at 10:35 am

Which LLM, that can be run on a home computer, would you recommend for helping refine prompts for Stable Diffusion — text to image?
Reply
@roelljr says:
April 24, 2024 at 10:35 am

A new logic/reasoning question for you test that is very hard for LLMs:

Solve this puzzle:
Puzzle: There are three piles of matches on a table – Pile A with 7 matches, Pile B with 11 matches, and Pile C with 6 matches. The goal is to rearrange the matches so that each pile contains exactly 8 matches.
Rules:
1. You can only add to a pile the exact number of matches it already contains.
2. All added matches must come from one other single pile.
3. You have only three moves to achieve the goal.
Reply

LLaMA 3 “Hyper Speed” is INSANE! (Best Version Yet)

“Matthew Berman”

To see the full content, share this page by clicking one of the buttons below

Related Articles

Fooling my Friend with his Real House on Minecraft…

Responding to the “Cessationist” Documentary – Part #3 |

[TF2] One Month of Sniper: Day 15

DRAG RACE ? | NINJA 300 VS NINJA 1000 ? | #shorts

25 Comments

Leave a ReplyCancel reply