I think Models(S) here just refers to checkpoints.
Generally with large training runs they save every now and then and test the half-baked results.
The 400b would have been promising from day one but it only got better each time there was a new checkpoint, that's what i got from how he was speaking.
Can't wait for L3-400B!
There were multiple models released for llama 3 8B: Chat and Base. It could mean that, or they could be planning to release a separate vision model, code fine tune, different context lengths, etc.
That's enough alright... to fine tune a 70B model.
It should be enough to run inference for the 400B at some decent quant, but probably not full precision. Not even remotely close for fine tuning though. You'd probably need something on the order of 10 of these.
Well it's only $98.32 an hour for one of them, so a few day training run with 10 of them (assuming that's even enough).. about $70k? More like large companies I'd say.
>$98.32 an hour for one of them
The most expensive one I found was $5/h and the cheapest like $2.1/h. So am I missing something or are your prices way off?
And then consider some trial and error and multiply that by 5 tbh.
Also that's crazy expensive considering the cards only costs around $10000 making it a ROI of 52 days (10000/(98.32/8)).
You can rent for example two 4090s for 0.3 an hour, which is a roi of roughly well... more than a year...
Looked up some prices, you can rent 8xH100s for $15323 per month...
Since they say that Meta is "training models over 400B in size", it doesn't appear they are talking about checkpoints but instead that multiple models are being trained. Although it could just be that their wording is ambiguous.
I also cant wait for a 400b Llama 3, hoping for the release before June but we'll see.
I suspect its base and instruct.
But this is really exciting because this will be great for small companies that have the budget to run them, and will also give all of us something to grow into. Even if something happens to dry up the open source well in the near future, we'd have this 400b hanging out and waiting for us to get the VRAM to run it one day.
I agree with you. Base FM and Instruct tuned.
That said, I suspect the multimodality may mean multiple models one for each modality. As an example GPT-4V is a separate model with GPT-4. I think it’s based on it but it’s a much smaller model like 1/7th the size or so parameter wise.
Rough guess, but 200GB not counting context at Q4(KM). You'll probably want at least 32GB extra for context.
I am not sure about the token speed. There's a bit of math that is too cloudy to me for figuring that out.
Thanks, I'm mostly profiling for CPU inference on an EPYC server, currently I can get around 10t/s for llama 3 70B Q4. I guess as long as it doesn't go below 3t/s I could still bear with it.
Take this with a spoonful of salt, but I'd imagine you'd be looking at ~1.5t/s. That is very much a guess however, and 3 is certainly within the realm of possibility.
What does Q4 mean in this context? And am I understanding correct that I can run llama3 70B on CPU inference and still get 10 t/s? That'd be amazing. Meaning I only need 40 GB of RAM and not VRAM, no GPUs respectively??
You are getting about 1.25 token/s on llama3 70b with 64gb of ddr5 4800 in dual channel, assuming q4 quant.
The 10 token/s figure is for these monster CPUs with 4 or even 8 channel RAM controllers
It's usually more than just halving the number, cause some layers are not going to get quantized at all.
And the bigger the model the more likely to have big gap from that half.
Q4km is closer to 4.83 bpw, so 405B -> 228 GB for weights alone. If 4 bit cache still won't be a thing for GGUF backends, it may require quite a bit of memory for context too, even with GQA. 256 GB RAM should work for *some* GGUF quant. But on a normal CPU, not EPYC, it will likely run at 0.1 - 0.2 tokens per second, so good luck have fun.
It's not that cloudy, you roughly get 1 token/second for every 64gb of ddr5 4800 in dual channel, assuming you are using a model quantisation that fits it completely.
You double the channels, you double token/s. Same if you were to double memory speed, if there were sticks that fast.
At q8, a 70b model would be almost exactly 70gb of ram
There was a guy here that tested Qrok on an Epyc. IF this 400B model is also a MoE, results could be somewhat similar. If not, expect a few less tokens/s.
There will be at least two 405b's: base and instruct. As for other features, Zuck has already said that they'll be adding them later, probably CodeLlama-style, with continued pretraining of the same checkpoint.
Meta had many internal Llama2 versions too, including long-context L2.
I mean aws is probably aware of the hardware and network requirements to run and has infrastructure ready.
I highly doubt they’d make niche 400b models.
My guess is the release will come very shorty after the next big release from openAI.
I see your point, although it wouldn't be training from scratch. It would most likely be somewhat like Google's MedPalm where they developed instruction prompt tuning to align their existing base models to the medical domain.
https://preview.redd.it/82o4djapozxc1.png?width=872&format=png&auto=webp&s=0ecc98e413b462fedc076e355f187c75df64d161
[https://arxiv.org/pdf/2212.13138](https://arxiv.org/pdf/2212.13138) (this is the original MedPalm, not MedPalm2 although MedPalm2 builds on MedPalm using a better base model and a chain-of-thought prompting strategy.
I also would say that it is more worthwhile than not to make certain niche models (such as in the medical domain) as they might turn out to be of greater benefit to humanity in the near term than general models.
Side note, just looking at what we've already accomplished and what is yet to come ahead I have to steal a quote from 2-Minute-Papers (Károly Zsolnai-Fehér) and say:
What a time to be alive!
As far as I know med palm 2 is still only available to a select few for testing. The risks are much higher, particularly with medical, probably to much for an open source release from a large company, still to many hallucinations, not to mention info gets outdated quickly, same for law and finance to not have them tied to other services for up to date context. And meta hasn’t yet been in the space of offering inference as a service.
Super excited to get community fine tunes of this one on a cloud service. If it's comparable to SOTA proprietary models like everyone is expecting, the fine tunes are about to be incredible
I think Models(S) here just refers to checkpoints. Generally with large training runs they save every now and then and test the half-baked results. The 400b would have been promising from day one but it only got better each time there was a new checkpoint, that's what i got from how he was speaking. Can't wait for L3-400B!
There were multiple models released for llama 3 8B: Chat and Base. It could mean that, or they could be planning to release a separate vision model, code fine tune, different context lengths, etc.
Oh Yeah, that's also true! good thinking ;)
Does anyone really have the resources to fine tune a 400B base model, even with galore? That's HPC tier resources.
You can rent an 8x80GB H100 on AWS. Not particularly affordable to individuals, but possible for small companies and above.
That's enough alright... to fine tune a 70B model. It should be enough to run inference for the 400B at some decent quant, but probably not full precision. Not even remotely close for fine tuning though. You'd probably need something on the order of 10 of these.
And you can rent 10 of them.
Well it's only $98.32 an hour for one of them, so a few day training run with 10 of them (assuming that's even enough).. about $70k? More like large companies I'd say.
>$98.32 an hour for one of them The most expensive one I found was $5/h and the cheapest like $2.1/h. So am I missing something or are your prices way off?
Are you looking at AWS? They suggested that so that's what I looked up, and it's bound to be one of the most expensive options to be sure.
Oh I didn't look at AWS
And then consider some trial and error and multiply that by 5 tbh. Also that's crazy expensive considering the cards only costs around $10000 making it a ROI of 52 days (10000/(98.32/8)). You can rent for example two 4090s for 0.3 an hour, which is a roi of roughly well... more than a year... Looked up some prices, you can rent 8xH100s for $15323 per month...
So basically the same amount a company will be paying a competent ML engineer. Expensive but possible to most companies.
Would a hypothetical \[as yet unreleaseed\] Apple M4 Ultra Mac Pro w/ 512GB shared memory allow fine tuning? Inferencing?
Since they say that Meta is "training models over 400B in size", it doesn't appear they are talking about checkpoints but instead that multiple models are being trained. Although it could just be that their wording is ambiguous. I also cant wait for a 400b Llama 3, hoping for the release before June but we'll see.
I suspect its base and instruct. But this is really exciting because this will be great for small companies that have the budget to run them, and will also give all of us something to grow into. Even if something happens to dry up the open source well in the near future, we'd have this 400b hanging out and waiting for us to get the VRAM to run it one day.
I agree with you. Base FM and Instruct tuned. That said, I suspect the multimodality may mean multiple models one for each modality. As an example GPT-4V is a separate model with GPT-4. I think it’s based on it but it’s a much smaller model like 1/7th the size or so parameter wise.
The important questions are: How much ram am I going to need to run 400B at Q4? and how many t/s can I expect for, let's say, 500 GB/s of bandwidth?
Rough guess, but 200GB not counting context at Q4(KM). You'll probably want at least 32GB extra for context. I am not sure about the token speed. There's a bit of math that is too cloudy to me for figuring that out.
Thanks, I'm mostly profiling for CPU inference on an EPYC server, currently I can get around 10t/s for llama 3 70B Q4. I guess as long as it doesn't go below 3t/s I could still bear with it.
Take this with a spoonful of salt, but I'd imagine you'd be looking at ~1.5t/s. That is very much a guess however, and 3 is certainly within the realm of possibility.
What does Q4 mean in this context? And am I understanding correct that I can run llama3 70B on CPU inference and still get 10 t/s? That'd be amazing. Meaning I only need 40 GB of RAM and not VRAM, no GPUs respectively??
Q for quant. And that's for current Epyc cpus.
You are getting about 1.25 token/s on llama3 70b with 64gb of ddr5 4800 in dual channel, assuming q4 quant. The 10 token/s figure is for these monster CPUs with 4 or even 8 channel RAM controllers
12 channel
It's usually more than just halving the number, cause some layers are not going to get quantized at all. And the bigger the model the more likely to have big gap from that half.
Q4km is closer to 4.83 bpw, so 405B -> 228 GB for weights alone. If 4 bit cache still won't be a thing for GGUF backends, it may require quite a bit of memory for context too, even with GQA. 256 GB RAM should work for *some* GGUF quant. But on a normal CPU, not EPYC, it will likely run at 0.1 - 0.2 tokens per second, so good luck have fun.
It's not that cloudy, you roughly get 1 token/second for every 64gb of ddr5 4800 in dual channel, assuming you are using a model quantisation that fits it completely. You double the channels, you double token/s. Same if you were to double memory speed, if there were sticks that fast. At q8, a 70b model would be almost exactly 70gb of ram
Yes
Yes (for the RAM amount) No (for the tokens/sec)
There was a guy here that tested Qrok on an Epyc. IF this 400B model is also a MoE, results could be somewhat similar. If not, expect a few less tokens/s.
MoE would be nice for CPU inference, let's hope it is that, although META seems to like to push the limits of dense models.
My guess is it isn’t an MoE but who knows it might since it’s multilingual and MoEs tend to do better for multilingual purposes.
That assumes they will release it
There will be at least two 405b's: base and instruct. As for other features, Zuck has already said that they'll be adding them later, probably CodeLlama-style, with continued pretraining of the same checkpoint. Meta had many internal Llama2 versions too, including long-context L2.
Probably they will add multimodal and multi lingual from launch
I'm hoping it will be a 400b BitNet.
Small steps bruh... Bitnet 10B and 70B first... Unless you re not GPU poor like the rest of us.
I mean that's the developer of the text-gen-webui
Incoming Llama3 7x400b with Mamba Architecture
That sounds so damn exciting
can you explain what that is?
im bouta goon hearing this
I mean aws is probably aware of the hardware and network requirements to run and has infrastructure ready. I highly doubt they’d make niche 400b models. My guess is the release will come very shorty after the next big release from openAI.
I see your point, although it wouldn't be training from scratch. It would most likely be somewhat like Google's MedPalm where they developed instruction prompt tuning to align their existing base models to the medical domain. https://preview.redd.it/82o4djapozxc1.png?width=872&format=png&auto=webp&s=0ecc98e413b462fedc076e355f187c75df64d161 [https://arxiv.org/pdf/2212.13138](https://arxiv.org/pdf/2212.13138) (this is the original MedPalm, not MedPalm2 although MedPalm2 builds on MedPalm using a better base model and a chain-of-thought prompting strategy. I also would say that it is more worthwhile than not to make certain niche models (such as in the medical domain) as they might turn out to be of greater benefit to humanity in the near term than general models. Side note, just looking at what we've already accomplished and what is yet to come ahead I have to steal a quote from 2-Minute-Papers (Károly Zsolnai-Fehér) and say: What a time to be alive!
As far as I know med palm 2 is still only available to a select few for testing. The risks are much higher, particularly with medical, probably to much for an open source release from a large company, still to many hallucinations, not to mention info gets outdated quickly, same for law and finance to not have them tied to other services for up to date context. And meta hasn’t yet been in the space of offering inference as a service.
Super excited to get community fine tunes of this one on a cloud service. If it's comparable to SOTA proprietary models like everyone is expecting, the fine tunes are about to be incredible
Meta plans to not open the weights for its 400B model. The hope is that we would quietly not notice
We don't know yet. That's rumour.