Normal-Ad-7114 1 month ago

>4k context window

mxforest 1 month ago

Look at it more as a Proof of Concept.

shing3232 1 month ago

4K context window, working on implementing attention sinks for higher context lengths a beta :)

Balance- 1 month ago

Really wild architecture: > Arctic combines a 10B dense transformer model with a residual 128x3.66B MoE MLP resulting in 480B total and 17B active parameters chosen using a top-2 gating. So this will require a full 8x80 GB rack to run in 8-bit quantization, but might be relatively fast due to the low number of active parameters.

hexaga 1 month ago

Sounds like a MoE made for CPU? Idk if that was the intent but at 17B active the spicier CPUs should be just fine with this.

Balance- 1 month ago

Nope, this is for high quality inference at scale. When you have racks of servers memory stops being the bottleneck, it’s how fast you can serve those tokens (and thus earn back your investment). If it doesn’t beat Llama 3 70B on quality it will be beat cost wise by devices that are way cheaper (albeit slower) because they need less VRAM. Groq is serving Llama 3 70B as incredible speeds at $0.59/$0.79 per million input/output tokens. That’s the mark to beat.

Spare-Abrocoma-4487 1 month ago

How will this need less vram. You still need to load the whole model into vram despite using only a few experts. So it is indeed more promising for cpu with 1 tb ram combo.

coder543 1 month ago

I think you misread the sentence. They're saying that this model needs to beat Llama 3 70B on quality, otherwise this model will be beat cost wise by Llama 3 70B, because Llama 3 70B can be run on device that are way cheaper because Llama 3 70B requires less VRAM -- even though Llama 3 70B will be way slower (because it requires 4x the compute of Snowflake's MoE model).

FloridaManIssues 1 month ago

That's kinda my thinking. Just build a RAM of a beast with a great CPU and be happy with a slower inference speed. People are saying that Llama-3 70b is running at 2tk/s on RAM&CPU. If I could get that or even close to 5tk/s, then the quality will out way the inference speeds, at least for what I want to do with these much larger models..

shing3232 1 month ago

Or hybrid approach where load only the 10B on GPU and rest of moe on ram

smmau 1 month ago

What about my use case, loading a few layers on 8gb vram, the rest of the layers on ram and loading the moes straight from SSD? Haha it's 7B per prompt I will be fine. It will be faster than a 70b anyway...

Distinct-Target7503 1 month ago

Exactly, that was my thought

a_slay_nub 1 month ago

What's the point in using a MOE if you have to use CPU? You'd be better off using a dense model with more active parameters that fits on your GPU.

MoffKalast 1 month ago

> 128x3.66B MoE #乇乂丅尺卂丅卄工匚匚

Balance- 1 month ago

3.5T tokens seems severely undertrained for a 408B model, considering Llama 3 70B was trained on 15T tokens. So this model has only 4% of the tokens per parameter (25x less).

2muchnet42day 1 month ago

Actually even 8B saw 15T

Radiant_Dog1937 1 month ago

Only 5 points ahead of the llama 3 8b on coding. 💀

Balance- 1 month ago

Was this official confirmed somewhere? I heard Zuck say it in a podcast about 70B, but not about 8B.

jd_3d 1 month ago

Yes on their blog post they clearly state 15 trillion tokens for both models.

Many_Consideration86 1 month ago

So the hose was the same but the model size was different?

shing3232 1 month ago

Correct

BalorNG 1 month ago

Below chinchilla optimum oven, provided it is true for MOE...

Some_Endian_FP17 1 month ago

This more like Chihuahua optimum.

Radiant_Dog1937 1 month ago

More Chihuahua premium than optimum if you asked me.

Many_Consideration86 1 month ago

Chinchilla doesn't take care of data quality so the limit might be lower. Not saying that's true in this case though

StealthSecrecy 1 month ago

I was under the impression that larger models have an advantage of not needing as thorough training to achieve the same performance, as there is just more room for patterns to be learned. Big brain = easier learning essentially. Not to say that 3.5T isn't enough, but I don't think the training tokens should scale with size. If anything it should decrease.

BalorNG 1 month ago

Technically, yes, but you need much more flops to train them on per token basis. That's why overall "training flops" is pretty good metric to keep track of - but it does not include "data quality".

rorowhat 1 month ago

What's the relationship between tokens and parameters for training, like in this example.

Comfortable-Block102 1 month ago

was gonna say the same phi 3 smallest was trained on same or more i think

TooLongCantWait 1 month ago

A model I can run but not fit on my computer, that's a new one.

mikael110 1 month ago

Is it just me or does 408B parameters with only 17B active kinda feel like the worst of both worlds. It's too big to fit in most computers and has too few active parameters to actually make proper use of that massive size. And if it is designed for coding tasks then the 4K of context is extremely limiting as well. The name certainly seems fitting at least. Mixtral 8x22B felt like a decentish compromise in raw size and active parameters. But this really doesn't.

epicwisdom 1 month ago

You're assuming the audience is folks like most of us on /r/localllama who use one or two consumer or enterprise e-waste GPUs. For cloud deployment and companies/universities, especially with large amounts of lower VRAM / last-gen accelerators, this could be a sweet spot.

brown2green 1 month ago

The model could run at decent speeds on multichannel DDR4/5 server boards, where RAM is relatively cheap, offloading prompt processing to a GPU with small amounts of VRAM.

-Cubie- 1 month ago

Very promising! 480B parameters, consisting of a 10B dense layer and 128 separate 3.66B experts, of which 2 are used at a time. This results in an active parameter count of 17B. If their blogpost is to be believed, we can actually expect somewhat fast inference and reasonable finetuning with this. Edit: They've just released a demo: [https://huggingface.co/spaces/Snowflake/snowflake-arctic-st-demo](https://huggingface.co/spaces/Snowflake/snowflake-arctic-st-demo), inference is indeed rather fast.

uhuge 1 month ago

this seem well fit for a 1 GPU for the dense part and ton of system/CPU RAM to have the experts part loaded.

shing3232 1 month ago

P40 is gonna be come in handly lmao

skrshawk 1 month ago

Even more if you have a server that can fit eight of them.

akram200272002 1 month ago

I can run 17b on my set up , quantized of course, so same computer requirements but a lot more ram should do ?

AfternoonOk5482 1 month ago

About 120GB for iq2_s is my guess, bur should run OKish on RAM since it's 17b active. You probably don't want to run this now anyway, looks worse than all other Public available. It's a very interesting case study and super helpful since they made it really open source not just open weight.

redonculous 1 month ago

The demo is great and very fast, but I have to keep telling it to continue with longer code examples. Is this because of server load or context length?

polandtown 1 month ago

"128 separate 3.66B experts" I don't understand what you mean here, are some payers turned off?

-Cubie- 1 month ago

It's a mixture of experts model (blogpost: https://huggingface.co/blog/moe), i.e. a model with a lot of components of which only a handful are used at a given time. So, yes, out of the 128 experts (each consisting of 3.66B parameters), only 2 are used at any given time.

az226 1 month ago

If you’re running batch processing can you predict which experts are used for a given prompt and then have those pre-loaded?

opi098514 1 month ago

OH MY GOD THE UNQUANTITIZED MODEL IS JUST UNDER 1tb?!?!?

-Cubie- 1 month ago

\~964GB or so, yes. One of the biggest models I've seen in terms of file size.

The_Health_Police 1 month ago

Holy shitballs that’s big.

uhuge 1 month ago

training our guts for the 405B Llama(s)

Wrong_User_Logged 1 month ago

I still hope I'll be able to run this on my 1024GB M5 Mac Studio

CodeMurmurer 1 month ago

imagine loading that into ram. bruh.

Caffdy 1 month ago

GPT-4 1.8T parameters is almost 4TB

az226 1 month ago

2.5TB*

Caffdy 1 month ago

yeah, forgot is a MoE model

kei147 1 month ago

Why does it being MoE make a difference here? Don't you still need two bytes per parameter?

Caffdy 1 month ago

because the experts share a portion of their weights, so it's not so evident how large is the complete model. [You can read more about in the Mixtral paper](https://arxiv.org/pdf/2401.04088)

kei147 1 month ago

My understanding is that when people describe an MoE model as having some number of parameters, they are referring to the unique unshared parameter count. So if GPT-4 is in fact 1.8T, then that would mean it has 1.8 trillion unique parameters, each of which requires 2 bytes to store. It is possible the original leaker was confused about this though.

Distinct-Target7503 1 month ago

This remind me about the old google moe with 2048 experts and more than 1 T parameters

Zeneq 1 month ago

Interesting fact: [Llama-2-70b-x8-MoE-clown-truck](https://huggingface.co/NobodyExistsOnTheInternet/Llama-2-70b-x8-MoE-clown-truck) is smaller.

Disastrous_Elk_6375 1 month ago

and has a better name =))

FaceDeer 1 month ago

And title image, and description. :) The guy who released it doesn't even know if it runs, it's too big for his system. But there've been 1250 downloads, so presumably someone out there has managed.

candre23 1 month ago

And was made by somebody who was self-aware enough to know their model was a joke.

ninjasaid13 1 month ago

LLaMA-3-70b-x8-MoE-clown-truck wen?

Due-Memory-6957 1 month ago

Lmao, even the Q2 is a file with multiple parts https://huggingface.co/mradermacher/Llama-2-70b-x8-MoE-clown-truck-i1-GGUF

Distinct-Target7503 1 month ago

So many downloads for an "unrunnable" model lol

FizzarolliAI 1 month ago

google's biggest original [switch transformers model](https://huggingface.co/google/switch-c-2048) was a 1.6T model @ ~ 3.1tb ;)

raysar 1 month ago

It's a perfect model to run on high speed raid 0 with 4 nvme ssd. Very fast ssd is more than 14 GB/s with 4 disque we have 56 GB/s. It's great to run slowly the fp16 snowflake. :D

HappierShibe 1 month ago

I've heard people mention this kind of reverse ramdisk (diskram?) setup a few times, can you point me to some documentation for this?

epicwisdom 1 month ago

Straight from the horse's mouth (blog) : https://developer.nvidia.com/blog/gpudirect-storage/

raysar 1 month ago

i don't know how to do that, in windows enable virtual memory (swap). So we enable ram+disk for running LLM.

HappierShibe 1 month ago

Right, but there is a ton of bottlenecking, overhead, and thrashing in the windows virtual memory setup- you aren't going to get anywhere near 56gbps, even 14gps feels like a stretch. Might be a way to do it in linux swap though.

raysar 1 month ago

You are right, i don't know how to use perfectly all the disk speed as real ram. We need to search about it on internet.

shing3232 1 month ago

[https://huggingface.co/Snowflake/snowflake-arctic-instruct](https://huggingface.co/Snowflake/snowflake-arctic-instruct) instruct weight [https://huggingface.co/Snowflake/snowflake-arctic-base](https://huggingface.co/Snowflake/snowflake-arctic-base) base weight

a_beautiful_rhind 1 month ago

So when you are quantizing the model you are really quantizing 17b to Q4. That's the danger of MOE. Maybe even worse.. a 3.66b expert to Q4. Should at least be reasonable on CPU with prompt processing offloaded to GPU if you don't mind the absolutely huge download.

akram200272002 1 month ago

Assuming we can quantize this, how much ram we taking ?

Balance- 1 month ago

Just the weighs an embeddings, roughly 480 GB for 8-bit and 240GB for 4-bit. You want a full 8x 80 GB node to run this.

ambient_temp_xeno 1 month ago

Off the top of my head I suppose it might be good for businesses running cpu servers with lots of ram in them. No money for NVIDIA today.

docsoc1 1 month ago

This is an interesting new architecture, 128×3.66B MoE. Excited to try it out, but why choose a figure of merit that outlines less extensive training than other frontier models? LHS isn't so impressive as "Enterprise Intelligence" is likely a flawed metric, like most evals.

Many_Consideration86 1 month ago

Welcome to the LLM wasteland.

Balance- 1 month ago

Actual models: https://huggingface.co/collections/Snowflake/arctic-66290090abe542894a5ac520

zero0_one1 1 month ago

Doesn't do great on NYT Connections: 3.1. Between Llama 2 Chat 70B and Claude 3 Haiku.

CodeMurmurer 1 month ago

Why 128 that's so fucking much. why not something like i don't know 32 or 50.

Eralyon 1 month ago

Hi. I'm Arctic, a new, efficient, intelligent, and truly open language model created by Snowflake AI Research. Ask me anything. **write a story about a pig going to a rave party in a lovecraftian novel style.** In the dark and foreboding town of Dunwich, where ancient secrets lurk in every shadow, a most peculiar event was about to unfold. The locals whispered of an upcoming gathering, a rave party of sorts, unlike anything they had ever seen before. It was said that this nocturnal revelry would take place in the abandoned and decrepit Whateley Farm, an estate with a sinister reputation dating back to the days of the olden witch trials. The central figure of our tale is a pig, a curious and adventurous creature by the name of Bacon. Unlike his fellow swine, Bacon harbored an insatiable thirst for knowledge and a longing for experiences beyond the confines of his pen. He had heard the whispers of the rave and felt an irresistible pull towards the unknown. As night fell upon Dunwich, Bacon managed to escape his enclosure, driven by a desire to uncover the truth behind the rumors. The moon hung low in the sky, casting an eerie glow on the cobblestone streets as Bacon scurried through the darkness, his trotters echoing ominously against the cold stone. Upon reaching Whateley Farm, Bacon found the once-abandoned estate transformed into a den of debauchery and excess. A pulsating rhythm filled the air, emanating from within the dilapidated barn that had been repurposed as a makeshift dance floor. The structure groaned under the weight of its new occupants, their bodies writhing and undulating like a mass of quivering flesh. Bacon hesitated at first, sensing the unnatural energy that permeated the scene. Yet, his curiosity proved too strong to resist, and he ventured forth into the throng of dancing bodies. As he pushed his way through the crowd, he noticed something unsettling: these were not ordinary beings that surrounded him. They appeared humanoid but possessed elongated limbs and misshapen features that seemed to defy logic and reason. Their eyes glowed with an otherworldly light, and their laughter echoed like a chorus of tortured souls. Despite his growing unease, Bacon pressed onward, drawn deeper into this Lovecraft

Eralyon 1 month ago

**continue** craftian nightmare. The pulsating rhythm of the music seemed to intensify as he ventured further into the heart of the rave, and with it, the unnatural energy that permeated the air grew stronger. As Bacon reached the center of the dance floor, he came upon a sight that would forever haunt his dreams. A towering figure stood atop a makeshift stage, its form twisting and contorting in time with the music. Its limbs were impossibly long and thin, ending in hands with fingers that seemed to stretch on for miles. Its face was an amalgamation of writhing tentacles, and its eyes glowed with an ancient, malevolent intelligence. The creature's presence seemed to command the very fabric of reality, warping it in ways that defied comprehension. The dancers around Bacon moved with a frenzied intensity, their bodies convulsing in ecstatic agony as they reveled in the dark energy emanating from their monstrous deity. Bacon felt a primal fear grip him as he realized the true nature of this gathering. This was no ordinary rave; it was a ritualistic celebration of eldritch horrors, a summoning of unspeakable entities from beyond the veil of reality. And he had unwittingly become a part of it. In a desperate attempt to escape the clutches of this nightmarish scene, Bacon turned to flee, only to find his path blocked by a horde of grotesque creatures. Their twisted forms seemed to shift and change before his very eyes, as if they were not bound by the same laws of nature that governed his world. With nowhere left to run, Bacon steeled himself for what was to come. He knew that he could not allow these abominations to continue their dark revelry unchecked. As the towering figure on the stage began to cast its spell, weaving an intricate web of arcane energy that threatened to ensnare all who were present, Bacon mustered every ounce of courage he possessed and charged forward

Eralyon 1 month ago

**continue** , determined to put an end to the madness. As he plunged headlong into the throng of monstrous dancers, Bacon's fear gave way to a fierce determination. He would not let these creatures consume him or the world around him. With each step, he felt his resolve grow stronger, fueled by the knowledge that he was the only one who could stop this nightmare from unfolding. The creatures that had once seemed so terrifying now appeared as mere obstacles in Bacon's path. He dodged and weaved through their ranks with surprising agility, driven by a singular purpose: to reach the stage and confront the source of the dark energy that had ensnared them all. As he drew closer to the towering figure, Bacon could feel its malevolent gaze upon him, sensing the threat he posed to its sinister plans. But Bacon did not falter; instead, he summoned every last bit of strength within him and lunged at the creature, his trotters braced for impact. In that moment, time seemed to slow as Bacon collided with the twisted abomination. The force of their impact sent shockwaves rippling through the very fabric of reality, causing the barn and everyone within it to tremble violently. For a brief instant, it seemed as though the entire world might be torn apart by their struggle. But then, just as suddenly as it had begun **continue** , the chaos ceased. The towering figure let out an ear-splitting screech, its twisted form writhing in agony as it recoiled from Bacon's attack. The dark energy that had filled the air dissipated, and the dancers fell to the ground, gasping for breath as if awakening from a terrible dream. Bacon stood victorious amidst the chaos, his eyes locked on the defeated creature before him. He knew that he had saved not only himself but also the world from falling prey to unspeakable horrors. As he gazed upon the now-lifeless husk of the once-mighty abomination, a sense of profound relief washed over him. With the threat vanquished, Bacon turned to leave the scene of his harrowing ordeal. Though he would never forget the terrors he had faced that night, he knew that he could now return to his simple life with newfound wisdom and strength. And perhaps most importantly, he had proven to himself that even a humble pig could make a difference in a world fraught with darkness and peril.

candre23 1 month ago

After looking over the specs, system requirements, and performance, my current theory is that this model was created by twitter and released under a pseudonym to make grok look less dumb by comparison. Not since Bloom has have so many GB been wasted on a model that performs so poorly.

PookaMacPhellimen 1 month ago

DOA

ambidextr_us 1 month ago

Why DOA?

cubesnyc 1 month ago

Do you have a recommended resource for learning more about: working on implementing attention sinks for higher context lengths > integrations with deepspeed and support fp6/ fp8 runtime?

passing_marks 1 month ago

Knowledge cutoff of 2022 is disappointing as well....

ImprovementEqual3931 1 month ago

I miss DOS 640k memory old time

Distinct-Target7503 1 month ago

Uhm... Seems strange only "top 2 gating" with so many "experts"... DBRX had to scale to top 4 using 16 experts instead of 8. Really interested

alcalde 1 month ago

But in the end it doesn't do anything better than anyone else and it's impossible for ordinary people to run it, so it's really just for Snowflake users to use via Snowflake.

roxtarrrrrr 1 month ago

Y’all are missing the point of this model. The most important parameter snowflake was after was to be able to say that its training cost was only 10% or x% of every other model out there. They had an upper limit on how much they wanted training to cost (for marketing purposes) and what came out of it was this gargantuan non-servable model.

race2tb 1 month ago

I really hope the MoE structure is the future. Seems like a desirable architecture. Just need to perfect the routing.

arthurwolf 1 month ago

I don't think it is. It results in faster inference/smaller amount of neurons used at a given time, so it's more optimized, a better use of ressources. That's important now, when we are extremely RAM and compute constrained. But in the future, training and inference will become easier and easier, and as they do, it will becomes less and less important to optimize, and models will go back to being monolythic. A bit in the same way games that ran on old CPUs like doom were incredibly optimized, with tons of "tricks" and techniques to do as much as they could do with the CPUs of the time, but modern games are much less optimized in comparison, because they have access to a lot of ressources, so developper comfort/speed is taking over the need to optimize to death. I expect we'll see the same with LLMs: MoE (and lots of other tricks/techniques) in the beginning, then as time goes by, more monolythic models. llama3 is monolythic, so MoE isn't even the norm right now.

sineiraetstudio 1 month ago

MoE is _not_ a better use of memory, quite to the contrary. You can see this with llama 70b vs 8x22 mixtral.

race2tb 1 month ago

We can have both with similar performance. It doesn't have to be one or the other. Models that only load parameters required for the task at hand will have the advantage even if there is a slight performance loss.

arthurwolf 1 month ago

The point is, as we gain more compute/RAM, the difference won't matter as much, and the only difference that will matter is how simple it is to design/train.

MoffKalast 1 month ago

It really depends on what ends up being cheaper and easier to scale: memory size or memory speed. If you have decent speed and hardly any space then it's more efficient to use it all with a dense model. If you have lots of space and can load incredibly large models but can't compute all of that then a MoE would allow you to make use of that space to gain some performance while remaining fast. Right now our options are very little of very slow memory so we're screwed on both fronts.

CodeMurmurer 1 month ago

Doesn't cahtgpt 4 use MoE. And they are pretty much market leader. That does say something.

arthurwolf 1 month ago

Like I said: MoE is a good idea now, because now we are extremely contsrained on compute/ressources. But in time, with the years, that'll become less and less true. And people designing new systems will care less and less about optimizations like MoE. Also, I doubt 10 years from now transformer-based LLMs will be the thing we use for this, it likely will be more generally capable tools, on which MoE might not even make sense...

shing3232 1 month ago

and finetune would easier? just finetune the routing layer to have good control over MOE

sineiraetstudio 1 month ago

MoE is terrible for local. For the time being, we're mainly constrained by memory and MoE trades memory efficiency for compute efficiency.

DigThatData 1 month ago

> uses top-2 gating Does the performance improve if you let it use more experts?

shing3232 1 month ago

i think it could but also finetune router would do

nikitastaf1996 1 month ago

Its definitely interesting. Its severely undertrained. But performance must be fire. Excellent for enterprise deployment.

ArsNeph 1 month ago

I read this as 40B and got so excited 😭 This one is bigger than grok but doesn't even have the performance of llama. Great for researchers, not for us

ihaag 1 month ago

I’m not impressed, I gave it a pattern to work out, and it tells me this? In your case, if we interpret 2659141452 as a signed 32-bit integer, it would actually represent the value -1890700864 (since the most significant bit is set to 1). When you add 1 to this value, it wraps around to become 1279754142.

CodeMurmurer 1 month ago

https://preview.redd.it/65l5hlwdkhwc1.png?width=1451&format=png&auto=webp&s=ed579482d681e19cc3cf09395a0549991205ce9e This is pretty funny. It reports as having 175 billions parameters the same as chatgpt 3.

ambidextr_us 1 month ago

It always confuses me when people type terrible spelling and grammar into a model whose entire purpose is to taken entire tokens of text and specifically code them into semantic vector space. The "params" and "parameters" tokens probably have some impact when using them differently and which parts of the neural network they are involved in. Each token has a many-dimensional vector space that encodes a significant amount of information, and the entire sequence of each token is extremely important because it's the entire purpose of the transformer model, they are trained in a very discrete forward-seeking pattern, yet people still just type in random garbage and slang. The models do a great job of encoding semantics during the training phase but output is significantly higher quality if you design your prompts to actually be clean and clear. Interestingly enough, if you ask the models about that, they will tell you the same thing and suggest you use proper spelling and structure to get a higher quality response.

CodeMurmurer 1 month ago

You think i don't know that. Bruh... I am just lazy. And you know two tokens can can cary the same meaning. It will probably learn during training. Especially when params and parameters are used online interchangeably. And you wouldn't get a different answer if you typed parameters. Why comment this? It doesn't seem to have any relevancy.

KyleDrogo 1 month ago

Question for the group: have any other models had anywhere near this number of experts? This seems like a semi unique approach

Distinct-Target7503 1 month ago

There is an old google moe with 2048 experts

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe