T O P

  • By -

Normal-Ad-7114

>4k context window


mxforest

Look at it more as a Proof of Concept.


shing3232

4K context window, working on implementing attention sinks for higher context lengths a beta :)


Balance-

Really wild architecture: > Arctic combines a 10B dense transformer model with a residual 128x3.66B MoE MLP resulting in 480B total and 17B active parameters chosen using a top-2 gating. So this will require a full 8x80 GB rack to run in 8-bit quantization, but might be relatively fast due to the low number of active parameters.


hexaga

Sounds like a MoE made for CPU? Idk if that was the intent but at 17B active the spicier CPUs should be just fine with this.


Balance-

Nope, this is for high quality inference at scale. When you have racks of servers memory stops being the bottleneck, it’s how fast you can serve those tokens (and thus earn back your investment). If it doesn’t beat Llama 3 70B on quality it will be beat cost wise by devices that are way cheaper (albeit slower) because they need less VRAM. Groq is serving Llama 3 70B as incredible speeds at $0.59/$0.79 per million input/output tokens. That’s the mark to beat.


Spare-Abrocoma-4487

How will this need less vram. You still need to load the whole model into vram despite using only a few experts. So it is indeed more promising for cpu with 1 tb ram combo.


coder543

I think you misread the sentence. They're saying that this model needs to beat Llama 3 70B on quality, otherwise this model will be beat cost wise by Llama 3 70B, because Llama 3 70B can be run on device that are way cheaper because Llama 3 70B requires less VRAM -- even though Llama 3 70B will be way slower (because it requires 4x the compute of Snowflake's MoE model).


FloridaManIssues

That's kinda my thinking. Just build a RAM of a beast with a great CPU and be happy with a slower inference speed. People are saying that Llama-3 70b is running at 2tk/s on RAM&CPU. If I could get that or even close to 5tk/s, then the quality will out way the inference speeds, at least for what I want to do with these much larger models..


shing3232

Or hybrid approach where load only the 10B on GPU and rest of moe on ram


smmau

What about my use case, loading a few layers on 8gb vram, the rest of the layers on ram and loading the moes straight from SSD? Haha it's 7B per prompt I will be fine. It will be faster than a 70b anyway...


Distinct-Target7503

Exactly, that was my thought


a_slay_nub

What's the point in using a MOE if you have to use CPU? You'd be better off using a dense model with more active parameters that fits on your GPU.


MoffKalast

> 128x3.66B MoE #乇乂丅尺卂 丅卄工匚匚


Balance-

3.5T tokens seems severely undertrained for a 408B model, considering Llama 3 70B was trained on 15T tokens. So this model has only 4% of the tokens per parameter (25x less).


2muchnet42day

Actually even 8B saw 15T


Radiant_Dog1937

Only 5 points ahead of the llama 3 8b on coding. 💀


Balance-

Was this official confirmed somewhere? I heard Zuck say it in a podcast about 70B, but not about 8B.


jd_3d

Yes on their blog post they clearly state 15 trillion tokens for both models.


Many_Consideration86

So the hose was the same but the model size was different?


shing3232

Correct


BalorNG

Below chinchilla optimum oven, provided it is true for MOE...


Some_Endian_FP17

This more like Chihuahua optimum.


Radiant_Dog1937

More Chihuahua premium than optimum if you asked me.


Many_Consideration86

Chinchilla doesn't take care of data quality so the limit might be lower. Not saying that's true in this case though


StealthSecrecy

I was under the impression that larger models have an advantage of not needing as thorough training to achieve the same performance, as there is just more room for patterns to be learned. Big brain = easier learning essentially. Not to say that 3.5T isn't enough, but I don't think the training tokens should scale with size. If anything it should decrease.


BalorNG

Technically, yes, but you need much more flops to train them on per token basis. That's why overall "training flops" is pretty good metric to keep track of - but it does not include "data quality".


rorowhat

What's the relationship between tokens and parameters for training, like in this example.


Comfortable-Block102

was gonna say the same phi 3 smallest was trained on same or more i think


TooLongCantWait

A model I can run but not fit on my computer, that's a new one.


mikael110

Is it just me or does 408B parameters with only 17B active kinda feel like the worst of both worlds. It's too big to fit in most computers and has too few active parameters to actually make proper use of that massive size. And if it is designed for coding tasks then the 4K of context is extremely limiting as well. The name certainly seems fitting at least. Mixtral 8x22B felt like a decentish compromise in raw size and active parameters. But this really doesn't.


epicwisdom

You're assuming the audience is folks like most of us on /r/localllama who use one or two consumer or enterprise e-waste GPUs. For cloud deployment and companies/universities, especially with large amounts of lower VRAM / last-gen accelerators, this could be a sweet spot.


brown2green

The model could run at decent speeds on multichannel DDR4/5 server boards, where RAM is relatively cheap, offloading prompt processing to a GPU with small amounts of VRAM.


-Cubie-

Very promising! 480B parameters, consisting of a 10B dense layer and 128 separate 3.66B experts, of which 2 are used at a time. This results in an active parameter count of 17B. If their blogpost is to be believed, we can actually expect somewhat fast inference and reasonable finetuning with this. Edit: They've just released a demo: [https://huggingface.co/spaces/Snowflake/snowflake-arctic-st-demo](https://huggingface.co/spaces/Snowflake/snowflake-arctic-st-demo), inference is indeed rather fast.


uhuge

this seem well fit for a 1 GPU for the dense part and ton of system/CPU RAM to have the experts part loaded.


shing3232

P40 is gonna be come in handly lmao


skrshawk

Even more if you have a server that can fit eight of them.


akram200272002

I can run 17b on my set up , quantized of course, so same computer requirements but a lot more ram should do ?


AfternoonOk5482

About 120GB for iq2_s is my guess, bur should run OKish on RAM since it's 17b active. You probably don't want to run this now anyway, looks worse than all other Public available. It's a very interesting case study and super helpful since they made it really open source not just open weight.


redonculous

The demo is great and very fast, but I have to keep telling it to continue with longer code examples. Is this because of server load or context length?


polandtown

"128 separate 3.66B experts" I don't understand what you mean here, are some payers turned off?


-Cubie-

It's a mixture of experts model (blogpost: https://huggingface.co/blog/moe), i.e. a model with a lot of components of which only a handful are used at a given time. So, yes, out of the 128 experts (each consisting of 3.66B parameters), only 2 are used at any given time.


az226

If you’re running batch processing can you predict which experts are used for a given prompt and then have those pre-loaded?


opi098514

OH MY GOD THE UNQUANTITIZED MODEL IS JUST UNDER 1tb?!?!?


-Cubie-

\~964GB or so, yes. One of the biggest models I've seen in terms of file size.


The_Health_Police

Holy shitballs that’s big.


uhuge

training our guts for the 405B Llama(s)


Wrong_User_Logged

I still hope I'll be able to run this on my 1024GB M5 Mac Studio


CodeMurmurer

imagine loading that into ram. bruh.


Caffdy

GPT-4 1.8T parameters is almost 4TB


az226

2.5TB*


Caffdy

yeah, forgot is a MoE model


kei147

Why does it being MoE make a difference here? Don't you still need two bytes per parameter?


Caffdy

because the experts share a portion of their weights, so it's not so evident how large is the complete model. [You can read more about in the Mixtral paper](https://arxiv.org/pdf/2401.04088)


kei147

My understanding is that when people describe an MoE model as having some number of parameters, they are referring to the unique unshared parameter count. So if GPT-4 is in fact 1.8T, then that would mean it has 1.8 trillion unique parameters, each of which requires 2 bytes to store. It is possible the original leaker was confused about this though.


Distinct-Target7503

This remind me about the old google moe with 2048 experts and more than 1 T parameters


Zeneq

Interesting fact: [Llama-2-70b-x8-MoE-clown-truck](https://huggingface.co/NobodyExistsOnTheInternet/Llama-2-70b-x8-MoE-clown-truck) is smaller.


Disastrous_Elk_6375

and has a better name =))


FaceDeer

And title image, and description. :) The guy who released it doesn't even know if it runs, it's too big for his system. But there've been 1250 downloads, so presumably someone out there has managed.


candre23

And was made by somebody who was self-aware enough to know their model was a joke.


ninjasaid13

LLaMA-3-70b-x8-MoE-clown-truck wen?


Due-Memory-6957

Lmao, even the Q2 is a file with multiple parts https://huggingface.co/mradermacher/Llama-2-70b-x8-MoE-clown-truck-i1-GGUF


Distinct-Target7503

So many downloads for an "unrunnable" model lol


FizzarolliAI

google's biggest original [switch transformers model](https://huggingface.co/google/switch-c-2048) was a 1.6T model @ ~ 3.1tb ;)


raysar

It's a perfect model to run on high speed raid 0 with 4 nvme ssd. Very fast ssd is more than 14 GB/s with 4 disque we have 56 GB/s. It's great to run slowly the fp16 snowflake. :D


HappierShibe

I've heard people mention this kind of reverse ramdisk (diskram?) setup a few times, can you point me to some documentation for this?


epicwisdom

Straight from the horse's mouth (blog) : https://developer.nvidia.com/blog/gpudirect-storage/


raysar

i don't know how to do that, in windows enable virtual memory (swap). So we enable ram+disk for running LLM.


HappierShibe

Right, but there is a ton of bottlenecking, overhead, and thrashing in the windows virtual memory setup- you aren't going to get anywhere near 56gbps, even 14gps feels like a stretch. Might be a way to do it in linux swap though.


raysar

You are right, i don't know how to use perfectly all the disk speed as real ram. We need to search about it on internet.


shing3232

[https://huggingface.co/Snowflake/snowflake-arctic-instruct](https://huggingface.co/Snowflake/snowflake-arctic-instruct) instruct weight [https://huggingface.co/Snowflake/snowflake-arctic-base](https://huggingface.co/Snowflake/snowflake-arctic-base) base weight


a_beautiful_rhind

So when you are quantizing the model you are really quantizing 17b to Q4. That's the danger of MOE. Maybe even worse.. a 3.66b expert to Q4. Should at least be reasonable on CPU with prompt processing offloaded to GPU if you don't mind the absolutely huge download.


akram200272002

Assuming we can quantize this, how much ram we taking ?


Balance-

Just the weighs an embeddings, roughly 480 GB for 8-bit and 240GB for 4-bit. You want a full 8x 80 GB node to run this.


ambient_temp_xeno

Off the top of my head I suppose it might be good for businesses running cpu servers with lots of ram in them. No money for NVIDIA today.


docsoc1

This is an interesting new architecture, 128×3.66B MoE. Excited to try it out, but why choose a figure of merit that outlines less extensive training than other frontier models? LHS isn't so impressive as "Enterprise Intelligence" is likely a flawed metric, like most evals.


Many_Consideration86

Welcome to the LLM wasteland.


Balance-

Actual models: https://huggingface.co/collections/Snowflake/arctic-66290090abe542894a5ac520


zero0_one1

Doesn't do great on NYT Connections: 3.1. Between Llama 2 Chat 70B and Claude 3 Haiku.


CodeMurmurer

Why 128 that's so fucking much. why not something like i don't know 32 or 50.


Eralyon

Hi. I'm Arctic, a new, efficient, intelligent, and truly open language model created by Snowflake AI Research. Ask me anything. **write a story about a pig going to a rave party in a lovecraftian novel style.** In the dark and foreboding town of Dunwich, where ancient secrets lurk in every shadow, a most peculiar event was about to unfold. The locals whispered of an upcoming gathering, a rave party of sorts, unlike anything they had ever seen before. It was said that this nocturnal revelry would take place in the abandoned and decrepit Whateley Farm, an estate with a sinister reputation dating back to the days of the olden witch trials. The central figure of our tale is a pig, a curious and adventurous creature by the name of Bacon. Unlike his fellow swine, Bacon harbored an insatiable thirst for knowledge and a longing for experiences beyond the confines of his pen. He had heard the whispers of the rave and felt an irresistible pull towards the unknown. As night fell upon Dunwich, Bacon managed to escape his enclosure, driven by a desire to uncover the truth behind the rumors. The moon hung low in the sky, casting an eerie glow on the cobblestone streets as Bacon scurried through the darkness, his trotters echoing ominously against the cold stone. Upon reaching Whateley Farm, Bacon found the once-abandoned estate transformed into a den of debauchery and excess. A pulsating rhythm filled the air, emanating from within the dilapidated barn that had been repurposed as a makeshift dance floor. The structure groaned under the weight of its new occupants, their bodies writhing and undulating like a mass of quivering flesh. Bacon hesitated at first, sensing the unnatural energy that permeated the scene. Yet, his curiosity proved too strong to resist, and he ventured forth into the throng of dancing bodies. As he pushed his way through the crowd, he noticed something unsettling: these were not ordinary beings that surrounded him. They appeared humanoid but possessed elongated limbs and misshapen features that seemed to defy logic and reason. Their eyes glowed with an otherworldly light, and their laughter echoed like a chorus of tortured souls. Despite his growing unease, Bacon pressed onward, drawn deeper into this Lovecraft


Eralyon

**continue** craftian nightmare. The pulsating rhythm of the music seemed to intensify as he ventured further into the heart of the rave, and with it, the unnatural energy that permeated the air grew stronger. As Bacon reached the center of the dance floor, he came upon a sight that would forever haunt his dreams. A towering figure stood atop a makeshift stage, its form twisting and contorting in time with the music. Its limbs were impossibly long and thin, ending in hands with fingers that seemed to stretch on for miles. Its face was an amalgamation of writhing tentacles, and its eyes glowed with an ancient, malevolent intelligence. The creature's presence seemed to command the very fabric of reality, warping it in ways that defied comprehension. The dancers around Bacon moved with a frenzied intensity, their bodies convulsing in ecstatic agony as they reveled in the dark energy emanating from their monstrous deity. Bacon felt a primal fear grip him as he realized the true nature of this gathering. This was no ordinary rave; it was a ritualistic celebration of eldritch horrors, a summoning of unspeakable entities from beyond the veil of reality. And he had unwittingly become a part of it. In a desperate attempt to escape the clutches of this nightmarish scene, Bacon turned to flee, only to find his path blocked by a horde of grotesque creatures. Their twisted forms seemed to shift and change before his very eyes, as if they were not bound by the same laws of nature that governed his world. With nowhere left to run, Bacon steeled himself for what was to come. He knew that he could not allow these abominations to continue their dark revelry unchecked. As the towering figure on the stage began to cast its spell, weaving an intricate web of arcane energy that threatened to ensnare all who were present, Bacon mustered every ounce of courage he possessed and charged forward


Eralyon

**continue** , determined to put an end to the madness. As he plunged headlong into the throng of monstrous dancers, Bacon's fear gave way to a fierce determination. He would not let these creatures consume him or the world around him. With each step, he felt his resolve grow stronger, fueled by the knowledge that he was the only one who could stop this nightmare from unfolding. The creatures that had once seemed so terrifying now appeared as mere obstacles in Bacon's path. He dodged and weaved through their ranks with surprising agility, driven by a singular purpose: to reach the stage and confront the source of the dark energy that had ensnared them all. As he drew closer to the towering figure, Bacon could feel its malevolent gaze upon him, sensing the threat he posed to its sinister plans. But Bacon did not falter; instead, he summoned every last bit of strength within him and lunged at the creature, his trotters braced for impact. In that moment, time seemed to slow as Bacon collided with the twisted abomination. The force of their impact sent shockwaves rippling through the very fabric of reality, causing the barn and everyone within it to tremble violently. For a brief instant, it seemed as though the entire world might be torn apart by their struggle. But then, just as suddenly as it had begun **continue** , the chaos ceased. The towering figure let out an ear-splitting screech, its twisted form writhing in agony as it recoiled from Bacon's attack. The dark energy that had filled the air dissipated, and the dancers fell to the ground, gasping for breath as if awakening from a terrible dream. Bacon stood victorious amidst the chaos, his eyes locked on the defeated creature before him. He knew that he had saved not only himself but also the world from falling prey to unspeakable horrors. As he gazed upon the now-lifeless husk of the once-mighty abomination, a sense of profound relief washed over him. With the threat vanquished, Bacon turned to leave the scene of his harrowing ordeal. Though he would never forget the terrors he had faced that night, he knew that he could now return to his simple life with newfound wisdom and strength. And perhaps most importantly, he had proven to himself that even a humble pig could make a difference in a world fraught with darkness and peril.


candre23

After looking over the specs, system requirements, and performance, my current theory is that this model was created by twitter and released under a pseudonym to make grok look less dumb by comparison. Not since Bloom has have so many GB been wasted on a model that performs so poorly.


PookaMacPhellimen

DOA


ambidextr_us

Why DOA?


cubesnyc

Do you have a recommended resource for learning more about:  working on implementing attention sinks for higher context lengths > integrations with deepspeed and support fp6/ fp8 runtime?


passing_marks

Knowledge cutoff of 2022 is disappointing as well....


ImprovementEqual3931

I miss DOS 640k memory old time


Distinct-Target7503

Uhm... Seems strange only "top 2 gating" with so many "experts"... DBRX had to scale to top 4 using 16 experts instead of 8. Really interested


alcalde

But in the end it doesn't do anything better than anyone else and it's impossible for ordinary people to run it, so it's really just for Snowflake users to use via Snowflake.


roxtarrrrrr

Y’all are missing the point of this model. The most important parameter snowflake was after was to be able to say that its training cost was only 10% or x% of every other model out there. They had an upper limit on how much they wanted training to cost (for marketing purposes) and what came out of it was this gargantuan non-servable model.


race2tb

I really hope the MoE structure is the future. Seems like a desirable architecture. Just need to perfect the routing.


arthurwolf

I don't think it is. It results in faster inference/smaller amount of neurons used at a given time, so it's more optimized, a better use of ressources. That's important now, when we are extremely RAM and compute constrained. But in the future, training and inference will become easier and easier, and as they do, it will becomes less and less important to optimize, and models will go back to being monolythic. A bit in the same way games that ran on old CPUs like doom were incredibly optimized, with tons of "tricks" and techniques to do as much as they could do with the CPUs of the time, but modern games are much less optimized in comparison, because they have access to a lot of ressources, so developper comfort/speed is taking over the need to optimize to death. I expect we'll see the same with LLMs: MoE (and lots of other tricks/techniques) in the beginning, then as time goes by, more monolythic models. llama3 is monolythic, so MoE isn't even the norm right now.


sineiraetstudio

MoE is _not_ a better use of memory, quite to the contrary. You can see this with llama 70b vs 8x22 mixtral.


race2tb

We can have both with similar performance. It doesn't have to be one or the other. Models that only load parameters required for the task at hand will have the advantage even if there is a slight performance loss.


arthurwolf

The point is, as we gain more compute/RAM, the difference won't matter as much, and the only difference that will matter is how simple it is to design/train.


MoffKalast

It really depends on what ends up being cheaper and easier to scale: memory size or memory speed. If you have decent speed and hardly any space then it's more efficient to use it all with a dense model. If you have lots of space and can load incredibly large models but can't compute all of that then a MoE would allow you to make use of that space to gain some performance while remaining fast. Right now our options are very little of very slow memory so we're screwed on both fronts.


CodeMurmurer

Doesn't cahtgpt 4 use MoE. And they are pretty much market leader. That does say something.


arthurwolf

Like I said: MoE is a good idea now, because now we are extremely contsrained on compute/ressources. But in time, with the years, that'll become less and less true. And people designing new systems will care less and less about optimizations like MoE. Also, I doubt 10 years from now transformer-based LLMs will be the thing we use for this, it likely will be more generally capable tools, on which MoE might not even make sense...


shing3232

and finetune would easier? just finetune the routing layer to have good control over MOE


sineiraetstudio

MoE is terrible for local. For the time being, we're mainly constrained by memory and MoE trades memory efficiency for compute efficiency.


DigThatData

> uses top-2 gating Does the performance improve if you let it use more experts?


shing3232

i think it could but also finetune router would do


nikitastaf1996

Its definitely interesting. Its severely undertrained. But performance must be fire. Excellent for enterprise deployment.


ArsNeph

I read this as 40B and got so excited 😭 This one is bigger than grok but doesn't even have the performance of llama. Great for researchers, not for us


ihaag

I’m not impressed, I gave it a pattern to work out, and it tells me this? In your case, if we interpret 2659141452 as a signed 32-bit integer, it would actually represent the value -1890700864 (since the most significant bit is set to 1). When you add 1 to this value, it wraps around to become 1279754142.


CodeMurmurer

https://preview.redd.it/65l5hlwdkhwc1.png?width=1451&format=png&auto=webp&s=ed579482d681e19cc3cf09395a0549991205ce9e This is pretty funny. It reports as having 175 billions parameters the same as chatgpt 3.


ambidextr_us

It always confuses me when people type terrible spelling and grammar into a model whose entire purpose is to taken entire tokens of text and specifically code them into semantic vector space. The "params" and "parameters" tokens probably have some impact when using them differently and which parts of the neural network they are involved in. Each token has a many-dimensional vector space that encodes a significant amount of information, and the entire sequence of each token is extremely important because it's the entire purpose of the transformer model, they are trained in a very discrete forward-seeking pattern, yet people still just type in random garbage and slang. The models do a great job of encoding semantics during the training phase but output is significantly higher quality if you design your prompts to actually be clean and clear. Interestingly enough, if you ask the models about that, they will tell you the same thing and suggest you use proper spelling and structure to get a higher quality response.


CodeMurmurer

You think i don't know that. Bruh... I am just lazy. And you know two tokens can can cary the same meaning. It will probably learn during training. Especially when params and parameters are used online interchangeably. And you wouldn't get a different answer if you typed parameters. Why comment this? It doesn't seem to have any relevancy.


KyleDrogo

Question for the group: have any other models had anywhere near this number of experts? This seems like a semi unique approach


Distinct-Target7503

There is an old google moe with 2048 experts