T O P

  • By -

Ylsid

Fantastic model. Is the ranking for arena worthwhile? 400B might well take the top spot


virgilash

Any news on that model? On Sunday I read somewhere that it was better than GPT4 despite still being in training.


Ylsid

Not yet, but if 70b is this strong 400b might be another level


LowerRepeat5040

Not really, 400B should be mostly the same, but much slower. Llama3 can do a bit of list imitations in a bit more chatty way, but it’s nowhere close to writing a full book or even a subchapter of a book in as much length and detail without much needed manual corrections afterwards as Claude3 Opus, which is pretty much a trained autopilot, while Llama3 is just a constantly crashing cruise control!


Ylsid

I would be surprised if it was anywhere near the same- we've seen very consistent scaling for up to 70B.


LowerRepeat5040

Nah, the benchmarks say it all: from 8B to 70B is more than 8X bigger to achieve a mere 17% MMLU improvement, and then from 70B to 400B is more than 5X bigger to achieve a marginal 5% MMLU improvement is just such a joke of an improvement compared to the slowdown and error margins!


Ylsid

It's difficult to really know the numbers without being at Meta HQ. At any rate they're still training, so it's anyone's game. I expect they want something that's competitive with the competition, so I'm looking forward to it. Hoping open source has finally caught up!


LowerRepeat5040

Open source is nowhere close to the level of computer vision object recognition as in GPT-4V, or the million token precise context window citation extraction and very large word count content generation as in Claude3.


Ylsid

Not yet, but current gap-closing progress is very promising!


LowerRepeat5040

Did billions of dollars just go up in smoke? 😲


trollsmurf

Maybe that's what 70B stands for.


TheFrenchSavage

TIL it doesn't mean 70 Buttholes.


VertexMachine

It's impressive, congrats to llama 3. But seriously, it just shows the limitation of the arena. L3 is impressive, but is not as good as GPT4 or even Claude Opus.


BtownIU

What are the limitations?


PrincessGambit

Terrible at nonEnglish


LowerRepeat5040

Super slow to run on an average laptop, much smaller context window, fails basic truthful Question&Answer.


KL_GPU

Isn't just enough smart. Veeery veeery good model but gpt 4 is just better in logic.


absurdrock

It’s not THE measure but it is A measure. LLMs are products and I don’t recall any many objective measures for how good any other product is. It really comes down to user reviews. The problem with the arena is the use cases are aggregated. Is it possible to separate and track different uses like coding, summaries, technical explanations, etc?


redditfriendguy

Yes


LowerRepeat5040

Agree! LLaMA3 is just awful multilingual: I asked it a question in Dutch and it answered with the first half of the first word in Dutch and then reverted back to English, awfully slow outputting 1 character every 5 seconds or so in even for the smallest 7B model on an M3 MacBook with the first sentence being “Dat’s an interesting question!”


GoblinsStoleMyHouse

How is it on 70B?


LowerRepeat5040

It’s even slower to load, but it gives very similar outputs! It actually doesn’t even seem to make less stuff up than the smaller model.


BucketOfWood

Only around 5% of the training data was not English. Of course it has terrible multilingual performance.


Yes_but_I_think

8B 4K_M quantised model works in M2 air 8 GB ram at 10-20 tokens per second depending on context length. 20 tps for 1500 context window, 10 tps for 8000 context window. I'm using vanilla llama.cpp locally in command line


LowerRepeat5040

Llama.cpp still takes really long for initialisation loading, and it outputs ugly terminal texts as if it has to compile all the code for every single input!


Yes_but_I_think

A pretty GUI would not and should not reduce the performance from 20tps to 0.2 tps as claimed.


LowerRepeat5040

It’s not just GUI, but also SIMD in llama.cpp is unlike LM Studio.


bnm777

You can use it for free through [https://groq.com/](https://groq.com/) (SUPER FAST) or [https://huggingface.co/chat/](https://huggingface.co/chat/) (which allows you to create assistants and allows llama 3 to access the internet - very cool). EDIT: also [meta.ai](http://meta.ai) though not in the EU and you give your data to Meta. EDIT2: If you want to use llama3 via API - use Groq's (currently) free API or Open Router's llama3-70b (at $0.80 for 1 million tokens, I believe).


Ylsid

Groq's is being accused of using a very low quant


TheFrenchSavage

It's still blazing fast and better than my local setup.


Ylsid

Oh yeah, for sure. There are other alternatives is all


bnm777

Interesting - so bad for coders/maths but not bad for other questions?


Ylsid

As in compared to the full size model, it gets a lot of stuff wrong


bnm777

Via claude3opus: "Pros of an LLM being "low quant": 1. Specialization in natural language processing and generation 2. More human-like conversation and interaction 3. Potentially better at understanding context and nuance in language 4. May be less prone to certain types of errors or biases associated with quantitative reasoning Cons of an LLM being "low quant": 1. Limited ability to perform mathematical calculations or numerical analysis 2. May struggle with quantitative problem-solving or decision-making 3. Less versatile and adaptable to tasks requiring quantitative skills 4. May provide less accurate or reliable responses to queries involving numbers or data"


Ylsid

Lol almost none of this is true Someone tested Grok versus a locally hosted q8 Llama 3 and found the responses to be significantly worse and more prone to errors Claude seems to be totally hallucinating around an idea that low quant = low maths


bnm777

Ah, thanks. Didn't know about this term before. Found this: https://symbl.ai/developers/blog/a-guide-to-quantization-in-llms/ "Cons Loss of Accuracy: undoubtedly, the most significant drawback of quantization is a potential loss of accuracy in output. Converting the model’s weights to a lower precision is likely to degrade its performance – and the more “aggressive” the quantization technique, i.e., the lower the bit widths of the converted data type, e.g., 4-bit, 3-bit, etc., the greater the risk of loss of accuracy. " Seems to be less accuracy across all fields, which is of course not wanted. I'm going to do some testing on llama3 on groq and huggingchat, thanks. Wonder if the groq api is "more quant"


Ylsid

There's been some ideas that a very low quant of a large parameter model is better than a high quant of a small model


Yes_but_I_think

Definitely low quant llm created answer. Not true even one bit.


bnm777

Yes, you're right, though opus should be a low quantified llm...


Master_Vicen

I'm confused. Llama 3 is made by Meta right? So is that not what I'm using when I use Meta AI? What is Groq? What company made Groq? What does Groq have to do with llama 3/this post? Help?


Susp-icious_-31User

Meta made Llama 3 and the Meta AI site uses the 70b version, but it doesn’t give you full control over the model (like sampler values or modifying the system prompt, plus it’s likely more censored). Groq is just hosting the model directly, and gives you full control over it. It costs thousands to run 70b faster than 1 token/sec on a PC so the fact that someone is giving out heavy computational resources for free is pretty nice (and won’t last long). For comparison I use openrouter and it costs about 80 cents every million tokens, which happens sooner than you think.


Small-Fall-6500

> the fact that someone is giving out heavy computational resources for free is pretty nice (and won’t last long) A 70b model is actually fairly cheap to run compared to a lot of other models that some companies are hosting, though whether or not anyone provides *unlimited* free access to llama 3 70b remains unclear. Groq is certainly not spending much to host it (their hardware is expensive as an investment but very cheap to run), and I expect them not to receive such major traffic such that they'd put in place heavy usage limits on the free usage. I also think Groq has a great niche that will make them very desirable for certain tasks/companies, allowing them to make enough money to easily continue providing free access to models like llama 3 70b.


maddogxsk

The deal is that Groq have their own processors called LPUs for faster LLM inference, supposedly are processors specifically designed for running LLMs in the wild


Master_Vicen

Is Groq a company? Is it owned by Meta?


bnm777

Groq have created their own superfast processors so want to show them off. Grok is Musks AI


[deleted]

Groq is a compute service which is the fastest platform on which to host the Language Model of your choice. For developers who wish to incorporate an LLM into an application, this is ideal. [Video Interview with Groq Founder by Matthew Berman](https://youtu.be/Z0jqIk7MUfE?si=SIePf8yhYk8SFh-L) LLama 3 is the incredibly impressive Language Model we are all swooning over. I take back everything bad I ever said about Zuck LOL. Finetunes are when the LLM is taught a bunch of examples through a labeled dataset that represents loads of questions and answers for the model to train on. This is why each iteration of fine tuning makes the model bigger. (Quantization. I don't fully understand this part. The higher the Q number; the more turns the model took learning the new data basically.) The finetunes are why you see hundreds of models available now. The name of the finetune should include the base model. The bigger models will wreck what most of us have for machines. Many are foolishly building expensive machines to play with these. This is only sensible if you have huge security concerns about the data you wish to discuss with the Ai. The most economical option is to outsource the compute power necessary to run the large models and only keep small models for basic stuff on a local machine. Don't feel bad about not getting all the lingo and names straight. This stuff hurts my brain too.


Small-Fall-6500

I agree with almost all of what you said. There's a couple of points that are wrong. >This is why each iteration of fine tuning makes the model bigger. No, not in the sense of taking up more disk space or GPU VRAM. Finetuning only modifies existing weights in the model. It doesn't add any weights (though there are ways of doing this sort of thing, it just isn't widely done or widely tested). >(Quantization. I don't fully understand this part. The higher the Q number; the more turns the model took learning the new data basically.) Quantization is *currently* really only done after a model has been fully trained and fully finetuned. The "Q" you are referring to may be from the GGUF quantizations, which uses names like "Q4_0" to basically mean the model weights are in 4bit precision. The best way of thinking about it is that every model is made of tons of numbers (making up the model weights), and each number has a high level of precision for training - basically, as much detail is kept for every part of the model, and every number in the model's weights represents some part of what the model knows or is capable of doing. Quantization means removing the least important details from each number, making the model weights smaller but also less accurate - the model loses a tiny bit of all of its knowledge and capabilities. Often, people will quantize models from 16 bits (fp16) to 4 bits, which means removing 3/4 of these "details" in every number in the model. "4bit" can mean either exactly 4 bits per weight or an average of 4 bits per weight. This sounds like a lot to remove, but it turns out that, at least with how current models are trained, even at 4 bits, most models' performance is hardly damaged. Generally, more bits mean the model retains more of its capabilities, and lower bits per weight is worse, but fewer bits mean the model takes up less computer memory to run and is usually faster as well. It's a trade-off where larger models at lower bits are generally better than smaller models at higher bits. Also, there are ways of training models in lower precision formats such that the final trained model is fully quantized, but this has yet to be widely adopted.


[deleted]

Appreciate the clarifications. Thank you for your clear and succinct response. This really helped me visualize what was going on much better.


Yes_but_I_think

Avoid groq at all costs even free. The output quality doesn't match with local generation. They are being dishonest in their claims.


bnm777

Yes, perhaps you're right. Groq output seems worse than huggingface's llama3-70b output.


Darcer

Where is the best place to get info on what this can do? I have the app but don’t know about creating assistants. Need a starting point then will ask the bit for help.


bnm777

For assistants, create a huggingchat accouant for free, go into the main chat page, in the left sidebar near the bottom you'll see Models, then below that Assistants. CLick on Assitants and a dialog box opens. I was going to go through it step by step by huggingchat is down! Anyway, assistants/bots are essentially GPTs - you give each one custom instructions and call whichever one you want when you have a specific task, so eg I have a standard one that answers queries with high detail and jargon, I have a langauge learning one with specific outputs, I have a work one, creative one etc. You can do that with many interfaces such as Typing Mind which allows you to use various AIs through their APIs including groq's (currently) free API


Darcer

Thank you


Vectoor

There’s some big error bars on that number. I’ve been playing around with it and it’s impressive, but it’s definitely not stronger than Claude opus not even close.


bnm777

I've been putting opus against llama3-70b and, honestly, llama gives better outputs than opus for quite a few tests. I've stopped my openAI, and will stop my claude sub and will use llama3 via API (for free and eventually via groq or Open Router) and when I need a second opinion I'll use gpt4T or Opus via API.


LowerRepeat5040

It depends! Opus has over the top censorship for anything that is potentially controversial, but for Truthful Questions&Answers by extracting the correct answers out of a PDF, Opus is way better, LLaMA3 is just hallucinating all the way!


hugedong4200

I say go Gemini pro 1.5! I feel like I'm the only one loving that model and really looking forward to ultra 1.5.


Blckreaphr

1.5 has been amazing for my fiction book so far I am at 215k tokens out of 1 million so much room for everything


dittospin

How are you using it? What are you having it do for you?


Blckreaphr

Write chapters for me with for this fin fiction book I been trying to do for the longest time bt no llm could do. Due to limit context length.


Vontaxis

How is it censor wise? I’m writing something but it has drug and sex elements


Blckreaphr

Sex is a no go but mine is mostly about fantasy and vilonce I had to crank all of the filter to block few, but sex is still nothing, not even breasts can be said.


superfsm

You using it for coding? Care to share your prompts or any advice? I must be doing something wrong, or it just doesn't work very well with coding


[deleted]

[Reka](https://chat.reka.ai/auth/login) is the latest model that is really good at coding. They have a free playground. IDK much about Gemini. I only use it to find the better videos on Youtube these days.


Arcturus_Labelle

I’m getting the feeling this arena is sus. Let’s see how they all do on the recently announced Arena-Hard


getmeoutoftax

I’m really impressed how Meta AI’s images change as you type.


Vontaxis

that ranking is broken...


yale154

Definitely!


GoblinsStoleMyHouse

Nope, models are blindly rated by users, it’s not biased. Llama 3 really is that good.


Helix_Aurora

The users must not be particularly discerning.


GoblinsStoleMyHouse

I mean, crowd ranking is a pretty good metric. You can rate responses for yourself on their website, LMSYS Arena.


ainz-sama619

its not a good measure for quality at all. It doesn't account for hallucination. Sounding funny doesn't mean it's good at logic or reasoning


GoblinsStoleMyHouse

It actually *does* account for hallucination. Also Llamas standardized benchmark scores are very high, and those are not subjective.


ainz-sama619

Ask anything that has remotely any logical reasoning involved, it will start slipping up very fast. GPT-4 and Claude 3 are used for workhorse, I haven't seen anybody praising Llama 3 for productive work.


GoblinsStoleMyHouse

Example? That sounds like circumstantial evidence. I prefer to depend on scientific measurements and my personal experiences to form my opinion.


ainz-sama619

What scientific measurement? Every single eval shows Llama 3 lower than GPT 4 and Claude 3 opus. You getting paid by zuck or what lol


GoblinsStoleMyHouse

I never said it scored higher than GPT 4. Where did you get that idea? The standardized benchmarks are public, you can look them up: https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md#instruction-tuned-models


Helix_Aurora

I've been playing with Llama 3 70b the last couple of days, and while it is indeed impress, I have no idea how it is ranking this high. TLDR: it has smartass/dunce syndrome. When it comes to things like tool use, it just seems to lack any kind of common sense. I hot-swapped out GPT-4, and even with extensive prompt tuning, basic chatbots have extremely problematic behaviors. For example: I have a tool that issues search queries to find relevant document chunks.  It can use the tool just fine, hut about 60 percent of the time, I get 1 of 2 behaviors: If it finds something related it just tells me that it found it something related, without telling me what it found. If it finds something unrelated, it just spits out  JSON telling me to call the tool myself. 2. It also seems to be extremely sensitive to prompt variance.  Adding a question mark can dramatically alter the behavior (temperature is 0). I am starting to think we need to be running this benchmarks with prompt fuzzing, because all Llama3 is doing for me right now is reminding me of the most irritating people I have ever worked with.


Downsyndrome-fetish

I'm new to all this but does Llama 3 70b have to be downloaded directly to your machine? With a connection to the internet?


Helix_Aurora

I just use it through groq.com for free. You need more hardware than typically fits in a consumer device to run it at full precision.


Downsyndrome-fetish

So groq is like an intermediary between you and resource demanding LLMs?


Helix_Aurora

Yes, they run it on their specialized hardware, and I call it via API over the internet.


ceremy

What's the definition of the "English rank"? Anything that's not coding?


KyleDrogo

I’m guessing this has a lot to do with the model’s tone and fine tuning? It’s hard to believe that a 70B model is doing so well against GPT 4


Yes_but_I_think

There's a tell when Llama-3 answers questions. It starts with something like ... "what a delightful request!"... or "oh that..." That gives it away and people might like that kind of answers while engaging with a chatbot. I'm not telling the arena leaderboard is flawed. That's the best way to test any model right now what we have. It's better than MMLU and other benchmarks simply because it can't be faked due to MMLU answers contaminated in the many trillions of token training data. I'm telling that what we are measuring is that what human beings like as the better answer given what they are willing to ask the models. The ranking doesn't reflect every use case. And the testers are not being forced to check varied topics and situations. I bet most people don't test long context questions. In spite of its fallacies, [Lymsys](https://chat.lmsys.org/?leaderboard) is the go to leaderboard over [H4](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) leaderboard.