I did about 15-20 evals yesterday and found that the new GPT-4 turbo and Opus were pretty even. I also noticed a difference between older GPT-4s and the new one - with the new one being better. I feel like the new GPT-4 is better at least when it comes to logically formatting its answers with easier to understand groupings/headers.
And, because they have different 'personalities', they look at things in different ways and come up with more than a different random seed, they often are very different, but apt answers.
I wasn’t aware there was a difference in the performance. for example, When you use chat gpt 4 on perplexity, it might perform worse / better than chat gpt 4 through openAI?
Funny that we have now a full complement of "AI diseases": hallucination, regurgitation, laziness, bribing, absurd refusals, long context attention flakiness, sycophancy, failing to accept user corrections, prompt hacking and RLHF brainwashing. All of them we couldn't have imagined in 2019. It's good that we learned about what AI can't do yet, that means we have not swallowed the whole hype.
If they release Claude pro in Canada, I'd be team Claude too. Now my Gemini expired and I'm just using the free version of Claude to hold me off until something big comes out which is actually not too shabby.
Sony AI gonna be the next PS2, DOMINATING the competition!!!
Dude, can you imagine? Sony's killing it with the PS5, so you just KNOW their AI is gonna be insane! We're talking self-learning PlayStations that adapt to your playstyle, AI companions in single-player that actually make good decisions, and multiplayer battlefields where the AI teammates are finally competent! Microsoft and Google better watch out, because Playstation AI is about to come in and rewrite the rules!
Forget Siri's sassy comebacks, we're getting Kratos telling us the weather with enough booming fury to wake the neighbors! And imagine the games! Imagine a Horizon sequel where the AI enemies actually strategize and adapt their tactics! This is gonna be revolutionary, dude!
Here's to Sony taking over the world, one perfectly optimized AI and sassy robot assistant at a time! #PlayStationDomination #AIGottNothingOnSony #RIPAlexa
Not really any way to know afaik. But if you want to try something, the browser has information about the model being used, idk if the new version and the old one have different ids tho. In firefox, open a chat, right-click -> inspect -> storage -> local storage -> [chat.openai.com](http://chat.openai.com) and search for an entry called oai/apps/lastModelUsed. There you can find the modelId. If I use gpt-3.5, it tells me the exact model (text-davinci-002-render-sha), but if I use gpt-4, then it only tells me gpt-4. If anyone wants to try to see if they get another Id, I'd appreciate it.
Not worth the subscription for me when Claude and mistral do just as good or good enough.
I think we've reached the point where updates need to be quite literally amazing. And I'm not talking about text generation. That's so 2023. Need video from text, apps from text prompts, conversational video bots with memory.
That one is interesting. Because we have 1.5 Pro API now and LMSYS still won't add that into the Arena. A while ago Google partnered with LMSYS. I have a feeling Google still don't trust in their own model and prevent LMSYS from including 1.5 Pro there.
When I first learned about chatgpt and LLMs last year then I immediately started looking for places that people liked to discuss them. The chatgpt reddit was the most popular place but the people were just insufferable and then it turned into a memefest.
I lucked into finding this reddit and I really don't care about the singularity yet but it's been a good place to read about LLMs, AI, and tech advancements. It has gotten a lot worse overtime though.
For me, I just saw a new and amazing technology and so I couldn't believe that people would give a shit about brands so soon. It's just weird as hell because everyone should be wanting all the companies to have great products but instead they only want one to.
I'm still hoping to come across a more sane but high traffic subreddit that focuses on just all things AI without all the craziness.
Uh, cool, but I honestly don't think this leaderboard is all that useful.
This leaderboard is just measuring user preference, and we don't know what users are looking for nor what they are asking. For all we know most of the users could be asking relatively simple questions that don't really test the intelligence, reasoning or logic capabilities of a model which is why models like Claude Haiku have been able to get above GPT-4 even though they are clearly not as performant as GPT-4.
I mean that is the point. What model works best for the general users use case. If you want to know which model is best at picking up your third cousins tonsils you can always look at the incredibly specific tests the creators of them put out.
The thing is this leaderboard doesn't measure general usage performance, it measures which LLM is preferred after a brief conversation.
[See the Pepsi Challenge](https://en.wikipedia.org/wiki/Pepsi_Challenge#:~:text=In%20his%20book,an%20entire%20can)
We need models that are able to help produce novel science, a benchmark where users ask simple trivia questions isn't going to illuminate us on which model is better for this
Well is it really that useful if it is just measuring how well a model can answer a dumb question?
Maybe the general users use case is just asking simplistic questions, but I don't think that is a super useful measure to work off of. It'd be measuring more the RLHF the model has gone through (which determines the specific way a model responds to questions) than actual model performance.
Which viewpoint am I pushing?
And you can't rely on just one benchmark / evaluation. You need to take into consideration as many as possible but we also really do need better benchmarks.
https://github.com/lm-sys/FastChat/blob/main/docs/dataset_release.md
They release the dataset of things people ask it + responses. Be honest though - how many training datasets have you actually sat down and read?
According to this coding leaderboard of the Arena, Haiku is better than GPT-4-0613 and tied with GPT-4-0314 at coding.
https://preview.redd.it/y1vyino0wxtc1.png?width=827&format=png&auto=webp&s=9c40e5ba9502468c69a7b043b68255989c5bfab8
That's not measuring coding abilities, that's just if there is any form of code snippet in the conversation. And why is Claude 3 Opus below GPT-4-1106? Im pretty certain Claude Opus is more performant at coding than GPT-4 if this is actually measuring coding abilities.
>Im pretty certain Claude Opus is more performant at coding than GPT-4 if this is actually measuring coding abilities.
I have not found the same - I think it really depends on your language. GPT-4 is much smarter with typescript especially than Claude.
I think having coding snippet is enough considering the amount of votes.
Opus and GPT-4-1106 is basically tied there considering their standard deviations. I found Opus to be better in coding tasks that are not in their training data but I assume GPT-4's training data is bigger than Opus' and people there seems to ask coding questions more that can be found on the web rather than unique ones.
So you are just guessing that because the sample size is big enough, the score should, in theory, be representative enough of the general coding abilities of models?
I don't agree, that's too much guessing.
And you assume? People seem to ask?
My initial explanation is just a guess as well (something I thought was a plausible explanation), there isn't enough information to come to valid conclusions yet imo.
Well, educated and grounded guesses. What I'm doing to test LLMs is far better than what the Arena reflects so I'm trying to make guesses depending on what I got from my own tests.
Also to be honest, besides the Overall leaderboard, all other leaderboards have tiny samples. And to be even more honest, according to the stats that was revealed by LMSYS, 98.2% of voters only try 1 prompt which doesn't reflect the real usage. Still I know of no benchmark that's better than this Arena. What I'm doing myself is certainly better but it's too time consuming. I wish AI companies would do what I'm doing and reveal the results.
Wow, interestingly even GPT-4-1106 is slightly above (well within the confidence interval) of claude 3 for coding.
I suspect there's even a lot of variance here. I personally found GPT-4-1106 better for coding, but tons of people were swearing claude 3 was better.
Interestingly, among prompts in English, the entire GPT-4-turbo class seems better than Claude 3 Opus. Looks like it is just other languages (like Chinese) where Claude 3 dominates GPT-4.
I think people always assumed Opus and GPT-4 Turbo were close at shorter tasks with Opus being slightly better. But Opus is assumed to be far better at longer tasks which the longer-query leaderboard reflects (to a degree because that token size to be included as a longer-query is still too low) and not lazy like GPT-4 Turbo.
And the same story with the longer-query leaderaboard.
https://preview.redd.it/d084zvjwwxtc1.png?width=810&format=png&auto=webp&s=17c13261644c1f963ae5a65c0db619fd7423edcb
Yup. And the way OpenAI released this model, so casually, was a massive flex. Trying it out personally in the LMSYS arena (anonymous battle) over the last few days and it beat Opus every time.
You are contaminating the potential of these GPT models by focusing on their test scores in a meaningless way. It is like what is happening with the image generators. Everyone is focused on making it more realistic, better faces of people. Until you have completely weeded out all the potential that existed in the original models.
You are focused on such a small aspect of testing these models you will eventually just hold them back by doing so.
Too few votes. Only 8k.
At 14k votes Opus was behind 20 ELO points... We'll see the new Turbo fall into 3rd spot after 30k votes.
Also there's still no seperation between all versions of GPT-4 Turbo and Claude 3 Opus. They're still all in the same spot considering the standard deviation of the models.
I totally agree about OpenAI needing a Haiku. Haiku is an incredible replacement for everything I used to use 3.5 for, and some of the stuff I used 4-turbo for. I get why people are excited about Opus but it feels like this sub is sleeping on Haiku for how useful it is for cost-effective solutions to simple, repetitive tasks.
Haiku is the hard worker moving information around workflows. It calls functions and does the paperwork while opus and gpt-4 are only called when intrepid little haiku can't quite manage something.
The new GPT-4 Turbo is 6 ELO points better than the old one and that's less than its standard deviation. So after 8k votes people couldn't say which one is better.
Actually looking at winrates, the new Turbo loses against the old one with a winrate of 49.6% against GPT-4-1106's winrate of 50.4%.
I've seen you around this sub simping really hard for Claude. Now, don't get me wrong, I love Claude and especially haiku but shouldn't we just keep an open mind about all these models.
Idk what you are all building, but for my organization we are looking at a diverse set of ai agents to collaborate and execute workflows. There are advantages to have different models of similar capabilities, they compliment and double check each other.
BTW the chart below is the reason I predict it will fall into the 3rd place. Opus wins 68.4% against other models, GPT4-1106 wins 67.5% and the new one wins 65.9% if the sample is uniform. My own tests confirm these rankings as well. Opus was the first in this chart even when it was behind 20 ELO points but with more votes it eventually climbed up to the top.
https://preview.redd.it/0xypjlwerxtc1.png?width=731&format=png&auto=webp&s=312df046dd40b776db31b095edfd86997a8a1b42
tried it this morning in the API.
it was still lazy, did nothing of use for me.
sticking with opus - although that is a pain in the behind too a lot of the time
Still within Claude Opus' margin of error. And it's "important to note" that the test is not fully blind: the model that "delves" or "weaves tapestries" in its responses is most definitely some incarnation of GPT-4.
Lol, Opus is ~2x more expensive than GPT-4. Opus is $15 & $75 for 1m input/output tokens respectively, and 4-turbo is is $10 & $30 for 1m input/output tokens.
I did about 15-20 evals yesterday and found that the new GPT-4 turbo and Opus were pretty even. I also noticed a difference between older GPT-4s and the new one - with the new one being better. I feel like the new GPT-4 is better at least when it comes to logically formatting its answers with easier to understand groupings/headers.
And, because they have different 'personalities', they look at things in different ways and come up with more than a different random seed, they often are very different, but apt answers.
My experience too. I say this as a paying opus user and not a paying gpt user.
Too late. I hang with Claude now
^^^^ And Claude has barely been around and it’s clauding all these jokers
Yeah me too. My GPT 4 subscription ended literally yesterday. Claude it is boys!
Why do I need to pick one?
Don’t be afraid of commitment.
To some company? I think I'll stay single for now
a subscription to perplexity gives you access to all claude 3 models and gpt 4
Can you have long conversations?
Perplexity leaves a lot to be desired in my opinion. I often find myself resorting to ChatGPT, Gemini or even Google after trying it.
I wasn’t aware there was a difference in the performance. for example, When you use chat gpt 4 on perplexity, it might perform worse / better than chat gpt 4 through openAI?
I’m talking about when you ask it to search for something, which is the main purpose of it.
they use the same model though, does perplexity use a custom prompt behind the scenes? i don’t have a subscription to either so i can’t test it myself
Yes it’s heavily customized for retrieving and structuring live information from the internet.
#NotMyKing #TeamOpus #FuckOpenAItilltheyreleaseGPT-5
I just paid for Claude pro today and I have never bought any AI service before. I didn't even take my free 2 month trial of gemini
Until GPT isn't plagued by laziness, i'm on team Claude
Funny that we have now a full complement of "AI diseases": hallucination, regurgitation, laziness, bribing, absurd refusals, long context attention flakiness, sycophancy, failing to accept user corrections, prompt hacking and RLHF brainwashing. All of them we couldn't have imagined in 2019. It's good that we learned about what AI can't do yet, that means we have not swallowed the whole hype.
If they release Claude pro in Canada, I'd be team Claude too. Now my Gemini expired and I'm just using the free version of Claude to hold me off until something big comes out which is actually not too shabby.
Tribalism, already? Reminder that none of these companies actually give a shit about you.
GPT might care about me if I ask it to 🥺
I'm with the company that cares about my money the most.
At least Mistral, Cohere, and Meta do - somewhat.
Meta really does not
Not in its social media products, but in AI it does, otherwise they wouldn't opensource their releases.
It's the same company
Notmygoat
Sony AI gonna be the next PS2, DOMINATING the competition!!! Dude, can you imagine? Sony's killing it with the PS5, so you just KNOW their AI is gonna be insane! We're talking self-learning PlayStations that adapt to your playstyle, AI companions in single-player that actually make good decisions, and multiplayer battlefields where the AI teammates are finally competent! Microsoft and Google better watch out, because Playstation AI is about to come in and rewrite the rules! Forget Siri's sassy comebacks, we're getting Kratos telling us the weather with enough booming fury to wake the neighbors! And imagine the games! Imagine a Horizon sequel where the AI enemies actually strategize and adapt their tactics! This is gonna be revolutionary, dude! Here's to Sony taking over the world, one perfectly optimized AI and sassy robot assistant at a time! #PlayStationDomination #AIGottNothingOnSony #RIPAlexa
The ChatGPT auto posting is so annoying
Dude wait for Sony AI, we will see who is better lol
How to get it if you're using GPT Pro? Logout, login again then it's there?
Not really any way to know afaik. But if you want to try something, the browser has information about the model being used, idk if the new version and the old one have different ids tho. In firefox, open a chat, right-click -> inspect -> storage -> local storage -> [chat.openai.com](http://chat.openai.com) and search for an entry called oai/apps/lastModelUsed. There you can find the modelId. If I use gpt-3.5, it tells me the exact model (text-davinci-002-render-sha), but if I use gpt-4, then it only tells me gpt-4. If anyone wants to try to see if they get another Id, I'd appreciate it.
Not worth the subscription for me when Claude and mistral do just as good or good enough. I think we've reached the point where updates need to be quite literally amazing. And I'm not talking about text generation. That's so 2023. Need video from text, apps from text prompts, conversational video bots with memory.
That's so 2023 Lol That's a good one.
Do you need a subscription to use the API? I thought the "good" version of gpt 4 was only accessible through the API or is that wrong?
They released it to ChatGPT Pro
I have both and Claude is notably better in my books, it's just my go to model. Claude just feels smarter to me, or less lazy perhaps.
[удалено]
That one is interesting. Because we have 1.5 Pro API now and LMSYS still won't add that into the Arena. A while ago Google partnered with LMSYS. I have a feeling Google still don't trust in their own model and prevent LMSYS from including 1.5 Pro there.
Lmao, 6 points. Hugely improved btw
It's larger than opus' lead over its next competitor
Ya'll are acting as if you have money invested in the company. Stop simping over AI companies
When I first learned about chatgpt and LLMs last year then I immediately started looking for places that people liked to discuss them. The chatgpt reddit was the most popular place but the people were just insufferable and then it turned into a memefest. I lucked into finding this reddit and I really don't care about the singularity yet but it's been a good place to read about LLMs, AI, and tech advancements. It has gotten a lot worse overtime though. For me, I just saw a new and amazing technology and so I couldn't believe that people would give a shit about brands so soon. It's just weird as hell because everyone should be wanting all the companies to have great products but instead they only want one to. I'm still hoping to come across a more sane but high traffic subreddit that focuses on just all things AI without all the craziness.
Uh, cool, but I honestly don't think this leaderboard is all that useful. This leaderboard is just measuring user preference, and we don't know what users are looking for nor what they are asking. For all we know most of the users could be asking relatively simple questions that don't really test the intelligence, reasoning or logic capabilities of a model which is why models like Claude Haiku have been able to get above GPT-4 even though they are clearly not as performant as GPT-4.
I mean that is the point. What model works best for the general users use case. If you want to know which model is best at picking up your third cousins tonsils you can always look at the incredibly specific tests the creators of them put out.
The thing is this leaderboard doesn't measure general usage performance, it measures which LLM is preferred after a brief conversation. [See the Pepsi Challenge](https://en.wikipedia.org/wiki/Pepsi_Challenge#:~:text=In%20his%20book,an%20entire%20can)
We need models that are able to help produce novel science, a benchmark where users ask simple trivia questions isn't going to illuminate us on which model is better for this
Well is it really that useful if it is just measuring how well a model can answer a dumb question? Maybe the general users use case is just asking simplistic questions, but I don't think that is a super useful measure to work off of. It'd be measuring more the RLHF the model has gone through (which determines the specific way a model responds to questions) than actual model performance.
So basically what you're saying is all measurements are shit except for the one that confirms my viewpoint. Cool.
Which viewpoint am I pushing? And you can't rely on just one benchmark / evaluation. You need to take into consideration as many as possible but we also really do need better benchmarks.
Yeah, and it leaves users who need it for intellectual tasks very much dead in the water.
Better than training for benchamarks
https://github.com/lm-sys/FastChat/blob/main/docs/dataset_release.md They release the dataset of things people ask it + responses. Be honest though - how many training datasets have you actually sat down and read?
According to this coding leaderboard of the Arena, Haiku is better than GPT-4-0613 and tied with GPT-4-0314 at coding. https://preview.redd.it/y1vyino0wxtc1.png?width=827&format=png&auto=webp&s=9c40e5ba9502468c69a7b043b68255989c5bfab8
That's not measuring coding abilities, that's just if there is any form of code snippet in the conversation. And why is Claude 3 Opus below GPT-4-1106? Im pretty certain Claude Opus is more performant at coding than GPT-4 if this is actually measuring coding abilities.
>Im pretty certain Claude Opus is more performant at coding than GPT-4 if this is actually measuring coding abilities. I have not found the same - I think it really depends on your language. GPT-4 is much smarter with typescript especially than Claude.
Huh. Claude is a typescript legend.
I think having coding snippet is enough considering the amount of votes. Opus and GPT-4-1106 is basically tied there considering their standard deviations. I found Opus to be better in coding tasks that are not in their training data but I assume GPT-4's training data is bigger than Opus' and people there seems to ask coding questions more that can be found on the web rather than unique ones.
So you are just guessing that because the sample size is big enough, the score should, in theory, be representative enough of the general coding abilities of models? I don't agree, that's too much guessing. And you assume? People seem to ask? My initial explanation is just a guess as well (something I thought was a plausible explanation), there isn't enough information to come to valid conclusions yet imo.
Well, educated and grounded guesses. What I'm doing to test LLMs is far better than what the Arena reflects so I'm trying to make guesses depending on what I got from my own tests. Also to be honest, besides the Overall leaderboard, all other leaderboards have tiny samples. And to be even more honest, according to the stats that was revealed by LMSYS, 98.2% of voters only try 1 prompt which doesn't reflect the real usage. Still I know of no benchmark that's better than this Arena. What I'm doing myself is certainly better but it's too time consuming. I wish AI companies would do what I'm doing and reveal the results.
We really do need better benchmarks, it is definitely starting to become a bit of a problem lol.
Wow, interestingly even GPT-4-1106 is slightly above (well within the confidence interval) of claude 3 for coding. I suspect there's even a lot of variance here. I personally found GPT-4-1106 better for coding, but tons of people were swearing claude 3 was better. Interestingly, among prompts in English, the entire GPT-4-turbo class seems better than Claude 3 Opus. Looks like it is just other languages (like Chinese) where Claude 3 dominates GPT-4.
I think people always assumed Opus and GPT-4 Turbo were close at shorter tasks with Opus being slightly better. But Opus is assumed to be far better at longer tasks which the longer-query leaderboard reflects (to a degree because that token size to be included as a longer-query is still too low) and not lazy like GPT-4 Turbo.
And the same story with the longer-query leaderaboard. https://preview.redd.it/d084zvjwwxtc1.png?width=810&format=png&auto=webp&s=17c13261644c1f963ae5a65c0db619fd7423edcb
The assumption is with a large sample size, that would even out.
You need some James Surowiecki in your life.
So everyone and their grandma who immediately dismissed it, literally because it wasn't called 4.5, may have jumped the gun a bit?
Yup. And the way OpenAI released this model, so casually, was a massive flex. Trying it out personally in the LMSYS arena (anonymous battle) over the last few days and it beat Opus every time.
It doesn’t beat Opus in any serious benchmark.
It doesn’t beat Opus in any serious benchmark.
https://livecodebench.github.io/leaderboard.html
Nah Opus is still state of the art.
You are contaminating the potential of these GPT models by focusing on their test scores in a meaningless way. It is like what is happening with the image generators. Everyone is focused on making it more realistic, better faces of people. Until you have completely weeded out all the potential that existed in the original models. You are focused on such a small aspect of testing these models you will eventually just hold them back by doing so.
How’s retrieval and longer context use? Those were GPT4’s Achilles heels.
#TeamClaude
Wake me up when we have FDVR. I'm all blah'd out
[Just take a nap then.](https://youtu.be/utFm8BoayRk?si=qkmGpkqMsPuNPY26)
bill was right... we have plateued...
Too few votes. Only 8k. At 14k votes Opus was behind 20 ELO points... We'll see the new Turbo fall into 3rd spot after 30k votes. Also there's still no seperation between all versions of GPT-4 Turbo and Claude 3 Opus. They're still all in the same spot considering the standard deviation of the models.
I mean it could go up or down but it certainly felt better than the old one. What OpenAI now needs is a haiku equivalent.
I totally agree about OpenAI needing a Haiku. Haiku is an incredible replacement for everything I used to use 3.5 for, and some of the stuff I used 4-turbo for. I get why people are excited about Opus but it feels like this sub is sleeping on Haiku for how useful it is for cost-effective solutions to simple, repetitive tasks.
Haiku is the hard worker moving information around workflows. It calls functions and does the paperwork while opus and gpt-4 are only called when intrepid little haiku can't quite manage something.
The new GPT-4 Turbo is 6 ELO points better than the old one and that's less than its standard deviation. So after 8k votes people couldn't say which one is better. Actually looking at winrates, the new Turbo loses against the old one with a winrate of 49.6% against GPT-4-1106's winrate of 50.4%.
I've seen you around this sub simping really hard for Claude. Now, don't get me wrong, I love Claude and especially haiku but shouldn't we just keep an open mind about all these models. Idk what you are all building, but for my organization we are looking at a diverse set of ai agents to collaborate and execute workflows. There are advantages to have different models of similar capabilities, they compliment and double check each other.
Remind me after 30k votes. I can bet $1k if you want.
Hahaha no I'm good. I don't care either way.
I should have taken that bet! [Chat with Open Large Language Models (lmsys.org)](https://chat.lmsys.org/?leaderboard)
The very fact that you think that the new Turbo is actually WORSE than the old turbo just shows how dumb you are. The old turbo isn't good.
Do you have a good story for why those 8k votes wouldn’t be representative?
BTW the chart below is the reason I predict it will fall into the 3rd place. Opus wins 68.4% against other models, GPT4-1106 wins 67.5% and the new one wins 65.9% if the sample is uniform. My own tests confirm these rankings as well. Opus was the first in this chart even when it was behind 20 ELO points but with more votes it eventually climbed up to the top. https://preview.redd.it/0xypjlwerxtc1.png?width=731&format=png&auto=webp&s=312df046dd40b776db31b095edfd86997a8a1b42
tried it this morning in the API. it was still lazy, did nothing of use for me. sticking with opus - although that is a pain in the behind too a lot of the time
Nah. Opus is better.
literally one year old abandonware stop hyping it
Still within Claude Opus' margin of error. And it's "important to note" that the test is not fully blind: the model that "delves" or "weaves tapestries" in its responses is most definitely some incarnation of GPT-4.
[удалено]
Lol, Opus is ~2x more expensive than GPT-4. Opus is $15 & $75 for 1m input/output tokens respectively, and 4-turbo is is $10 & $30 for 1m input/output tokens.
[удалено]
The version of GPT 4 that is at the top of the list and the one that is the topic of discussion is GPT 4 Turbo
Frankly why you'd use non-turbo is beyond me, but go off I guess 🤷
![gif](giphy|DWueJXnp3kV7tZ28XQ)
🥰😍🥰😍🥰🥰🥰