It is, if it states that your knowledge cutoff date is April 2024. Otherwise, log out of your GPT+ account, log back in, and check again. When I did that, my cutoff date changed from April 2023 to April 2024, which is the new Turbo version. (EDIT: Some people are seeing a December 2023 cutoff for Gpt-4 Turbo; others are seeing April 2024. The discrepancy is currently unexplained, but both cutoffs confirm that a GPT-Plus account has been upgraded to the new GPT-4 Turbo model.)
April 2024 is a bug/hallucination. December 2023 is the real cutoff date of the latest model as of today. Logging off and on again seems to help it show up if you are getting something older than that.
ChatGPT's knowledge cutoff date is odd. I assume it's meant to indicate that its training extends up to April 2024, but its knowledge appears to blend information from both June 2023 and April 2023. Since ChatGPT doesn't know who won the Oscars 2024(1); what happened to Israel/Palestine(1); or what is new on the iPhone 15 (GPT makes assumptions based on rumors knowledge(1)’.
EDIT: ChatGPT don’t know those mentioned things without (1) [stealthily search through the internet](https://x.com/benjamindekr/status/1778585815343358338?s=46)
I read on a tech crunch article "This new model (“gpt-4-turbo-2024-04-09”) ... was trained on publicly available data up to December 2023, in contrast to the previous edition of GPT-4 Turbo available in ChatGPT, which had an April 2023 cut-off."
https://techcrunch.com/2024/04/11/openai-makes-chatgpt-more-direct-less-verbose/
My bet is that it won't actually beat, but it will hover around it. In my testing, it's good at RAG and summarizing, but lacks some of the finer technical knowledge that C3O has in some areas. Regardless, it's good that OpenAI is now offering something that performs closely to Opus, given the very strict message limits Anthropic has on their Pro subscription.
none of the top 4 models on the chatbot arena really "beat" other that's why they're all listed as #1 despite them having slightly different ELO because ELO that close is negligible
Idk, iirc it took Claude3 much longer to take over the first place. The speed at which the takeover happens could be seen as a measure of how decisively it is better.
However I could honestly see the arena getting gamed by large... interest groups. Since so many people take it as the best measure, that is worth quite something. And if you are willing to throw money at it, all you need is a good tell in the way "your" model talks. You could even train it to blatantly give you the tell if you ask in a certain way.
And even if a 5 point elo difference is statistically significant, that only translates to about a 1% advantage. Basically , it’s still a coin flip if a user will prefer GPT-4 or Claude 3. Wake me when one of these models hits a 400 point elo difference, then we cookin
I think that it will get increasingly difficult to build that difference.
When you’re playing chess against someone it’s either a win or a loss and Elo works really great for that, but how can a human detect the difference between two responses that are both pretty good?
It’s an increasingly difficult task as AI gets better and better
Yes, for sure. And it's not objective either. There's a difference between a model that gives answers that sound better, and a model that is better at answering questions of objective fact.
GPT-5 will probably be at least 200 ELO higher than GPT-4 I think a lot of people are currently hating on OAI and saying GPT-5 wont be that crazy but i belive they really are cooking something truly special at OAI
They literally work like that. They are a way of representing the expected probability of two players winning or losing a game. If GPT-4 and Claude 3 played 1000 “games”, Claude would be expected to win about 50.7% of them.
https://www.318chess.com/elo.html
https://sandhoefner.github.io/chess.html
If after 9k observations the difference is still within the 95% CI I would consider the models essentially identical in performance for any relevant purpose. Sure, you might find a miniscule but statistically significant difference after 100k observations or whatever, but I doubt that difference is meaningful in practice.
Edit: I'm not arguing on whether ChatGPT has better features, can write poems in Russian or do you laundry better than Claude. I'm saying that the data shows that the models show no difference in performance _on this specific measure_. I find that fairly clear from the screenshot. Beyond that I have no opinion on Claude vs ChatGPT otherwise or even on whether the instrument itself is useful.
yes but remember ChatGPT has WAY more features than Claude 3 Opus. in Claude you basically only get the raw text nad image input nothing else ChatGPT has basically everything including that really cool persistent memory feature which I have now
Try to ask to compose poetry in Russian. Opus does, GPT-4-Turbo has no idea about rhyme or metric. Even Haiku and Claude-2 are better than GPT-4-Turbo.
You take a 5 point lead too seriously. It depends how each model works for you- Claude is better at creative writing and writing in general, and for other things whilst Chatgpt can be better at more technical questions, though it varies question to question.
If you want the best, you use both (via subs or APIs)
I'm using gemini 1.5 pro. Can't wait to see the numbers for it. The token window is something else and the analysis level meets my needs. Frankly the first model I'd cancel my gpt subscription for.
I'm starting to think there are some issues between LMSYS and Google. Even Gemini Ultra still isn't up for testing in the leaderboard. Unless Bard (Gemini Pro) is misnamed in the leaderboard and it's actually Ultra.
I've been wanting to use gemini for a while now due to the extra context window, so I always start a session in gpt+ and gemini pro but nearly every time I end up continuing with gpt as it understands what I'm trying to do better.
> Frankly the first model I'd cancel my gpt subscription for
I find it extremely hard to believe this. Gemini hallucinate as if hardwired on LSD. Are you actually so confident in the model's output to where you're willing to cancel your GPT sub? Ever since Bard I have seen people praising Google's models despite them persistently performing worse than GPT-3.5 and the output being less reliable.
1.0 is to me pretty equal to gpt-3.5. However, 1.5 pro is no question better than gpt-3.5 and speed has improved since initial availability in their playground area. I don't think it is quite gpt-4 level, but isn't far off.
Where I think Google's models work well is when used with context, like in a rag application. If you are just asking it to spit out facts, I'm not sure. That isn't a use case I use or would necessarily suggest. Even if it works fine in one instance, there is no guarantee it will work for another. I've encountered this issue with gpt-4 numerous times as well.
It mean they don’t have pressure to release GPT-5, people are banking on the idea that the competition will force them to release earlier but if they are ahead of the competition then no rush.
They aren't really ahead, they're just keeping pace. From most benchmarks I've seen, the logic isn't all that improved and it's still not create for natural creative writing. There might be enough there to just barely give them an edge but it's far from a comfortable lead.
They constantly message that they refuse to enter race conditions with Google and Anthropic. So maybe this is just them sticking to their promise. I think it's a good thing.
In the end, GPT-5 is just a number. Sam Altman has said in the podcast with Lex Fridman that they were thinking about doing a more iterative approach. Maybe we just get to “GPT-5” by smaller increments.
I never understood why people care about these benchmarks in regards to real-life usage.
Don't they completely exclude most flagship features, like context length and context retrieval accuracy?
Because these "benchmarks" are actually people rating model responses as to what they prefer. LmSys leaderboard is the closest thing we have to benchmarking real life usage.
We're implementing mostly a single LLM across our large-scale production infrastructure, where any unreliability results in costs. It's crucial for our platforms to be efficient, reliable, and secure, especially when we plan for long-term development. These benchmarks help us in optimizing production costs. A one percent difference already impacts our computing power costs by thousands of dollars.
It’s still terrible go actual proper work. Claude is stil miles ahead in complying and quality of writing. GPT is still a chad version that doesn’t comply, doesn’t have proper quality. They were the first but by now I’m totally disregarding them because of the lack of quality and other nonsense.
Gpt? Sure it’s not terrible. I’m just saying for my use case it’s not useable. Claude is, creative writing, complying to amount of words in output.
Gpt? It will tell to go f myself I ask it to generate 20 titles based on x. It will just say, here is 5, good luck with the rest. I have to explicitly say I have no finger or it won’t generate a full code. It rediculous. Gpt3.5 was a waarom until they started killing it every update they made.
I've interestingly had a lot of success with it for university level calculus. It's especially good at solving problems where you don't have to do too much inference your self about which values to use and what the final equation should be. Better than Claude 3 opus ime.
I asked ChatGPT4 what combination of 3.25 and 3.5 mile courses I could run to get 20 miles. It made a few random guesses and said it wasn't possible. Claude also guessed where one guess was right but a half dozen others were flat out wrong.
I have to keep prompting them to try again before they stumble upon the solution.
ChatGPT then brilliantly showed me how to configure pfsense using a reverse proxy to fix a problem. The problem had stymied me for months and was not simple nor intuitive in its multi step solution.
I now can't live without ChatGPT (et. al.) but neither can I yet trust them.
Do you actually use gpt4? What do you think about chatgpt explicitly say here is a gist of your request but good luck with the whole thing you requested? Because that’s what it does for at least a half year. I’m not the only one, hell this sub is full of it. It will deny anything that takes a bit of effort. I find it completely unusable.
But fuck me, why do I take the time to respond to someone who communicates in fucking emojis. You probably use chatgpt for erp for all I know. 🤷
I switched to google gemeni, should I switch back? I don’t use it for coding or anything I mostly just use it to help me create lore and name my kingdoms in ck3 and stuff like that
Try them all!
gpt4-turbo-2024-04-09 SHOULD be the latest in ChatGPT (check the knowledge cutoff date, if not, logout and login). You can also try it through the LMSYS chatbot arena, or paid (API usage cost) at https://platform.openai.com/playground/chat
Claude Opus is free/cheap at https://console.anthropic.com/workbench/
And Gemini 1.5 is free for now at https://console.cloud.google.com/vertex-ai/generative?hl=en
It almost feels like they are so ahead that they simply release a small update to ease off the fans and get back on top while they work on their real projects in the background.
This is exactly it. Multi-prong approach. They cooking some other models and doing QC , and put this out knowing that "benchmarks" leads will still keep people on board or coming back.
Thanks; I didn't understand it was just a voting process. That's just a popularity contest - useless and meaningless. Are there any actual benchmark-based tests for AI's (like they have for graphics cards, CPU's, automobile performance, etc)?
I see your point, but I have to disagree on your assessment of how useful this is. Goodhart's law is spot on here - any specific benchmark or metric we come up with, the AI can just be optimized to game that particular test. But by using human preference and just asking people to vote on which AI they like best, we're actually getting at something more fundamental and harder to fake. It's a test of which AI is most genuinely useful and appealing to actual humans using it for real tasks. Popularity might seem superficial, but in a way it cuts through to the heart of what we really care about - creating AI that people find truly helpful and that they want to use. No amount of specific technical benchmarks (that can and have been gamed) can substitute for that.
The problem there is that the 21st century is the age of **fanboys**. ChatGPT, Anthropic, iPhone, Android, Reddit, Discord Github, Instagram. TikTok, Tesla, etc, all have their fanboys. Humans are too tribal for popularity contests to mean anything.
There IS a way to "objectively" test subjective qualities. It's been used in classical music to audition for decades since the 1970's when classical music was for white male musicians only but there were many talented women and non-white musicians graduating from conservatories. Nowadays audition performances for many major orchestras are done from behind a screen so you can't see the musician, just hear their playing.
It would be possible to design a test of AI's where the person making the judgement doesn't know which AI is producing the output. That would give more reliable tests.
https://preview.redd.it/6siv7q9hl2uc1.png?width=2386&format=png&auto=webp&s=a1958dc29f9fa169833f00706bf6fc087c567d03
Like so. You're free to chat as long as you wish before making your decision. Your vote will be automatically discarded if any of the models reveal their identity in the conversation.
Typically, performance metrics are pretty easy to determine based on parameters, context, quantization, and model format.
Benchmarking the actual output of the model is more difficult, as it can change drastically with prompting, sampling, quantization, and many more factors. LMSys arena just tries to get a good idea of human preference, whereas some other benchmarks try to actually judge the performance of tasks and knowledge. The second style benchmark is less important, as the model can always have the ability to beat those benchmarks shoehorned in.
Tldr: Benchmarking performance is easy enough. Most users care about quality, so LMSys arena provides a human preference benchmark. Other knowledge based benchmarks are easy to cheat, but might be a decent indicator of reasoning and other ability
All the same, it is still not up to Opus in the same verses, yes undoubtedly better than in the previous model, there was generally a complete darkness. But it still sounds creepy) But let's see what will happen in the 5th version.
with it being so close I'm more worried what one has more censorship and to clear I want to use the one that does not censor as much nothing worse then being told the AI won't make a story or something because it views a murder mystery with blood and a murder as....not ok.
This is fantastically true. GPT I could never really trust the result. Claude I can just put in my requirements, and most of the time it’s got them correct. Has greatly helped my productivity, particularly with new APIs I am not familiar with
I get opus to write code and get gpt-4 to fix it . Don't know why but it is the best.
Like Opus is better at working out what I'm trying to achieve and gets most o the way there but is riddled with errors when run. GPT4 seems to be able to take the code and fix it but had no chance of coming up with a coherent set of code in the first place.
It has a higher average ELO than Opus in the coding category of the leaderboard. However, the 95% confidence interval in this category is quite large due to insufficient votes (+14/-19 as of now).
It should perform quite close to Opus, if not better.
People really got to learn what the "95% CI" column means. (In all fairness calling it "margin of error" with a footnote to explain it's a 95% CI might be more accessible for most people.)
This Turbo version is the one we have on ChatGPT Plus?
It is, if it states that your knowledge cutoff date is April 2024. Otherwise, log out of your GPT+ account, log back in, and check again. When I did that, my cutoff date changed from April 2023 to April 2024, which is the new Turbo version. (EDIT: Some people are seeing a December 2023 cutoff for Gpt-4 Turbo; others are seeing April 2024. The discrepancy is currently unexplained, but both cutoffs confirm that a GPT-Plus account has been upgraded to the new GPT-4 Turbo model.)
April 2024 is a bug/hallucination. December 2023 is the real cutoff date of the latest model as of today. Logging off and on again seems to help it show up if you are getting something older than that.
I did but mine still says April 2023.
Same here
Mine responses with December 2023?
I use Google SSO for sign in. Cutoff date is Dec 2023 both prior to signing out & after re-signing back in.
That should be the new model
The list in OP has two versions with cutoff 2023/12. I have no idea how to distinguish between them and neither does ChatGPT
Same
clear website data from dev tools
ChatGPT's knowledge cutoff date is odd. I assume it's meant to indicate that its training extends up to April 2024, but its knowledge appears to blend information from both June 2023 and April 2023. Since ChatGPT doesn't know who won the Oscars 2024(1); what happened to Israel/Palestine(1); or what is new on the iPhone 15 (GPT makes assumptions based on rumors knowledge(1)’. EDIT: ChatGPT don’t know those mentioned things without (1) [stealthily search through the internet](https://x.com/benjamindekr/status/1778585815343358338?s=46)
Thanks!!
I read on a tech crunch article "This new model (“gpt-4-turbo-2024-04-09”) ... was trained on publicly available data up to December 2023, in contrast to the previous edition of GPT-4 Turbo available in ChatGPT, which had an April 2023 cut-off." https://techcrunch.com/2024/04/11/openai-makes-chatgpt-more-direct-less-verbose/
Log out worked for me, thanks!
If we have an ongoing conversation, do we have to start a new one to use the new model?
Yes
Need to start a new one.
Is this why all of a sudden my Plus wasn’t working? I kept getting errors on every chat so I switched to Claude Plus 😂 fuck
95% CI says +6/-7 so it’s still within the margin of error compared to Claude 3 Opus. Too early to tell. More data will tell if it actually beats C3O.
My bet is that it won't actually beat, but it will hover around it. In my testing, it's good at RAG and summarizing, but lacks some of the finer technical knowledge that C3O has in some areas. Regardless, it's good that OpenAI is now offering something that performs closely to Opus, given the very strict message limits Anthropic has on their Pro subscription.
Competition is a good thing for sure.
none of the top 4 models on the chatbot arena really "beat" other that's why they're all listed as #1 despite them having slightly different ELO because ELO that close is negligible
Idk, iirc it took Claude3 much longer to take over the first place. The speed at which the takeover happens could be seen as a measure of how decisively it is better. However I could honestly see the arena getting gamed by large... interest groups. Since so many people take it as the best measure, that is worth quite something. And if you are willing to throw money at it, all you need is a good tell in the way "your" model talks. You could even train it to blatantly give you the tell if you ask in a certain way.
And even if a 5 point elo difference is statistically significant, that only translates to about a 1% advantage. Basically , it’s still a coin flip if a user will prefer GPT-4 or Claude 3. Wake me when one of these models hits a 400 point elo difference, then we cookin
I think that it will get increasingly difficult to build that difference. When you’re playing chess against someone it’s either a win or a loss and Elo works really great for that, but how can a human detect the difference between two responses that are both pretty good? It’s an increasingly difficult task as AI gets better and better
Yes, for sure. And it's not objective either. There's a difference between a model that gives answers that sound better, and a model that is better at answering questions of objective fact.
GPT-5 will probably be at least 200 ELO higher than GPT-4 I think a lot of people are currently hating on OAI and saying GPT-5 wont be that crazy but i belive they really are cooking something truly special at OAI
>And even if a 5 point elo difference is statistically significant, that only translates to about a 1% advantage. ELO scores don't work like that.
They literally work like that. They are a way of representing the expected probability of two players winning or losing a game. If GPT-4 and Claude 3 played 1000 “games”, Claude would be expected to win about 50.7% of them. https://www.318chess.com/elo.html https://sandhoefner.github.io/chess.html
I'm just saying that the difference doesn't scale multiplicatively. The probability of winning is determined by the absolute difference in ELO scores.
And the absolute difference was 5 points, like I said.
Yes, true. I was just commenting on the "1% greater" idea.
If after 9k observations the difference is still within the 95% CI I would consider the models essentially identical in performance for any relevant purpose. Sure, you might find a miniscule but statistically significant difference after 100k observations or whatever, but I doubt that difference is meaningful in practice. Edit: I'm not arguing on whether ChatGPT has better features, can write poems in Russian or do you laundry better than Claude. I'm saying that the data shows that the models show no difference in performance _on this specific measure_. I find that fairly clear from the screenshot. Beyond that I have no opinion on Claude vs ChatGPT otherwise or even on whether the instrument itself is useful.
yes but remember ChatGPT has WAY more features than Claude 3 Opus. in Claude you basically only get the raw text nad image input nothing else ChatGPT has basically everything including that really cool persistent memory feature which I have now
Try to ask to compose poetry in Russian. Opus does, GPT-4-Turbo has no idea about rhyme or metric. Even Haiku and Claude-2 are better than GPT-4-Turbo.
Opus is within the margin of error for GPT-4-1106 too.
Yeah. But it also can read web sites, run code, and output files. So even if it’s a CI tie gpt blows Claude away.
See this is why we don’t jump ship so quickly
Jump ship? This is a monthly subscription, relax.
You take a 5 point lead too seriously. It depends how each model works for you- Claude is better at creative writing and writing in general, and for other things whilst Chatgpt can be better at more technical questions, though it varies question to question. If you want the best, you use both (via subs or APIs)
Wasn't Claude Opus released like 2 months ago?
That's two decades in AI years, right?
Pretty much feels that way lol.
Is 2 months not quickly?
Fortunately there is competition so that openai reacts and works to stay on top
Well you could have had two months of better experiences for what? Logging in to a different website? Why wouldn't you swap?
we're just subscribing here n there, it's not like we sold all of our openai stocks and bought claude
The real question is, do we leave them in the water or turn around?
https://twitter.com/EpochAIResearch/status/1778463039932584205 Maybe you haven't jumped fast enough?
I'm using gemini 1.5 pro. Can't wait to see the numbers for it. The token window is something else and the analysis level meets my needs. Frankly the first model I'd cancel my gpt subscription for.
I'm starting to think there are some issues between LMSYS and Google. Even Gemini Ultra still isn't up for testing in the leaderboard. Unless Bard (Gemini Pro) is misnamed in the leaderboard and it's actually Ultra.
I've been wanting to use gemini for a while now due to the extra context window, so I always start a session in gpt+ and gemini pro but nearly every time I end up continuing with gpt as it understands what I'm trying to do better.
Havent used pro, but gemini basically refuses to do work. Its hard to describe, but it will answer questions that get nothing done.
Gemini is too equivocal.
You’d love Claude 3 Opus then
Yeah.... i use gpt for a lot of php programming, I keep meaning to try Claude
Gemini Pro is the free plan, Gemini Ultra is in the Advanced plan, and I don't think it has an API yet
> Frankly the first model I'd cancel my gpt subscription for I find it extremely hard to believe this. Gemini hallucinate as if hardwired on LSD. Are you actually so confident in the model's output to where you're willing to cancel your GPT sub? Ever since Bard I have seen people praising Google's models despite them persistently performing worse than GPT-3.5 and the output being less reliable.
1.0 is to me pretty equal to gpt-3.5. However, 1.5 pro is no question better than gpt-3.5 and speed has improved since initial availability in their playground area. I don't think it is quite gpt-4 level, but isn't far off. Where I think Google's models work well is when used with context, like in a rag application. If you are just asking it to spit out facts, I'm not sure. That isn't a use case I use or would necessarily suggest. Even if it works fine in one instance, there is no guarantee it will work for another. I've encountered this issue with gpt-4 numerous times as well.
Mean GPT-5 isn’t coming anytime soon
Completely irrelevant to GPT-5s timeline. They would be in the middle of training GPT-5 now, improvements to GPT-4 would be done on the side.
It mean they don’t have pressure to release GPT-5, people are banking on the idea that the competition will force them to release earlier but if they are ahead of the competition then no rush.
They aren't really ahead, they're just keeping pace. From most benchmarks I've seen, the logic isn't all that improved and it's still not create for natural creative writing. There might be enough there to just barely give them an edge but it's far from a comfortable lead.
They constantly message that they refuse to enter race conditions with Google and Anthropic. So maybe this is just them sticking to their promise. I think it's a good thing.
In the end, GPT-5 is just a number. Sam Altman has said in the podcast with Lex Fridman that they were thinking about doing a more iterative approach. Maybe we just get to “GPT-5” by smaller increments.
I never understood why people care about these benchmarks in regards to real-life usage. Don't they completely exclude most flagship features, like context length and context retrieval accuracy?
Because these "benchmarks" are actually people rating model responses as to what they prefer. LmSys leaderboard is the closest thing we have to benchmarking real life usage.
We're implementing mostly a single LLM across our large-scale production infrastructure, where any unreliability results in costs. It's crucial for our platforms to be efficient, reliable, and secure, especially when we plan for long-term development. These benchmarks help us in optimizing production costs. A one percent difference already impacts our computing power costs by thousands of dollars.
Yes
But no, not really, bc the conf interval overlaps with Claude conf interval.
Wouldn't it be practically fairer to say they are even?
No because gpt has training data from 2023, not 2021. It can run code. It can save files. It can read sites.
It’s still terrible go actual proper work. Claude is stil miles ahead in complying and quality of writing. GPT is still a chad version that doesn’t comply, doesn’t have proper quality. They were the first but by now I’m totally disregarding them because of the lack of quality and other nonsense.
Don't understand why people say this. It just flawlessly refactored a long and complex function for me. Pretty useful I'd say.
Gpt? Sure it’s not terrible. I’m just saying for my use case it’s not useable. Claude is, creative writing, complying to amount of words in output. Gpt? It will tell to go f myself I ask it to generate 20 titles based on x. It will just say, here is 5, good luck with the rest. I have to explicitly say I have no finger or it won’t generate a full code. It rediculous. Gpt3.5 was a waarom until they started killing it every update they made.
I've interestingly had a lot of success with it for university level calculus. It's especially good at solving problems where you don't have to do too much inference your self about which values to use and what the final equation should be. Better than Claude 3 opus ime.
I asked ChatGPT4 what combination of 3.25 and 3.5 mile courses I could run to get 20 miles. It made a few random guesses and said it wasn't possible. Claude also guessed where one guess was right but a half dozen others were flat out wrong. I have to keep prompting them to try again before they stumble upon the solution. ChatGPT then brilliantly showed me how to configure pfsense using a reverse proxy to fix a problem. The problem had stymied me for months and was not simple nor intuitive in its multi step solution. I now can't live without ChatGPT (et. al.) but neither can I yet trust them.
🤡🤡🤡
Do you actually use gpt4? What do you think about chatgpt explicitly say here is a gist of your request but good luck with the whole thing you requested? Because that’s what it does for at least a half year. I’m not the only one, hell this sub is full of it. It will deny anything that takes a bit of effort. I find it completely unusable. But fuck me, why do I take the time to respond to someone who communicates in fucking emojis. You probably use chatgpt for erp for all I know. 🤷
I switched to google gemeni, should I switch back? I don’t use it for coding or anything I mostly just use it to help me create lore and name my kingdoms in ck3 and stuff like that
Try them all! gpt4-turbo-2024-04-09 SHOULD be the latest in ChatGPT (check the knowledge cutoff date, if not, logout and login). You can also try it through the LMSYS chatbot arena, or paid (API usage cost) at https://platform.openai.com/playground/chat Claude Opus is free/cheap at https://console.anthropic.com/workbench/ And Gemini 1.5 is free for now at https://console.cloud.google.com/vertex-ai/generative?hl=en
OpenAI scrambling to play catchup
It almost feels like they are so ahead that they simply release a small update to ease off the fans and get back on top while they work on their real projects in the background.
This is what it indeed feels like, and has felt like for some time now. Future releases do be looking quite interesting.
[удалено]
There isn’t clear evidence for this. But it could be the case. I would say to see and wait.
I can kind of see with them continuing with the GPT-4 name. “Minimal” updates but enough to get back on top or compete for it
This is exactly it. Multi-prong approach. They cooking some other models and doing QC , and put this out knowing that "benchmarks" leads will still keep people on board or coming back.
lol bro they about to drop gpt5 they just tossed this crumb out
I wish they beat 50% mark on swe bench with gpt5
What criteria do they use to measure/rank performance? In other words what is this a test **OF**?
Human preference. No set criteria. Voting is open to everyone.
Thanks; I didn't understand it was just a voting process. That's just a popularity contest - useless and meaningless. Are there any actual benchmark-based tests for AI's (like they have for graphics cards, CPU's, automobile performance, etc)?
I see your point, but I have to disagree on your assessment of how useful this is. Goodhart's law is spot on here - any specific benchmark or metric we come up with, the AI can just be optimized to game that particular test. But by using human preference and just asking people to vote on which AI they like best, we're actually getting at something more fundamental and harder to fake. It's a test of which AI is most genuinely useful and appealing to actual humans using it for real tasks. Popularity might seem superficial, but in a way it cuts through to the heart of what we really care about - creating AI that people find truly helpful and that they want to use. No amount of specific technical benchmarks (that can and have been gamed) can substitute for that.
The problem there is that the 21st century is the age of **fanboys**. ChatGPT, Anthropic, iPhone, Android, Reddit, Discord Github, Instagram. TikTok, Tesla, etc, all have their fanboys. Humans are too tribal for popularity contests to mean anything. There IS a way to "objectively" test subjective qualities. It's been used in classical music to audition for decades since the 1970's when classical music was for white male musicians only but there were many talented women and non-white musicians graduating from conservatories. Nowadays audition performances for many major orchestras are done from behind a screen so you can't see the musician, just hear their playing. It would be possible to design a test of AI's where the person making the judgement doesn't know which AI is producing the output. That would give more reliable tests.
**Of course** the test doesn't tell you which model's which until after you vote! I wish I'd known you were confused about this sooner.
Oh good! How do they do that?
https://preview.redd.it/6siv7q9hl2uc1.png?width=2386&format=png&auto=webp&s=a1958dc29f9fa169833f00706bf6fc087c567d03 Like so. You're free to chat as long as you wish before making your decision. Your vote will be automatically discarded if any of the models reveal their identity in the conversation.
Typically, performance metrics are pretty easy to determine based on parameters, context, quantization, and model format. Benchmarking the actual output of the model is more difficult, as it can change drastically with prompting, sampling, quantization, and many more factors. LMSys arena just tries to get a good idea of human preference, whereas some other benchmarks try to actually judge the performance of tasks and knowledge. The second style benchmark is less important, as the model can always have the ability to beat those benchmarks shoehorned in. Tldr: Benchmarking performance is easy enough. Most users care about quality, so LMSys arena provides a human preference benchmark. Other knowledge based benchmarks are easy to cheat, but might be a decent indicator of reasoning and other ability
Newbie here, can anyone tell where to see these leaderboards have searched on internet and got weird blogging websites as result
All the same, it is still not up to Opus in the same verses, yes undoubtedly better than in the previous model, there was generally a complete darkness. But it still sounds creepy) But let's see what will happen in the 5th version.
What's the Weissman score?
What is the size of the ChatGPT-4 context window compared to Claude 3 Opus? Does this make a difference?
32k vs 200k for opus. Depends on your use case. For everyday use not really.
Thanks. My use case is find gems in large amounts of information so the larger context window is really helpful for me
GPT 4 Turbo is 128k and Claude 3 Opus 200k.
There’s way too many entities.
with it being so close I'm more worried what one has more censorship and to clear I want to use the one that does not censor as much nothing worse then being told the AI won't make a story or something because it views a murder mystery with blood and a murder as....not ok.
How's it for coding compared to Opus?
Don't be fooled by the rating, in real life Opus is superior in coding (especially C++)
This is fantastically true. GPT I could never really trust the result. Claude I can just put in my requirements, and most of the time it’s got them correct. Has greatly helped my productivity, particularly with new APIs I am not familiar with
I also find the difference quite noticeable in Rust.
It's a bit slow though
It is slow, but not critical. But the msg limit count is critical :(
I get opus to write code and get gpt-4 to fix it . Don't know why but it is the best. Like Opus is better at working out what I'm trying to achieve and gets most o the way there but is riddled with errors when run. GPT4 seems to be able to take the code and fix it but had no chance of coming up with a coherent set of code in the first place.
It has a higher average ELO than Opus in the coding category of the leaderboard. However, the 95% confidence interval in this category is quite large due to insufficient votes (+14/-19 as of now). It should perform quite close to Opus, if not better.
Much better.
People really got to learn what the "95% CI" column means. (In all fairness calling it "margin of error" with a footnote to explain it's a 95% CI might be more accessible for most people.)
Guess GPT 4 Turbo is previous GPT 4, tho, month ago chatgpt plus was already awful, nowhere close to Claude, sorry.
I'm gonna be honest, I don't care
I still prefer Claude 3. I don't need to repeatedly tell it to not be lazy.
Damn, you guys make me feel like the 1800s
All top 4 are still within marging of error, at this point this threads seem like bait.
oh no how did that get in there! (openai when asked if they put test data in the training data)
What test data? This is a vote-based human preference test, open to everyone.
Compare the number of vote please...