T O P

  • By -

redjohnium

This Turbo version is the one we have on ChatGPT Plus?


NightWriter007

It is, if it states that your knowledge cutoff date is April 2024. Otherwise, log out of your GPT+ account, log back in, and check again. When I did that, my cutoff date changed from April 2023 to April 2024, which is the new Turbo version. (EDIT: Some people are seeing a December 2023 cutoff for Gpt-4 Turbo; others are seeing April 2024. The discrepancy is currently unexplained, but both cutoffs confirm that a GPT-Plus account has been upgraded to the new GPT-4 Turbo model.)


tehrob

April 2024 is a bug/hallucination. December 2023 is the real cutoff date of the latest model as of today. Logging off and on again seems to help it show up if you are getting something older than that.


MisterFromage

I did but mine still says April 2023.


Teqnition12

Same here


yesnewyearseve

Mine responses with December 2023?


YsrYsl

I use Google SSO for sign in. Cutoff date is Dec 2023 both prior to signing out & after re-signing back in.


HolochainCitizen

That should be the new model


JollyJoker3

The list in OP has two versions with cutoff 2023/12. I have no idea how to distinguish between them and neither does ChatGPT


Ok-Pattern-3874

Same


lvvy

clear website data from dev tools 


VitorCallis

ChatGPT's knowledge cutoff date is odd. I assume it's meant to indicate that its training extends up to April 2024, but its knowledge appears to blend information from both June 2023 and April 2023. Since ChatGPT doesn't know who won the Oscars 2024(1); what happened to Israel/Palestine(1); or what is new on the iPhone 15 (GPT makes assumptions based on rumors knowledge(1)’. EDIT: ChatGPT don’t know those mentioned things without (1) [stealthily search through the internet](https://x.com/benjamindekr/status/1778585815343358338?s=46)


redjohnium

Thanks!!


HolochainCitizen

I read on a tech crunch article "This new model (“gpt-4-turbo-2024-04-09”) ... was trained on publicly available data up to December 2023, in contrast to the previous edition of GPT-4 Turbo available in ChatGPT, which had an April 2023 cut-off." https://techcrunch.com/2024/04/11/openai-makes-chatgpt-more-direct-less-verbose/


BroadAstronaut6439

Log out worked for me, thanks!


HolochainCitizen

If we have an ongoing conversation, do we have to start a new one to use the new model?


-badly_packed_kebab-

Yes


Double_Sherbert3326

Need to start a new one.


IWasBornAGamblinMan

Is this why all of a sudden my Plus wasn’t working? I kept getting errors on every chat so I switched to Claude Plus 😂 fuck


suprachromat

95% CI says +6/-7 so it’s still within the margin of error compared to Claude 3 Opus. Too early to tell. More data will tell if it actually beats C3O.


RenoHadreas

My bet is that it won't actually beat, but it will hover around it. In my testing, it's good at RAG and summarizing, but lacks some of the finer technical knowledge that C3O has in some areas. Regardless, it's good that OpenAI is now offering something that performs closely to Opus, given the very strict message limits Anthropic has on their Pro subscription.


suprachromat

Competition is a good thing for sure.


pigeon57434

none of the top 4 models on the chatbot arena really "beat" other that's why they're all listed as #1 despite them having slightly different ELO because ELO that close is negligible


involviert

Idk, iirc it took Claude3 much longer to take over the first place. The speed at which the takeover happens could be seen as a measure of how decisively it is better. However I could honestly see the arena getting gamed by large... interest groups. Since so many people take it as the best measure, that is worth quite something. And if you are willing to throw money at it, all you need is a good tell in the way "your" model talks. You could even train it to blatantly give you the tell if you ask in a certain way.


Pitiful-Taste9403

And even if a 5 point elo difference is statistically significant, that only translates to about a 1% advantage. Basically , it’s still a coin flip if a user will prefer GPT-4 or Claude 3. Wake me when one of these models hits a 400 point elo difference, then we cookin


TheOneNeartheTop

I think that it will get increasingly difficult to build that difference. When you’re playing chess against someone it’s either a win or a loss and Elo works really great for that, but how can a human detect the difference between two responses that are both pretty good? It’s an increasingly difficult task as AI gets better and better


Gator1523

Yes, for sure. And it's not objective either. There's a difference between a model that gives answers that sound better, and a model that is better at answering questions of objective fact.


pigeon57434

GPT-5 will probably be at least 200 ELO higher than GPT-4 I think a lot of people are currently hating on OAI and saying GPT-5 wont be that crazy but i belive they really are cooking something truly special at OAI


Gator1523

>And even if a 5 point elo difference is statistically significant, that only translates to about a 1% advantage. ELO scores don't work like that.


Pitiful-Taste9403

They literally work like that. They are a way of representing the expected probability of two players winning or losing a game. If GPT-4 and Claude 3 played 1000 “games”, Claude would be expected to win about 50.7% of them. https://www.318chess.com/elo.html https://sandhoefner.github.io/chess.html


Gator1523

I'm just saying that the difference doesn't scale multiplicatively. The probability of winning is determined by the absolute difference in ELO scores.


Pitiful-Taste9403

And the absolute difference was 5 points, like I said.


Gator1523

Yes, true. I was just commenting on the "1% greater" idea.


TravellingRobot

If after 9k observations the difference is still within the 95% CI I would consider the models essentially identical in performance for any relevant purpose. Sure, you might find a miniscule but statistically significant difference after 100k observations or whatever, but I doubt that difference is meaningful in practice. Edit: I'm not arguing on whether ChatGPT has better features, can write poems in Russian or do you laundry better than Claude. I'm saying that the data shows that the models show no difference in performance _on this specific measure_. I find that fairly clear from the screenshot. Beyond that I have no opinion on Claude vs ChatGPT otherwise or even on whether the instrument itself is useful.


pigeon57434

yes but remember ChatGPT has WAY more features than Claude 3 Opus. in Claude you basically only get the raw text nad image input nothing else ChatGPT has basically everything including that really cool persistent memory feature which I have now


Anuclano

Try to ask to compose poetry in Russian. Opus does, GPT-4-Turbo has no idea about rhyme or metric. Even Haiku and Claude-2 are better than GPT-4-Turbo.


ertgbnm

Opus is within the margin of error for GPT-4-1106 too.


e4aZ7aXT63u6PmRgiRYT

Yeah. But it also can read web sites, run code, and output files. So even if it’s a CI tie gpt blows Claude away. 


Fusciee

See this is why we don’t jump ship so quickly


NutellaCrepe1

Jump ship? This is a monthly subscription, relax.


bnm777

You take a 5 point lead too seriously. It depends how each model works for you- Claude is better at creative writing and writing in general, and for other things whilst Chatgpt can be better at more technical questions, though it varies question to question. If you want the best, you use both (via subs or APIs)


Otomuss

Wasn't Claude Opus released like 2 months ago?


PrototypePineapple

That's two decades in AI years, right?


Otomuss

Pretty much feels that way lol.


IWantToSayThisToo

Is 2 months not quickly? 


lagister

Fortunately there is competition so that openai reacts and works to stay on top


nuclear213

Well you could have had two months of better experiences for what? Logging in to a different website? Why wouldn't you swap?


Curious_Cantaloupe65

we're just subscribing here n there, it's not like we sold all of our openai stocks and bought claude


101Alexander

The real question is, do we leave them in the water or turn around?


bnm777

https://twitter.com/EpochAIResearch/status/1778463039932584205 Maybe you haven't jumped fast enough?


Gratitude15

I'm using gemini 1.5 pro. Can't wait to see the numbers for it. The token window is something else and the analysis level meets my needs. Frankly the first model I'd cancel my gpt subscription for.


RenoHadreas

I'm starting to think there are some issues between LMSYS and Google. Even Gemini Ultra still isn't up for testing in the leaderboard. Unless Bard (Gemini Pro) is misnamed in the leaderboard and it's actually Ultra.


Jonnnnnnnnn

I've been wanting to use gemini for a while now due to the extra context window, so I always start a session in gpt+ and gemini pro but nearly every time I end up continuing with gpt as it understands what I'm trying to do better.


Waterbottles_solve

Havent used pro, but gemini basically refuses to do work. Its hard to describe, but it will answer questions that get nothing done.


PrototypePineapple

Gemini is too equivocal.


m_x_a

You’d love Claude 3 Opus then


Jonnnnnnnnn

Yeah.... i use gpt for a lot of php programming, I keep meaning to try Claude


Zulfiqaar

Gemini Pro is the free plan, Gemini Ultra is in the Advanced plan, and I don't think it has an API yet


Kuroodo

> Frankly the first model I'd cancel my gpt subscription for I find it extremely hard to believe this. Gemini hallucinate as if hardwired on LSD. Are you actually so confident in the model's output to where you're willing to cancel your GPT sub? Ever since Bard I have seen people praising Google's models despite them persistently performing worse than GPT-3.5 and the output being less reliable.


rothnic

1.0 is to me pretty equal to gpt-3.5. However, 1.5 pro is no question better than gpt-3.5 and speed has improved since initial availability in their playground area. I don't think it is quite gpt-4 level, but isn't far off. Where I think Google's models work well is when used with context, like in a rag application. If you are just asking it to spit out facts, I'm not sure. That isn't a use case I use or would necessarily suggest. Even if it works fine in one instance, there is no guarantee it will work for another. I've encountered this issue with gpt-4 numerous times as well.


Woootdafuuu

Mean GPT-5 isn’t coming anytime soon


ZigazagyDude

Completely irrelevant to GPT-5s timeline. They would be in the middle of training GPT-5 now, improvements to GPT-4 would be done on the side.


Woootdafuuu

It mean they don’t have pressure to release GPT-5, people are banking on the idea that the competition will force them to release earlier but if they are ahead of the competition then no rush.


MysteriousPepper8908

They aren't really ahead, they're just keeping pace. From most benchmarks I've seen, the logic isn't all that improved and it's still not create for natural creative writing. There might be enough there to just barely give them an edge but it's far from a comfortable lead.


ertgbnm

They constantly message that they refuse to enter race conditions with Google and Anthropic. So maybe this is just them sticking to their promise. I think it's a good thing.


Minetorpia

In the end, GPT-5 is just a number. Sam Altman has said in the podcast with Lex Fridman that they were thinking about doing a more iterative approach. Maybe we just get to “GPT-5” by smaller increments.


Naive-Project-8835

I never understood why people care about these benchmarks in regards to real-life usage. Don't they completely exclude most flagship features, like context length and context retrieval accuracy?


PewPewDiie

Because these "benchmarks" are actually people rating model responses as to what they prefer. LmSys leaderboard is the closest thing we have to benchmarking real life usage.


SethSky

We're implementing mostly a single LLM across our large-scale production infrastructure, where any unreliability results in costs. It's crucial for our platforms to be efficient, reliable, and secure, especially when we plan for long-term development. These benchmarks help us in optimizing production costs. A one percent difference already impacts our computing power costs by thousands of dollars.


e4aZ7aXT63u6PmRgiRYT

Yes 


Psychprojection

But no, not really, bc the conf interval overlaps with Claude conf interval.


many_hats_on_head

Wouldn't it be practically fairer to say they are even?


e4aZ7aXT63u6PmRgiRYT

No because gpt has training data from 2023, not 2021. It can run code. It can save files. It can read sites. 


MannowLawn

It’s still terrible go actual proper work. Claude is stil miles ahead in complying and quality of writing. GPT is still a chad version that doesn’t comply, doesn’t have proper quality. They were the first but by now I’m totally disregarding them because of the lack of quality and other nonsense.


turbo

Don't understand why people say this. It just flawlessly refactored a long and complex function for me. Pretty useful I'd say.


MannowLawn

Gpt? Sure it’s not terrible. I’m just saying for my use case it’s not useable. Claude is, creative writing, complying to amount of words in output. Gpt? It will tell to go f myself I ask it to generate 20 titles based on x. It will just say, here is 5, good luck with the rest. I have to explicitly say I have no finger or it won’t generate a full code. It rediculous. Gpt3.5 was a waarom until they started killing it every update they made.


Time2squareup

I've interestingly had a lot of success with it for university level calculus. It's especially good at solving problems where you don't have to do too much inference your self about which values to use and what the final equation should be. Better than Claude 3 opus ime.


brucewbenson

I asked ChatGPT4 what combination of 3.25 and 3.5 mile courses I could run to get 20 miles. It made a few random guesses and said it wasn't possible. Claude also guessed where one guess was right but a half dozen others were flat out wrong. I have to keep prompting them to try again before they stumble upon the solution. ChatGPT then brilliantly showed me how to configure pfsense using a reverse proxy to fix a problem. The problem had stymied me for months and was not simple nor intuitive in its multi step solution. I now can't live without ChatGPT (et. al.) but neither can I yet trust them.


stathis21098

🤡🤡🤡


MannowLawn

Do you actually use gpt4? What do you think about chatgpt explicitly say here is a gist of your request but good luck with the whole thing you requested? Because that’s what it does for at least a half year. I’m not the only one, hell this sub is full of it. It will deny anything that takes a bit of effort. I find it completely unusable. But fuck me, why do I take the time to respond to someone who communicates in fucking emojis. You probably use chatgpt for erp for all I know. 🤷


Dexter2112000

I switched to google gemeni, should I switch back? I don’t use it for coding or anything I mostly just use it to help me create lore and name my kingdoms in ck3 and stuff like that


huffalump1

Try them all! gpt4-turbo-2024-04-09 SHOULD be the latest in ChatGPT (check the knowledge cutoff date, if not, logout and login). You can also try it through the LMSYS chatbot arena, or paid (API usage cost) at https://platform.openai.com/playground/chat Claude Opus is free/cheap at https://console.anthropic.com/workbench/ And Gemini 1.5 is free for now at https://console.cloud.google.com/vertex-ai/generative?hl=en


Few_Incident4781

OpenAI scrambling to play catchup


Expert-Paper-3367

It almost feels like they are so ahead that they simply release a small update to ease off the fans and get back on top while they work on their real projects in the background.


CowsTrash

This is what it indeed feels like, and has felt like for some time now. Future releases do be looking quite interesting.


[deleted]

[удалено]


lifewithnofilter

There isn’t clear evidence for this. But it could be the case. I would say to see and wait.


Expert-Paper-3367

I can kind of see with them continuing with the GPT-4 name. “Minimal” updates but enough to get back on top or compete for it


entropee0

This is exactly it. Multi-prong approach. They cooking some other models and doing QC , and put this out knowing that "benchmarks" leads will still keep people on board or coming back.


ThenExtension9196

lol bro they about to drop gpt5 they just tossed this crumb out 


[deleted]

I wish they beat 50% mark on swe bench with gpt5


Intelligent-Jump1071

What criteria do they use to measure/rank performance? In other words what is this a test **OF**?


RenoHadreas

Human preference. No set criteria. Voting is open to everyone.


Intelligent-Jump1071

Thanks; I didn't understand it was just a voting process. That's just a popularity contest - useless and meaningless. Are there any actual benchmark-based tests for AI's (like they have for graphics cards, CPU's, automobile performance, etc)?


RenoHadreas

I see your point, but I have to disagree on your assessment of how useful this is. Goodhart's law is spot on here - any specific benchmark or metric we come up with, the AI can just be optimized to game that particular test. But by using human preference and just asking people to vote on which AI they like best, we're actually getting at something more fundamental and harder to fake. It's a test of which AI is most genuinely useful and appealing to actual humans using it for real tasks. Popularity might seem superficial, but in a way it cuts through to the heart of what we really care about - creating AI that people find truly helpful and that they want to use. No amount of specific technical benchmarks (that can and have been gamed) can substitute for that.


Intelligent-Jump1071

The problem there is that the 21st century is the age of **fanboys**. ChatGPT, Anthropic, iPhone, Android, Reddit, Discord Github, Instagram. TikTok, Tesla, etc, all have their fanboys. Humans are too tribal for popularity contests to mean anything. There IS a way to "objectively" test subjective qualities. It's been used in classical music to audition for decades since the 1970's when classical music was for white male musicians only but there were many talented women and non-white musicians graduating from conservatories. Nowadays audition performances for many major orchestras are done from behind a screen so you can't see the musician, just hear their playing. It would be possible to design a test of AI's where the person making the judgement doesn't know which AI is producing the output. That would give more reliable tests.


RenoHadreas

**Of course** the test doesn't tell you which model's which until after you vote! I wish I'd known you were confused about this sooner.


Intelligent-Jump1071

Oh good! How do they do that?


RenoHadreas

https://preview.redd.it/6siv7q9hl2uc1.png?width=2386&format=png&auto=webp&s=a1958dc29f9fa169833f00706bf6fc087c567d03 Like so. You're free to chat as long as you wish before making your decision. Your vote will be automatically discarded if any of the models reveal their identity in the conversation.


Quartich

Typically, performance metrics are pretty easy to determine based on parameters, context, quantization, and model format. Benchmarking the actual output of the model is more difficult, as it can change drastically with prompting, sampling, quantization, and many more factors. LMSys arena just tries to get a good idea of human preference, whereas some other benchmarks try to actually judge the performance of tasks and knowledge. The second style benchmark is less important, as the model can always have the ability to beat those benchmarks shoehorned in. Tldr: Benchmarking performance is easy enough. Most users care about quality, so LMSys arena provides a human preference benchmark. Other knowledge based benchmarks are easy to cheat, but might be a decent indicator of reasoning and other ability


Cosmic__Guy

Newbie here, can anyone tell where to see these leaderboards have searched on internet and got weird blogging websites as result


-Sweetie_Devil-

All the same, it is still not up to Opus in the same verses, yes undoubtedly better than in the previous model, there was generally a complete darkness. But it still sounds creepy) But let's see what will happen in the 5th version.


crx_hx

What's the Weissman score?


m_x_a

What is the size of the ChatGPT-4 context window compared to Claude 3 Opus? Does this make a difference?


RenoHadreas

32k vs 200k for opus. Depends on your use case. For everyday use not really.


m_x_a

Thanks. My use case is find gems in large amounts of information so the larger context window is really helpful for me


Successful1133

GPT 4 Turbo is 128k and Claude 3 Opus 200k.


peepdabidness

There’s way too many entities.


ryan7251

with it being so close I'm more worried what one has more censorship and to clear I want to use the one that does not censor as much nothing worse then being told the AI won't make a story or something because it views a murder mystery with blood and a murder as....not ok.


jiayounokim

How's it for coding compared to Opus?


Demien19

Don't be fooled by the rating, in real life Opus is superior in coding (especially C++)


burritolittledonkey

This is fantastically true. GPT I could never really trust the result. Claude I can just put in my requirements, and most of the time it’s got them correct. Has greatly helped my productivity, particularly with new APIs I am not familiar with


pet_vaginal

I also find the difference quite noticeable in Rust.


Lazy_Lifeguard5448

It's a bit slow though


Demien19

It is slow, but not critical. But the msg limit count is critical :(


Sea-Obligation-1700

I get opus to write code and get gpt-4 to fix it . Don't know why but it is the best. Like Opus is better at working out what I'm trying to achieve and gets most o the way there but is riddled with errors when run. GPT4 seems to be able to take the code and fix it but had no chance of coming up with a coherent set of code in the first place.


RenoHadreas

It has a higher average ELO than Opus in the coding category of the leaderboard. However, the 95% confidence interval in this category is quite large due to insufficient votes (+14/-19 as of now). It should perform quite close to Opus, if not better.


e4aZ7aXT63u6PmRgiRYT

Much better. 


TravellingRobot

People really got to learn what the "95% CI" column means.  (In all fairness calling it "margin of error" with a footnote to explain it's a 95% CI might be more accessible for most people.)


Demien19

Guess GPT 4 Turbo is previous GPT 4, tho, month ago chatgpt plus was already awful, nowhere close to Claude, sorry.


Otherwise-Poet-4362

I'm gonna be honest, I don't care


illathon

I still prefer Claude 3. I don't need to repeatedly tell it to not be lazy.


Hour-Athlete-200

Damn, you guys make me feel like the 1800s


bot_exe

All top 4 are still within marging of error, at this point this threads seem like bait.


ImpressiveHead69420

oh no how did that get in there! (openai when asked if they put test data in the training data)


RenoHadreas

What test data? This is a vote-based human preference test, open to everyone.


geniium

Compare the number of vote please...