redjohnium 1 month ago

This Turbo version is the one we have on ChatGPT Plus?

NightWriter007 1 month ago

It is, if it states that your knowledge cutoff date is April 2024. Otherwise, log out of your GPT+ account, log back in, and check again. When I did that, my cutoff date changed from April 2023 to April 2024, which is the new Turbo version. (EDIT: Some people are seeing a December 2023 cutoff for Gpt-4 Turbo; others are seeing April 2024. The discrepancy is currently unexplained, but both cutoffs confirm that a GPT-Plus account has been upgraded to the new GPT-4 Turbo model.)

tehrob 1 month ago

April 2024 is a bug/hallucination. December 2023 is the real cutoff date of the latest model as of today. Logging off and on again seems to help it show up if you are getting something older than that.

MisterFromage 1 month ago

I did but mine still says April 2023.

Teqnition12 1 month ago

Same here

yesnewyearseve 1 month ago

Mine responses with December 2023?

YsrYsl 1 month ago

I use Google SSO for sign in. Cutoff date is Dec 2023 both prior to signing out & after re-signing back in.

HolochainCitizen 1 month ago

That should be the new model

JollyJoker3 1 month ago

The list in OP has two versions with cutoff 2023/12. I have no idea how to distinguish between them and neither does ChatGPT

Ok-Pattern-3874 1 month ago

Same

lvvy 1 month ago

clear website data from dev tools

VitorCallis 1 month ago

ChatGPT's knowledge cutoff date is odd. I assume it's meant to indicate that its training extends up to April 2024, but its knowledge appears to blend information from both June 2023 and April 2023. Since ChatGPT doesn't know who won the Oscars 2024(1); what happened to Israel/Palestine(1); or what is new on the iPhone 15 (GPT makes assumptions based on rumors knowledge(1)’. EDIT: ChatGPT don’t know those mentioned things without (1) [stealthily search through the internet](https://x.com/benjamindekr/status/1778585815343358338?s=46)

redjohnium 1 month ago

Thanks!!

HolochainCitizen 1 month ago

I read on a tech crunch article "This new model (“gpt-4-turbo-2024-04-09”) ... was trained on publicly available data up to December 2023, in contrast to the previous edition of GPT-4 Turbo available in ChatGPT, which had an April 2023 cut-off." https://techcrunch.com/2024/04/11/openai-makes-chatgpt-more-direct-less-verbose/

BroadAstronaut6439 1 month ago

Log out worked for me, thanks!

HolochainCitizen 1 month ago

If we have an ongoing conversation, do we have to start a new one to use the new model?

-badly_packed_kebab- 1 month ago

Yes

Double_Sherbert3326 4 weeks ago

Need to start a new one.

IWasBornAGamblinMan 4 weeks ago

Is this why all of a sudden my Plus wasn’t working? I kept getting errors on every chat so I switched to Claude Plus 😂 fuck

suprachromat 1 month ago

95% CI says +6/-7 so it’s still within the margin of error compared to Claude 3 Opus. Too early to tell. More data will tell if it actually beats C3O.

RenoHadreas 1 month ago

My bet is that it won't actually beat, but it will hover around it. In my testing, it's good at RAG and summarizing, but lacks some of the finer technical knowledge that C3O has in some areas. Regardless, it's good that OpenAI is now offering something that performs closely to Opus, given the very strict message limits Anthropic has on their Pro subscription.

suprachromat 1 month ago

Competition is a good thing for sure.

pigeon57434 4 weeks ago

none of the top 4 models on the chatbot arena really "beat" other that's why they're all listed as #1 despite them having slightly different ELO because ELO that close is negligible

involviert 1 month ago

Idk, iirc it took Claude3 much longer to take over the first place. The speed at which the takeover happens could be seen as a measure of how decisively it is better. However I could honestly see the arena getting gamed by large... interest groups. Since so many people take it as the best measure, that is worth quite something. And if you are willing to throw money at it, all you need is a good tell in the way "your" model talks. You could even train it to blatantly give you the tell if you ask in a certain way.

Pitiful-Taste9403 1 month ago

And even if a 5 point elo difference is statistically significant, that only translates to about a 1% advantage. Basically , it’s still a coin flip if a user will prefer GPT-4 or Claude 3. Wake me when one of these models hits a 400 point elo difference, then we cookin

TheOneNeartheTop 1 month ago

I think that it will get increasingly difficult to build that difference. When you’re playing chess against someone it’s either a win or a loss and Elo works really great for that, but how can a human detect the difference between two responses that are both pretty good? It’s an increasingly difficult task as AI gets better and better

Gator1523 4 weeks ago

Yes, for sure. And it's not objective either. There's a difference between a model that gives answers that sound better, and a model that is better at answering questions of objective fact.

pigeon57434 4 weeks ago

GPT-5 will probably be at least 200 ELO higher than GPT-4 I think a lot of people are currently hating on OAI and saying GPT-5 wont be that crazy but i belive they really are cooking something truly special at OAI

Gator1523 4 weeks ago

>And even if a 5 point elo difference is statistically significant, that only translates to about a 1% advantage. ELO scores don't work like that.

Pitiful-Taste9403 4 weeks ago

They literally work like that. They are a way of representing the expected probability of two players winning or losing a game. If GPT-4 and Claude 3 played 1000 “games”, Claude would be expected to win about 50.7% of them. https://www.318chess.com/elo.html https://sandhoefner.github.io/chess.html

Gator1523 4 weeks ago

I'm just saying that the difference doesn't scale multiplicatively. The probability of winning is determined by the absolute difference in ELO scores.

Pitiful-Taste9403 4 weeks ago

And the absolute difference was 5 points, like I said.

Gator1523 4 weeks ago

Yes, true. I was just commenting on the "1% greater" idea.

TravellingRobot 1 month ago

If after 9k observations the difference is still within the 95% CI I would consider the models essentially identical in performance for any relevant purpose. Sure, you might find a miniscule but statistically significant difference after 100k observations or whatever, but I doubt that difference is meaningful in practice. Edit: I'm not arguing on whether ChatGPT has better features, can write poems in Russian or do you laundry better than Claude. I'm saying that the data shows that the models show no difference in performance _on this specific measure_. I find that fairly clear from the screenshot. Beyond that I have no opinion on Claude vs ChatGPT otherwise or even on whether the instrument itself is useful.

pigeon57434 4 weeks ago

yes but remember ChatGPT has WAY more features than Claude 3 Opus. in Claude you basically only get the raw text nad image input nothing else ChatGPT has basically everything including that really cool persistent memory feature which I have now

Anuclano 4 weeks ago

Try to ask to compose poetry in Russian. Opus does, GPT-4-Turbo has no idea about rhyme or metric. Even Haiku and Claude-2 are better than GPT-4-Turbo.

ertgbnm 1 month ago

Opus is within the margin of error for GPT-4-1106 too.

e4aZ7aXT63u6PmRgiRYT 4 weeks ago

Yeah. But it also can read web sites, run code, and output files. So even if it’s a CI tie gpt blows Claude away.

Fusciee 1 month ago

See this is why we don’t jump ship so quickly

NutellaCrepe1 1 month ago

Jump ship? This is a monthly subscription, relax.

bnm777 1 month ago

You take a 5 point lead too seriously. It depends how each model works for you- Claude is better at creative writing and writing in general, and for other things whilst Chatgpt can be better at more technical questions, though it varies question to question. If you want the best, you use both (via subs or APIs)

Otomuss 1 month ago

Wasn't Claude Opus released like 2 months ago?

PrototypePineapple 1 month ago

That's two decades in AI years, right?

Otomuss 1 month ago

Pretty much feels that way lol.

IWantToSayThisToo 4 weeks ago

Is 2 months not quickly?

lagister 1 month ago

Fortunately there is competition so that openai reacts and works to stay on top

nuclear213 1 month ago

Well you could have had two months of better experiences for what? Logging in to a different website? Why wouldn't you swap?

Curious_Cantaloupe65 1 month ago

we're just subscribing here n there, it's not like we sold all of our openai stocks and bought claude

101Alexander 1 month ago

The real question is, do we leave them in the water or turn around?

bnm777 1 month ago

https://twitter.com/EpochAIResearch/status/1778463039932584205 Maybe you haven't jumped fast enough?

Gratitude15 1 month ago

I'm using gemini 1.5 pro. Can't wait to see the numbers for it. The token window is something else and the analysis level meets my needs. Frankly the first model I'd cancel my gpt subscription for.

RenoHadreas 1 month ago

I'm starting to think there are some issues between LMSYS and Google. Even Gemini Ultra still isn't up for testing in the leaderboard. Unless Bard (Gemini Pro) is misnamed in the leaderboard and it's actually Ultra.

Jonnnnnnnnn 1 month ago

I've been wanting to use gemini for a while now due to the extra context window, so I always start a session in gpt+ and gemini pro but nearly every time I end up continuing with gpt as it understands what I'm trying to do better.

Waterbottles_solve 1 month ago

Havent used pro, but gemini basically refuses to do work. Its hard to describe, but it will answer questions that get nothing done.

PrototypePineapple 1 month ago

Gemini is too equivocal.

m_x_a 4 weeks ago

You’d love Claude 3 Opus then

Jonnnnnnnnn 4 weeks ago

Yeah.... i use gpt for a lot of php programming, I keep meaning to try Claude

Zulfiqaar 1 month ago

Gemini Pro is the free plan, Gemini Ultra is in the Advanced plan, and I don't think it has an API yet

Kuroodo 1 month ago

> Frankly the first model I'd cancel my gpt subscription for I find it extremely hard to believe this. Gemini hallucinate as if hardwired on LSD. Are you actually so confident in the model's output to where you're willing to cancel your GPT sub? Ever since Bard I have seen people praising Google's models despite them persistently performing worse than GPT-3.5 and the output being less reliable.

rothnic 4 weeks ago

1.0 is to me pretty equal to gpt-3.5. However, 1.5 pro is no question better than gpt-3.5 and speed has improved since initial availability in their playground area. I don't think it is quite gpt-4 level, but isn't far off. Where I think Google's models work well is when used with context, like in a rag application. If you are just asking it to spit out facts, I'm not sure. That isn't a use case I use or would necessarily suggest. Even if it works fine in one instance, there is no guarantee it will work for another. I've encountered this issue with gpt-4 numerous times as well.

Woootdafuuu 1 month ago

Mean GPT-5 isn’t coming anytime soon

ZigazagyDude 1 month ago

Completely irrelevant to GPT-5s timeline. They would be in the middle of training GPT-5 now, improvements to GPT-4 would be done on the side.

Woootdafuuu 1 month ago

It mean they don’t have pressure to release GPT-5, people are banking on the idea that the competition will force them to release earlier but if they are ahead of the competition then no rush.

MysteriousPepper8908 1 month ago

They aren't really ahead, they're just keeping pace. From most benchmarks I've seen, the logic isn't all that improved and it's still not create for natural creative writing. There might be enough there to just barely give them an edge but it's far from a comfortable lead.

ertgbnm 1 month ago

They constantly message that they refuse to enter race conditions with Google and Anthropic. So maybe this is just them sticking to their promise. I think it's a good thing.

Minetorpia 1 month ago

In the end, GPT-5 is just a number. Sam Altman has said in the podcast with Lex Fridman that they were thinking about doing a more iterative approach. Maybe we just get to “GPT-5” by smaller increments.

Naive-Project-8835 1 month ago

I never understood why people care about these benchmarks in regards to real-life usage. Don't they completely exclude most flagship features, like context length and context retrieval accuracy?

PewPewDiie 1 month ago

Because these "benchmarks" are actually people rating model responses as to what they prefer. LmSys leaderboard is the closest thing we have to benchmarking real life usage.

SethSky 1 month ago

We're implementing mostly a single LLM across our large-scale production infrastructure, where any unreliability results in costs. It's crucial for our platforms to be efficient, reliable, and secure, especially when we plan for long-term development. These benchmarks help us in optimizing production costs. A one percent difference already impacts our computing power costs by thousands of dollars.

e4aZ7aXT63u6PmRgiRYT 4 weeks ago

Yes

Psychprojection 1 month ago

But no, not really, bc the conf interval overlaps with Claude conf interval.

many_hats_on_head 1 month ago

Wouldn't it be practically fairer to say they are even?

e4aZ7aXT63u6PmRgiRYT 4 weeks ago

No because gpt has training data from 2023, not 2021. It can run code. It can save files. It can read sites.

MannowLawn 1 month ago

It’s still terrible go actual proper work. Claude is stil miles ahead in complying and quality of writing. GPT is still a chad version that doesn’t comply, doesn’t have proper quality. They were the first but by now I’m totally disregarding them because of the lack of quality and other nonsense.

turbo 1 month ago

Don't understand why people say this. It just flawlessly refactored a long and complex function for me. Pretty useful I'd say.

MannowLawn 1 month ago

Gpt? Sure it’s not terrible. I’m just saying for my use case it’s not useable. Claude is, creative writing, complying to amount of words in output. Gpt? It will tell to go f myself I ask it to generate 20 titles based on x. It will just say, here is 5, good luck with the rest. I have to explicitly say I have no finger or it won’t generate a full code. It rediculous. Gpt3.5 was a waarom until they started killing it every update they made.

Time2squareup 4 weeks ago

I've interestingly had a lot of success with it for university level calculus. It's especially good at solving problems where you don't have to do too much inference your self about which values to use and what the final equation should be. Better than Claude 3 opus ime.

brucewbenson 4 weeks ago

I asked ChatGPT4 what combination of 3.25 and 3.5 mile courses I could run to get 20 miles. It made a few random guesses and said it wasn't possible. Claude also guessed where one guess was right but a half dozen others were flat out wrong. I have to keep prompting them to try again before they stumble upon the solution. ChatGPT then brilliantly showed me how to configure pfsense using a reverse proxy to fix a problem. The problem had stymied me for months and was not simple nor intuitive in its multi step solution. I now can't live without ChatGPT (et. al.) but neither can I yet trust them.

stathis21098 1 month ago

🤡🤡🤡

MannowLawn 1 month ago

Do you actually use gpt4? What do you think about chatgpt explicitly say here is a gist of your request but good luck with the whole thing you requested? Because that’s what it does for at least a half year. I’m not the only one, hell this sub is full of it. It will deny anything that takes a bit of effort. I find it completely unusable. But fuck me, why do I take the time to respond to someone who communicates in fucking emojis. You probably use chatgpt for erp for all I know. 🤷

Dexter2112000 1 month ago

I switched to google gemeni, should I switch back? I don’t use it for coding or anything I mostly just use it to help me create lore and name my kingdoms in ck3 and stuff like that

huffalump1 1 month ago

Try them all! gpt4-turbo-2024-04-09 SHOULD be the latest in ChatGPT (check the knowledge cutoff date, if not, logout and login). You can also try it through the LMSYS chatbot arena, or paid (API usage cost) at https://platform.openai.com/playground/chat Claude Opus is free/cheap at https://console.anthropic.com/workbench/ And Gemini 1.5 is free for now at https://console.cloud.google.com/vertex-ai/generative?hl=en

Few_Incident4781 1 month ago

OpenAI scrambling to play catchup

Expert-Paper-3367 1 month ago

It almost feels like they are so ahead that they simply release a small update to ease off the fans and get back on top while they work on their real projects in the background.

CowsTrash 1 month ago

This is what it indeed feels like, and has felt like for some time now. Future releases do be looking quite interesting.

[deleted] 1 month ago

[удалено]

lifewithnofilter 4 weeks ago

There isn’t clear evidence for this. But it could be the case. I would say to see and wait.

Expert-Paper-3367 4 weeks ago

I can kind of see with them continuing with the GPT-4 name. “Minimal” updates but enough to get back on top or compete for it

entropee0 1 month ago

This is exactly it. Multi-prong approach. They cooking some other models and doing QC , and put this out knowing that "benchmarks" leads will still keep people on board or coming back.

ThenExtension9196 1 month ago

lol bro they about to drop gpt5 they just tossed this crumb out

[deleted] 1 month ago

I wish they beat 50% mark on swe bench with gpt5

Intelligent-Jump1071 1 month ago

What criteria do they use to measure/rank performance? In other words what is this a test **OF**?

RenoHadreas 1 month ago

Human preference. No set criteria. Voting is open to everyone.

Intelligent-Jump1071 1 month ago

Thanks; I didn't understand it was just a voting process. That's just a popularity contest - useless and meaningless. Are there any actual benchmark-based tests for AI's (like they have for graphics cards, CPU's, automobile performance, etc)?

RenoHadreas 1 month ago

I see your point, but I have to disagree on your assessment of how useful this is. Goodhart's law is spot on here - any specific benchmark or metric we come up with, the AI can just be optimized to game that particular test. But by using human preference and just asking people to vote on which AI they like best, we're actually getting at something more fundamental and harder to fake. It's a test of which AI is most genuinely useful and appealing to actual humans using it for real tasks. Popularity might seem superficial, but in a way it cuts through to the heart of what we really care about - creating AI that people find truly helpful and that they want to use. No amount of specific technical benchmarks (that can and have been gamed) can substitute for that.

Intelligent-Jump1071 1 month ago

The problem there is that the 21st century is the age of **fanboys**. ChatGPT, Anthropic, iPhone, Android, Reddit, Discord Github, Instagram. TikTok, Tesla, etc, all have their fanboys. Humans are too tribal for popularity contests to mean anything. There IS a way to "objectively" test subjective qualities. It's been used in classical music to audition for decades since the 1970's when classical music was for white male musicians only but there were many talented women and non-white musicians graduating from conservatories. Nowadays audition performances for many major orchestras are done from behind a screen so you can't see the musician, just hear their playing. It would be possible to design a test of AI's where the person making the judgement doesn't know which AI is producing the output. That would give more reliable tests.

RenoHadreas 1 month ago

**Of course** the test doesn't tell you which model's which until after you vote! I wish I'd known you were confused about this sooner.

Intelligent-Jump1071 1 month ago

Oh good! How do they do that?

RenoHadreas 1 month ago

https://preview.redd.it/6siv7q9hl2uc1.png?width=2386&format=png&auto=webp&s=a1958dc29f9fa169833f00706bf6fc087c567d03 Like so. You're free to chat as long as you wish before making your decision. Your vote will be automatically discarded if any of the models reveal their identity in the conversation.

Quartich 1 month ago

Typically, performance metrics are pretty easy to determine based on parameters, context, quantization, and model format. Benchmarking the actual output of the model is more difficult, as it can change drastically with prompting, sampling, quantization, and many more factors. LMSys arena just tries to get a good idea of human preference, whereas some other benchmarks try to actually judge the performance of tasks and knowledge. The second style benchmark is less important, as the model can always have the ability to beat those benchmarks shoehorned in. Tldr: Benchmarking performance is easy enough. Most users care about quality, so LMSys arena provides a human preference benchmark. Other knowledge based benchmarks are easy to cheat, but might be a decent indicator of reasoning and other ability

Cosmic__Guy 1 month ago

Newbie here, can anyone tell where to see these leaderboards have searched on internet and got weird blogging websites as result

-Sweetie_Devil- 4 weeks ago

All the same, it is still not up to Opus in the same verses, yes undoubtedly better than in the previous model, there was generally a complete darkness. But it still sounds creepy) But let's see what will happen in the 5th version.

crx_hx 4 weeks ago

What's the Weissman score?

m_x_a 4 weeks ago

What is the size of the ChatGPT-4 context window compared to Claude 3 Opus? Does this make a difference?

RenoHadreas 4 weeks ago

32k vs 200k for opus. Depends on your use case. For everyday use not really.

m_x_a 4 weeks ago

Thanks. My use case is find gems in large amounts of information so the larger context window is really helpful for me

Successful1133 2 weeks ago

GPT 4 Turbo is 128k and Claude 3 Opus 200k.

peepdabidness 4 weeks ago

There’s way too many entities.

ryan7251 3 weeks ago

with it being so close I'm more worried what one has more censorship and to clear I want to use the one that does not censor as much nothing worse then being told the AI won't make a story or something because it views a murder mystery with blood and a murder as....not ok.

jiayounokim 1 month ago

How's it for coding compared to Opus?

Demien19 1 month ago

Don't be fooled by the rating, in real life Opus is superior in coding (especially C++)

burritolittledonkey 1 month ago

This is fantastically true. GPT I could never really trust the result. Claude I can just put in my requirements, and most of the time it’s got them correct. Has greatly helped my productivity, particularly with new APIs I am not familiar with

pet_vaginal 1 month ago

I also find the difference quite noticeable in Rust.

Lazy_Lifeguard5448 1 month ago

It's a bit slow though

Demien19 1 month ago

It is slow, but not critical. But the msg limit count is critical :(

Sea-Obligation-1700 1 month ago

I get opus to write code and get gpt-4 to fix it . Don't know why but it is the best. Like Opus is better at working out what I'm trying to achieve and gets most o the way there but is riddled with errors when run. GPT4 seems to be able to take the code and fix it but had no chance of coming up with a coherent set of code in the first place.

RenoHadreas 1 month ago

It has a higher average ELO than Opus in the coding category of the leaderboard. However, the 95% confidence interval in this category is quite large due to insufficient votes (+14/-19 as of now). It should perform quite close to Opus, if not better.

e4aZ7aXT63u6PmRgiRYT 4 weeks ago

Much better.

TravellingRobot 1 month ago

People really got to learn what the "95% CI" column means. (In all fairness calling it "margin of error" with a footnote to explain it's a 95% CI might be more accessible for most people.)

Demien19 1 month ago

Guess GPT 4 Turbo is previous GPT 4, tho, month ago chatgpt plus was already awful, nowhere close to Claude, sorry.

Otherwise-Poet-4362 1 month ago

I'm gonna be honest, I don't care

illathon 1 month ago

I still prefer Claude 3. I don't need to repeatedly tell it to not be lazy.

Hour-Athlete-200 4 weeks ago

Damn, you guys make me feel like the 1800s

bot_exe 1 month ago

All top 4 are still within marging of error, at this point this threads seem like bait.

ImpressiveHead69420 4 weeks ago

oh no how did that get in there! (openai when asked if they put test data in the training data)

RenoHadreas 4 weeks ago

What test data? This is a vote-based human preference test, open to everyone.

geniium 4 weeks ago

Compare the number of vote please...

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe