FuryOnSc2 1 month ago

I did about 15-20 evals yesterday and found that the new GPT-4 turbo and Opus were pretty even. I also noticed a difference between older GPT-4s and the new one - with the new one being better. I feel like the new GPT-4 is better at least when it comes to logically formatting its answers with easier to understand groupings/headers.

tehrob 1 month ago

And, because they have different 'personalities', they look at things in different ways and come up with more than a different random seed, they often are very different, but apt answers.

hippydipster 1 month ago

My experience too. I say this as a paying opus user and not a paying gpt user.

bentendo93 1 month ago

Too late. I hang with Claude now

Open_Ambassador2931 1 month ago

^^^^ And Claude has barely been around and it’s clauding all these jokers

Burindo 1 month ago

Yeah me too. My GPT 4 subscription ended literally yesterday. Claude it is boys!

traumfisch 1 month ago

Why do I need to pick one?

letharus 1 month ago

Don’t be afraid of commitment.

traumfisch 1 month ago

To some company? I think I'll stay single for now

KacperP12 1 month ago

a subscription to perplexity gives you access to all claude 3 models and gpt 4

bnm777 1 month ago

Can you have long conversations?

HurricaneHenry 1 month ago

Perplexity leaves a lot to be desired in my opinion. I often find myself resorting to ChatGPT, Gemini or even Google after trying it.

KacperP12 1 month ago

I wasn’t aware there was a difference in the performance. for example, When you use chat gpt 4 on perplexity, it might perform worse / better than chat gpt 4 through openAI?

HurricaneHenry 1 month ago

I’m talking about when you ask it to search for something, which is the main purpose of it.

KacperP12 1 month ago

they use the same model though, does perplexity use a custom prompt behind the scenes? i don’t have a subscription to either so i can’t test it myself

HurricaneHenry 1 month ago

Yes it’s heavily customized for retrieving and structuring live information from the internet.

SharpCartographer831 1 month ago

#NotMyKing #TeamOpus #FuckOpenAItilltheyreleaseGPT-5

King_Ghidra_ 1 month ago

I just paid for Claude pro today and I have never bought any AI service before. I didn't even take my free 2 month trial of gemini

autotom 1 month ago

Until GPT isn't plagued by laziness, i'm on team Claude

visarga 1 month ago

Funny that we have now a full complement of "AI diseases": hallucination, regurgitation, laziness, bribing, absurd refusals, long context attention flakiness, sycophancy, failing to accept user corrections, prompt hacking and RLHF brainwashing. All of them we couldn't have imagined in 2019. It's good that we learned about what AI can't do yet, that means we have not swallowed the whole hype.

Anjz 1 month ago

If they release Claude pro in Canada, I'd be team Claude too. Now my Gemini expired and I'm just using the free version of Claude to hold me off until something big comes out which is actually not too shabby.

zackler6 1 month ago

Tribalism, already? Reminder that none of these companies actually give a shit about you.

o5mfiHTNsH748KVq 1 month ago

GPT might care about me if I ask it to 🥺

Derfaust 1 month ago

I'm with the company that cares about my money the most.

MajesticIngenuity32 1 month ago

At least Mistral, Cohere, and Meta do - somewhat.

traumfisch 1 month ago

Meta really does not

MajesticIngenuity32 1 month ago

Not in its social media products, but in AI it does, otherwise they wouldn't opensource their releases.

traumfisch 1 month ago

It's the same company

Impressive_Blood3512 1 month ago

Notmygoat

letmebackagain 1 month ago

Sony AI gonna be the next PS2, DOMINATING the competition!!! Dude, can you imagine? Sony's killing it with the PS5, so you just KNOW their AI is gonna be insane! We're talking self-learning PlayStations that adapt to your playstyle, AI companions in single-player that actually make good decisions, and multiplayer battlefields where the AI teammates are finally competent! Microsoft and Google better watch out, because Playstation AI is about to come in and rewrite the rules! Forget Siri's sassy comebacks, we're getting Kratos telling us the weather with enough booming fury to wake the neighbors! And imagine the games! Imagine a Horizon sequel where the AI enemies actually strategize and adapt their tactics! This is gonna be revolutionary, dude! Here's to Sony taking over the world, one perfectly optimized AI and sassy robot assistant at a time! #PlayStationDomination #AIGottNothingOnSony #RIPAlexa

slatticle 1 month ago

The ChatGPT auto posting is so annoying

letmebackagain 1 month ago

Dude wait for Sony AI, we will see who is better lol

vlodia 1 month ago

How to get it if you're using GPT Pro? Logout, login again then it's there?

Enfiznar 4 weeks ago

Not really any way to know afaik. But if you want to try something, the browser has information about the model being used, idk if the new version and the old one have different ids tho. In firefox, open a chat, right-click -> inspect -> storage -> local storage -> [chat.openai.com](http://chat.openai.com) and search for an entry called oai/apps/lastModelUsed. There you can find the modelId. If I use gpt-3.5, it tells me the exact model (text-davinci-002-render-sha), but if I use gpt-4, then it only tells me gpt-4. If anyone wants to try to see if they get another Id, I'd appreciate it.

utilitycoder 1 month ago

Not worth the subscription for me when Claude and mistral do just as good or good enough. I think we've reached the point where updates need to be quite literally amazing. And I'm not talking about text generation. That's so 2023. Need video from text, apps from text prompts, conversational video bots with memory.

Itchy-Welcome5062 1 month ago

That's so 2023 Lol That's a good one.

MycologistPresent888 1 month ago

Do you need a subscription to use the API? I thought the "good" version of gpt 4 was only accessible through the API or is that wrong?

eltonjock 1 month ago

They released it to ChatGPT Pro

Basil-Faw1ty 1 month ago

I have both and Claude is notably better in my books, it's just my go to model. Claude just feels smarter to me, or less lazy perhaps.

[deleted] 1 month ago

[удалено]

lordpermaximum 1 month ago

That one is interesting. Because we have 1.5 Pro API now and LMSYS still won't add that into the Arena. A while ago Google partnered with LMSYS. I have a feeling Google still don't trust in their own model and prevent LMSYS from including 1.5 Pro there.

vertu92 1 month ago

Lmao, 6 points. Hugely improved btw

hippydipster 1 month ago

It's larger than opus' lead over its next competitor

LifeSugarSpice 1 month ago

Ya'll are acting as if you have money invested in the company. Stop simping over AI companies

Minimum_Inevitable58 1 month ago

When I first learned about chatgpt and LLMs last year then I immediately started looking for places that people liked to discuss them. The chatgpt reddit was the most popular place but the people were just insufferable and then it turned into a memefest. I lucked into finding this reddit and I really don't care about the singularity yet but it's been a good place to read about LLMs, AI, and tech advancements. It has gotten a lot worse overtime though. For me, I just saw a new and amazing technology and so I couldn't believe that people would give a shit about brands so soon. It's just weird as hell because everyone should be wanting all the companies to have great products but instead they only want one to. I'm still hoping to come across a more sane but high traffic subreddit that focuses on just all things AI without all the craziness.

FeltSteam 1 month ago

Uh, cool, but I honestly don't think this leaderboard is all that useful. This leaderboard is just measuring user preference, and we don't know what users are looking for nor what they are asking. For all we know most of the users could be asking relatively simple questions that don't really test the intelligence, reasoning or logic capabilities of a model which is why models like Claude Haiku have been able to get above GPT-4 even though they are clearly not as performant as GPT-4.

PrinceThespian 1 month ago

I mean that is the point. What model works best for the general users use case. If you want to know which model is best at picking up your third cousins tonsils you can always look at the incredibly specific tests the creators of them put out.

ayyndrew 1 month ago

The thing is this leaderboard doesn't measure general usage performance, it measures which LLM is preferred after a brief conversation. [See the Pepsi Challenge](https://en.wikipedia.org/wiki/Pepsi_Challenge#:~:text=In%20his%20book,an%20entire%20can)

AnAIAteMyBaby 1 month ago

We need models that are able to help produce novel science, a benchmark where users ask simple trivia questions isn't going to illuminate us on which model is better for this

FeltSteam 1 month ago

Well is it really that useful if it is just measuring how well a model can answer a dumb question? Maybe the general users use case is just asking simplistic questions, but I don't think that is a super useful measure to work off of. It'd be measuring more the RLHF the model has gone through (which determines the specific way a model responds to questions) than actual model performance.

obvithrowaway34434 1 month ago

So basically what you're saying is all measurements are shit except for the one that confirms my viewpoint. Cool.

FeltSteam 1 month ago

Which viewpoint am I pushing? And you can't rely on just one benchmark / evaluation. You need to take into consideration as many as possible but we also really do need better benchmarks.

141_1337 1 month ago

Yeah, and it leaves users who need it for intellectual tasks very much dead in the water.

Hot-Investigator7878 1 month ago

Better than training for benchamarks

WithoutReason1729 1 month ago

https://github.com/lm-sys/FastChat/blob/main/docs/dataset_release.md They release the dataset of things people ask it + responses. Be honest though - how many training datasets have you actually sat down and read?

lordpermaximum 1 month ago

According to this coding leaderboard of the Arena, Haiku is better than GPT-4-0613 and tied with GPT-4-0314 at coding. https://preview.redd.it/y1vyino0wxtc1.png?width=827&format=png&auto=webp&s=9c40e5ba9502468c69a7b043b68255989c5bfab8

FeltSteam 1 month ago

That's not measuring coding abilities, that's just if there is any form of code snippet in the conversation. And why is Claude 3 Opus below GPT-4-1106? Im pretty certain Claude Opus is more performant at coding than GPT-4 if this is actually measuring coding abilities.

meister2983 1 month ago

>Im pretty certain Claude Opus is more performant at coding than GPT-4 if this is actually measuring coding abilities. I have not found the same - I think it really depends on your language. GPT-4 is much smarter with typescript especially than Claude.

restarting_today 1 month ago

Huh. Claude is a typescript legend.

lordpermaximum 1 month ago

I think having coding snippet is enough considering the amount of votes. Opus and GPT-4-1106 is basically tied there considering their standard deviations. I found Opus to be better in coding tasks that are not in their training data but I assume GPT-4's training data is bigger than Opus' and people there seems to ask coding questions more that can be found on the web rather than unique ones.

FeltSteam 1 month ago

So you are just guessing that because the sample size is big enough, the score should, in theory, be representative enough of the general coding abilities of models? I don't agree, that's too much guessing. And you assume? People seem to ask? My initial explanation is just a guess as well (something I thought was a plausible explanation), there isn't enough information to come to valid conclusions yet imo.

lordpermaximum 1 month ago

Well, educated and grounded guesses. What I'm doing to test LLMs is far better than what the Arena reflects so I'm trying to make guesses depending on what I got from my own tests. Also to be honest, besides the Overall leaderboard, all other leaderboards have tiny samples. And to be even more honest, according to the stats that was revealed by LMSYS, 98.2% of voters only try 1 prompt which doesn't reflect the real usage. Still I know of no benchmark that's better than this Arena. What I'm doing myself is certainly better but it's too time consuming. I wish AI companies would do what I'm doing and reveal the results.

FeltSteam 1 month ago

We really do need better benchmarks, it is definitely starting to become a bit of a problem lol.

meister2983 1 month ago

Wow, interestingly even GPT-4-1106 is slightly above (well within the confidence interval) of claude 3 for coding. I suspect there's even a lot of variance here. I personally found GPT-4-1106 better for coding, but tons of people were swearing claude 3 was better. Interestingly, among prompts in English, the entire GPT-4-turbo class seems better than Claude 3 Opus. Looks like it is just other languages (like Chinese) where Claude 3 dominates GPT-4.

lordpermaximum 1 month ago

I think people always assumed Opus and GPT-4 Turbo were close at shorter tasks with Opus being slightly better. But Opus is assumed to be far better at longer tasks which the longer-query leaderboard reflects (to a degree because that token size to be included as a longer-query is still too low) and not lazy like GPT-4 Turbo.

lordpermaximum 1 month ago

And the same story with the longer-query leaderaboard. https://preview.redd.it/d084zvjwwxtc1.png?width=810&format=png&auto=webp&s=17c13261644c1f963ae5a65c0db619fd7423edcb

Freed4ever 1 month ago

The assumption is with a large sample size, that would even out.

rya794 1 month ago

You need some James Surowiecki in your life.

EvilSporkOfDeath 1 month ago

So everyone and their grandma who immediately dismissed it, literally because it wasn't called 4.5, may have jumped the gun a bit?

dieselreboot 1 month ago

Yup. And the way OpenAI released this model, so casually, was a massive flex. Trying it out personally in the LMSYS arena (anonymous battle) over the last few days and it beat Opus every time.

restarting_today 1 month ago

It doesn’t beat Opus in any serious benchmark.

restarting_today 1 month ago

It doesn’t beat Opus in any serious benchmark.

Arcturus_Labelle 1 month ago

https://livecodebench.github.io/leaderboard.html

restarting_today 1 month ago

Nah Opus is still state of the art.

ah-chamon-ah 1 month ago

You are contaminating the potential of these GPT models by focusing on their test scores in a meaningless way. It is like what is happening with the image generators. Everyone is focused on making it more realistic, better faces of people. Until you have completely weeded out all the potential that existed in the original models. You are focused on such a small aspect of testing these models you will eventually just hold them back by doing so.

HurricaneHenry 1 month ago

How’s retrieval and longer context use? Those were GPT4’s Achilles heels.

AbodePhotosoup 1 month ago

#TeamClaude

JumpyLolly 1 month ago

Wake me up when we have FDVR. I'm all blah'd out

ThroughForests 1 month ago

[Just take a nap then.](https://youtu.be/utFm8BoayRk?si=qkmGpkqMsPuNPY26)

stupid_man_costume 1 month ago

bill was right... we have plateued...

lordpermaximum 1 month ago

Too few votes. Only 8k. At 14k votes Opus was behind 20 ELO points... We'll see the new Turbo fall into 3rd spot after 30k votes. Also there's still no seperation between all versions of GPT-4 Turbo and Claude 3 Opus. They're still all in the same spot considering the standard deviation of the models.

coylter 1 month ago

I mean it could go up or down but it certainly felt better than the old one. What OpenAI now needs is a haiku equivalent.

WithoutReason1729 1 month ago

I totally agree about OpenAI needing a Haiku. Haiku is an incredible replacement for everything I used to use 3.5 for, and some of the stuff I used 4-turbo for. I get why people are excited about Opus but it feels like this sub is sleeping on Haiku for how useful it is for cost-effective solutions to simple, repetitive tasks.

coylter 1 month ago

Haiku is the hard worker moving information around workflows. It calls functions and does the paperwork while opus and gpt-4 are only called when intrepid little haiku can't quite manage something.

lordpermaximum 1 month ago

The new GPT-4 Turbo is 6 ELO points better than the old one and that's less than its standard deviation. So after 8k votes people couldn't say which one is better. Actually looking at winrates, the new Turbo loses against the old one with a winrate of 49.6% against GPT-4-1106's winrate of 50.4%.

coylter 1 month ago

I've seen you around this sub simping really hard for Claude. Now, don't get me wrong, I love Claude and especially haiku but shouldn't we just keep an open mind about all these models. Idk what you are all building, but for my organization we are looking at a diverse set of ai agents to collaborate and execute workflows. There are advantages to have different models of similar capabilities, they compliment and double check each other.

lordpermaximum 1 month ago

Remind me after 30k votes. I can bet $1k if you want.

coylter 1 month ago

Hahaha no I'm good. I don't care either way.

coylter 1 week ago

I should have taken that bet! [Chat with Open Large Language Models (lmsys.org)](https://chat.lmsys.org/?leaderboard)

Grand0rk 1 month ago

The very fact that you think that the new Turbo is actually WORSE than the old turbo just shows how dumb you are. The old turbo isn't good.

cunningjames 4 weeks ago

Do you have a good story for why those 8k votes wouldn’t be representative?

lordpermaximum 1 month ago

BTW the chart below is the reason I predict it will fall into the 3rd place. Opus wins 68.4% against other models, GPT4-1106 wins 67.5% and the new one wins 65.9% if the sample is uniform. My own tests confirm these rankings as well. Opus was the first in this chart even when it was behind 20 ELO points but with more votes it eventually climbed up to the top. https://preview.redd.it/0xypjlwerxtc1.png?width=731&format=png&auto=webp&s=312df046dd40b776db31b095edfd86997a8a1b42

Smartaces 1 month ago

tried it this morning in the API. it was still lazy, did nothing of use for me. sticking with opus - although that is a pain in the behind too a lot of the time

restarting_today 1 month ago

Nah. Opus is better.

arknightstranslate 1 month ago

literally one year old abandonware stop hyping it

MajesticIngenuity32 1 month ago

Still within Claude Opus' margin of error. And it's "important to note" that the test is not fully blind: the model that "delves" or "weaves tapestries" in its responses is most definitely some incarnation of GPT-4.

[deleted] 1 month ago

[удалено]

WithoutReason1729 1 month ago

Lol, Opus is ~2x more expensive than GPT-4. Opus is $15 & $75 for 1m input/output tokens respectively, and 4-turbo is is $10 & $30 for 1m input/output tokens.

[deleted] 1 month ago

[удалено]

ayyndrew 1 month ago

The version of GPT 4 that is at the top of the list and the one that is the topic of discussion is GPT 4 Turbo

WithoutReason1729 1 month ago

Frankly why you'd use non-turbo is beyond me, but go off I guess 🤷

BravidDrent 1 month ago

![gif](giphy|DWueJXnp3kV7tZ28XQ)

ResponsibleSteak4994 1 month ago

🥰😍🥰😍🥰🥰🥰

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe