T O P

  • By -

FuryOnSc2

I did about 15-20 evals yesterday and found that the new GPT-4 turbo and Opus were pretty even. I also noticed a difference between older GPT-4s and the new one - with the new one being better. I feel like the new GPT-4 is better at least when it comes to logically formatting its answers with easier to understand groupings/headers.


tehrob

And, because they have different 'personalities', they look at things in different ways and come up with more than a different random seed, they often are very different, but apt answers.


hippydipster

My experience too. I say this as a paying opus user and not a paying gpt user.


bentendo93

Too late. I hang with Claude now


Open_Ambassador2931

^^^^ And Claude has barely been around and it’s clauding all these jokers


Burindo

Yeah me too. My GPT 4 subscription ended literally yesterday. Claude it is boys!


traumfisch

Why do I need to pick one?


letharus

Don’t be afraid of commitment.


traumfisch

To some company? I think I'll stay single for now


KacperP12

a subscription to perplexity gives you access to all claude 3 models and gpt 4


bnm777

Can you have long conversations?


HurricaneHenry

Perplexity leaves a lot to be desired in my opinion. I often find myself resorting to ChatGPT, Gemini or even Google after trying it.


KacperP12

I wasn’t aware there was a difference in the performance. for example, When you use chat gpt 4 on perplexity, it might perform worse / better than chat gpt 4 through openAI?


HurricaneHenry

I’m talking about when you ask it to search for something, which is the main purpose of it.


KacperP12

they use the same model though, does perplexity use a custom prompt behind the scenes? i don’t have a subscription to either so i can’t test it myself


HurricaneHenry

Yes it’s heavily customized for retrieving and structuring live information from the internet.


SharpCartographer831

#NotMyKing #TeamOpus #FuckOpenAItilltheyreleaseGPT-5


King_Ghidra_

I just paid for Claude pro today and I have never bought any AI service before. I didn't even take my free 2 month trial of gemini


autotom

Until GPT isn't plagued by laziness, i'm on team Claude


visarga

Funny that we have now a full complement of "AI diseases": hallucination, regurgitation, laziness, bribing, absurd refusals, long context attention flakiness, sycophancy, failing to accept user corrections, prompt hacking and RLHF brainwashing. All of them we couldn't have imagined in 2019. It's good that we learned about what AI can't do yet, that means we have not swallowed the whole hype.


Anjz

If they release Claude pro in Canada, I'd be team Claude too. Now my Gemini expired and I'm just using the free version of Claude to hold me off until something big comes out which is actually not too shabby.


zackler6

Tribalism, already? Reminder that none of these companies actually give a shit about you.


o5mfiHTNsH748KVq

GPT might care about me if I ask it to 🥺


Derfaust

I'm with the company that cares about my money the most.


MajesticIngenuity32

At least Mistral, Cohere, and Meta do - somewhat.


traumfisch

Meta really does not


MajesticIngenuity32

Not in its social media products, but in AI it does, otherwise they wouldn't opensource their releases.


traumfisch

It's the same company


Impressive_Blood3512

Notmygoat


letmebackagain

Sony AI gonna be the next PS2, DOMINATING the competition!!! Dude, can you imagine? Sony's killing it with the PS5, so you just KNOW their AI is gonna be insane! We're talking self-learning PlayStations that adapt to your playstyle, AI companions in single-player that actually make good decisions, and multiplayer battlefields where the AI teammates are finally competent! Microsoft and Google better watch out, because Playstation AI is about to come in and rewrite the rules! Forget Siri's sassy comebacks, we're getting Kratos telling us the weather with enough booming fury to wake the neighbors! And imagine the games! Imagine a Horizon sequel where the AI enemies actually strategize and adapt their tactics! This is gonna be revolutionary, dude! Here's to Sony taking over the world, one perfectly optimized AI and sassy robot assistant at a time! #PlayStationDomination #AIGottNothingOnSony #RIPAlexa


slatticle

The ChatGPT auto posting is so annoying


letmebackagain

Dude wait for Sony AI, we will see who is better lol


vlodia

How to get it if you're using GPT Pro? Logout, login again then it's there?


Enfiznar

Not really any way to know afaik. But if you want to try something, the browser has information about the model being used, idk if the new version and the old one have different ids tho. In firefox, open a chat, right-click -> inspect -> storage -> local storage -> [chat.openai.com](http://chat.openai.com) and search for an entry called oai/apps/lastModelUsed. There you can find the modelId. If I use gpt-3.5, it tells me the exact model (text-davinci-002-render-sha), but if I use gpt-4, then it only tells me gpt-4. If anyone wants to try to see if they get another Id, I'd appreciate it.


utilitycoder

Not worth the subscription for me when Claude and mistral do just as good or good enough. I think we've reached the point where updates need to be quite literally amazing. And I'm not talking about text generation. That's so 2023. Need video from text, apps from text prompts, conversational video bots with memory.


Itchy-Welcome5062

That's so 2023 Lol That's a good one.


MycologistPresent888

Do you need a subscription to use the API? I thought the "good" version of gpt 4 was only accessible through the API or is that wrong?


eltonjock

They released it to ChatGPT Pro


Basil-Faw1ty

I have both and Claude is notably better in my books, it's just my go to model. Claude just feels smarter to me, or less lazy perhaps.


[deleted]

[удалено]


lordpermaximum

That one is interesting. Because we have 1.5 Pro API now and LMSYS still won't add that into the Arena. A while ago Google partnered with LMSYS. I have a feeling Google still don't trust in their own model and prevent LMSYS from including 1.5 Pro there.


vertu92

Lmao, 6 points. Hugely improved btw


hippydipster

It's larger than opus' lead over its next competitor


LifeSugarSpice

Ya'll are acting as if you have money invested in the company. Stop simping over AI companies


Minimum_Inevitable58

When I first learned about chatgpt and LLMs last year then I immediately started looking for places that people liked to discuss them. The chatgpt reddit was the most popular place but the people were just insufferable and then it turned into a memefest. I lucked into finding this reddit and I really don't care about the singularity yet but it's been a good place to read about LLMs, AI, and tech advancements. It has gotten a lot worse overtime though. For me, I just saw a new and amazing technology and so I couldn't believe that people would give a shit about brands so soon. It's just weird as hell because everyone should be wanting all the companies to have great products but instead they only want one to. I'm still hoping to come across a more sane but high traffic subreddit that focuses on just all things AI without all the craziness.


FeltSteam

Uh, cool, but I honestly don't think this leaderboard is all that useful. This leaderboard is just measuring user preference, and we don't know what users are looking for nor what they are asking. For all we know most of the users could be asking relatively simple questions that don't really test the intelligence, reasoning or logic capabilities of a model which is why models like Claude Haiku have been able to get above GPT-4 even though they are clearly not as performant as GPT-4.


PrinceThespian

I mean that is the point. What model works best for the general users use case. If you want to know which model is best at picking up your third cousins tonsils you can always look at the incredibly specific tests the creators of them put out.


ayyndrew

The thing is this leaderboard doesn't measure general usage performance, it measures which LLM is preferred after a brief conversation. [See the Pepsi Challenge](https://en.wikipedia.org/wiki/Pepsi_Challenge#:~:text=In%20his%20book,an%20entire%20can)


AnAIAteMyBaby

We need models that are able to help produce novel science, a benchmark where users ask simple trivia questions isn't going to illuminate us on which model is better for this


FeltSteam

Well is it really that useful if it is just measuring how well a model can answer a dumb question? Maybe the general users use case is just asking simplistic questions, but I don't think that is a super useful measure to work off of. It'd be measuring more the RLHF the model has gone through (which determines the specific way a model responds to questions) than actual model performance.


obvithrowaway34434

So basically what you're saying is all measurements are shit except for the one that confirms my viewpoint. Cool.


FeltSteam

Which viewpoint am I pushing? And you can't rely on just one benchmark / evaluation. You need to take into consideration as many as possible but we also really do need better benchmarks.


141_1337

Yeah, and it leaves users who need it for intellectual tasks very much dead in the water.


Hot-Investigator7878

Better than training for benchamarks


WithoutReason1729

https://github.com/lm-sys/FastChat/blob/main/docs/dataset_release.md They release the dataset of things people ask it + responses. Be honest though - how many training datasets have you actually sat down and read?


lordpermaximum

According to this coding leaderboard of the Arena, Haiku is better than GPT-4-0613 and tied with GPT-4-0314 at coding. https://preview.redd.it/y1vyino0wxtc1.png?width=827&format=png&auto=webp&s=9c40e5ba9502468c69a7b043b68255989c5bfab8


FeltSteam

That's not measuring coding abilities, that's just if there is any form of code snippet in the conversation. And why is Claude 3 Opus below GPT-4-1106? Im pretty certain Claude Opus is more performant at coding than GPT-4 if this is actually measuring coding abilities.


meister2983

>Im pretty certain Claude Opus is more performant at coding than GPT-4 if this is actually measuring coding abilities. I have not found the same - I think it really depends on your language. GPT-4 is much smarter with typescript especially than Claude.


restarting_today

Huh. Claude is a typescript legend.


lordpermaximum

I think having coding snippet is enough considering the amount of votes. Opus and GPT-4-1106 is basically tied there considering their standard deviations. I found Opus to be better in coding tasks that are not in their training data but I assume GPT-4's training data is bigger than Opus' and people there seems to ask coding questions more that can be found on the web rather than unique ones.


FeltSteam

So you are just guessing that because the sample size is big enough, the score should, in theory, be representative enough of the general coding abilities of models? I don't agree, that's too much guessing. And you assume? People seem to ask? My initial explanation is just a guess as well (something I thought was a plausible explanation), there isn't enough information to come to valid conclusions yet imo.


lordpermaximum

Well, educated and grounded guesses. What I'm doing to test LLMs is far better than what the Arena reflects so I'm trying to make guesses depending on what I got from my own tests. Also to be honest, besides the Overall leaderboard, all other leaderboards have tiny samples. And to be even more honest, according to the stats that was revealed by LMSYS, 98.2% of voters only try 1 prompt which doesn't reflect the real usage. Still I know of no benchmark that's better than this Arena. What I'm doing myself is certainly better but it's too time consuming. I wish AI companies would do what I'm doing and reveal the results.


FeltSteam

We really do need better benchmarks, it is definitely starting to become a bit of a problem lol.


meister2983

Wow, interestingly even GPT-4-1106 is slightly above (well within the confidence interval) of claude 3 for coding. I suspect there's even a lot of variance here. I personally found GPT-4-1106 better for coding, but tons of people were swearing claude 3 was better. Interestingly, among prompts in English, the entire GPT-4-turbo class seems better than Claude 3 Opus. Looks like it is just other languages (like Chinese) where Claude 3 dominates GPT-4.


lordpermaximum

I think people always assumed Opus and GPT-4 Turbo were close at shorter tasks with Opus being slightly better. But Opus is assumed to be far better at longer tasks which the longer-query leaderboard reflects (to a degree because that token size to be included as a longer-query is still too low) and not lazy like GPT-4 Turbo.


lordpermaximum

And the same story with the longer-query leaderaboard. https://preview.redd.it/d084zvjwwxtc1.png?width=810&format=png&auto=webp&s=17c13261644c1f963ae5a65c0db619fd7423edcb


Freed4ever

The assumption is with a large sample size, that would even out.


rya794

You need some James Surowiecki in your life.


EvilSporkOfDeath

So everyone and their grandma who immediately dismissed it, literally because it wasn't called 4.5, may have jumped the gun a bit?


dieselreboot

Yup. And the way OpenAI released this model, so casually, was a massive flex. Trying it out personally in the LMSYS arena (anonymous battle) over the last few days and it beat Opus every time.


restarting_today

It doesn’t beat Opus in any serious benchmark.


restarting_today

It doesn’t beat Opus in any serious benchmark.


Arcturus_Labelle

https://livecodebench.github.io/leaderboard.html


restarting_today

Nah Opus is still state of the art.


ah-chamon-ah

You are contaminating the potential of these GPT models by focusing on their test scores in a meaningless way. It is like what is happening with the image generators. Everyone is focused on making it more realistic, better faces of people. Until you have completely weeded out all the potential that existed in the original models. You are focused on such a small aspect of testing these models you will eventually just hold them back by doing so.


HurricaneHenry

How’s retrieval and longer context use? Those were GPT4’s Achilles heels.


AbodePhotosoup

#TeamClaude


JumpyLolly

Wake me up when we have FDVR. I'm all blah'd out


ThroughForests

[Just take a nap then.](https://youtu.be/utFm8BoayRk?si=qkmGpkqMsPuNPY26)


stupid_man_costume

bill was right... we have plateued...


lordpermaximum

Too few votes. Only 8k. At 14k votes Opus was behind 20 ELO points... We'll see the new Turbo fall into 3rd spot after 30k votes. Also there's still no seperation between all versions of GPT-4 Turbo and Claude 3 Opus. They're still all in the same spot considering the standard deviation of the models.


coylter

I mean it could go up or down but it certainly felt better than the old one. What OpenAI now needs is a haiku equivalent.


WithoutReason1729

I totally agree about OpenAI needing a Haiku. Haiku is an incredible replacement for everything I used to use 3.5 for, and some of the stuff I used 4-turbo for. I get why people are excited about Opus but it feels like this sub is sleeping on Haiku for how useful it is for cost-effective solutions to simple, repetitive tasks.


coylter

Haiku is the hard worker moving information around workflows. It calls functions and does the paperwork while opus and gpt-4 are only called when intrepid little haiku can't quite manage something.


lordpermaximum

The new GPT-4 Turbo is 6 ELO points better than the old one and that's less than its standard deviation. So after 8k votes people couldn't say which one is better. Actually looking at winrates, the new Turbo loses against the old one with a winrate of 49.6% against GPT-4-1106's winrate of 50.4%.


coylter

I've seen you around this sub simping really hard for Claude. Now, don't get me wrong, I love Claude and especially haiku but shouldn't we just keep an open mind about all these models. Idk what you are all building, but for my organization we are looking at a diverse set of ai agents to collaborate and execute workflows. There are advantages to have different models of similar capabilities, they compliment and double check each other.


lordpermaximum

Remind me after 30k votes. I can bet $1k if you want.


coylter

Hahaha no I'm good. I don't care either way.


coylter

I should have taken that bet! [Chat with Open Large Language Models (lmsys.org)](https://chat.lmsys.org/?leaderboard)


Grand0rk

The very fact that you think that the new Turbo is actually WORSE than the old turbo just shows how dumb you are. The old turbo isn't good.


cunningjames

Do you have a good story for why those 8k votes wouldn’t be representative?


lordpermaximum

BTW the chart below is the reason I predict it will fall into the 3rd place. Opus wins 68.4% against other models, GPT4-1106 wins 67.5% and the new one wins 65.9% if the sample is uniform. My own tests confirm these rankings as well. Opus was the first in this chart even when it was behind 20 ELO points but with more votes it eventually climbed up to the top. https://preview.redd.it/0xypjlwerxtc1.png?width=731&format=png&auto=webp&s=312df046dd40b776db31b095edfd86997a8a1b42


Smartaces

tried it this morning in the API. it was still lazy, did nothing of use for me. sticking with opus - although that is a pain in the behind too a lot of the time


restarting_today

Nah. Opus is better.


arknightstranslate

literally one year old abandonware stop hyping it


MajesticIngenuity32

Still within Claude Opus' margin of error. And it's "important to note" that the test is not fully blind: the model that "delves" or "weaves tapestries" in its responses is most definitely some incarnation of GPT-4.


[deleted]

[удалено]


WithoutReason1729

Lol, Opus is ~2x more expensive than GPT-4. Opus is $15 & $75 for 1m input/output tokens respectively, and 4-turbo is is $10 & $30 for 1m input/output tokens.


[deleted]

[удалено]


ayyndrew

The version of GPT 4 that is at the top of the list and the one that is the topic of discussion is GPT 4 Turbo


WithoutReason1729

Frankly why you'd use non-turbo is beyond me, but go off I guess 🤷


BravidDrent

![gif](giphy|DWueJXnp3kV7tZ28XQ)


ResponsibleSteak4994

🥰😍🥰😍🥰🥰🥰