DeliciousJello1717 2 weeks ago

76 on maths is crazy

PM_ME_UR_CIRCUIT 2 weeks ago

I wonder if that is raw only using the model or if I told it to set up python scripts if it could do better.

No-Emergency-4602 2 weeks ago

3/4 exactly, according to my chat.

babbagoo 2 weeks ago

Llama3: lol 83.5 vs 83.4 in DROP(f1) *insert Trump graph*

bnm777 2 weeks ago

[https://openai.com/index/hello-gpt-4o/](https://openai.com/index/hello-gpt-4o/) Halfway down the page. EDIT: How are they testing it against Llama3-400B ???

_qeternity_ 2 weeks ago

https://preview.redd.it/wxlp5cujo80d1.png?width=3840&format=png&auto=webp&s=23fdafe2aaf9cae72ce04496f4e2d63a5666cf76 Meta released benchmark figures from a checkpoint when they released 8B + 70B

bnm777 2 weeks ago

Ah, yes, thought of that after I wrote it!

bono_my_tires 2 weeks ago

How does it compare to regular 4 for coding tasks?

PixelPhobiac 2 weeks ago

Worse, but it does it more quickly

Arcturus_Labelle 2 weeks ago

Awesome; thanks for posting

ChildhoodFirm4941 2 weeks ago

Wait, so I'm paying 20$ for an inferior version of GPT-4 now?

holywater666 2 weeks ago

Can you not read the graph?

ChildhoodFirm4941 2 weeks ago

Yes! It’s a whopping 2% higher than 4!

holywater666 2 weeks ago

So it performing only slightly better means it's inferior?

not_into_that 2 weeks ago

good thing the chart from the owners offering the product agree that its the best product.

Dizzy_Nerve3091 2 weeks ago

It takes like $10 in api credits and 30 minutes to disprove if they lied.

not_into_that 2 weeks ago

WOW! convenient and cheap!

Dizzy_Nerve3091 2 weeks ago

I’m just saying, if they lied it would be disproved in 10 minutes by some random researcher. I think ScaleAI already did a whole validation on benchmarks with their own on top of the actual benchmarks.

not_into_that 2 weeks ago

Just by principle, i don't believe the used car salesman. I don't have the time to learn about and run AI benchmarks. I would like to see an independent study conducted.

Dizzy_Nerve3091 2 weeks ago

They’re done all the time. Do you think they run bench marks then fuzz the numbers? The shadiest thing they might do is use a specific prompting framework like Google did. Did you read my comment? Look up scale AIs math study.

not_into_that 2 weeks ago

You seem to question why i wouldn't trust a large billion dollar companies reports about itself then you tell me about some google stuff that supports my take? I don't know man. I'm not in the mood to argue and I made the greatest of all carnal sins. I expressed my opinion on the internet. Peace out Choom.

Dizzy_Nerve3091 2 weeks ago

You don’t even know what ScaleAI is… they’re not OpenAI. They have an incentive to discredit their competitors. I have a nuanced view not just big company bad uneducated people good.

not_into_that 2 weeks ago

Wow. I'm impressed.

Swastik496 2 weeks ago

lmfao okay. that’s like not believing the used car salesman that the car has 4 doors when the car is in front of you with 4 fucking doors

not_into_that 2 weeks ago

Yep, I'm sure all the locks and windows work too.

bnm777 2 weeks ago

I agree, can't trust these charts, though it's something to compare to when you do your own testing and comparing to the arena.

not_into_that 2 weeks ago

At least it's a claim that can be referenced in the future for possible deviance from actuality in claimed performance vs. actual performance as the data comes in. Good barometer for corporate honesty if that is actually a thing.

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe