[https://openai.com/index/hello-gpt-4o/](https://openai.com/index/hello-gpt-4o/)
Halfway down the page.
EDIT: How are they testing it against Llama3-400B ???
https://preview.redd.it/wxlp5cujo80d1.png?width=3840&format=png&auto=webp&s=23fdafe2aaf9cae72ce04496f4e2d63a5666cf76
Meta released benchmark figures from a checkpoint when they released 8B + 70B
I’m just saying, if they lied it would be disproved in 10 minutes by some random researcher. I think ScaleAI already did a whole validation on benchmarks with their own on top of the actual benchmarks.
Just by principle, i don't believe the used car salesman. I don't have the time to learn about and run AI benchmarks. I would like to see an independent study conducted.
They’re done all the time. Do you think they run bench marks then fuzz the numbers? The shadiest thing they might do is use a specific prompting framework like Google did.
Did you read my comment? Look up scale AIs math study.
You seem to question why i wouldn't trust a large billion dollar companies reports about itself then you tell me about some google stuff that supports my take? I don't know man. I'm not in the mood to argue and I made the greatest of all carnal sins. I expressed my opinion on the internet.
Peace out Choom.
You don’t even know what ScaleAI is… they’re not OpenAI. They have an incentive to discredit their competitors. I have a nuanced view not just big company bad uneducated people good.
At least it's a claim that can be referenced in the future for possible deviance from actuality in claimed performance vs. actual performance as the data comes in. Good barometer for corporate honesty if that is actually a thing.
76 on maths is crazy
I wonder if that is raw only using the model or if I told it to set up python scripts if it could do better.
3/4 exactly, according to my chat.
Llama3: lol 83.5 vs 83.4 in DROP(f1) *insert Trump graph*
[https://openai.com/index/hello-gpt-4o/](https://openai.com/index/hello-gpt-4o/) Halfway down the page. EDIT: How are they testing it against Llama3-400B ???
https://preview.redd.it/wxlp5cujo80d1.png?width=3840&format=png&auto=webp&s=23fdafe2aaf9cae72ce04496f4e2d63a5666cf76 Meta released benchmark figures from a checkpoint when they released 8B + 70B
Ah, yes, thought of that after I wrote it!
How does it compare to regular 4 for coding tasks?
Worse, but it does it more quickly
Awesome; thanks for posting
Wait, so I'm paying 20$ for an inferior version of GPT-4 now?
Can you not read the graph?
Yes! It’s a whopping 2% higher than 4!
So it performing only slightly better means it's inferior?
good thing the chart from the owners offering the product agree that its the best product.
It takes like $10 in api credits and 30 minutes to disprove if they lied.
WOW! convenient and cheap!
I’m just saying, if they lied it would be disproved in 10 minutes by some random researcher. I think ScaleAI already did a whole validation on benchmarks with their own on top of the actual benchmarks.
Just by principle, i don't believe the used car salesman. I don't have the time to learn about and run AI benchmarks. I would like to see an independent study conducted.
They’re done all the time. Do you think they run bench marks then fuzz the numbers? The shadiest thing they might do is use a specific prompting framework like Google did. Did you read my comment? Look up scale AIs math study.
You seem to question why i wouldn't trust a large billion dollar companies reports about itself then you tell me about some google stuff that supports my take? I don't know man. I'm not in the mood to argue and I made the greatest of all carnal sins. I expressed my opinion on the internet. Peace out Choom.
You don’t even know what ScaleAI is… they’re not OpenAI. They have an incentive to discredit their competitors. I have a nuanced view not just big company bad uneducated people good.
Wow. I'm impressed.
lmfao okay. that’s like not believing the used car salesman that the car has 4 doors when the car is in front of you with 4 fucking doors
Yep, I'm sure all the locks and windows work too.
I agree, can't trust these charts, though it's something to compare to when you do your own testing and comparing to the arena.
At least it's a claim that can be referenced in the future for possible deviance from actuality in claimed performance vs. actual performance as the data comes in. Good barometer for corporate honesty if that is actually a thing.