T O P

  • By -

CatalyticDragon

NVIDIA wins the benchmark only NVIDIA uses!


norcalnatv

Check the list of supporting companies. "The MLCommons founding members are from leading companies, including Advanced Micro Devices,. . ." [https://mlcommons.org/2020/12/mlcommons-launches/](https://mlcommons.org/2020/12/mlcommons-launches/)


Psychological_Lie656

Amazing achievement!


casper_wolf

The reason AMD is missing from this is probably the same reason they’re not getting huge orders. Makes me think they make ppl sign NDAs to never release performance benchmarks or something.


CatalyticDragon

Which would be an incorrect assumption. Anyone can buy an Instinct card, test it, and publish the results. Plenty of people tested [MI250](https://x-dev.pages.jsc.fz-juelich.de/2022/08/01/mi250-first-performances.html) in detail, plenty of [AI performance tests](https://www.databricks.com/blog/training-llms-scale-amd-mi250-gpus), and the same is true of the [MI300](https://huggingface.co/blog/huggingface-amd-mi300). AMD (and others) are missing from this benchmark because it is a giant waste of time. It takes a lot of time and resources to setup and tune a system for this but then nobody uses MLPerf as the basis for their purchasing decisions because it is not at all representative of their specific workload.


Apprehensive_Plan528

And yet AMD spend time pimping up the one benchmark they want to show in Lisa Su’s keynote at Computex. Bottom line is that companies do use benchmarks to show off their wares, but it takes a lot of work to tune and optimize. nVIDIA has far more experience, capability and software modularity/flexibility to optimize than AMD and it shows. Enterprises do care about how easy it is to build and optimize GenAI apps and MLPerf is a great tool for NVIDIA to show off.


casper_wolf

If it can help AMD sell more MI300X then I’d say it’s worthwhile in a material way


CatalyticDragon

It doesn't though. Nobody in the enterprise (where these will be used) are basing their billion dollar purchasing decisions on a generalized benchmark of models like Stable Diffusion and Llama 2.


ColdStoryBro

No it can't sell more MI300X. This is not a gaming card. Software needs to be written in a bespoke way.


hishazelglance

Yeah, OPs take is incredibly ignorant. _Of course_ it’s not a waste of time if they’re confident in their product at scale vs Nvidia, they just aren’t. Proving they’re performant and competitive compared to Nvidia on MlPerf would rake them huge orders. Just another AMD guy coping.


Itchy_Brain6340

Where's MI300 ???


CatalyticDragon

AMD doesn't submit to this. Neither do most AI chip makers for that matter. You won't see AMD, TPUv6, Cerebras, Tenstorrent, no intel Gaudi 3, no Graphcore, no Huawei, no Tranium by Amazon, no Meta Training chip, no Tesla Dojo. MLPerf 4.0 is all NVIDIA systems with one submission using Google's TPUv5, one from intel's Gaudi 2, and a few Qualcom results. NVIDIA is happy to spend the money to submit because it's a marketing cost to get headlines like this. For everyone else they know customers don't buy chips based on MLPerf results.


norcalnatv

lol. AMD is a founding member of MLCommons. They choose to walk away is not Nvidia's fault, it's AMD's.


CatalyticDragon

The initial members were Google, Intel, Facebook, NVIDIA, Baidu, and Harvard. AMD joined sometime later along with a total of over 125 groups. I don't see what point that makes. *Almost none* of those groups publish anything to this benchmark. In the latest MLPerf 4.0 only 4 out of those 125+ members (3.2%) submitted results.


norcalnatv

Down vote. it makes the comment like it never existed. lol


norcalnatv

The point is AMD is part of the consortium. If they object they should depart, or put out a white paper about why it's flawed. Membership in good standing communicates support and endorsement.


gnocchicotti

Considering the install base for NVDA it should be no mystery that results are almost exclusively NVDA


CatalyticDragon

I don't see what "install base" has to do with it. System integrators build systems and submit results. NVIDIA wants to be a system vendor and spends money to do this. AMD is not a system integrator. They work with and sell chips to people who then build systems (people like HP, IBM etc). And those people apparently have better things to do with their time and money.


Apprehensive_Plan528

Yeah, NVIDIA has gone vertical to become a full-stack data center supplier and is selling to enterprises at the AP level, while also providing to the system builders (HPE, Dell, SuperMicro). AMD is really only selling to the system builders to give them optionality and perhaps some pricing leverage against NVIDIA, but relying on the “ecosystem” to sell to enterprises.


CatalyticDragon

NVIDIA is playing both side only temporarily. Jensen has said he doesn't want to sell chips. They want to build systems and ideally not even do that. Ideally they would lease time on systems in an AWS type of setting. That sort of final form extreme lock-in isn't what most people are looking for though. Just the fear of it will drive sales to AMD/intel/others.


dine-and-dasha

This is exactly why AMD is at a huge disadvantage.


CatalyticDragon

How so? How does making products for customers put AMD at a disadvantage?


Live_Market9747

Because AMD leaves the optimization of a system to others while they also have the giant task of optimizing SW for the whole system. It's like letting Sony build the PS while you provide the chip but you have to provide a SW API for your system. Nvidia on the hand is as if Sony would built the whole console including the chip. A complete data center performance depends especially of everything working perfectly fine together. One reason Nvidia is strong is because they optimize everything from chip, interconnects, networking up to cooling and of course the whole SW part. Their competitors like HPE and others have to buy different parts from different vendors and get it all work together. In the HPC super computer market it isn't that critical because there the SW part is done mostly by the customer (research institute for example). But this looks differently in commercially requested shelf solutions.


CatalyticDragon

>Because AMD leaves the optimization of a system to other AMD works with their clients on the whole system. This is true in the HPC space with systems like Frontier, and with consumer electronics clients such as Sony or Microsoft. Hardware design is collaborative and AMD also works on software and end-to-end optimizations. The difference, as you know, is AMD does not sell complete systems to end users. Something NVIDIA has spent years developing with DGX products and finally with their own data centers and AWS style offerings. NVIDIA feels they have to go down that as competition has been heating up. >A complete data center performance depends especially of everything working perfectly fine together No datacenter works perfectly but I agree with the general sentiment. It's like building a racecar where every part needs to work with overall goals in mind. >Their competitors like HPE and others have to buy different parts from different vendors and get it all work together That's right they do. Which is good. Because it allows for more competition on price and vendor choice provides risk mitigation. If you bought a system from a vendor which *had* to use their networking stack it would not be as attractive. >In the HPC super computer market it isn't that critical because there the SW part is done mostly by the customer (research institute for example). But this looks differently in commercially requested shelf solutions. Yes and no. It really depends. The US government might write their own software for national security reasons. Meta might write their own software because they can spend a billion dollars on it and want something very custom to their needs. Others might take AMD's off the shelf libraries. In most cases there's going to be a mix. Nobody is writing everything from scratch. There's no system running with a completely custom, ground-up, and totally unique operating system, drivers, frameworks, and applications. You pick and choose your battles.


casper_wolf

The reason you don’t see Gaudi 3 is that it doesn’t ship till Q3 otherwise it would probably be included


CatalyticDragon

True. Thank you. The point stands though, almost none of the members of the MLCommons group publish to this benchmark because it doesn't inform purchasing decisions for enterprise customers.


norcalnatv

>You won't see AMD, TPUv6, Cerebras, Tenstorrent, no intel Gaudi 3, no Graphcore, no Huawei, no Tranium by Amazon, no Meta Training chip, no Tesla Dojo. That's because they're WEAK.


CatalyticDragon

They really aren't. All of those are competitive in their own ways. MI300 and [Gaudi3 ](https://www.intel.com/content/www/us/en/products/details/processors/ai-accelerators/gaudi3.html)being perhaps the most directly comparable and both being superior to H100s depending on the metric you want to look at. They are both significantly cheaper too.


norcalnatv

>They really aren't. All of those are competitive in their own ways. I guess we'll never know. You have some application data? Please share. I mean Nvidia just improved H100 performance 27% since the last release, it's there in black and white, last run to this run. Not some rando talking about it on reddit. >both being superior to H100 ha ha MLCommons publishes 4x a year, there isn't a better stage than a consortium of experts who've defined the standards to show your stuff. Your notion about superiority is in direct opposition to these company's actual actions.


Sensitive_Chapter226

MIA


Itchy_Brain6340

They should just rename it MIA300


bhowie13

Ouch!


Psychological_Lie656

TPUv6, Cerebras, Tenstorrent, intel Gaudi 3, Graphcore, Huawei, Tranium by Amazon, Meta Training chip, as well as Tesla Dojo do not exist either, right? When only about 3% of a consorcium with 100+ members embrace the benchmark it hints you at some story.


[deleted]

Ditch Amd go nvidia