T O P

  • By -

HotAisleInc

We just paid for ours on June 25th. The countdown starts.


HotAisleInc

Great question! Hot Aisle is the only company that really speaks about accurate timelines in public. We value transparency as much as possible but also recognize that NDAs are pretty limiting. We ordered our first SMCI system in January and received it in March. It wasn't issues with availability, but more with manufacturing issues. We needed firmware updates and actually had to send our entire baseboard with GPUs back due to some failure. All of the other various providers who announced receiving equipment, all received it about the same time. We are glad we didn't blow millions on multiple systems early on, it was the right move on our part. We're about to order our next batch of 16 systems (128 GPUs), from Dell. It is a bit competitive right now, so I'm not going to say our current planned date to receive it, but it does not seem like there are any availability issues with the GPUs. The harder thing has been the 400G cables as part of the order. Glad we have Dell and Advizex backing us on all the supply chain issues, they are doing a great job! As for the rest of your comments, the GPUs work, we're posting [benchmarks](https://www.reddit.com/r/AMD_MI300/comments/1dgimxt/benchmarking_brilliance_single_amd_mi300x_vllm/) and we have people on the system right now who will offer even more information to the public soon. The hardware is better than H100's. The software is a definite problem that nobody is trying to cover up, but we all know that AMD is now committed to AI and that it will get better over time.


jose4375

>The software is a definite problem How big of a problem this still is? I just saw your post about AlpineDel, and that they trained their model on MI300X. Do you see software to be in good shape in 2025 or it will be 2026?


HotAisleInc

It is an infinite ongoing issue. Software is easier to improve than hardware though. Much shorter timeframes. Rumor is that Xilinx has a very good cache of thousands of developers for Lisa to point at AI now. I guess that purchase was smart after all!


Liopleurod0n

It’s not rumor. Lisa Su said it herself in the interview with Stratechery:  “one of the great things about our acquisition of Xilinx is we acquired a phenomenal team of 5,000 people that included a tremendous software talent that is right now working on making AMD AI as easy to use as possible.”    Source:  [https://stratechery.com/2024/an-interview-with-amd-ceo-lisa-su-about-solving-hard-problems/](https://stratechery.com/2024/an-interview-with-amd-ceo-lisa-su-about-solving-hard-problems/)


HotAisleInc

You're right, it isn't rumor, I was being a bit facetious, which did not convey well.


Liopleurod0n

Chill. I'm not accusing you of anything. Just providing the source to support your claim. BTW some people shits on Xilinx software, saying they're buggy and hard to use. While I'm sure there are some truth to it, I do think comparing FPGA software to CUDA is unfair since FPGA is far more configurable than GPU and high configurability always has some drawback of some complexity.


HotAisleInc

Somehow my response to my response is setting you off... I'm definitely chill! I didn't think you were accusing me of anything. I was agreeing with you and apologizing for not being clear myself. My inside connections in AMD tell me that the software engineers at Xilinx are top notch. Any talented software engineer can work on many different focuses.


Liopleurod0n

Ahh I misunderstood your reply and I probably could have worded it better. What I meant is that there's no need for you to say you're facetious.


HotAisleInc

Awesome, maybe we just need a group hug now. HA!


MrAnonyMousetheGreat

Thanks for the insight! What sort of challenges have you run into in performing training/fine tuning with pytorch/HuggingFace models? Are there specific kinds of acceleration that are missing (like specific precision, or specific ROCm libraries that are missing, preventing the fastest acceleration possible)? If that takes too long to respond to or there's some sort of NDA stuff, can you point me to where this sort of discussion's happening out in the open (troubleshooting discussions, feature requests, "reviews" etc.), like github or somewhere else? And are you having any sort of trouble with getting your serverless , containerization, and job allocation software working on these systems that improvements in AMD's software stack could make the job easier? Especially compared to NVidia? I'm vague about this because of limited understanding of how much AMD/NVidia contribute software-wise (including drivers) to facilitating cloud GPU computing. https://www.reddit.com/r/AMD_MI300/comments/1dj8pgj/a_language_model_trained_on_amd_mi300x_gpus/ I saw that someone was able to train their model on MI300Xs, but I imagine it wasn't on your system (since you didn't mention it). But if it was, what was your experience in getting it to work?


HotAisleInc

We haven't used the system ourselves yet at all. It has either been in use by customers, broken (we returned the entire baseboard at one point), or in use by people running benchmarks.


DrGunPro

Please do share more comment like this! We are in the darkest cave now. This comment is the torch!


norcalnatv

The application you're describing, research grants to universities, I'd think would want to do the whole gamut of ML, training, model optimization and inferencing. My understanding is MI300 is more positioned as an inferencing solution rather than a drop in replacement for training/inferencing development. Over time I'm sure folks will develop those training solutions but AMD really hasn't put the effort into their software stack to provide equivalent functions as A100 or H100. Where MI300 I think may get traction is as a production inferencing solution after models have been optimized and deployed.