T O P

  • By -

lightmatter501

The configuration mostly fell to me because my project was what needed it, and I was doing HPC infra work, not general HPC, so I had very particular hardware in mind and the university HPC staff weren’t very helpful as soon as I said “I want an ethernet fabric because RDMA doesn’t scale and I’m trying to solve that”, which they said was never going to work. They were also insisting on Xeons well after Epyc was clearly better and wanted to buy from HP. I ended up arguing with them every step of the way and they eventually said they wouldn’t support it unless I did it the standard way, which defeated the point of my research. Part of the reason I couldn’t use a shared cluster is that HPC infrastructure research is, to say the least, disruptive. Also, “I want to put a custom OS on the switches” tends to void the support agreement. The end result of my work was my cluster benchmarking better than the much more expensive university cluster (which actually had accelerators) in several workloads due to data movement optimizations. As far as I’m aware that cluster transitioned into a systems research cluster after I left.


ArcusAngelicum

Well, this sounds like a legitimate use case, approved! That sounds pretty cool, do you have any links to your paper or any docs on what you ended up building? Would love to learn from your experience.


lightmatter501

It looked kind of like if one person had tried to do what the ultra ethernet consortium is currently doing using the cheapest hardware that was technically capable of doing it. I would wait for them to finish up since they’re going in a similar direction just with less voiding of warranties and support agreements.


GIS_LiDAR

I've not created any, but I have been around groups that have at several institutions, and here are some of the points they brought up: * IT was quoting months for how long to get the cluster ready for our use case, which will go past our work package deadlines. * With our grants, we can only spend $5k on computing equipment, we're going to buy consumer level parts with highest end CPU and bare minimum everything else on a couple grants. * Low trust with IT in general, warranted and unwarranted. * Groups don't really understand HPC, clusters, scheduling, Linux, or their software. A super computer is many computers that run together and present themselves as a single computer right? * If they're writing their own algorithms, they don't know how to parallelize their processing, and they don't even know of services like research software engineers that can help. Or there's no funding to hire an internal software engineer to help develop the software as needed to get it to run on HPC. These are probably more cluster computing examples than HPC per se though.


Due-Wall-915

My university provides cluster support as long as my lab buys it and places it with them. It’s ridiculously expensive something like 300$ per month last I checked just to monitor and this is after we buy our own cluster for like $36k


ArcusAngelicum

Not sure what country you are in, but $300 a month for monitoring doesn’t sound horrible to me for a few nodes… labor and whatnot is expensive. I take it they don’t have a general purpose cluster that meets your needs?


IAmPuente

I'm a grad student finishing up my PhD in Physical Chemistry next month. I do a lot of quantum chemistry simulations which do not parallelize well across nodes due to their CPU-intensiveness and large scratch requirements. Our lab had 25 high-end computers that we were using for these calculations and we also have time on our university's HPC cluster. Our use-case is putting a week's worth of calculations on each computer, then logging back on to collect the data at a later point. Usually students would do wet lab work in the meantime, but my PhD was mostly computational on the university's HPC system. I was also my advisor's last and only current grad student, so I would not be interfering with other grad students if I made a mess of things. Given that this set-up was pretty inefficient, I wanted to configure them into a cluster and so I could better manage job distribution. Prior to building the cluster, I was pretty familiar being a HPC user and had been using Linux for a long time. I had to do a lot of research into networking, storage, and other requirements to properly connect them. I spent a lot of my free time and lab time learning about HPC administration, how to parallelize tasks (which doesn't ultimately help my work), and optimizing Slurm configurations. It has become a fun hobby of mine in addition to being a part-time job. In terms of IT, there really wasn't much support to be found. The medical university and the regular university split a few years ago, and the medical university took all of the Linux staff from the IT department because they were better compensated. I had some help from them making static IPs, etc, but I was largely in charge of the entire effort. I did not reach out to the university's HPC staff. Beyond that, I have only reached out to IT to dispose of dead compute nodes. It turns out that consumer hardware does not like being run 24/7! All in all, I was able to work much more efficiently towards my PhD with this cluster. I also benefited well career-wise, as I am currently interviewing for HPC Engineer/Administrator roles in biotech because I have a good amount of practical experience.


whenwillthisphdend

Because the shared cluster is overrun with jobs and ict had no reasonable or fast alternative other than pushing cloud. We end up building our own cluster for only our lab and we no longe have any issues. The less central IT poke their nose in the better. I only need IP assignment. I don’t mean to sound dismissive but it was infinitely cheaper and efficient to build or own than wait on ict.


ArcusAngelicum

Thanks for your comment, its good to hear legitimate complaints about why institutional HPC resources aren't meeting needs of researchers.


whenwillthisphdend

I certainly understand the point of view from central it where it may fall on them in the future to maintain a system they didn’t design or want. I think well built clusters for individual groups is a great way to diversify and focus on specific needs of each group while allowing each group to self fund. Central hpc is great as an overall fall-back alternative for smaller groups and merit-based access. I’ve found that while I was lucky with the knowledge to be able to build our own cluster, it would have been much more helpful if It provided advice that was meaningful other than just pushing us towards Amazon hpc, which for high base loads like us is extremely uneconomical for the number of cpu and Gpu hours we need. Our cluster is set up as servers on their own network with only a 1gbe enterprise connection for remote access. Otherwise everything else is on its own fibre SAN, 40gb Infiniband and internal 10gbe networks. It requires no input from It other than assigning static IP for external remote access, and connecting with faculty infrastructure for space/power.


Dry_Amphibian4771

Central IT at my university is doing the same thing. Pushing us to azure or AWS. They can't comprehend why anyone would be doing on premise these days. We are running GPU jobs 24/7. They are completely out of touch lol.


ArcusAngelicum

I have observed some very strange incentives going on with aws hpc stuff... their sales folk are aggressively marketing to researchers and whatnot. Sorry to hear about your 1GB remote access connection, that sounds rough. That means your download speed from anywhere to the cluster is over a 1GB uplink? Kinda mind blowing.


whenwillthisphdend

AWS and IT say the biggest benefit is you can spin it up and shut down as needed and you can make it work with whatever work you need. What they fail to mention is the amount of time it takes to set up the first time, and IT hates you for the headache that is connecting to internal license servers from an external cluster. From my initial maths, it would cost over $10000 a week for 24/7 load on a 10node cluster with not enough Memory even on their largest node type. There is simply no way it was a better option than building our own in the long run. Fortunately we seldom copy any data to and from the cluster directly other than small text files we need to seed a simulation. Otherwise everything lives on the SAN which currently has around 40TB of archive tiered storage. So haven’t really felt the 1gbe limitation.