For most companies there is currently no way around AI services. But the challenges are great: Simply purchasing AI assistants from cloud providers is rarely the best way – especially when it comes to data protection, but also because of the potentially high costs. So just plug a few Nvidia GPUs into the server and get started? That's rarely enough, explains Daniel Menzel in an interview on the title topic of the new iX 4/2024: AI in your own data center.
Advertisement
Daniel Menzel is managing director of Menzel IT GmbH from Berlin and builds HPC, ML and private cloud computing clusters with his team.
The common assumption is that machine learning simply requires more computing power. Why is this wrong?
There are two reasons for this: Firstly, machine learning (at least in training) is only really efficient with GPUs. Equipping classic servers with larger CPUs will not help with truly high-performance ML infrastructures. On the other hand, machine learning requires a very high-throughput and low-latency network based on Ethernet or even InfiniBand and regularly also very powerful central storage. The classic FibreChannel SAN is usually not nearly sufficient for this.
Companies can learn a lot from HPC operations. Why is that?
What is HPC today will become corporate IT tomorrow. High-density systems, 100, 200 and 400G, RDMA – all of these technologies were first “tested” in HPC and ML before they came into enterprise IT. We are currently seeing this transition very strongly in water cooling: a niche topic a good 5 years ago, but today almost everyone who has more than two racks in their basement is talking about it.
Which components of their infrastructure do IT departments particularly tend to forget?
Definitely the network. This not only has to have high throughput, but also, in particular, low latency in 2024. Since in many infrastructures, especially cloud-like ones, the east-west traffic is often larger than the north-south traffic due to storage synchronization, microservices and general machine-to-machine communication, you should consider a redesign or update also think about switching from classic three-tier to modern spine-leaf architectures.
In addition, there is no ML storage that fits every scenario. The data is sometimes structured, sometimes unstructured, sometimes as files, sometimes as databases. Before a storage can be designed, it must be clarified with the users what their data looks like. However, a few very basic rules often apply. First of all: “A lot helps a lot”. The training and test data can be in the upper tera or even petabyte range. Then ML storage must be very high-performance, i.e. low-latency. It is also very common for people to read far more than they write.
Mr. Menzel, thank you very much for the answers! All information about what can be learned from HPC operations for machine learning in your own data center can be found in the new iX 4/2024. The April issue also shows how companies can correctly set up their network and storage for their own AI services.
In the “Three Questions and Answers” series, iX wants to get to the heart of today's IT challenges – whether it's the view of the user in front of the PC, the view of the manager or the everyday life of an administrator. Do you have any suggestions from your daily practice or that of your users? Whose tips on which topic would you like to read briefly and to the point? Then feel free to write to us or leave a comment in the forum.
(fo)