In April 2022, the Cloud Native Computing Foundation (CNCF) added Volcano to their incubator as the first batch computing project. Among other things, the software should make it possible to implement cloud-native applications for high-performance computing (HPC), big data and artificial intelligence (AI). Volcano’s recently released 1.6 release adds new features such as elastic job management, dynamic scheduling and rescheduling based on actual resource utilization, and a message-passing interface job plugin.
Planning for elastic jobs
Elastic job scheduling can work with both Volcano jobs and PyTorch jobs. This new feature is designed to accelerate AI training and big data analysis while also reducing costs by using Spot instances in the cloud.
The number of replicas allowed for an elastic job is within the values [min,max]. where min is the minAvailable value of the job and max is the number of replicates of the job. The elastic scheduling engine allocates resources preferentially to the minAvailable pods, thereby ensuring that their minimum resource requirements are met.
The elastic scheduling engine preferentially allocates resources to the minAvailable pods so that their minimum resource requirements are met.
(Bild: Cloud Native Computing Foundation)
When resources are idle, the scheduler allocates them to the Elastic Pods to speed up processing power. However, if the cluster is short of resources, the scheduler prioritizes the resources of the elastic pods, triggering a scale-in. The scheduler also balances resource allocation based on priorities. For example, a high-priority job can then anticipate an elastic pod’s resources for a low-priority job. Developers can find a very detailed article on this technique on GitHub.
Prometheus helps the planning mechanism
The current scheduling mechanism used at Volcano is based on request and allocation of resources. However, the development team behind the software has found that this practice can result in an unbalanced utilization of node resources. For example, a user can schedule a pod on a node with extremely high resource usage, causing a node exception, while there are some other nodes in the cluster that are not heavily used.
Therefore, in version 1.6.0, Volcano works with the observability tool Prometheus to make scheduling decisions. Prometheus collects data on cluster node resource usage, and Volcano uses this data to balance node resource usage as best as possible. Developers can also configure CPU and memory limits for each node. This prevents node exceptions caused by pods using too many resources.
Plug-in for Message Passing Interface jobs
With this release, programmers can now use Volcano Jobs to run Message Passing Interface (MPI) jobs. The plugins built into Volcano Jobs such as svc, env and ssh automatically configure password-free communication and injection of environment variables for the masters and workers of MPI jobs. The execution of MPI jobs is said to have become much easier with the MPI plug-in: Developers no longer need to worry about the shell syntax, the communication between master and worker or the manual SSH authentication and can start an MPI job in a much simpler way.
Interested parties can find more information, including a Quick Start Guide, on GitHub.
To home page
#Cloud #native #project #Volcano #version #finished