Hardware Efficiency

Scaling Out, Not Just Up

The journey of computing has always been a quest for greater efficiency. From hypervisors carving up physical servers to containers offering even more granular control, the pattern is clear. Now, with AI/ML and High-Performance Computing (HPC) taking center stage, a new frontier in resource optimization is opening up, especially around specialized hardware like high-performance networking.

This is where solutions like DRANET, a Kubernetes network driver, are making significant strides. By cleverly using Kubernetes’ Dynamic Resource Allocation (DRA), DRANET offers a declarative, Kubernetes-native method to manage and assign advanced network interfaces, including those powerful RDMA-capable NICs, directly to Pods. This isn’t merely about network connectivity; it’s a more intelligent approach to utilizing the potent, and often costly, hardware that underpins today’s distributed applications.

Scaling Up and Its Downsides

For a long time, tackling demanding workloads—think large-scale AI model training—meant “scaling up.” The answer was bigger machines, more GPUs, more NICs, often with an entire high-spec server dedicated to a single, massive job. While this method has its place, it can breed inefficiency. What happens if jobs don’t perfectly saturate all those allocated resources? Or if several smaller, yet still demanding, tasks could have run in parallel instead?

Embracing the Scale-Out Philosophy

DRANET adopts a “scale-out” method for managing specialized hardware. This approach involves dividing resources precisely, which offers several clear advantages.

Dedicated Network Performance for Tasks

Different parts of a complex application, such as a calculation worker or a data server, can obtain exclusive access to specific network hardware. For instance, a task requiring very fast data transfer can be assigned its own high-speed RDMA network interface. This separation ensures that crucial data communications proceed without interference from other activities on the same computer. Consequently, network performance becomes reliable and tailored to each task’s requirements.

Smarter Hardware Allocation

Consider a server equipped with 8 GPUs and 8 RDMA network interfaces. Instead of a single large job claiming all these resources, DRANET enables a more flexible assignment. For example, a distributed training task might need two workers, each using one GPU and one dedicated RDMA interface.

Our research validates this approach with concrete data. In tests comparing a topologically aligned setup (enabled by DRANET and DRA) against a traditional, non-aligned one, the results were conclusive:

All-Gather Performance: The aligned configuration achieved 46.59 GB/s on large messages, a 59.6% increase over the unaligned setup’s 29.20 GB/s.
All-Reduce Performance: Similarly, the aligned configuration hit 46.93 GB/s, a 58.1% improvement over the unaligned 29.68 GB/s.

This precise assignment leaves the remaining 7 GPUs and 7 RDMA interfaces on that server available for other tasks. Such detailed control is central to scaling out: it allows for running multiple, smaller, concurrent workloads, each with the appropriate hardware. This marks a change from older models where an entire powerful server might be largely underused if a job didn’t need its full capacity.

Running More Jobs and Improving Cost-Effectiveness

Several demanding, independent jobs can operate simultaneously on the same physical infrastructure. AI teams, for example, can conduct numerous training experiments in parallel, each with its own GPU and network portion. Similarly, a scientific computing simulation, a distributed database, and inference services can all function together without impeding each other’s network access.

This ability to run multiple, correctly-sized jobs concurrently leads to better overall hardware utilization. For expensive systems used in AI and HPC, this approach can significantly improve cost-effectiveness compared to dedicating entire servers for extended durations. It helps ensure that valuable GPU and RDMA resources are more consistently put to good use.

Beyond GPU-Centric AI/ML

While the advantages for AI/ML training (like leveraging NCCL over RDMA) are immediately apparent, the principle of providing dedicated, high-performance network paths via DRANET extends its benefits to a larger array of distributed workloads:

Distributed Databases: Systems needing rapid inter-node synchronization, replication, and query processing can gain significantly from dedicated RDMA links.
High-Performance Storage: Applications can achieve faster access to distributed file systems (think Lustre or BeeGFS with RDMA) or interact with NVMe-oF storage with performance nearing local speeds.
In-Memory Data Grids & Caches: Distributed caches can dramatically reduce latency for remote data access, enhancing application responsiveness.
Financial Services Applications: The quest for ultra-low latency messaging in high-frequency trading and real-time risk analytics becomes more attainable and predictable.
Traditional HPC (MPI-based applications): Scientific simulations can benefit from dedicated RDMA paths for their MPI communication, all configured and managed within the Kubernetes ecosystem.