Composable Infrastructure allows job schedulers to dynamically map any number of CPU nodes to the optimum number of NVIDIA® Tesla® V100 GPU accelerators and NVMe storage resources required by datacenter applications. This flexibility increases resource utilization in mixed use datacenters where AI, deep learning, data science, Monte Carlo simulations and image processing applications may run on the same hardware.
Data center architectures have seen significant advances with the advent of the cloud but have remained shackled by the motherboard-chassis paradigm, which locks resource allocation to the point of purchase. For many, that means hyperconverged servers with large numbers of GPUs and NVMe storage designed to handle peak application utilization leaving many resources under-utilized when operating non-peak applications. For others, that means under-powered nodes that can't handle heavily compute intensive workloads. Liqid Composable interconnects disaggregated resource pools to a fabric, freeing users from the restrictions of the motherboard-chassis configuration that has remained one of the final physical limitations of the digital world. Through innovations in low-latency fabrics and intelligent software, Liqid Composable addresses the painful and costly limitations associated with static architectures by interconnecting pools of compute, networking, data storage, and graphics processing devices with a PCI-Express (PCIe) fabric to deliver transformative results.
One Stop Systems’ expansion technology works well with the Liqid composable infrastructure software because the OSS hardware provides industry-leading GPU and NVMe hardware density, increasing resource availability across the data center. Liqid’s technology platform allows users to manage, scale out, configure and even automate physical, bare-metal server systems. With the ability to treat GPUs as a disaggregated, shared resource for the first time, scaled via OSS expansion systems, composable solutions from Liqid deliver the infrastructure to meet today’s most taxing HPC challenges, such as peer-to-peer transfers and memory access for AI and machine learning.
PCIe Device Lending - Composable Infrastructure made easy
Dolphin eXpressWare SmartIO software offers a flexible way to enable PCIe IO devices (NVMe drives, FPGAs, GPUs etc) to be accessed within a PCIe Network. Devices can be borrowed over the PCIe network without any software overhead at the performance of PCI Express. Device Lending is a simple way to reconfigure systems and reallocate resources. GPUs, NVMe drives or FPGAs can be added or removed without having to be physically installed in a particular system on the network. The result is a flexible simple method of creating a pool of devices that maximizes usage.
Since this solution uses standard PCIe, it does not add any software overhead to the communication path. Standard PCIe transactions are used between the systems. Dolphin's eXpressWare software manages the connection and is responsible for setting up the PCIe Non Transparent Bridge (NTB) mappings.
Two types of functions are implemented with device lending. These are the lending function and the borrowing function:
The Dolphin Device Lending software enables this process to be controlled using a set of command line tools and options. These tools can be used directly or integrated into any other higher level resource management system. The device lending software is very flexible and does not require any boot order or power on sequencing. PCIe devices borrowed from a remote system can be used as if they were local devices until they are given back. The Device Lending software does not require any changes to standard device drivers or to the Linux kernel.
Device lending also enables a SR-IOV device to be shared as a MR-IOV device. SR-IOV functions can be borrowed by any system in the PCIe Network, thereby enabling the device to be shared by multiple systems. This maximizes the use of SR-IOV devices such as 100 Gbit Ethernet cards.
The GPUltima-CI is a power-optimized rack that can be configured with up to 32 dual Intel Xeon Scalable Architecture compute nodes, 64 network adapters, 48 NVIDIA® Volta™ GPUs, and 32 NVMe drives on a 128Gb PCIe switched fabric, and can support tens of thousands of composable server configurations per rack. Using one or many racks, the OSS solution contains the necessary resources to compose any combination of GPU, NIC and storage resources as may be required in today’s mixed workload data center.