Skip to main content

Nvidia DGX cluster serves as cloud training ground


CD Poon with DGX
C.D. Poon

Over the past year, ITS Research Computing staff and about 30 of its customers have become more conversant with Kubernetes and Docker through the use of a recently expanded Nvidia DGX cluster.

Learning Kubernetes and Docker is another opportunity for Carolina to leverage cloud capabilities for research computing.

Kubernetes is a Google-created, open-source technology that runs containerized applications — meaning a software package that contains everything needed to run an application, including specific versions of code libraries. It can run anywhere on any machine — like one’s home computer or, for example, on any of the three major cloud computing platforms: Google Cloud Platform, Amazon Web Services or Microsoft Azure. Docker, which created the industry standard, is one brand of containerized applications.

Started as a pilot

In spring 2018, Research Computing deployed one Nvidia DGX desktop machine and three Nvidia DGX-1 servers as a pilot project to study GPU computing, container technology and cloud computing. The DGX desktop station serves as the master node while DGX-1 servers function as compute node, forming a small cluster using Kubernetes as the job manager and scheduler.

Along with the DGX systems, Nvidia provides access to its NGC service, also known as the Nvidia GPU Cloud. Nvidia curates software applications into NGC in the form of Docker images. Likewise, DGX users can add their own software applications in the form of Docker images to the Nvidia GPU Cloud and keep them private or make them available for others to use.

Fengping Hu
Fengping Hu

“To me, this is the future computer,” said C.D. Poon, the Research Computing scientist who is responsible for exploring what can be done with the DGX cluster and helping University researchers use it.

Using containerized applications simplifies the requirements for one’s computer and broadly expands options of where one can compute. “There is no requirement,” he added, “only that your machine is equipped with Docker.”

Nowadays, many applications — especially artificial intelligence and GPU applications — are packaged in container images.

“Our platform provides users with a very easy and convenient way to run these kinds of applications,” said Research Computing’s Fengping Hu, the system administer of the DGX cluster.

Protecting customers’ data, of course, is critical too. For that reason, Research Computing developed code to enable users to run container applications safely without compromising system security.

Preparing for the cloud

Others within Research Computing are making their own forays into the cloud. Poon, meanwhile, has not moved the DGX cluster, Kubernetes or Docker into the cloud. Using Nvidia GPU Cloud is his first step.

So far, Poon said, everything he needs to do he can do in DGX. Later, when he needs more resources and wants to achieve that by moving to the cloud, he said, “I will have no problem migrating to the cloud.”

DGX has value

DGX Station
DGX Station

Poon and Hu no longer consider DGX a pilot project for Research Computing. Sure, they and campus researchers are still expanding their knowledge of what’s possible with DGX, Kubernetes and Docker, but they know the setup works and is adding value.

DGX, Poon said, is “very solid” and “gets things done very fast.”

In fact, in October 2018, the Kubernetes/Docker cluster became so popular it was necessary to move two GPU-equipped Dell compute nodes from the Longleaf cluster over to the DGX cluster. Now the DGX cluster has 36 graphics processing units, or GPUs. In comparison, Longleaf has 96 GPUs. GPUs serve as a computing co-processor to central processing units (CPUs). More GPUs mean more and faster processing power.

Most jobs are AI

Poon, who has a doctorate in chemistry, uses the DGX cluster for molecular dynamics computations. While Research Computing doesn’t require details of users specific research topics, Poon does know that the majority of the jobs on DGX are artificial intelligence based.

DGX and the Longleaf GPU nodes are no different in terms of the computations they can run. “They are the same thing. They can do the same job,” Poon said. The small DGX cluster has only six machines while Longleaf, for example, has 14 machines with GPUs.

Researchers who already know how to use Docker, or if Poon has time to instruct them, can receive access to the DGX.

Understanding Docker and Kubernetes is key to running many research applications in the cloud. Experimenting with the DGX cluster and Nvidia GPU Cloud is a great way to advance research into the cloud service providers.


Comments are closed.