xcware

Home
Documentation
Nebula System
Nebula GPU for Instances

Nebula GPU for Instances

Documentation

Nebula GPU for Instances

The Nebula system can leverage Nvidia GPU cards, sharing them with Instances-cn machines or using Nvidia vGPU management software to partition the GPU into multiple vGPU mdev devices, which can then be utilized by Instances-xvm machines.

Officially, Nvidia GPU RTX/T/V Series devices and the Nvidia vGPU Linux_KVM Driver 17+ are supported. However, there are reports that older devices and driver versions may also work, provided you can install and configure mdev device slicing on the Linux distribution of a Sky Node. For more details, please refer to the Nvidia Documentation.

Before enabling GPU support for instances, ensure that you have installed the correct Nvidia driver for your GPU model and Linux distribution. Refer to Nvidia's installation guide to find the driver version that matches your GPU.
Install the Nvidia drivers and complete this setup before deploying Nebula.

GPU for Instances-cn machines

Containers can use the Nvidia GPU device installed on a Sky Node by installing additional packages on the Sky Node.

Debian-based Linux

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list;
sudo apt-get update;
sudo apt-get install -y nvidia-container-toolkit;

RHEL-based distributions

curl -s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo | sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo;
sudo yum install -y nvidia-container-toolkit;

When provisioning an Instances-cn machine, select "Shared Nvidia" to enable the GPU for your instance, allowing you to use it with GPU-enabled software, commonly for AI operations.

GPU for Instances-xvm machines

Using Nvidia vGPU management software to partition the GPU into multiple vGPU mdev devices for use by Instances-xvm machines involves a more complex setup and requires basic Linux skills. After installing the Nvidia KVM driver on the Sky Node, follow these steps:

Verify loaded kernel modules:

lsmod | grep nvidia;
nvidia_vgpu_vfio       49152  9
nvidia              14393344  229 nvidia_vgpu_vfio
mdev                   20480  2 vfio_mdev,nvidia_vgpu_vfio
vfio                   32768  6 vfio_mdev,nvidia_vgpu_vfio,vfio_iommu_type1

Print the GPU device status with the nvidia-smi command. The output should be similar to the following one:

nvidia-smi;
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.63       Driver Version: 470.63       CUDA Version: N/A      |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A40          Off  | 00000000:84:00.0 Off |                    0 |
|  0%   46C    P0    39W / 300W |      0MiB / 45634MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Verify whether the driver has created the mdev_supported_types directory, for example:
```
ls /sys/bus/pci/devices/0000\:84:\00.0/mdev_supported_types/
nvidia-105  nvidia-106  nvidia-107  nvidia-108  nvidia-109  nvidia-110 [...]
```
Each Nvidia folder represents an mdev-supported device and contains information about the function, profile, and other relevant details. For this purpose the Q1, Q2, Q4, ... profiles are important which determine how we can slice the GPU card into vGPU mdev devices.

For example, if you have an Nvidia RTX 6000 card with 24GB of memory and use the Q1 profile, you can create 24 vGPU devices for 24 Instances-xvm machines. With the Q2 profile, you would create 12 vGPU devices, and with the Q4 profile, you would create 6 vGPU devices, and so on.
Once you have chosen your profile, you can configure the "/nvidia/vgpu.sh" script to partition the Nvidia card and enable the Nexus system to detect the vGPU devices. To do this, create a file named "/nvidia/vgpu.sh" and make it executable:
```
mkdir /nvidia;
touch /nvidia/vgpu.sh;
chmod +x /nvidia/vgpu.sh;
```

Now, use the vi or nano editor to add the following contents. In this example, we are using the Q2 profile to partition the RTX 6000 into 12 mdev devices. Please note the ###Q2 tag.

#!/bin/bash
###Q2
echo "abcdefab-abcd-abcd-abcd-abcdefabcd01" > /sys/bus/pci/devices/0000\:84:\00.0/mdev_supported_types/nvidia-376/create;
echo "abcdefab-abcd-abcd-abcd-abcdefabcd02" > /sys/bus/pci/devices/0000\:84:\00.0/mdev_supported_types/nvidia-376/create;
echo "abcdefab-abcd-abcd-abcd-abcdefabcd03" > /sys/bus/pci/devices/0000\:84:\00.0/mdev_supported_types/nvidia-376/create;
echo "abcdefab-abcd-abcd-abcd-abcdefabcd04" > /sys/bus/pci/devices/0000\:84:\00.0/mdev_supported_types/nvidia-376/create;
echo "abcdefab-abcd-abcd-abcd-abcdefabcd05" > /sys/bus/pci/devices/0000\:84:\00.0/mdev_supported_types/nvidia-376/create;
echo "abcdefab-abcd-abcd-abcd-abcdefabcd06" > /sys/bus/pci/devices/0000\:84:\00.0/mdev_supported_types/nvidia-376/create;
echo "abcdefab-abcd-abcd-abcd-abcdefabcd07" > /sys/bus/pci/devices/0000\:84:\00.0/mdev_supported_types/nvidia-376/create;
echo "abcdefab-abcd-abcd-abcd-abcdefabcd08" > /sys/bus/pci/devices/0000\:84:\00.0/mdev_supported_types/nvidia-376/create;
echo "abcdefab-abcd-abcd-abcd-abcdefabcd09" > /sys/bus/pci/devices/0000\:84:\00.0/mdev_supported_types/nvidia-376/create;
echo "abcdefab-abcd-abcd-abcd-abcdefabcd10" > /sys/bus/pci/devices/0000\:84:\00.0/mdev_supported_types/nvidia-376/create;
echo "abcdefab-abcd-abcd-abcd-abcdefabcd11" > /sys/bus/pci/devices/0000\:84:\00.0/mdev_supported_types/nvidia-376/create;
echo "abcdefab-abcd-abcd-abcd-abcdefabcd12" > /sys/bus/pci/devices/0000\:84:\00.0/mdev_supported_types/nvidia-376/create;

Edit the root user's crontab to add the following entry:
```
@reboot /nvidia/vgpu.sh;
```
Finally, reboot and deploy Nebula. When provisioning Instances-xvm machines, you'll have the option to select one of the available vGPU devices for your instance.

Nebula GPU for Instances

Nebula GPU for Instances

GPU for Instances-cn machines

GPU for Instances-xvm machines

xcware Strategic Partners