Nebula GPU for Instances

Nebula GPU for Instances


The Nebula system can leverage Nvidia GPU cards, sharing them with Instances-cn machines or using Nvidia vGPU management software to partition the GPU into multiple vGPU mdev devices, which can then be utilized by Instances-xvm machines.

Officially, Nvidia GPU RTX/T/V Series devices and the Nvidia vGPU Linux_KVM Driver 17+ are supported. However, there are reports that older devices and driver versions may also work, provided you can install and configure mdev device slicing on the Linux distribution of a Sky Node. For more details, please refer to the Nvidia Documentation.


GPU for Instances-cn machines


Containers can use the Nvidia GPU device installed on a Sky Node by installing additional packages on the Sky Node.

  • Debian Linux
    curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list;
    sudo apt-get update;
    sudo apt-get install -y nvidia-container-toolkit;

  • RHEL-based distributions
    curl -s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo | sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo;
    sudo yum install -y nvidia-container-toolkit;

When provisioning an Instances-cn machine, select "Shared Nvidia" to enable the GPU for your instance, allowing you to use it with GPU-enabled software, commonly for AI operations.


GPU for Instances-xvm machines


Using Nvidia vGPU management software to partition the GPU into multiple vGPU mdev devices for use by Instances-xvm machines involves a more complex setup and requires basic Linux skills. After installing the Nvidia KVM driver on the Sky Node, follow these steps:

  1. Verify loaded kernel modules:
    lsmod | grep nvidia;
    nvidia_vgpu_vfio       49152  9
    nvidia              14393344  229 nvidia_vgpu_vfio
    mdev                   20480  2 vfio_mdev,nvidia_vgpu_vfio
    vfio                   32768  6 vfio_mdev,nvidia_vgpu_vfio,vfio_iommu_type1

  2. Print the GPU device status with the nvidia-smi command. The output should be similar to the following one:
    nvidia-smi;
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 470.63       Driver Version: 470.63       CUDA Version: N/A      |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |                               |                      |               MIG M. |
    |===============================+======================+======================|
    |   0  NVIDIA A40          Off  | 00000000:84:00.0 Off |                    0 |
    |  0%   46C    P0    39W / 300W |      0MiB / 45634MiB |      0%      Default |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
    
    +-----------------------------------------------------------------------------+
    | Processes:                                                                  |
    |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
    |        ID   ID                                                   Usage      |
    |=============================================================================|
    |  No running processes found                                                 |
    +-----------------------------------------------------------------------------+

  3. Verify whether the driver has created the mdev_supported_types directory, for example:
    ls /sys/bus/pci/devices/0000\:84:\00.0/mdev_supported_types/
    nvidia-105  nvidia-106  nvidia-107  nvidia-108  nvidia-109  nvidia-110 [...]

    Each Nvidia folder represents an mdev-supported device and contains information about the function, profile, and other relevant details. For this purpose the Q1, Q2, Q4, ... profiles are important which determine how we can slice the GPU card into vGPU mdev devices.

    For example, if you have an Nvidia RTX 6000 card with 24GB of memory and use the Q1 profile, you can create 24 vGPU devices for 24 Instances-xvm machines. With the Q2 profile, you would create 12 vGPU devices, and with the Q4 profile, you would create 6 vGPU devices, and so on.

  4. Once you have chosen your profile, you can configure the "/nvidia/vgpu.sh" script to partition the Nvidia card and enable the Nexus system to detect the vGPU devices. To do this, create a file named "/nvidia/vgpu.sh" and make it executable:
    mkdir /nvidia;
    touch /nvidia/vgpu.sh;
    chmod +x /nvidia/vgpu.sh;

  5. Now, use the vi or nano editor to add the following contents. In this example, we are using the Q2 profile to partition the RTX 6000 into 12 mdev devices. Please note the ###Q2 tag.
    #!/bin/bash
    ###Q2
    echo "abcdefab-abcd-abcd-abcd-abcdefabcd01" > /sys/bus/pci/devices/0000\:84:\00.0/mdev_supported_types/nvidia-376/create;
    echo "abcdefab-abcd-abcd-abcd-abcdefabcd02" > /sys/bus/pci/devices/0000\:84:\00.0/mdev_supported_types/nvidia-376/create;
    echo "abcdefab-abcd-abcd-abcd-abcdefabcd03" > /sys/bus/pci/devices/0000\:84:\00.0/mdev_supported_types/nvidia-376/create;
    echo "abcdefab-abcd-abcd-abcd-abcdefabcd04" > /sys/bus/pci/devices/0000\:84:\00.0/mdev_supported_types/nvidia-376/create;
    echo "abcdefab-abcd-abcd-abcd-abcdefabcd05" > /sys/bus/pci/devices/0000\:84:\00.0/mdev_supported_types/nvidia-376/create;
    echo "abcdefab-abcd-abcd-abcd-abcdefabcd06" > /sys/bus/pci/devices/0000\:84:\00.0/mdev_supported_types/nvidia-376/create;
    echo "abcdefab-abcd-abcd-abcd-abcdefabcd07" > /sys/bus/pci/devices/0000\:84:\00.0/mdev_supported_types/nvidia-376/create;
    echo "abcdefab-abcd-abcd-abcd-abcdefabcd08" > /sys/bus/pci/devices/0000\:84:\00.0/mdev_supported_types/nvidia-376/create;
    echo "abcdefab-abcd-abcd-abcd-abcdefabcd09" > /sys/bus/pci/devices/0000\:84:\00.0/mdev_supported_types/nvidia-376/create;
    echo "abcdefab-abcd-abcd-abcd-abcdefabcd10" > /sys/bus/pci/devices/0000\:84:\00.0/mdev_supported_types/nvidia-376/create;
    echo "abcdefab-abcd-abcd-abcd-abcdefabcd11" > /sys/bus/pci/devices/0000\:84:\00.0/mdev_supported_types/nvidia-376/create;
    echo "abcdefab-abcd-abcd-abcd-abcdefabcd12" > /sys/bus/pci/devices/0000\:84:\00.0/mdev_supported_types/nvidia-376/create;

  6. Edit the root user's crontab to add the following entry:
    @reboot /nvidia/vgpu.sh;

  7. Finally, reboot and deploy Nebula. When provisioning Instances-xvm machines, you'll have the option to select one of the available vGPU devices for your instance.

We utilize xcware specifically for our external CAD/CAE workforce needs. Our vGPU Workstations outperform our previously used VMware Horizon on the same hardware. More importantly, it is now easier to onboard and scalable for every project.

— Mark K.
IT-Manager @ Bielomatik

I rely on xcware for crafting and implementing solutions for my clients due to its scalability and quick setup time for projects. 8 out of 10 customers remain with the initial xcware project setup, streamlining my delivery process.

— Thomas B.
Cloud Solutions Architect

We have successfully migrated 500+ servers and desktops from VMware to xcware. We extend our gratitude to the xcware Consulting Team for delivering exceptional work.

— Franco O.
IT Manager @ SportSA

We were pleasantly surprised by how effortlessly we could construct our Big Data platform and extend it to various production lines across the globe.

— Simone C.
Big Data Engineer @ UBX

As a developer specializing in native cloud solutions, I am delighted that xcware is available for free for developers like me. This allows me to enhance my cloud skills and expand my expertise.

— Sindra L.
Cloud Engineer

My favorite is the Flow-fx engine and the API. With Nexus Flow-fx, you can automate everything, and I mean everything! I manage over 150+ Linux servers fully automated.

— Mirco. W.
Linux Administrator @ S&P

xcware Strategic Partners