GPU cloud: architecture and mechanism

7 minute read


As world enters the era of artificial intelligence, the demand of parallel computing power(which is largely based on GPU) is growing in an unprecedented manner.

However, powerful GPU is usually expensive and can not be equipped on each device. Thus, current industry trend is to move GPU computational power from client’s machine to data center and cloud.

Famous cloud providers such as AWS and GCP already enable GPU instance in their virtual machine renting services. NVIDIA also develops cloud services for GPU-specific workload such as gaming and deep learning.

GPU cloud architecture

The core value of cloud is resource sharing and renting. We can assign hardware resources based on needs, which in turn optimizes efficiency and minimizes cost.

A virtual machine is the essential building block to divide resources. GPU cloud is no different. Unlike other hardwares like CPU, memory and storage, extra effort is needed to bring GPU into VM. The rest of the article will explain in detail how GPU resource is virtualized.

Following is the architecture of a typical GPU cloud: gpu_cloud_architecture

The diagram depicts structure of one specific zone. Admin host manages the entire data center. GPU host provides GPU-powered VM instances to customers.

Linux virtualization stack

Linux’s virtualization stack contains several layers:



This layer consists of users of virtualization. Examples are command-line tools, GUI application, and various cloud service providers.


This is the API provided by Red Hat to unify management of different virtualization back-ends.


QEMU is a hardware emulator, it provides a set of different emulated hardware and device models for the virtual machine.

KVM runs in kernel space and contains only core virtualization logic such as vCPU and memory mapping. Thus, QEMU is a good complement for providing peripheral support.

When the guest OS operates emulated hardware, QEMU translates the instruction and gets it executed on host.


KVM is the virtualization module shipped with Linux kernel. Since released, it gains more and more popularity.

Large cloud service providers such as AWS are actively migrating from existing virtualization solution like Xen towards KVM.


KVM itself is a Linux kernel module. KVM and its interaction with guest and host is as follows: kvm_overview

Project Structure

KVM source code is located in Linux main branch at /virt/kvm. Most of the logic is inside kvm_main.c file.

Process of creating a VM

Creating a VM is one of the most typical workflows of KVM. Let’s walk through its process:

1. Open KVM device

KVM itself is represented as a device in Linux. In order for it to operate VM, it has to be opened first:


2. Create VM

After KVM device is opened, we use ioctl to send command to the device asking for creating a VM:

vm_fd = ioctl(kvm_fd, KVM_CREATE_VM, 0)

3. Create virtual CPU for the VM

The core component of a VM is CPU and memory. a vCPU is represented as a file as well and created using ioctl.

vcpu_fd = ioctl(vm_fd, KVM_CREATE_VCPU, 0)

4. Set up memory for the VM

The memory of VM is created in user space and then mapped to VM.

void *mem = mmap(0, mem_size, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_ANONYMOUS, -1, 0);
struct kvm_userspace_memory_region region = {.userspace_addr = mem, .memory_size = mem_size};
ioctl(vm_fd, KVM_SET_USER_MEMORY_REGION, &region);

5. Run VM

After everything is set up, we run vCPU inside VM and wait for it to exit:

while(true) {
    ioctl(vcpu_fd, KVM_RUN, 0);


KVM API is mainly used by guest framework such as QEMU. Client usually controls behaviors of VM through QEMU command line parameters and seldom calls KVM API directly.

GPU resource sharing


The core value of cloud is sharing resources. For CPU/memory/network, this is relatively easy to implement, we can assign any fraction of resource to a VM.

For GPU, things are different. Most consumer grade GPUs don’t support dividing GPU resources. This means GPU can only be exclusively assigned to one VM as a whole. We can add many GPUs to PCIe slots and assign one GPU per VM, but it will be obviously annoying to manage and maintain.


Nvidia’s vGPU technology is used for resolving this problem. Workstation GPUs such as Quadro and Tesla support this feature.

mediated device framework(mdev)

This is the mechanism for implementing vGPU. In general, it is compatible with VFIO UAPI, but managed by vendor driver which controls resource sharing logic.

Multiple virtual devices can be generated by Mdev, each contains a fraction of GPU resource. The generated virtual device is located inside /sys/bus/mdev/devices/ and ready to be attached to VM.

GPU sharing for container

Containerization gains more and more popularity nowadays, and it is natural to come up with the demand to share GPU among containers.

As of 2020, it remains an open question: Is sharing GPU to multiple containers feasible?

The root cause is, there is no way of partitioning GPU resources (CUDA cores, memory), or even assigning priorities for most non-datacenter GPUs. So before NVIDIA provides hardware support for resource sharing, we can do nothing about it🙃

GPU Passthrough: access GPU from VM

GPU is viewed as a PCIe device. It is normally attached to and accessed by host driver. In order for it to be used by VM, A special driver called vfio driver is needed to control GPU from inside VM.

how to enable GPU passthrough

bind vfio driver to GPU

Adding GRUB parameters is a convenient way to bind vfio driver:


Attach GPU to VM

This can be done via command line tool or GUI application. The GPU is added to VM as a PCIe device.

The client calls libvirt library and libvirt adds relevant parameter indicating attached device to start qemu process. It is like:

$ ps -ef|grep qemu
libvirt+ 1351222       1 99 16:14 ?        00:02:21 /usr/bin/qemu-system-x86_64 ... -device vfio-pci,host=01:00.0

vfio driver


In the past, specific device assignment code has to be written for different PCI devices.

In order to unify the device access process of VM, VFIO driver frameworks is developed.


In general, vfio is a re-mapper of device address. When guest accesses device, vfio is responsible for remapping the address to real, physical device address. Thus, the host is bypassed and guest can access physical PCIe device directly.

VFIO operates devices in the granularity of IOMMU group. IOMMU is used for connecting physical device’s I/O bus to main memory.

An IOMMU group combines multiple devices(which are usually logically related) into the same set. A typical example is NVIDIA’s Video and Audio device.

You can use lspci command to inspect IOMMU info of a pci device:

$ lspci -vvv
08:00.0 VGA compatible controller: NVIDIA Corporation Device 2484 (rev a1) (prog-if 00 [VGA controller])
    Subsystem: NVIDIA Corporation Device 146b
    Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
    Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
    Latency: 0
    Interrupt: pin A routed to IRQ 79
    IOMMU group: 13
    Region 0: Memory at f5000000 (32-bit, non-prefetchable) [size=16M]
    Region 1: Memory at e0000000 (64-bit, prefetchable) [size=256M]
    Region 3: Memory at f0000000 (64-bit, prefetchable) [size=32M]
    Region 5: I/O ports at d000 [size=128]
    Expansion ROM at 000c0000 [virtual] [disabled] [size=128K]
    Capabilities: <access denied>
    Kernel driver in use: nvidia
    Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia

Source code analysis

VFIO is a PCIe driver, so its source code is located at /drivers/vfio/pci. Main logic is located at vfio_pci.c.

VFIO device is viewed just like a file which we can open/release/read/write. The operations are like:

static const struct vfio_device_ops vfio_pci_ops = {
    .name		= "vfio-pci",
    .open		= vfio_pci_open,
    .release	= vfio_pci_release,
    .ioctl		= vfio_pci_ioctl,
    .read		= vfio_pci_read,
    .write		= vfio_pci_write,
    .mmap		= vfio_pci_mmap,
    .request	= vfio_pci_request,
    .match		= vfio_pci_match,

VFIO read/write shares the same backend called vfio_pci_rw. Its code is like:

static ssize_t vfio_pci_rw(void *device_data, char __user *buf,
               size_t count, loff_t *ppos, bool iswrite) 
    unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
    struct vfio_pci_device *vdev = device_data;
    switch (index) {
        return vfio_pci_config_rw(vdev, buf, count, ppos, iswrite);

        return vfio_pci_bar_rw(vdev, buf, count, ppos, iswrite);
        index -= VFIO_PCI_NUM_REGIONS;
        return vdev->region[index].ops->rw(vdev, buf,
                           count, ppos, iswrite);
    return -EINVAL;

We can see that different operations apply to different region indexes.

For pci_bar_rw, it generally contains 2 steps, first setup IO mapping, then do actual read/write.

ssize_t vfio_pci_bar_rw(struct vfio_pci_device *vdev, char __user *buf,
            size_t count, loff_t *ppos, bool iswrite)
    int ret = vfio_pci_setup_barmap(vdev, bar);
    do_io_rw(vdev, res->flags & IORESOURCE_MEM, io, buf, pos,
            count, x_start, x_end, iswrite);

The data is transferred between user space and PCIe device, with VFIO driver as the medium:

static ssize_t do_io_rw(struct vfio_pci_device *vdev, bool test_mem,
            void __iomem *io, char __user *buf,
            loff_t off, size_t count, size_t x_start,
            size_t x_end, bool iswrite)
    ssize_t done = 0;
    int ret;
    while (count) {
        size_t fillable, filled;
        if (fillable >= 4 && !(off % 4)) {
            u32 val;
            if (iswrite) {
                if (copy_from_user(&val, buf, 4))
                    return -EFAULT;
                ret = vfio_pci_iowrite32(vdev, test_mem,
                             val, io + off);
            } else {
                ret = vfio_pci_ioread32(vdev, test_mem,
                            &val, io + off);
                if (copy_to_user(buf, &val, 4))
                    return -EFAULT;
            filled = 4;
        count -= filled;
        done += filled;
        off += filled;
        buf += filled;

    return done;